A Distribution-Free Theory of Nonparametric Regression · La´szlo´ Gyo¨rfi Michael Kohler Adam Krzyz˙ak Harro Walk A Distribution-Free Theory of Nonparametric Regression With

A Distribution-Free Theory of Nonparametric

Regression

László Györfi, et al.

Springer

Springer Series in Statistics

Advisors:P. Bickel, P. Diggle, S. Fienberg, K. Krickeberg,I. Olkin, N. Wermuth, S. Zeger

This page intentionally left blank

Laszlo Gyorfi Michael KohlerAdam Krzyzak Harro Walk

A Distribution-FreeTheory of NonparametricRegression

With 86 Figures

Laszlo Gyorfi Michael KohlerDepartment of Computer Science and Fachbereich Mathematik

Information Theory Universitat StuttgartBudapest University of Technology and Pfaffenwaldring 57

Economics 70569 Stuttgart1521 Stoczek, U.2. GermanyBudapest [email protected]@inf.bme.hu

Adam Krzyzak Harro WalkDepartment of Computer Science Fachbereich MathematikConcordia University Universitat Stuttgart1455 De Maisonneuve Boulevard West Pfaffenwaldring 57Montreal, Quebec, H3G 1M8 70569 StuttgartCanada [email protected] [email protected]

Library of Congress Cataloging-in-Publication DataA distribution-free theory of nonparametric regression / Laszlo Gyorfi . . . [et al.].

p. cm. — (Springer series in statistics)Includes bibliographical references and index.ISBN 0-387-95441-4 (alk. paper)1. Regression analysis. 2. Nonparametric statistics. 3. Distribution (Probability theory)

I. Gyorfi, Laszlo. II. Series.QA278.2 .D57 2002519.5′36—dc21 2002021151

ISBN 0-387-95441-4 Printed on acid-free paper.

2002 Springer-Verlag New York, Inc.All rights reserved. This work may not be translated or copied in whole or in part without thewritten permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York,NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis. Usein connection with any form of information storage and retrieval, electronic adaptation, computersoftware, or by similar or dissimilar methodology now known or hereafter developed is forbidden.The use in this publication of trade names, trademarks, service marks, and similar terms, even ifthey are not identified as such, is not to be taken as an expression of opinion as to whether or notthey are subject to proprietary rights.

Printed in the United States of America.

9 8 7 6 5 4 3 2 1 SPIN 10866288

Typesetting: Pages created by the authors using a Springer TEX macro package.

www.springer-ny.com

Springer-Verlag New York Berlin HeidelbergA member of BertelsmannSpringer Science+Business Media GmbH

To our families:

Kati, Kati, and JancsiJudith, Iris, and Julius

Henryka, Jakub, and TomaszHildegard

This page intentionally left blank

Preface

The regression estimation problem has a long history. Already in 1632Galileo Galilei used a procedure which can be interpreted as fitting a linearrelationship to contaminated observed data. Such fitting of a line througha cloud of points is the classical linear regression problem. A solution ofthis problem is provided by the famous principle of least squares, whichwas discovered independently by A. M. Legendre and C. F. Gauss andpublished in 1805 and 1809, respectively. The principle of least squares canalso be applied to construct nonparametric regression estimates, where onedoes not restrict the class of possible relationships, and will be one of theapproaches studied in this book.

Linear regression analysis, based on the concept of a regression function,was introduced by F. Galton in 1889, while a probabilistic approach in thecontext of multivariate normal distributions was already given by A. Bra-vais in 1846. The first nonparametric regression estimate of local averagingtype was proposed by J. W. Tukey in 1947. The partitioning regression es-timate he introduced, by analogy to the classical partitioning (histogram)density estimate, can be regarded as a special least squares estimate.

Some aspects of nonparametric estimation had already appeared in bel-letristic literature in 1930/31 in The Man Without Qualities by RobertMusil (1880-1942) where, in Section 103 (first book), methods of partition-ing estimation are described: “... as happens so often in life, you ... findyourself facing a phenomenon about which you can’t quite tell whether itis a law or pure chance; that’s where things acquire a human interest. Thenyou translate a series of observations into a series of figures, which you di-vide into categories to see which numbers lie between this value and that,

viii Preface

and the next, and so on .... You then calculate the degree of aberration,the mean deviation, the degree of deviation from some arbitrary value ...the average value ... and so forth, and with the help of all these conceptsyou study your given phenomenon” (cited from page 531 of the Englishtranslation, Alfred A. Knopf Inc., Picador, 1995).

Besides its long history, the problem of regression estimation is of increas-ing importance today. Stimulated by the tremendous growth of informationtechnology in the past 20 years, there is a growing demand for procedurescapable of automatically extracting useful information from massive highly-dimensional databases that companies gather about their customers. One ofthe fundamental approaches for dealing with this “data-mining problem” isregression estimation. Usually there is little or no a priori information aboutthe data, leaving the researcher with no other choice but a nonparametricapproach.

This book presents a modern approach to nonparametric regression withrandom design. The starting point is a prediction problem where mini-mization of the mean squared error (or L2 risk) leads to the regressionfunction. If the goal is to construct an estimate of this function which hasmean squared prediction error close to the minimum mean squared error,then this goal naturally leads to the L2 error criterion used throughout thisbook.

We study almost all known regression estimates, such as classical localaveraging estimates including kernel, partitioning, and nearest neighbor es-timates, least squares estimates using splines, neural networks and radialbasis function networks, penalized least squares estimates, local polyno-mial kernel estimates, and orthogonal series estimates. The emphasis is onthe distribution-free properties of the estimates, and thus most consistencyresults presented in this book are valid for all distributions of the data.When it is impossible to derive distribution-free results, as is the case forrates of convergence, the emphasis is on results which require as few con-straints on distributions as possible, on distribution-free inequalities, andon adaptation.

Our aim in writing this book was to produce a self-contained textintended for a wide audience, including graduate students in statistics,mathematics, computer science, and engineering, as well as researchers inthese fields. We start off with elementary techniques and gradually intro-duce more difficult concepts as we move along. Chapters 1–6 require onlya basic knowledge of probability. In Chapters 7 and 8 we use exponentialinequalities for the sum of independent random variables and for the sum ofmartingale differences. These inequalities are proven in Appendix A. Theremaining part of the book contains somewhat more advanced concepts,such as almost sure convergence together with the real analysis techniquesgiven in Appendix A. The foundations of the least squares and penalizedleast squares estimates are given in Chapters 9 and 19, respectively.

Preface ix

1

2

3

4

5

6

7

8

23

24

26

27

25

9

10

11

12

13

14

16

18

15

17

19

20

21

22

Figure 1. The structure of the book.

The structure of the book is shown in Figure 1. This figure is a precedencetree which could assist an instructor in organizing a course based on this

x Preface

book. It shows the sequence of chapters needed to be covered in order tounderstand a particular chapter. The focus of the chapters in the upper-left box is on local averaging estimates, in the lower-left box on strongconsistency results, in the upper-right box on least squares estimation, andin the lower-right box on penalized least squares.

We would like to acknowledge the contribution of many people who influ-enced the writing of this book. Luc Devroye, Gabor Lugosi, Eric Regener,and Alexandre Tsybakov made many invaluable suggestions leading to con-ceptual improvements and better presentation. A number of colleagues andfriends have, often without realizing it, contributed to our understandingof nonparametrics. In particular we would like to thank in this respectPaul Algoet, Andrew Barron, Peter Bartlett, Lucien Birge, Jan Beirlant,Alain Berlinet, Sandor Csibi, Miguel Delgado, Jurgen Dippon, JeromeFriedman, Wlodzimierz Greblicki, Iain Johnstone, Jack Koplowitz, TamasLinder, Andrew Nobel, Mirek Pawlak, Ewaryst Rafajlowicz, Igor Vajda,Sara van de Geer, Edward van der Meulen, and Sid Yakowitz. Andras An-tos, Andras Gyorgy, Michael Hamers, Kinga Mathe, Daniel Nagy, MartaPinter, Dominik Schafer and Stefan Winter provided long lists of mistakesand typographical errors. Sandor Gyori drew the figures and gave us adviceand help on many LATEX-problems. John Kimmel was helpful, patient andsupportive at every stage.

In addition, we gratefully acknowledge the research support of the Bu-dapest University of Technology and Economics, the Hungarian Academyof Sciences (MTA SZTAKI, AKP, and MTA IEKCS), the Hungarian Min-istry of Education (FKFP and MOB), the University of Stuttgart, DeutscheForschungsgemeinschaft, Stiftung Volkswagenwerk, Deutscher Akademis-cher Austauschdienst, Alexander von Humboldt Stiftung, ConcordiaUniversity, Montreal, NSERC Canada, and FCAR Quebec.

Early versions of this text were tried out at a DMV seminar in Ober-wolfach, Germany, and in various classes at the Carlos III University ofMadrid, the University of Stuttgart, and at the International Centre forMechanical Sciences in Udine. We would like to thank the students therefor useful feedback which improved this book.

Laszlo Gyorfi, Budapest, HungaryMichael Kohler, Stuttgart, GermanyAdam Krzyzak, Montreal, Canada

Harro Walk, Stuttgart, Germany

June 6, 2002

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Why Is Nonparametric Regression Important? . . . . . . . 11.1 Regression Analysis and L2 Risk . . . . . . . . . . . . . . . . . . . . . 11.2 Regression Function Estimation and L2 Error . . . . . . . . . . 21.3 Practical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Application to Pattern Recognition . . . . . . . . . . . . . . . . . . . 61.5 Parametric versus Nonparametric Estimation . . . . . . . . . . 91.6 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.7 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.8 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.9 Fixed versus Random Design Regression . . . . . . . . . . . . . . . 151.10 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 How to Construct Nonparametric Regression Esti-mates? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.1 Four Related Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3 Bias–Variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.4 Choice of Smoothing Parameters and Adaptation . . . . . . . 262.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

xii Contents

3 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.1 Slow Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Minimax Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3 Individual Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 Partitioning Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.2 Stone’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.4 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Kernel Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.3 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.4 Local Polynomial Kernel Estimates . . . . . . . . . . . . . . . . . . . 805.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6 k-NN Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.3 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7 Splitting the Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007.1 Best Random Choice of a Parameter . . . . . . . . . . . . . . . . . . 1007.2 Partitioning, Kernel, and Nearest Neighbor Estimates . . . 1057.3 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1128.1 Best Deterministic Choice of the Parameter . . . . . . . . . . . . 1128.2 Partitioning and Kernel Estimates . . . . . . . . . . . . . . . . . . . . 1138.3 Proof of Theorem 8.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1158.4 Nearest Neighbor Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 1268.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

9 Uniform Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . 130

Contents xiii

9.1 Basic Exponential Inequalities . . . . . . . . . . . . . . . . . . . . . . . 1319.2 Extension to Random L1 Norm Covers . . . . . . . . . . . . . . . . 1349.3 Covering and Packing Numbers . . . . . . . . . . . . . . . . . . . . . . 1409.4 Shatter Coefficients and VC Dimension . . . . . . . . . . . . . . . . 1439.5 A Uniform Law of Large Numbers . . . . . . . . . . . . . . . . . . . . 1539.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

10 Least Squares Estimates I: Consistency . . . . . . . . . . . . . . . 15810.1 Why and How Least Squares? . . . . . . . . . . . . . . . . . . . . . . . . 15810.2 Consistency from Bounded to Unbounded Y . . . . . . . . . . . 16510.3 Linear Least Squares Series Estimates . . . . . . . . . . . . . . . . . 17010.4 Piecewise Polynomial Partitioning Estimates . . . . . . . . . . . 17410.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

11 Least Squares Estimates II: Rate of Convergence . . . . . 18311.1 Linear Least Squares Estimates . . . . . . . . . . . . . . . . . . . . . . 18311.2 Piecewise Polynomial Partitioning Estimates . . . . . . . . . . . 19411.3 Nonlinear Least Squares Estimates . . . . . . . . . . . . . . . . . . . 19711.4 Preliminaries to the Proof of Theorem 11.4 . . . . . . . . . . . . 20311.5 Proof of Theorem 11.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21011.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

12 Least Squares Estimates III: Complexity Regularization 22212.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22212.2 Definition of the Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . 22512.3 Asymptotic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22712.4 Piecewise Polynomial Partitioning Estimates . . . . . . . . . . . 23212.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

13 Consistency of Data-Dependent Partitioning Estimates 23513.1 A General Consistency Theorem. . . . . . . . . . . . . . . . . . . . . . 23513.2 Cubic Partitions with Data-Dependent Grid Size . . . . . . . 24113.3 Statistically Equivalent Blocks . . . . . . . . . . . . . . . . . . . . . . . 24313.4 Nearest Neighbor Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 24513.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

14 Univariate Least Squares Spline Estimates . . . . . . . . . . . . 25214.1 Introduction to Univariate Splines . . . . . . . . . . . . . . . . . . . . 25214.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26714.3 Spline Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

xiv Contents

14.4 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27714.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

15 Multivariate Least Squares Spline Estimates . . . . . . . . . . 28315.1 Introduction to Tensor Product Splines . . . . . . . . . . . . . . . . 28315.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29015.3 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29415.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296

16 Neural Networks Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 29716.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29716.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30016.3 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31516.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328

17 Radial Basis Function Networks . . . . . . . . . . . . . . . . . . . . . . 32917.1 Radial Basis Function Networks . . . . . . . . . . . . . . . . . . . . . . 32917.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33217.3 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34017.4 Increasing Kernels and Approximation . . . . . . . . . . . . . . . . 34817.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350

18 Orthogonal Series Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 35318.1 Wavelet Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35318.2 Empirical Orthogonal Series Estimates . . . . . . . . . . . . . . . . 35618.3 Connection with Least Squares Estimates . . . . . . . . . . . . . . 35818.4 Empirical Orthogonalization of Piecewise Polynomials . . . 36118.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36618.6 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37218.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378

19 Advanced Techniques from Empirical Process Theory 38019.1 Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38019.2 Extension of Theorem 11.6 . . . . . . . . . . . . . . . . . . . . . . . . . . 38519.3 Extension of Theorem 11.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 39019.4 Piecewise Polynomial Partitioning Estimates . . . . . . . . . . . 39719.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405

20 Penalized Least Squares Estimates I: Consistency . . . . . 407

Contents xv

20.1 Univariate Penalized Least Squares Estimates . . . . . . . . . . 40820.2 Proof of Lemma 20.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41420.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41820.4 Multivariate Penalized Least Squares Estimates . . . . . . . . 42520.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42720.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429

21 Penalized Least Squares Estimates II: Rate of Conver-gence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43321.1 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43321.2 Application of Complexity Regularization . . . . . . . . . . . . . 44021.3 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

22 Dimension Reduction Techniques . . . . . . . . . . . . . . . . . . . . . 44822.1 Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44922.2 Projection Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45122.3 Single Index Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45622.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457

23 Strong Consistency of Local Averaging Estimates . . . . . 45923.1 Partitioning Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45923.2 Kernel Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47923.3 k-NN Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48623.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491

24 Semirecursive Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49324.1 A General Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49324.2 Semirecursive Kernel Estimate . . . . . . . . . . . . . . . . . . . . . . . 49624.3 Semirecursive Partitioning Estimate . . . . . . . . . . . . . . . . . . 50724.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511

25 Recursive Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51225.1 A General Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51225.2 Recursive Kernel Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . 51725.3 Recursive Partitioning Estimate . . . . . . . . . . . . . . . . . . . . . . 51825.4 Recursive NN Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51825.5 Recursive Series Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . 52025.6 Pointwise Universal Consistency . . . . . . . . . . . . . . . . . . . . . . 52625.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537

xvi Contents

26 Censored Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54026.1 Right Censoring Regression Models . . . . . . . . . . . . . . . . . . . 54026.2 Survival Analysis, the Kaplan-Meier Estimate . . . . . . . . . . 54126.3 Regression Estimation for Model A . . . . . . . . . . . . . . . . . . . 54826.4 Regression Estimation for Model B . . . . . . . . . . . . . . . . . . . 55526.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563

27 Dependent Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56427.1 Stationary and Ergodic Observations . . . . . . . . . . . . . . . . . . 56527.2 Dynamic Forecasting: Autoregression . . . . . . . . . . . . . . . . . 56827.3 Static Forecasting: General Case . . . . . . . . . . . . . . . . . . . . . 57227.4 Time Series Problem: Cesaro Consistency . . . . . . . . . . . . . . 57627.5 Time Series Problem: Universal Prediction . . . . . . . . . . . . . 57627.6 Estimating Smooth Regression Functions . . . . . . . . . . . . . . 58227.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588

Appendix A: Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589A.1 A Denseness Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589A.2 Inequalities for Independent Random Variables . . . . . . . . . 592A.3 Inequalities for Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . 598A.4 Martingale Convergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601Problems and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607

Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639

Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644

1Why Is Nonparametric RegressionImportant?

In the present and following chapters we provide an overview of this book.In this chapter we introduce the problem of regression function estimationand describe important properties of regression estimates. An overview ofvarious approaches to nonparametric regression estimates is provided inthe next chapter.

1.1 Regression Analysis and L2 Risk

In regression analysis one considers a random vector (X,Y ), where X isRd-valued and Y is R-valued, and one is interested how the value of the so-called response variable Y depends on the value of the observation vectorX. This means that one wants to find a (measurable) function f : Rd → R,such that f(X) is a “good approximation of Y ,” that is, f(X) should beclose to Y in some sense, which is equivalent to making |f(X)−Y | “small.”Since X and Y are random vectors, |f(X)−Y | is random as well, thereforeit is not clear what “small |f(X)−Y |” means. We can resolve this problemby introducing the so-called L2 risk or mean squared error of f ,

E|f(X) − Y |2,

and requiring it to be as small as possible.While it seems natural to use the expectation, it is not obvious why one

wants to minimize the expectation of the squared distance between f(X)

2 1. Why Is Nonparametric Regression Important?

and Y and not, more generally, the Lp risk

E|f(X) − Y |p

for some p ≥ 1 (especially p = 1). There are two reasons for considering theL2 risk. First, as we will see in the sequel, this simplifies the mathemati-cal treatment of the whole problem. For example, as is shown below, thefunction which minimizes the L2 risk can be derived explicitly. Second, andmore important, trying to minimize the L2 risk leads naturally to estimateswhich can be computed rapidly. This will be described later in detail (see,e.g., Chapter 10).

So we are interested in a (measurable) function m∗ : Rd → R such that

E|m∗(X) − Y |2 = minf :Rd→R

E|f(X) − Y |2.

Such a function can be obtained explicitly as follows. Let

m(x) = EY |X = xbe the regression function. We will show that the regression functionminimizes the L2 risk. Indeed, for an arbitrary f : Rd → R, one has

E|f(X) − Y |2 = E|f(X) −m(X) +m(X) − Y |2= E|f(X) −m(X)|2 + E|m(X) − Y |2,

where we have used

E (f(X) −m(X))(m(X) − Y )= E

E(f(X) −m(X))(m(X) − Y )

∣∣X

= E (f(X) −m(X))Em(X) − Y |X= E (f(X) −m(X))(m(X) −m(X))= 0.

Hence,

E|f(X) − Y |2 =∫

Rd

|f(x) −m(x)|2µ(dx) + E|m(X) − Y |2, (1.1)

where µ denotes the distribution of X. The first term is called the L2error of f . It is always nonnegative and is zero if f(x) = m(x). Therefore,m∗(x) = m(x), i.e., the optimal approximation (with respect to the L2risk) of Y by a function of X is given by m(X).

1.2 Regression Function Estimation and L2 Error

In applications the distribution of (X,Y ) (and hence also the regressionfunction) is usually unknown. Therefore it is impossible to predict Y using

1.2. Regression Function Estimation and L2 Error 3

m(X). But it is often possible to observe data according to the distributionof (X,Y ) and to estimate the regression function from these data.

To be more precise, denote by (X,Y ), (X1, Y1), (X2, Y2), . . . independentand identically distributed (i.i.d.) random variables with EY 2 < ∞. LetDn be the set of data defined by

Dn = (X1, Y1), . . . , (Xn, Yn) .In the regression function estimation problem one wants to use the data Dn

in order to construct an estimate mn : Rd → R of the regression functionm. Here mn(x) = mn(x,Dn) is a measurable function of x and the data.For simplicity, we will suppress Dn in the notation and write mn(x) insteadof mn(x,Dn).

In general, estimates will not be equal to the regression function. Tocompare different estimates, we need an error criterion which measures thedifference between the regression function and an arbitrary estimate mn.In the literature, several distinct error criteria are used: first, the pointwiseerror,

|mn(x) −m(x)| for some fixed x ∈ Rd,

second, the supremum norm error,

‖mn −m‖∞ = supx∈C

|mn(x) −m(x)| for some fixed set C ⊆ Rd,

which is mostly used for d = 1 where C is a compact subset of R and,third, the Lp error,

∫

C

|mn(x) −m(x)|pdx,

where the integration is with respect to the Lebesgue measure, C is a fixedsubset of Rd, and p ≥ 1 is arbitrary (often p = 2 is used).

One of the key points we would like to make is that the motivation forintroducing the regression function leads naturally to an L2 error crite-rion for measuring the performance of the regression function estimate.Recall that the main goal was to find a function f such that the L2 riskE|f(X)−Y |2 is small. The minimal value of this L2 risk is E|m(X)−Y |2,and it is achieved by the regression function m. Similarly to (1.1), one canshow that the L2 risk E|mn(X) − Y |2|Dn of an estimate mn satisfies

E|mn(X) − Y |2|Dn

=

∫

Rd

|mn(x)−m(x)|2µ(dx)+E|m(X)−Y |2. (1.2)

Thus the L2 risk of an estimate mn is close to the optimal value if and onlyif the L2 error

∫

Rd

|mn(x) −m(x)|2µ(dx) (1.3)


is close to zero. Therefore we will use the L2 error (1.3) in order to measurethe quality of an estimate and we will study estimates for which this L2error is small.

1.3 Practical Applications

In this section we describe several applications in order to illustrate thepractical relevance of regression estimation.

Example 1.1. Loan management.

A bank is interested in predicting the return Y on a loan given to a cus-tomer. Available to the bank is the profile X of the customer includinghis credit history, assets, profession, income, age, etc. The predicted returnaffects the decision as to whether to issue or refuse a loan, as well as theconditions of the loan. For more details refer to Krahl et al. (1998).

Example 1.2. Profit prediction in marketing.

A company is interested in boosting its sales by mailing product adver-tisements to potential customers. Typically 50,000 people are selected. Ifselection is done randomly, then typically only one or two percent respondto the advertisement. This way, about 49,000 letters are wasted. What ismore, many respondents out of this number will choose only discountedoffers on which the company loses money, or they will buy a product at theregular price but later return it. The company makes money only on theremaining respondents. Clearly, the company is interested in predicting theprofit (or loss) for a potential customer. It is easy to obtain a list of namesand addresses of potential customers by choosing them randomly from thetelephone book or by buying a list from another company. Furthermore,there are databases available which provide, for each name and address,attributes like sex, age, education, profession, etc., describing the person(or a small group of people to which he belongs). The company is interestedin using the vector X of attributes for a particular customer to predict theprofit Y .

Example 1.3. Boston housing values.

Harrison and Rubinfeld (1978) considered the effect of air pollution concen-tration on housing values in Boston. The data consisted of 506 samples ofmedian home values Y in a neighborhood with attributes X such as nitro-gen oxide concentration, crime rate, average number of rooms, percentageof nonretail businesses, etc. A regression estimate was fitted to the dataand it was then used to determine the median value of homes as a functionof air pollution measured by nitrogen oxide concentration. For more detailsrefer to Harrison and Rubinfeld (1978) and Breiman et al. (1984).

1.3. Practical Applications 5

Example 1.4. Wheat crop prediction.

The Ministry for Agriculture of Hungary supported a research project forestimating the total expected crop yield of corn and wheat in order toplan and schedule the export-import of these commodities. They tried topredict the corn and wheat yield per unit area based on measurementsof the reflected light spectrum obtained from satellite images taken bythe LANDSAT 7 satellite (Asmus et al. (1987)). The satellite computesthe integrals of spectral density (the energy of the light) in the followingspectrum bands (wavelengths in µm):(1) [0.45, 0.52] blue;(2) [0.52, 0.60] green;(3) [0.63, 0.69] yellow;(4) [0.76, 0.90] red;(5) [1.55, 1.75] infrared;(6) [2.08, 2.35] infrared; and(7) [10.40, 12.50] infrared.These are the components of the observation vector X which is used topredict crop yields for corn and wheat. The bands (2), (3), and (4) turnedout to be the most relevant for this task.

Example 1.5. Fat-free weight.

A variety of health books suggest that the readers assess their health–at least in part–by considering the percentage of fat-free weight. Exactdetermination of this quantity requires knowledge of the body volume,which is not easily measurable. It can be computed from an underwaterweighing: for this, a person has to undress and submerge in water in order tocompute the increase of volume. This procedure is very inconvenient and soone wishes to estimate the body fat content Y from indirect measurementsX, such as electrical impedance of the skin, weight, height, age, and sex.

Example 1.6. Survival analysis.

In survival analysis one is interested in predicting the survival time Y ofa patient with a life-threatening disease given a description X of the case,such as type of disease, blood measurements, sex, age, therapy, etc. Theresult can be used to determine the appropriate therapy for a patient bymaximizing the predicted survival time with respect to the therapy (see,e.g., Dippon, Fritz, and Kohler (2002) for an application in connection withbreast cancer data). One specific feature in this application is that usuallyone cannot observe the survival time of a patient. Instead, one gets onlythe minimum of the survival time and a censoring time together with theinformation as to whether the survival time is less than the censoring timeor not. We deal with regression function estimation from such censoreddata in Chapter 26.


Most of these applications are concerned with the prediction of Y fromX.But some of them (see Examples 1.3 and 1.6) also deal with interpretationof the dependency of Y on X.

1.4 Application to Pattern Recognition

In pattern recognition, Y takes only finitely many values. For simplicityassume that Y takes two values, say 0 and 1. The aim is to predict thevalue of Y given the value of X (e.g., to predict whether a patient has aspecial disease or not, given some measurements of the patient like bodytemperature, blood pressure, etc.). The goal is to find a function g∗ : Rd →0, 1 which minimizes the probability of g∗(X) = Y , i.e., to find a functiong∗ such that

Pg∗(X) = Y = ming:Rd→0,1

Pg(X) = Y , (1.4)

where g∗ is called the Bayes decision function, and Pg(X) = Y ) is theprobability of misclassification.

The Bayes decision function can be obtained explicitly.

Lemma 1.1.

g∗(x) =

1 if PY = 1|X = x ≥ 1/2,0 if PY = 1|X = x < 1/2, (1.5)

is the Bayes decision function, i.e., g∗ satisfies (1.4).

Proof. Let g : Rd → 0, 1 be an arbitrary (measurable) function. Fixx ∈ Rd. Then

Pg(X) = Y |X = x = 1 − Pg(X) = Y |X = x= 1 − PY = g(x)|X = x.

Hence,

Pg(X) = Y |X = x − Pg∗(X) = Y |X = x= PY = g∗(x)|X = x − PY = g(x)|X = x ≥ 0,

because

PY = g∗(x)|X = x = max PY = 0|X = x,PY = 1|X = xby the definition of g∗. This proves

Pg∗(X) = Y |X = x ≤ Pg(X) = Y |X = xfor all x ∈ Rd, which implies

Pg∗(X) = Y =∫

Pg∗(X) = Y |X = xµ(dx)

1.4. Application to Pattern Recognition 7

≤∫

Pg(X) = Y |X = xµ(dx)

= Pg(X) = Y .

PY = 1|X = x and PY = 0|X = x are the so-called a posterioriprobabilities. Observe that

PY = 1|X = x = EY |X = x = m(x).

A natural approach is to estimate the regression function m by an estimatemn using data Dn = (X1, Y1), . . . , (Xn, Yn) and then to use a so-calledplug-in estimate

gn(x) =

1 if mn(x) ≥ 1/2,0 if mn(x) < 1/2, (1.6)

to estimate g∗. The next theorem implies that if mn is close to the realregression function m, then the error probability of decision gn is near tothe error probability of the optimal decision g∗.

Theorem 1.1. Let m : Rd → R be a fixed function and define the plug-indecision g by

g(x) =

1 if m(x) ≥ 1/2,0 if m(x) < 1/2.

Then

0 ≤ Pg(X) = Y − Pg∗(X) = Y

≤ 2∫

Rd

|m(x) −m(x)|µ(dx)

≤ 2(∫

Rd

|m(x) −m(x)|2µ(dx)) 1

2

.

Proof. It follows from the proof of Lemma 1.1 that, for arbitrary x ∈ Rd,

Pg(X) = Y |X = x − Pg∗(X) = Y |X = x

= PY = g∗(x)|X = x − PY = g(x)|X = x

= Ig∗(x)=1m(x) + Ig∗(x)=0(1 −m(x))

− (Ig(x)=1m(x) + Ig(x)=0(1 −m(x))

)

= Ig∗(x)=1m(x) + Ig∗(x)=0(1 −m(x))

− (Ig∗(x)=1m(x) + Ig∗(x)=0(1 − m(x))

)

+(Ig∗(x)=1m(x) + Ig∗(x)=0(1 − m(x))

)


− (Ig(x)=1m(x) + Ig(x)=0(1 − m(x))

)

+(Ig(x)=1m(x) + Ig(x)=0(1 − m(x))

)

− (Ig(x)=1m(x) + Ig(x)=0(1 −m(x))

)

≤ Ig∗(x)=1(m(x) − m(x)) + Ig∗(x)=0(m(x) −m(x))

+ Ig(x)=1(m(x) −m(x)) + Ig(x)=0(m(x) − m(x))

(because of

Ig(x)=1m(x) + Ig(x)=0(1 − m(x)) = maxm(x), 1 − m(x)by definition of g)

≤ 2|m(x) −m(x)|.Hence

0 ≤ Pg(X) = Y − Pg∗(X) = Y

=∫

(Pg(X) = Y |X = x − Pg∗(X) = Y |X = x)µ(dx)

≤ 2∫

|m(x) −m(x)|µ(dx).

The second assertion follows from the Cauchy-Schwarz inequality.

In Theorem 1.1, the second inequality in particular is not tight. Thereforepattern recognition is easier than regression estimation (cf. Devroye, Gyorfi,and Lugosi (1996)).

It follows from Theorem 1.1 that the error probability of the plug-indecision gn defined above satisfies

0 ≤ Pgn(X) = Y |Dn − Pg∗(X) = Y

≤ 2∫

Rd

|mn(x) −m(x)|µ(dx)

≤ 2(∫

Rd

|mn(x) −m(x)|2µ(dx)) 1

2

.

Thus estimates mn with small L2 error automatically lead to estimates gn

with small misclassification probability. Observe, however, that for (1.6) tobe a good approximation of (1.5) it is not important that mn(x) be close tom(x). Instead it is only important that mn(x) should be on the same sideof the decision boundary as m(x), i.e., that mn(x) > 1

2 whenever m(x) > 12

and mn(x) < 12 whenever m(x) < 1

2 . Nevertheless, one often constructsestimates by minimizing the L2 risk E|mn(X) − Y |2|Dn and using theplug-in rule (1.6), because trying to minimize the L2 risk leads to estimateswhich can be computed efficiently.

1.5. Parametric versus Nonparametric Estimation 9

This can be generalized to the case where Y takes M ≥ 2 distinct values,without loss of generality (w.l.o.g.) 1, . . . , M (e.g., depending on whethera patient has a special type of disease or no disease). The goal is to find afunction g∗ : Rd → 1, . . . ,M such that

Pg∗(X) = Y = ming:Rd→1,...,M

Pg(X) = Y , (1.7)

where g∗ is called the Bayes decision function. It can be computed usingthe a posteriori probabilities PY = k|X = x (k ∈ 1, . . . ,M):

g∗(x) = arg max1≤k≤M

PY = k|X = x (1.8)

(cf. Problem 1.4).The a posteriori probabilities are the regression functions

PY = k|X = x = EIY =k|X = x = m(k)(x).

Given data Dn = (X1, Y1), . . . , (Xn, Yn), estimates m(k)n of m(k) can be

constructed from the data set

D(k)n = (X1, IY1=k), . . . , (Xn, IYn=k),

and one can use a plug-in estimate

gn(x) = arg max1≤k≤M

m(k)n (x) (1.9)

to estimate g∗. If the estimates m(k)n are close to the a posteriori proba-

bilities, then again the error of the plug-in estimate (1.9) is close to theoptimal error (cf. Problem 1.5).

1.5 Parametric versus Nonparametric Estimation

The classical approach for estimating a regression function is the so-calledparametric regression estimation. Here one assumes that the structure ofthe regression function is known and depends only on finitely many pa-rameters, and one uses the data to estimate the (unknown) values of theseparameters.

The linear regression estimate is an example of such an estimate. In linearregression one assumes that the regression function is a linear combinationof the components of x = (x(1), . . . , x(d))T , i.e.,

m(x(1), . . . , x(d)) = a0 +d∑

i=1

aix(i) ((x(1), . . . , x(d))T ∈ Rd)

for some unknown a0, . . . , ad ∈ R. Then one uses the data to estimatethese parameters, e.g., by applying the principle of least squares, where


one chooses the coefficients a0, . . . , ad of the linear function such that itbest fits the given data:

(a0, . . . , ad) = arg mina0,...,ad∈Rd

1n

n∑

j=1

∣∣∣∣∣Yj − a0 −

d∑

i=1

aiX(i)j

∣∣∣∣∣

2

.

Here X(i)j denotes the ith component of Xj and z = arg minx∈D f(x) is the

abbreviation for z ∈ D and f(z) = minx∈D f(x). Finally one defines theestimate by

mn(x) = a0 +d∑

i=1

aix(i) ((x(1), . . . , x(d))T ∈ Rd).

Parametric estimates usually depend only on a few parameters, thereforethey are suitable even for small sample sizes n, if the parametric modelis appropriately chosen. Furthermore, they are often easy to interpret. Forinstance in a linear model (when m(x) is a linear function) the absolutevalue of the coefficient ai indicates how much influence the ith componentof X has on the value of Y , and the sign of ai describes the nature of thisinfluence (increasing or decreasing the value of Y ).

However, parametric estimates have a big drawback. Regardless of thedata, a parametric estimate cannot approximate the regression functionbetter than the best function which has the assumed parametric structure.For example, a linear regression estimate will produce a large error forevery sample size if the true underlying regression function is not linearand cannot be well approximated by linear functions.

For univariate X one can often use a plot of the data to choose a properparametric estimate. But this is not always possible, as we now illustrateusing simulated data. These data will be used throughout the book. Theyconsist of n = 200 points such that X is standard normal restricted to[−1, 1], i.e., the density of X is proportional to the standard normal den-sity on [−1, 1] and is zero elsewhere. The regression function is piecewisepolynomial:

m(x) =

(x+ 2)2/2 if − 1 ≤ x < −0.5,x/2 + 0.875 if − 0.5 ≤ x < 0,−5(x− 0.2)2 + 1.075 if 0 < x ≤ 0.5,x+ 0.125 if 0.5 ≤ x < 1.

Given X, the conditional distribution of Y − m(X) is normal with meanzero and standard deviation

σ(X) = 0.2 − 0.1 cos(2πX).

1.5. Parametric versus Nonparametric Estimation 11

−1 −0.5 0.5 1

0.5

Figure 1.1. Simulated data points.

−1 −0.5 0.5 1

0.5

Figure 1.2. Data points and regression function.

Figure 1.1 shows the data points. In this example the human eye is notable to see from the data points what the regression function looks like. InFigure 1.2 the data points are shown together with the regression function.

In Figure 1.3 a linear estimate is constructed for these simulated data.Obviously, a linear function does not approximate the regression functionwell.

Furthermore, for multivariate X, there is no easy way to visualize thedata. Thus, especially for multivariate X, it is not clear how to choose a


−1 −0.5 0.5 1

0.5

Figure 1.3. Linear regression estimate.

proper form of a parametric estimate, and a wrong form will lead to a badestimate.

This inflexibility concerning the structure of the regression function isavoided by so-called nonparametric regression estimates. These methods,which do not assume that the regression function can be described byfinitely many parameters, are introduced in Chapter 2 and are the mainsubject of this book.

1.6 Consistency

We will now define the modes of convergence of the regression estimatesthat we will study in this book.

The first and weakest property an estimate should have is that, as thesample size grows, it should converge to the estimated quantity, i.e., theerror of the estimate should converge to zero for a sample size tending toinfinity. Estimates which have this property are called consistent.

To measure the error of a regression estimate, we use the L2 error∫

|mn(x) −m(x)|2µ(dx).

The estimate mn depends on the data Dn, therefore the L2 error is arandom variable. We are interested in the convergence of the expectation ofthis random variable to zero as well as in the almost sure (a.s.) convergenceof this random variable to zero.

1.7. Rate of Convergence 13

Definition 1.1. A sequence of regression function estimates mn is calledweakly consistent for a certain distribution of (X,Y ), if

limn→∞

E∫

(mn(x) −m(x))2µ(dx)

= 0.

Definition 1.2. A sequence of regression function estimates mn is calledstrongly consistent for a certain distribution of (X,Y ), if

limn→∞

∫(mn(x) −m(x))2µ(dx) = 0 with probability one.

It may be that a regression function estimate is consistent for a cer-tain class of distributions of (X,Y ), but not consistent for others. It isclearly desirable to have estimates that are consistent for a large classof distributions. In this monograph we are interested in properties of mn

that are valid for all distributions of (X,Y ), that is, in distribution-freeor universal properties. The concept of universal consistency is importantin nonparametric regression because the mere use of a nonparametric esti-mate is normally a consequence of the partial or total lack of informationabout the distribution of (X,Y ). Since in many situations we do not haveany prior information about the distribution, it is essential to have esti-mates that perform well for all distributions. This very strong requirementof universal goodness is formulated as follows:

Definition 1.3. A sequence of regression function estimates mn iscalled weakly universally consistent if it is weakly consistent for alldistributions of (X,Y ) with EY 2 < ∞.

Definition 1.4. A sequence of regression function estimates mn iscalled strongly universally consistent if it is strongly consistent forall distributions of (X,Y ) with EY 2 < ∞.

We will later give many examples of estimates that are weakly andstrongly universally consistent.

1.7 Rate of Convergence

If an estimate is universally consistent, then, regardless of the true under-lying distribution of (X,Y ), the L2 error of the estimate converges to zerofor a sample size tending to infinity. But this says nothing about how fastthis happens. Clearly, it is desirable to have estimates for which the L2error converges to zero as fast as possible.

To decide about the rate of convergence of an estimate mn, we will lookat the expectation of the L2 error,

E∫

|mn(x) −m(x)|2µ(dx). (1.10)


A natural question to ask is whether there exist estimates for which(1.10) converges to zero at some fixed, nontrivial rate for all distributionsof (X,Y ). Unfortunately, as we will see in Chapter 3, such estimates donot exist, i.e., for any estimate the rate of convergence may be arbitrarilyslow. In order to get nontrivial rates of convergence, one has to restrict theclass of distributions, e.g., by imposing some smoothness assumptions onthe regression function.

In Chapter 3 we will define classes Fp of the distributions of (X,Y ) wherethe corresponding regression function satisfies some smoothness conditiondepending on a parameter p (e.g., m is p times continuously differentiable).We then use the classical minimax approach to define the optimal rate ofconvergence for such classes Fp. This means that we will try to minimizethe maximal value of (1.10) within the class Fp of the distributions of(X,Y ), i.e., we will look at

infmn

sup(X,Y )∈Fp

E∫

|mn(x) −m(x)|2µ(dx), (1.11)

where the infimum is taken over all estimates mn. We are interested inoptimal estimates mn, for which the maximal value of (1.10) within Fp,i.e.,

sup(X,Y )∈Fp

E∫

|mn(x) −m(x)|2µ(dx), (1.12)

is close to (1.11).To simplify our analysis, we will only look at the asymptotic behavior of

(1.11) and (1.12), i.e., we will determine the rate of convergence of (1.11)to zero for a sample size tending to infinity, and we will construct estimateswhich achieve (up to some constant factor) the same rate of convergence.

For classes Fp, wherem is p times continuously differentiable, the optimalrate of convergence will be n− 2p

2p+d .

1.8 Adaptation

Often, estimates which achieve the optimal minimax rate of convergence fora given class Fp0 of distributions (where, e.g., m is p0 times continuouslydifferentiable) require the knowledge of p0 and are adjusted perfectly tothis class of distributions. Therefore they don’t achieve the optimal rate ofconvergence for other classes Fp, p = p0.

If one could find out in an application to which classes of distributionsthe true underlying distribution belongs, then one could choose that classwhich has the best rate of convergence (which will be the smallest class inthe case of nested classes), and could choose an estimate which achievesthe optimal minimax rate of convergence within this class. This, however,

1.9. Fixed versus Random Design Regression 15

would require knowledge about the smoothness of the regression function.In applications such knowledge is typically not available and, unfortunatelyit is not possible to use the data to decide about the smoothness of theregression function (at least, we do not know of any test which can decidehow smooth the regression function is, e.g., whether m is continuous ornot).

Therefore, instead of looking at each class Fp of distributions separately,and constructing estimates which are optimal for this class only, one triesto construct estimates which achieve the optimal (or a nearly optimal)minimax rate of convergence simultaneously for many different classes ofdistributions. Such estimates are called adaptive and will be used through-out this book. Several possibilities for constructing adaptive estimates willbe described in the next chapter.

1.9 Fixed versus Random Design Regression

The problem studied in this book is also called regression estimation withrandom design, which means that the Xi’s are random variables. We wantto mention that there exists a related problem, called regression estimationwith fixed design. This section is about similarities and differences betweenthese two problems.

Regression function estimation with fixed design can be described asfollows: one observes values of some function at some fixed (given) pointswith additive random errors, and wants to recover the true value of thefunction at these points. More precisely, given data (x1, Y1), . . . , (xn, Yn),where x1, . . . , xn are fixed (nonrandom) points in Rd and

Yi = f(xi) + σi · εi (i = 1, . . . , n) (1.13)

for some (unknown) function f : Rd → R, some σ1, . . . , σn ∈ R+, andsome independent and identically distributed random variables ε1, . . . , εnwith Eε1 = 0 and Eε21 = 1, one wants to estimate the values of f at the so-called design points x1, . . . , xn. Typically, in this problem, one has d = 1,sometimes also d = 2 (image reconstruction). Often the xi are equidistant,e.g., in [0, 1], and one assumes that the variance σ2

i of the additive error(noise) σi · εi is constant, i.e., σ2

1 = · · · = σ2n = σ2.

Clearly, this problem has some similarity with the problem we study inthis book. This becomes obvious when one rewrites the data in our modelas

Yi = m(Xi) + εi, (1.14)

where εi = εi(Xi) = Yi −m(Xi) satisfies Eεi|Xi = 0.It may seem that fixed design regression is a more general approach

than random design and that one can handle random design regressionestimation by imposing conditions on the design points and then applying


results for fixed design regression. We want to point out that this is nottrue, because the assumptions in both models are fundamentally different.First, in (1.14) the error εi depends on Xi, thus it is not independentof Xi and its whole structure (i.e., kind of distribution) can change withXi. Second, the design points X1, . . . , Xn in (1.14) are typically far frombeing uniformly distributed. And third, while in fixed design regression thedesign points are typically univariate or at most bivariate, the dimensiond of X in random design regression is often much larger than two, whichfundamentally changes the problem (cf. Section 2.2).

1.10 Bibliographic Notes

We have made a computer search for nonparametric regression, resultingin 3457 items. It is clear that we cannot cite all of them and we apologizeat this point for the many good papers which we didn’t cite. In the laterchapters we refer only to publications on L2 theory. Concerning nonpara-metric regression estimation including pointwise or uniform consistencyproperties, we refer to the following monographs (with further referencestherein): Bartlett and Anthony (1999), Bickel et al. (1993), Bosq (1996),Bosq and Lecoutre (1987), Breiman et al. (1984), Collomb (1980), De-vroye, Gyorfi, and Lugosi (1996), Devroye and Lugosi (2001), Efromovich(1999), Eggermont and La Riccia (2001), Eubank (1999), Fan and Gijbels(1995), Hardle (1990), Hardle et al. (1998), Hart (1997), Hastie, Tibshiraniand Friedman (2001), Horowitz (1998), Korostelev and Tsybakov (1993),Nadaraya (1989), Prakasa Rao (1983), Simonoff (1996), Thompson andTapia (1990), Vapnik (1982; 1998), Vapnik and Chervonenkis (1974), Wandand Jones (1995). We refer also to the bibliography of Collomb (1985) andto the survey on fixed design regression of Gasser and Muller (1979).

For parametric methods we refer to Rao (1973), Seber (1977), Draperand Smith (1981) and Farebrother (1988) and the literature cited therein.

Lemma 1.1 and Theorem 1.1 are well-known in the literature. Con-cerning Theorem 1.1 see, e.g., Van Ryzin (1966), Wolverton and Wagner(1969a), Glick (1973), Csibi (1971), Gyorfi (1975; 1978), Devroye andWagner (1976), Devroye (1982b), or Devroye and Gyorfi (1985).

The concept of (weak) universal consistency goes back to Stone (1977).

Problems and Exercises

Problem 1.1. Show that the regression function also has the following pointwiseoptimality property:

E|m(X) − Y |2|X = x

= min

fE|f(X) − Y |2|X = x

Problems and Exercises 17

for µ-almost all x ∈ Rd.

Problem 1.2. Prove (1.2).

Problem 1.3. Let (X, Y ) be an Rd×R-valued random variable with E|Y | < ∞.Determine a function f∗ : Rd → R which minimizes the L1 risk, i.e., whichsatisfies

E|f∗(X) − Y | = minf :Rd→R

E|f(X) − Y |.

Problem 1.4. Prove that the decision rule (1.8) satisfies (1.7).

Problem 1.5. Show that the error probability of the plug-in decision rule (1.9)satisfies

0 ≤ Pgn(X) = Y |Dn − Pg∗(X) = Y

≤M∑

k=1

∫

Rd

|m(k)n (x) − m(k)(x)|µ(dx)

≤M∑

k=1

∫

Rd

|m(k)n (x) − m(k)(x)|2µ(dx)

12

.

2How to Construct NonparametricRegression Estimates?

In this chapter we give an overview of various ways to define nonparametricregression estimates. In Section 2.1 we introduce four related paradigmsfor nonparametric regression. For multivariate X there are some specialmodifications of the resulting estimates due to the so-called “curse of di-mensionality.” These will be described in Section 2.2. The estimates dependon smoothing parameters. The choice of these parameters is importantbecause of the so-called bias-variance tradeoff, which will be describedin Section 2.3. Finally, in Section 2.4, several methods for choosing thesmoothing parameters are introduced.

2.1 Four Related Paradigms

In this section we describe four paradigms of nonparametric regression:local averaging, local modeling, global modeling (or least squaresestimation), and penalized modeling.

Recall that the data can be written as

Yi = m(Xi) + εi,

where εi = Yi − m(Xi) satisfies E(εi|Xi) = 0. Thus Yi can be consideredas the sum of the value of the regression function at Xi and some errorεi, where the expected value of the error is zero. This motivates the con-struction of the estimates by local averaging, i.e., estimation of m(x) bythe average of those Yi where Xi is “close” to x. Such an estimate can be

2.1. Four Related Paradigms 19

written as

mn(x) =n∑

i=1

Wn,i(x) · Yi,

where the weights Wn,i(x) = Wn,i(x,X1, . . . , Xn) ∈ R depend onX1, . . . , Xn. Usually the weights are nonnegative and Wn,i(x) is “small”if Xi is “far” from x.

An example of such an estimate is the partitioning estimate. Here onechooses a finite or countably infinite partition Pn = An,1, An,2, . . . of Rd

consisting of Borel sets An,j ⊆ Rd and defines, for x ∈ An,j , the estimateby averaging Yi’s with the corresponding Xi’s in An,j , i.e.,

mn(x) =∑n

i=1 IXi∈An,jYi∑ni=1 IXi∈An,j

for x ∈ An,j , (2.1)

where IA denotes the indicator function of set A, so

Wn,i(x) =IXi∈An,j∑nl=1 IXl∈An,j

for x ∈ An,j .

Here and in the following we use the convention 00 = 0.

The second example of a local averaging estimate is the Nadaraya–Watson kernel estimate. Let K : Rd → R+ be a function called the kernelfunction, and let h > 0 be a bandwidth. The kernel estimate is defined by

mn(x) =∑n

i=1K(

x−Xi

h

)Yi

∑ni=1K

(x−Xi

h

) , (2.2)

so

Wn,i(x) =K

(x−Xi

h

)

∑nj=1K

(x−Xj

h

) .

If one uses the so-called naive kernel (or window kernel) K(x) = I‖x‖≤1,then

mn(x) =∑n

i=1 I‖x−Xi‖≤hYi∑ni=1 I‖x−Xi‖≤h

,

i.e., one estimates m(x) by averaging Yi’s such that the distance betweenXi and x is not greater than h.

For more general K : Rd → R+ one uses a weighted average of the Yi,where the weight of Yi (i.e., the influence of Yi on the value of the estimateat x) depends on the distance between Xi and x.

Our final example of local averaging estimates is the k-nearest neighbor(k-NN) estimate. Here one determines the k nearest Xi’s to x in terms ofdistance ‖x−Xi‖ and estimates m(x) by the average of the correspondingYi’s. More precisely, for x ∈ Rd, let

(X(1)(x), Y(1)(x)), . . . , (X(n)(x), Y(n)(x))

20 2. How to Construct Nonparametric Regression Estimates?

K(x) = I||x||≤1

x

K(x) = e−x2

x

Figure 2.1. Examples of kernels: window (naive) kernel and Gaussian kernel.

x

X(1)(x)

X(2)(x)X(3)(x)

X(4)(x)X(5)(x)

X(6)(x)

Figure 2.2. Nearest neighbors to x.

be a permutation of

(X1, Y1), . . . , (Xn, Yn)

such that

‖x−X(1)(x)‖ ≤ · · · ≤ ‖x−X(n)(x)‖.The k-NN estimate is defined by

mn(x) =1k

k∑

i=1

Y(i)(x). (2.3)

Here the weight Wni(x) equals 1/k if Xi is among the k nearest neighborsof x, and equals 0 otherwise.

The kernel estimate (2.2) can be considered as locally fitting a constantto the data. In fact, it is easy to see (cf. Problem 2.2) that it satisfies

mn(x) = arg minc∈R

1n

n∑

i=1

K

(x−Xi

h

)(Yi − c)2 . (2.4)

A generalization of this leads to the local modeling paradigm: instead oflocally fitting a constant to the data, locally fit a more general function,which depends on several parameters. Let g(·, akl

k=1) : Rd → R be a

2.1. Four Related Paradigms 21

function depending on parameters aklk=1. For each x ∈ Rd, choose values

of these parameters by a local least squares criterion

ak(x)lk=1 = arg min

aklk=1

1n

n∑

i=1

K

(x−Xi

h

)(Yi − g

(Xi, akl

k=1))2

. (2.5)

Here we do not require that the minimum in (2.5) be unique. In case thereare several points at which the minimum is attained we use an arbitraryrule (e.g., by flipping a coin) to choose one of these points. Evaluate thefunction g for these parameters at the point x and use this as an estimateof m(x):

mn(x) = g(x, ak(x)l

k=1

). (2.6)

If one chooses g(x, c) = c (x ∈ Rd), then this leads to the Nadaraya–Watson kernel estimate.

The most popular example of a local modeling estimate is the local poly-nomial kernel estimate. Here one locally fits a polynomial to the data. Forexample, for d = 1, X is real-valued and

g(x, akl

k=1

)=

l∑

k=1

akxk−1

is a polynomial of degree l − 1 (or less) in x.A generalization of the partitioning estimate leads to global modeling

or least squares estimates. Let Pn = An,1, An,2, . . . be a partition ofRd and let Fn be the set of all piecewise constant functions with respectto that partition, i.e.,

Fn =

∑

j

ajIAn,j: aj ∈ R

. (2.7)

Then it is easy to see (cf. Problem 2.3) that the partitioning estimate (2.1)satisfies

mn(·) = arg minf∈Fn

1n

n∑

i=1

|f(Xi) − Yi|2

. (2.8)

Hence it minimizes the empirical L2 risk

1n

n∑

i=1

|f(Xi) − Yi|2 (2.9)

over Fn. Least squares estimates are defined by minimizing the empiricalL2 risk over a general set of functions Fn (instead of (2.7)). Observe thatit doesn’t make sense to minimize (2.9) over all (measurable) functions f ,because this may lead to a function which interpolates the data and hence isnot a reasonable estimate. Thus one has to restrict the set of functions over


Figure 2.3. The estimate on the right seems to be more reasonable than theestimate on the left, which interpolates the data.

which one minimizes the empirical L2 risk. Examples of possible choices ofthe set Fn are sets of piecewise polynomials with respect to a partition Pn,or sets of smooth piecewise polynomials (splines). The use of spline spacesensures that the estimate is a smooth function.

Instead of restricting the set of functions over which one minimizes,one can also add a penalty term to the functional to be minimized. LetJn(f) ≥ 0 be a penalty term penalizing the “roughness” of a function f .The penalized modeling or penalized least squares estimate mn isdefined by

mn = arg minf

1n

n∑

i=1

|f(Xi) − Yi|2 + Jn(f)

, (2.10)

where one minimizes over all measurable functions f . Again we do notrequire that the minimum in (2.10) be unique. In the case it is not unique,we randomly select one function which achieves the minimum.

A popular choice for Jn(f) in the case d = 1 is

Jn(f) = λn

∫|f ′′(t)|2dt, (2.11)

where f ′′ denotes the second derivative of f and λn is some positive con-stant. We will show in Chapter 20 that for this penalty term the minimumin (2.10) is achieved by a cubic spline with knots at the Xi’s, i.e., by atwice differentiable function which is equal to a polynomial of degree 3 (orless) between adjacent values of the Xi’s (a so-called smoothing spline). Ageneralization of (2.11) is

Jn,k(f) = λn

∫|f (k)(t)|2dt,

where f (k) denotes the k-th derivative of f . For multivariate X one can use

Jn,k(f) = λn

∫ ∑

i1,...,ik∈1,...,d

∣∣∣∣

∂kf

∂xi1 . . . ∂xik

∣∣∣∣

2

dx,

which leads to the so-called thin plate spline estimates.

2.2. Curse of Dimensionality 23

2.2 Curse of Dimensionality

If X takes values in a high-dimensional space (i.e., if d is large), estimatingthe regression function is especially difficult. The reason for this is that inthe case of large d it is, in general, not possible to densely pack the spaceof X with finitely many sample points, even if the sample size n is verylarge. This fact is often referred to as the “curse of dimensionality.” In thesequel we will illustrate this with an example.

Let X, X1, . . . , Xn be independent and identically distributed Rd-valuedrandom variables with X uniformly distributed in the hypercube [0, 1]d.Denote the expected supremum-norm distance of X to its nearest neighborin X1, . . . , Xn by d∞(d, n), i.e., set

d∞(d, n) = E

mini=1,...,n

‖X −Xi‖∞

.

Here ‖x‖∞ is the supremum norm of a vector x = (x(1), . . . , x(d))T ∈ Rd

defined by

‖x‖∞ = maxl=1,...,d

∣∣∣x(l)

∣∣∣ .

Then

d∞(d, n) =∫ ∞

0P

mini=1,...,n

‖X −Xi‖∞ > t

dt

=∫ ∞

0

(1 − P

min

i=1,...,n‖X −Xi‖∞ ≤ t

)dt.

The bound

P

mini=1,...,n

‖X −Xi‖∞ ≤ t

≤ n · P ‖X −X1‖∞ ≤ t

≤ n · (2t)d

implies

d∞(d, n) ≥∫ 1/(2n1/d)

0

(1 − n · (2t)d

)dt

=[t− n · 2d · t

d+1

d+ 1

]1/(2n1/d)

t=0

=1

2 · n1/d− 1d+ 1

· 12 · n1/d

=d

2(d+ 1)· 1n1/d

.

Table 2.1 shows values of this lower bound for various values of d and n.As one can see, for dimension d = 10 or d = 20 this lower bound is not


Table 2.1. Lower bounds for d∞(d, n).

n = 100 n = 1000 n = 10, 000 n = 100, 000d∞(1, n) ≥ 0.0025 ≥ 0.00025 ≥ 0.000025 ≥ 0.0000025d∞(10, n) ≥ 0.28 ≥ 0.22 ≥ 0.18 ≥ 0.14d∞(20, n) ≥ 0.37 ≥ 0.34 ≥ 0.30 ≥ 0.26

close to zero even if the sample size n is very large. So for most values of xone only has data points (Xi, Yi) available where Xi is not close to x. Butat such data points m(Xi) will, in general, not be close to m(x) even for asmooth regression function.

The only way to overcome the curse of dimensionality is to incorporateadditional assumptions about the regression function besides the sample.This is implicitly done by nearly all multivariate estimation procedures, in-cluding projection pursuit, neural networks, radial basis function networks,trees, etc.

As we will see in Problem 2.4, a similar problem also occurs if one replacesthe supremum norm by the Euclidean norm. Of course, the argumentsabove are no longer valid if the components of X are not independent (e.g.,if all components of X are equal and hence all values of X lie on a line inRd). But in this case they are (roughly speaking) still valid with d replacedby the number of independent components of X (or, more generally, the“intrinsic” dimension of X), which for large d may still be a large number.

2.3 Bias–Variance Tradeoff

Let mn be an arbitrary estimate. For any x ∈ Rd we can write the expectedsquared error of mn at x as

E|mn(x) −m(x)|2= E|mn(x) − Emn(x)|2 + |Emn(x) −m(x)|2= Var(mn(x)) + |bias(mn(x))|2.

Here Var(mn(x)) is the variance of the random variable mn(x) andbias(mn(x)) is the difference between the expectation of mn(x) and m(x).This also leads to a similar decomposition of the expected L2 error:

E∫

|mn(x) −m(x)|2µ(dx)

=∫

E|mn(x) −m(x)|2µ(dx)

=∫

Var(mn(x))µ(dx) +∫

|bias(mn(x))|2µ(dx).

2.3. Bias–Variance Tradeoff 25

h

Error

1nh

h2

1nh

+ h2

Figure 2.4. Bias–variance tradeoff.

The importance of these decompositions is that the integrated variance andthe integrated squared bias depend in opposite ways on the wiggliness ofan estimate. If one increases the wiggliness of an estimate, then usuallythe integrated bias will decrease, but the integrated variance will increase(so-called bias–variance tradeoff).

In Figure 2.4 this is illustrated for the kernel estimate, where one has,under some regularity conditions on the underlying distribution and for thenaive kernel,

∫

Rd

Var(mn(x))µ(dx) = c11nhd

+ o

(1nhd

)

and∫

Rd

|bias(mn(x))|2µ(dx) = c2h2 + o

(h2) .

Here h denotes the bandwidth of the kernel estimate which controls thewiggliness of the estimate, c1 is some constant depending on the conditionalvariance VarY |X = x, the regression function is assumed to be Lipschitzcontinuous, and c2 is some constant depending on the Lipschitz constant.

The value h∗ of the bandwidth for which the sum of the integrated vari-ance and the squared bias is minimal depends on c1 and c2. Since theunderlying distribution, and hence also c1 and c2, are unknown in an ap-plication, it is important to have methods which choose the bandwidthautomatically using only the data Dn. Such methods will be described inthe next section.


2.4 Choice of Smoothing Parameters andAdaptation

Recall that we want to construct estimates mn,p such that the L2 risk

E|mn,p(X) − Y |2|Dn (2.12)

is small. Hence the smoothing parameter p of an estimate (e.g., bandwidthp = h of a kernel estimate or number p = K of cells of a partitioningestimate) should be chosen to make (2.12) small.

For a fixed function f : Rd → R, the L2 risk E|f(X) − Y |2 can beestimated by the empirical L2 risk (error on the sample)

1n

n∑

i=1

|f(Xi) − Yi|2. (2.13)

The resubstitution method also uses this estimate for mn,p, i.e., itchooses the smoothing parameter p that minimizes

1n

n∑

i=1

|mn,p(Xi) − Yi|2. (2.14)

Usually this leads to overly optimistic estimates of the L2 risk and is hencenot useful. The reason for this behavior is that (2.14) favors estimateswhich are too well-adapted to the data and are not reasonable for newobservations (X,Y ).

This problem doesn’t occur if one uses a new sample

(X1, Y1), . . . , (Xn, Yn)

to estimate (2.12), where

(X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xn, Yn),

are i.i.d., i.e., if one minimizes

1n

n∑

i=1

|mn,p(Xi) − Yi|2. (2.15)

Of course, in the regression function estimation problem one doesn’t havean additional sample. But this isn’t a real problem, because we can simplysplit the sample Dn into two parts: a learning sample

Dn1 = (X1, Y1), . . . , (Xn1 , Yn1)which we use to construct estimates

mn1,p(·) = mn1,p(·, Dn1)

depending on some parameter p, and a test sample

(Xn1+1, Yn1+1), . . . , (Xn, Yn)

2.4. Choice of Smoothing Parameters and Adaptation 27

which we use to choose the parameter p of the estimate by minimizing

1n− n1

n∑

i=n1+1

|mn1,p(Xi) − Yi|2. (2.16)

In applications one often uses n1 = 23n or n1 = n

2 .If n is large, especially if n is so large that it is computationally difficult

to construct an estimate mn using all the data, then this is a very rea-sonable method (cf. Chapter 7). But it has the drawback that it choosesone estimate from the family mn1,p : p of estimates which depend onlyon n1 < n of the sample points. To avoid this problem one can take aparameter p∗ for which (2.16) is minimal and use it for an estimate mn,p∗

which uses the whole sample Dn. But then one introduces some instabilityinto the estimate: if one splits the sample Dn, in a different way, into alearning sample and a test sample, then one might get another parameterp for which the error on the test sample is minimal and, hence, one mightend up with another estimate mn,p, which doesn’t seem to be reasonable.This can be avoided if one repeats this procedure for all possible splits ofthe sample and averages (2.16) for all these splits. In general, this is com-putationally intractable, therefore one averages (2.16) only over some of allthe possible splits.

For k-fold cross-validation these splits are chosen in a special deter-ministic way. Let 1 ≤ k ≤ n. For simplicity we assume that n

k is an integer.Divide the data into k groups of equal size n

k and denote the set of dataconsisting of all groups, except the lth one, by Dn,l:

Dn,l =

(X1, Y1), . . . , (Xnk (l−1), Yn

k (l−1)), (Xnk l+1, Yn

k l+1), . . . , (Xn, Yn).

For each data set Dn,l construct estimates mn− nk ,p(·, Dn,l). Then choose

the parameter p such that

1k

k∑

l=1

1n/k

nk l∑

i= nk (l−1)+1

∣∣mn− n

k ,p(Xi, Dn,l) − Yi

∣∣2 (2.17)

is minimal, and use this parameter p∗ for an estimate mn,p∗ constructedfrom the whole sample Dn.n-fold cross-validation, where Dn,l is the whole sample except (Xl, Yl)

and where one minimizes

1n

n∑

l=1

|mn−1,p(Xl, Dn,l) − Yl|2,

is often referred to as cross-validation. Here mn−1,p(·, Dn,l) is the esti-mate computed with parameter value p and based upon the whole sampleexcept (Xl, Yl) (so it is based upon n − 1 of the n data points) andmn−1,p(Xl, Dn,l) is the value of this estimate at the point Xl, i.e., at thatx-value of the sample which is not used in the construction of the estimate.


There is an important difference between the estimates (2.16) and (2.17)of the L2 risk (2.12). Formula (2.16) estimates the L2 risk of the estimatemn1,p(·, Dn1), i.e., the L2 risk of an estimate constructed with the dataDn1 , while (2.17) estimates the average L2 risk of an estimate constructedwith n− n

k of the data points in Dn.For least squares and penalized least squares estimates we will also study

another method called complexity regularization (cf. Chapter 12) forchoosing the smoothing parameter. The idea there is to derive an upperbound on the L2 error of the estimate and to choose the parameter suchthat this upper bound is minimal. The upper bound will be of the form

1n

n∑

i=1

(|mn,p(Xi) − Yi|2 − |m(Xi) − Yi|2)

+ penn(mn,p), (2.18)

where penn(mn,p) is a penalty term penalizing the complexity of the esti-mate. Observe that minimization of (2.18) with respect to p is equivalentto minimization of

1n

n∑

i=1

|mn,p(Xi) − Yi|2 + penn(mn,p),

and that the latter term depends only on mn,p and the data. If mn,p isdefined by minimizing the empirical L2 risk over some linear vector spaceFn,p of functions with dimension Kp, then the penalty will be of the form

penn(mn,p) = c · Kp

nor penn(mn,p) = c · log(n) · Kp

n.

In contrast to the definition of penalized least squares estimates (cf. (2.10)and (2.11)) the penalty here depends not directly on mn,p but on the classof functions Fn to which mn,p belongs.


The description in Section 2.1 concerning the four paradigms in nonpara-metric regression is based on Friedman (1991). The partitioning estimatewas introduced under the name regressogram by Tukey (1947; 1961). Thekernel estimate is due to Nadaraya (1964; 1970) and Watson (1964). Near-est neighbor estimates were introduced in pattern recognition by Fix andHodges (1951) and also used in density estimation and regression estima-tion by Loftsgaarden and Quesenberry (1965), Royall (1966), Cover andHart (1967), Cover (1968a), and Stone (1977), respectively.

The principle of least squares, which is behind the global modeling esti-mates, is much older. It was independently proposed by A. M. Legendre in1805 and by C. F. Gauss with a publication in 1809 (but of course appliedin a parametric setting). Further developments are due to P. S. Laplace


1816, P. L. Chebyshev 1859, F. R. Helmert 1872, J. P. Gram 1879, andT. N. Thiele 1903 (with the aspect of a suitable termination of an or-thogonal series expansion). F. Galton’s 1889 work has been continued byF. Y. Edgeworth, K. Pearson, G. U. Yule, and R. A. Fisher in the lastdecade of the nineteenth century and in the first decade of the twentiethcentury. For historical details we refer to Hald (1998), Farebrother (1999),and Stigler (1999).

The principle of penalized modeling, in particular, smoothing splines,goes back to Whittaker (1923), Schoenberg (1964), and Reinsch (1967); seeWahba (1990) for additional references.

The phrase “curse of dimensionality” is due to Bellman (1961).The concept of cross-validation in statistics was introduced by Lunts and

Brailovsky (1967), Allen (1974) and M. Stone (1974). Complexity regular-ization was introduced under the name “structural risk minimization” byVapnik, see Vapnik (1982; 1998) and the references therein. There are manyother ways of using the data for choosing a smoothing parameter, see, e.g.,Chapter 4 in Fan and Gijbels (1995) and the references therein.


Problem 2.1. Let z1, . . . , zn ∈ R and set z = (1/n)∑n

i=1 zi.(a) Show, for any c ∈ R,

1n

n∑

i=1

|c − zi|2 = |c − z|2 +1n

n∑

i=1

|z − zi|2 .

(b) Conclude from (a):

1n

n∑

i=1

|z − zi|2 = minc∈R

1n

n∑

i=1

|c − zi|2 .

Problem 2.2. Prove that the Nadaraya–Watson kernel estimate defined by (2.2)satisfies (2.4).

Hint: Let mn(x) be the Nadaraya–Watson kernel estimate. Show that, for anyc ∈ R,

n∑

i=1

K(

x − Xi

h

)(Yi − c)2

=n∑

i=1

K(

x − Xi

h

)(Yi − mn(x))2 +

n∑

i=1

K(

x − Xi

h

)(mn(x) − c)2.

Problem 2.3. Prove that the partitioning estimate defined by (2.1) satisfies(2.8).


Hint: Let F be defined by (2.7) and let mn be the partitioning estimate definedby (2.1). Show by the aid of Problem 2.1 that, for any f ∈ F ,

n∑

i=1

|f(Xi) − Yi|2 =n∑

i=1

|f(Xi) − mn(Xi)|2 +n∑

i=1

|mn(Xi) − Yi|2.

Problem 2.4. Let X, X1, . . . , Xn be independent and uniformly distributed on[0, 1]d. Prove

E

min

i=1,...,n‖X − Xi‖

≥ d

d + 1

(Γ(

d2 + 1

)1/d

√π

)

· 1n1/d

.

Hint: The volume of a ball in Rd with radius t is given by

πd/2

Γ(

d2 + 1

) · td,

where Γ(x) =∫ ∞0

tx−1e−tdt (x > 0) satisfies Γ(x + 1) = x · Γ(x), Γ(1) = 1, andΓ(1/2) =

√π. Show that this implies

P

min

i=1,...,n‖X − Xi‖ ≤ t

≤ n · πd/2

Γ(

d2 + 1

) · td.

3Lower Bounds

3.1 Slow Rate

Recall that the nonparametric regression problem is formulated as follows:Given the observation X and the training data

Dn = (X1, Y1), . . . , (Xn, Yn)of independent and identically distributed random variables, estimate therandom variable Y by a regression function estimate

mn(X) = mn(X,Dn).

The error criterion is the L2 error

‖mn −m‖2 =∫

(mn(x) −m(x))2µ(dx).

Obviously, the average L2 error E‖mn −m‖2 is completely determined bythe distribution of the pair (X,Y ) and the regression function estimatormn. We shall see in Chapters 4, 5, 6, etc., that there exist universally consis-tent regression estimates. The next question is whether there are regressionfunction estimates with E‖mn −m‖2 tending to 0 with a guaranteed rateof convergence. Disappointingly, such estimates do not exist. As our nexttheorem indicates, it is impossible to obtain a nontrivial rate of convergenceresults without imposing strong restrictions on the distribution of (X,Y ),because even when the distribution of X is good and Y = m(X), i.e., Y isnoiseless, the rate of convergence of any estimate can be arbitrarily slow.

32 3. Lower Bounds

Theorem 3.1. Let an be a sequence of positive numbers converging tozero. For every sequence of regression estimates, there exists a distributionof (X,Y ), such that X is uniformly distributed on [0, 1], Y = m(X), m is±1 valued, and

lim supn→∞

E‖mn −m‖2

an≥ 1.

Proof. Without loss of generality we assume that 1/4 ≥ a1 ≥ a2 ≥ · · · > 0,otherwise replace an by min1/4, a1, . . . , an. Let pj be a probabilitydistribution and let P = Aj be a partition of [0, 1] such that Aj isan interval of length pj . We consider regression functions indexed by aparameter

c = (c1, c2, . . .),

where cj ∈ −1, 1. Denote the set of all such parameters by C. For c =(c1, c2, . . .) ∈ C define m(c) : [0, 1] → −1, 1 by

m(c)(x) = cj if x ∈ Aj ,

i.e., m(c) is piecewise constant with respect to the partition P and takes avalue cj on Aj .

Fix a sequence of regression estimates mn. We will show that there existsa distribution of (X,Y ) such that X is uniformly distributed on [0, 1],Y = m(c)(X) for some c ∈ C, and

lim supn→∞

E∫ |mn(x) −m(c)(x)|2µ(dx)

an≥ 1.

Let mn be the projection of mn on the set of all functions which arepiecewise constant with respect to P, i.e., for x ∈ Aj , set

mn(x) =1pj

∫

Aj

mn(z)µ(dz).

18

58

68 1

1

−1

x

m(x)

Figure 3.1. Regression function for p1 = 18 , p2 = 4

8 , p3 = 18 , p4 = 2

8 , and c1 =c4 = 1, c2 = c3 = −1.

3.1. Slow Rate 33

Then∫

Aj

|mn(x) −m(c)(x)|2µ(dx)

=∫

Aj

|mn(x) − mn(x)|2µ(dx) +∫

Aj


≥∫

Aj

|mn(x) −m(c)(x)|2µ(dx).

Set

cnj =

1 if∫

Ajmn(z)µ(dz)/pj ≥ 0,

−1 otherwise.

Fix x ∈ Aj . If cnj = 1 and cj = −1, then mn(x) ≥ 0 and m(c)(x) = −1,which implies

|mn(x) −m(c)(x)|2 ≥ 1.

If cnj = −1 and cj = 1, then mn(x) < 0 and m(c)(x) = 1, which againimplies

|mn(x) −m(c)(x)|2 ≥ 1.

It follows that∫

Aj

|mn(x) −m(c)(x)|2µ(dx) ≥∫

Aj


≥ Icnj =cj ·∫

Aj

1µ(dx)

= Icnj =cj · pj

≥ Icnj =cj · Iµn(Aj)=0 · pj ,

and so

E∫


≥∞∑

j=1

Pcnj = cj , µn(Aj) = 0pj

=: Rn(c).

Now we randomize c. Let C1, C2, . . . be a sequence of independent andidentically distributed random variables independent of X1, X2, . . . whichsatisfy

PC1 = 1 = PC1 = −1 =12,

and set

C = (C1, C2, . . .).

34 3. Lower Bounds

On the one hand,

Rn(c) ≤∞∑

j=1

Pµn(Aj) = 0pj

=∞∑

j=1

(1 − pj)npj ,

and, on the other hand,

ERn(C) =∞∑

j=1

Pcnj = Cj , µn(Aj) = 0pj

=∞∑

j=1

EPcnj = Cj , µn(Aj) = 0|X1, . . . , Xnpj

=∞∑

j=1

EIµn(Aj)=0Pcnj = Cj |X1, . . . , Xnpj

=∞∑

j=1

EIµn(Aj)=0

12

pj

=12

∞∑

j=1

(1 − pj)npj ,

where we used the fact that for µn(Aj) = 0 the random variables cnj andCj are independent given X1, . . . , Xn. Thus

Rn(c)ERn(C)

≤ 2.

We can apply the Fatou lemma:

E

lim supn→∞

Rn(C)ERn(C)

≥ lim sup

n→∞ERn(C)ERn(C)

= 1,

which implies that there exists c ∈ C such that

lim supn→∞

Rn(c)ERn(C)

≥ 1.

Summarizing the above results we get that there exists a distribution of(X,Y ) such that X is uniformly distributed on [0, 1], Y = m(c)(X) forsome c ∈ C, and

lim supn→∞

E∫ |mn(x) −m(x)|2µ(dx)12

∑∞j=1(1 − pj)npj

≥ 1.

3.1. Slow Rate 35

What is left is to show that for an satisfying 1/4 ≥ a1 ≥ a2 ≥ · · · > 0 andan → 0, there is a distribution pj such that

12

∞∑

j=1

(1 − pj)npj ≥ an

for all n. This is shown by the following lemma:

Lemma 3.1. Let an be a sequence of positive numbers converging tozero with 1/2 ≥ a1 ≥ a2 ≥ · · ·. Then there is a distribution pj such that

∞∑

j=1

(1 − pj)npj ≥ an

for all n.

Proof. Set

p1 = 1 − 2a1

and choose integers kn,

1 = k1 < k2 < · · ·and p2, p3, . . . such that, for i > kn,

pi ≤ 12n

andkn+1∑

i=kn+1

pi = 2(an − an+1).

Then∞∑

j=1

pj = p1 +∞∑

n=1

2(an − an+1) = 1

and∞∑

j=1

(1 − pj)npj ≥(

1 − 12n

)n ∑

pj≤1/2n

pj

≥ 12

∑

pj≤1/2n

pj

(because (1 − 1/(2n))n ≥ 1/2, cf. Problem 3.2)

≥ 12

∞∑

j=kn+1

pj

36 3. Lower Bounds

=12

∞∑

i=n

2(ai − ai+1)

= an.

Pattern recognition is easier than regression estimation (cf. Theorem1.1), therefore lower bounds for pattern recognition imply lower boundsfor regression estimation, so Theorem 3.1 can be sharpened as follows: Letan be a sequence of positive numbers converging to zero with 1/64 ≥a1 ≥ a2 ≥ · · ·. For every sequence of regression estimates, there existsa distribution of (X,Y ), such that X is uniformly distributed on [0, 1],Y = m(X), and

E‖mn −m‖2 ≥ an

for all n.The proof of this statement applies the combination of Theorem 7.2 and

Problem 7.2 in Devroye, Gyorfi, and Lugosi (1996): Consider the classi-fication problem. Let a′

n be a sequence of positive numbers convergingto zero with 1/16 ≥ a′

1 ≥ a′2 ≥ · · ·. For every sequence of classification

rules gn, there exists a distribution of (X,Y ), such that X is uniformlydistributed on [0, 1], Y = m(X), and

Pgn(X) = Y ≥ a′n

for all n. Using this, for an arbitrary regression estimate mn, introduce theclassification rule gn such that gn(X) = 1 if mn(X) ≥ 1/2 and 0 otherwise.Apply the above-mentioned result for a′

n = 4an. Then

E‖mn −m‖2 = E(mn(X) −m(X))2= E(mn(X) − Y )2≥ E(gn(X) − Y )2/4= Pgn(X) = Y /4≥ a′

n/4 = an.

3.2 Minimax Lower Bounds

Theorem 3.1 shows that universally good regression estimates do not existeven in the case of a nice distribution of X and noiseless Y . Rate of con-vergence studies for particular estimates must necessarily be accompaniedby conditions on (X,Y ). Under certain regularity conditions it is possibleto obtain upper bounds for the rates of convergence to 0 for E‖mn −m‖2

of certain estimates. Then it is natural to ask what the fastest achievablerate is for the given class of distributions.

3.2. Minimax Lower Bounds 37

Let D be a class of distributions of (X,Y ). Given data

Dn = (X1, Y1), . . . , (Xn, Yn)an arbitrary regression estimate is denoted by mn.

In the classical minimax approach one tries to minimize the maximalerror within a class of distributions. If we use E‖mn −m‖2 as error, thenthis means that one tries to minimize

sup(X,Y )∈D

E(mn(X) −m(X))2.

In the sequel we will derive asymptotic lower bounds of

infmn

sup(X,Y )∈D

E(mn(X) −m(X))2

for special classes D of distributions. Here the infimum is taken over allestimates mn, i.e., over all measurable functions of the data.

Definition 3.1. The sequence of positive numbers an is called the lowerminimax rate of convergence for the class D if

lim infn→∞

infmn

sup(X,Y )∈D

E‖mn −m‖2an

= C1 > 0.

Definition 3.2. The sequence of positive numbers an is called the optimalrate of convergence for the class D if it is a lower minimax rate ofconvergence and there is an estimate mn such that

lim supn→∞

sup(X,Y )∈D

E‖mn −m‖2an

= C0 < ∞.

We will derive rate of convergence results for classes of distributionswhere the regression function satisfies the following smoothness condition:

Definition 3.3. Let p = k + β for some k ∈ N0 and 0 < β ≤ 1, andlet C > 0. A function f : Rd → R is called (p, C)-smooth if for everyα = (α1, . . . , αd), αi ∈ N0,

∑dj=1 αj = k the partial derivative ∂kf

∂xα11 ...∂x

αdd

exists and satisfies∣∣∣∣

∂kf

∂xα11 . . . ∂xαd

d

(x) − ∂kf

∂xα11 . . . ∂xαd

d

(z)∣∣∣∣ ≤ C · ‖x− z‖β (x, z ∈ Rd).

Let F (p,C) be the set of all (p, C)-smooth functions f : Rd → R.

Clearly, if D ⊆ D and an is an lower minimax rate of convergencefor D, then it is also a lower minimax rate of convergence for D. Thus, todetermine lower minimax rates of convergence, it might be useful to restrictthe class of distributions. It turns out that it suffices to look at classes ofdistributions where X is uniformly distributed on [0, 1]d and Y −m(X) hasa normal distribution.

38 3. Lower Bounds

Definition 3.4. Let D(p,C) be the class of distributions of (X,Y ) suchthat:

(i) X is uniformly distributed on [0, 1]d;(ii) Y = m(X)+N , where X and N are independent and N is standard

normal; and(iii) m ∈ F (p,C).

In the next theorem we derive a lower minimax rate of convergence forthis class of distributions.

Theorem 3.2. For the class D(p,C), the sequence

an = n− 2p2p+d

is a lower minimax rate of convergence. In particular,

lim infn→∞

infmn

sup(X,Y )∈D(p,C)

E‖mn −m‖2C

2d2p+dn− 2p

2p+d

≥ C1 > 0

for some constant C1 independent of C.

For bounded X, in the subsequent chapters, we will discuss estimateswhich achieve the lower minimax rate of convergence in Theorem 3.2 forLipschitz continuous m (cf. Theorems 4.3, 5.2, and 6.2) and for m from ahigher smoothness class (cf. Corollary 11.2, Theorem 14.5, and Corollary19.1). Therefore this is the optimal rate of convergence for this class.

The proof of Theorem 3.2 applies the following lemma:

Lemma 3.2. Let u be an l-dimensional real vector, let C be a zero meanrandom variable taking values in −1,+1, and let N be an l-dimensionalstandard normal random variable, independent of C. Set

Z = Cu+N.

Then the error probability of the Bayes decision for C based on Z is

L∗ := ming:Rl→R

P g(Z) = C = Φ(−‖u‖),

where Φ is the standard normal distribution function.

Proof. Let ϕ be the density of an l–dimensional standard normal randomvariable. The conditional density of Z, given C = 1, is ϕ(z − u) and, givenC = −1, is ϕ(z+u). For an arbitrary decision rule g : Rl → R one obtains

Pg(Z) = C= PC = 1Pg(Z) = C|C = 1 + PC = −1Pg(Z) = C|C = −1

=12Pg(Z) = −1|C = 1 +

12Pg(Z) = 1|C = −1

=12

∫Ig(z)=−1ϕ(z − u) dz +

12

∫Ig(z)=+1ϕ(z + u) dz


u

z

−u

(z, u) > 0

(a)

u

z

−u

(z, u) < 0

(b)

Figure 3.2. (a) z is closer to u than to −u, (b) z is closer to −u than to u.

=12

∫ (Ig(z)=−1ϕ(z − u) + Ig(z)=+1ϕ(z + u)

)dz.

The above expression is minimized by the Bayes decision rule

g∗(z) =

1 if ϕ(z − u) > ϕ(z + u),−1 otherwise.

This together with

ϕ(z − u) > ϕ(z + u) ⇔ z is closer to u than to − u ⇔ (u, z) > 0

proves that the Bayes decision rule is given by

g∗(z) =

1 if (u, z) > 0,−1 if (u, z) ≤ 0,

where (u, z) denotes the inner product of u and z. Hence

L∗ = Pg∗(Z) = C= Pg∗(Z) = −1, C = 1 + Pg∗(Z) = 1, C = −1= P(u, Z) ≤ 0, C = 1 + P(u, Z) > 0, C = −1= P‖u‖2 + (u,N) ≤ 0, C = 1 + P−‖u‖2 + (u,N) > 0, C = −1

(because of Z = Cu+N)

≤ 12P(u,N) ≤ −‖u‖2 +

12P(u,N) > ‖u‖2.

For u = 0, one obtains

L∗ =12

· 1 +12

· 0 = Φ(−‖u‖).

For u = 0,

(u,N)‖u‖ =

(u

‖u‖ , N)

is a one-dimensional standard normal variable, which implies

40 3. Lower Bounds

L∗ =12P(

u

‖u‖ , N)

≤ −‖u‖

+12P(

u

‖u‖ , N)> ‖u‖

= Φ(−‖u‖).

Proof of Theorem 3.2. First we define (depending on n) subclasses ofdistributions (X,Y ) contained in D(p,C). Set

Mn =⌈(C2n)

12p+d

⌉.

Partition [0, 1]d by Mdn cubes An,j of side length 1/Mn and with centers

an,j. Choose a function g : Rd → R such that the support of g is asubset of [− 1

2 ,12 ]d,

∫g2(x) dx > 0, and g ∈ F (p,2β−1). Define g : Rd → R

by g(x) = C · g(x). Then:(I) the support of g is a subset of [− 1

2 ,12 ]d;

(II)∫g2(x) dx = C2

∫g2(x) dx and

∫g2(x) dx > 0; and

(III) g ∈ F (p,C2β−1).The class of regression functions is indexed by a vector

cn = (cn,1, . . . , cn,Mdn)

of +1 or −1 components, so the “worst regression function” will dependon the sample size n. Denote the set of all such vectors by Cn. For cn =(cn,1, . . . , cn,Md

n) ∈ Cn define the function

m(cn)(x) =Md

n∑

j=1

cn,jgn,j(x),

where

gn,j(x) = M−pn g(Mn(x− an,j)).

Next we show that, because of (III),

m(cn) ∈ F (p,C).

Let α = (α1, . . . , αd), αi ∈ N0, and∑d

j=1 αj = k. Set Dα = ∂k

∂xα11 ...∂x

αdd

. If

x, z ∈ An,i then (I) and (III) imply

|Dαm(cn)(x) −Dαm(cn)(z)|= |cn,i| · |Dαgn,i(x) −Dαgn,i(z)|≤ C2β−1M−p

n Mkn‖Mn(x− an,i) −Mn(z − an,i)‖β

≤ C2β−1‖x− z‖β

≤ C‖x− z‖β .

Now assume that x ∈ An,i and z ∈ An,j for i = j. Choose x, z on theline between x and z such that x is on the boundary of An,i, z is on the


boundary of An,j , and ‖x− x‖ + ‖z − z‖ ≤ ‖x− z‖. Then

|Dαm(cn)(x) −Dαm(cn)(z)|= |cn,iD

αgn,i(x) − cn,jDαgn,j(z)|

≤ |cn,iDαgn,i(x)| + |cn,jD

αgn,j(z)|= |cn,i| · |Dαgn,i(x) −Dαgn,i(x)| + |cn,j | · |Dαgn,j(z) −Dαgn,j(z)|

(because of Dαgn,i(x) = Dαgn,j(z) = 0)

≤ C2β−1(‖x− x‖β + ‖z − z‖β)

(as in the first case)

= C2β

(12‖x− x‖β +

12‖z − z‖β

)

≤ C2β

(‖x− x‖2

+‖z − z‖

2

)β

(by Jensen’s inequality)

≤ C‖x− z‖β .

Hence, each distribution of (X,Y ) with Y = m(c)(X) +N for some c ∈ Cn

is contained in D(p,C), and it suffices to show

lim infn→∞

infmn

sup(X,Y ):Y =m(c)(X)+N,c∈Cn

M2pn

C2 E‖mn −m(c)‖2 > 0.

Let mn be an arbitrary estimate. By definition, gn,j : j is an orthogonalsystem in L2, therefore the projection mn of mn to m(c) : c ∈ Cn is givenby

mn(x) =Md

n∑

j=1

cn,jgn,j(x),

where

cn,j =

∫An,j

mn(x)gn,j(x) dx∫

An,jg2

n,j(x) dx.

Let c ∈ Cn be arbitrary. Then

‖mn −m(c)‖2 ≥ ‖mn −m(c)‖2

=Md

n∑

j=1

∫

An,j

(cn,jgn,j(x) − cn,jgn,j(x))2 dx

=Md

n∑

j=1

∫

An,j

(cn,j − cn,j)2g2n,j(x) dx

42 3. Lower Bounds

=∫g2(x) dx

Mdn∑

j=1

(cn,j − cn,j)21

M2p+dn

.

Let cn,j be 1 if cn,j ≥ 0 and −1 otherwise. Because of

|cn,j − cn,j | ≥ |cn,j − cn,j |/2,we get

‖mn −m(c)‖2 ≥∫g2(x) dx · 1

41

M2p+dn

Mdn∑

j=1

(cn,j − cn,j)2

≥∫g2(x) dx · 1

M2p+dn

Mdn∑

j=1

Icn,j =cn,j

=C2

M2pn

·∫g2(x) dx · 1

Mdn

Mdn∑

j=1

Icn,j =cn,j.

Thus for the proof we need

lim infn→∞

infcn

supcn

1Md

n

Mdn∑

j=1

Pcn,j = cn,j > 0.

Now we randomize cn. Let Cn,1, . . . , Cn,Mdn

be a sequence of i.i.d. randomvariables independent of (X1, N1), (X2, N2), . . ., which satisfy

PCn,1 = 1 = PCn,1 = −1 =12.

Set

Cn = (Cn,1, . . . , Cn,Mdn).

Then

infcn

supcn

1Md

n

Mdn∑

j=1

Pcn,j = cn,j ≥ infcn

1Md

n

Mdn∑

j=1

Pcn,j = Cn,j,

where cn,j can be interpreted as a decision on Cn,j using Dn. Its errorprobability is minimal for the Bayes decision Cn,j , which is 1 if PCn,j =1|Dn ≥ 1

2 and −1 otherwise, therefore

infcn

1Md

n

Mdn∑

j=1

Pcn,j = Cn,j ≥ 1Md

n

Mdn∑

j=1

PCn,j = Cn,j

= PCn,1 = Cn,1= EPCn,1 = Cn,1|X1, . . . , Xn.

3.3. Individual Lower Bounds 43

Let Xi1 , . . . , Xilbe those Xi ∈ An,1. Then

(Yi1 , . . . , Yil) = Cn,1 · (gn,1(Xi1), . . . , gn,1(Xil

)) + (Ni1 , . . . , Nil),

while

Y1, . . . , Yn \ Yi1 , . . . , Yil

depends only on C \ Cn,1 and on Xr’s and Nr’s with r ∈ i1, . . . , il,and therefore is independent of Cn,1 given X1, . . . , Xn. Now conditioningon X1, . . . , Xn, the error of the conditional Bayes decision for Cn,1 basedon (Y1, . . . , Yn) depends only on (Yi1 , . . . , Yil

), hence Lemma 3.2 implies

PCn,1 = Cn,1|X1, . . . , Xn = Φ

−√√√√

l∑

r=1

g2n,1(Xir )

= Φ

−√√√√

n∑

i=1

g2n,1(Xi)

,

where Φ is the standard normal distribution function. The second deriva-tive of Φ(−√

x) is positive, therefore Φ(−√x) is convex, so by Jensen’s

inequality

PCn,1 = Cn,1 = E

Φ

−√√√√

n∑

i=1

g2n,1(Xi)

≥ Φ

−√√√√E

n∑

i=1

g2n,1(Xi)

= Φ(−√nEg2

n,1(X1))

= Φ

(

−√

nM−(2p+d)n

∫g2(x) dx

)

≥ Φ

(

−√∫

g2(x) dx

)

> 0.

3.3 Individual Lower Bounds

In some sense, the lower bounds in Section 3.2 are not satisfactory. Theydo not tell us anything about the way the error decreases as the samplesize is increased for a given regression problem. These bounds, for each n,

44 3. Lower Bounds

give information about the maximal error within the class, but not aboutthe behavior of the error for a single fixed distribution as the sample sizen increases. In other words, the “bad” distribution, causing the largesterror for an estimator, may be different for each n. For example, the lowerbound for the class D(p,C) does not exclude the possibility that there existsa sequence of estimators mn such that for every distribution in D(p,C),the expected error E‖mn −m‖2 decreases at an exponential rate in n.

In this section, we are interested in “individual” minimax lower boundsthat describe the behavior of the error for a fixed distribution of (X,Y ) asthe sample size n grows.

Definition 3.5. A sequence of positive numbers an is called the individ-ual lower rate of convergence for a class D of distributions of (X,Y ),if

infmn

sup(X,Y )∈D

lim supn→∞

E‖mn −m‖2an

> 0 ,

where the infimum is taken over all sequences mn of the estimates.

In this definition the lim supn→∞ can be replaced by lim infn→∞, herewe consider lim supn→∞ for the sake of simplicity.

We will show that for every sequence bn tending to zero, bnn− 2p2p+d is

an individual lower rate of convergence of the class D(p,C). Hence, thereexist individual lower rates of these classes, which are arbitrarily close tothe optimal lower rates.

Theorem 3.3. Let bn be an arbitrary positive sequence tending to zero.Then the sequence

bnan = bnn− 2p

2p+d

is an individual lower rate of convergence for the class D(p,C).

For the sequence √bn Theorem 3.3 implies that for all mn there is

(X,Y ) ∈ D(p,C) such that

lim supn→∞

E‖mn −m‖2√bnan

> 0,

thus

lim supn→∞

E‖mn −m‖2bnan

= ∞.

Proof of Theorem 3.3. The proof is an extension of the proof of Theo-rem 3.2, but is a little involved. We therefore recommend skipping it duringthe first reading. First we define a subclass of distributions of (X,Y ) con-tained in D(p,C). We pack infinitely many disjoint cubes into [0, 1]d in thefollowing way: For a given probability distribution pj, let Bj be a par-tition of [0, 1] such that Bj is an interval of length pj . We pack disjoint


...

p1

...

p2

...

p3

...

p4

...

· · · 1

1

Figure 3.3. Two dimensional partition.

cubes of volume pdj into the rectangle

Bj × [0, 1]d−1.

Denote these cubes by

Aj,1, . . . , Aj,Sj ,

where

Sj =⌊

1pj

⌋d−1

.

Let aj,k be the center of Aj,k. Choose a function g : Rd → R such that:(I) the support of g is a subset of [− 1

2 ,12 ]d;

(II)∫g2(x) dx > 0; and

(III) g ∈ F (p,C2β−1).The class of regression functions is indexed by a vector

c = (c1,1, c1,2, . . . , c1,S1 , c2,1, c2,2, . . . , c2,S2 , . . .)

of +1 or −1 components. Denote the set of all such vectors by C. For c ∈ Cdefine the function

m(c)(x) =∞∑

j=1

Sj∑

k=1

cj,kgj,k(x),

where

gj,k(x) = ppjg(p

−1j (x− aj,k)).

46 3. Lower Bounds

As in the proof of Theorem 3.2, (III) implies that

m(c) ∈ F (p,C).

Hence, each distribution of (X,Y ) with Y = m(c)(X) +N for some c ∈ Cis contained in D(p,C), which implies

infmn

sup(X,Y )∈D(p,C)

lim supn→∞

E‖mn −m‖2bnan

≥ infmn

sup(X,Y ):Y =m(c)(X)+N,c∈C

lim supn→∞

E‖mn −m(c)‖2bnan

. (3.1)

Let mn be an arbitrary estimate. By definition, gj,k : j, k is an orthogonalsystem in L2, therefore the projection mn of mn to m(c) : c ∈ C is givenby

mn(x) =∑

j,k

cn,j,kgj,k(x),

where

cn,j,k =

∫Aj,k

mn(x)gj,k(x) dx∫

Aj,kg2

j,k(x) dx.

Let c ∈ C be arbitrary. Then

‖mn −m(c)‖2 ≥ ‖mn −m(c)‖2

=∑

j,k

∫

Aj,k

(cn,j,kgj,k(x) − cj,kgj,k(x))2 dx

=∑

j,k

∫

Aj,k

(cn,j,k − cj,k)2g2j,k(x) dx

=∫g2(x) dx

∑

j,k

(cn,j,k − cj,k)2p2p+dj .

Let cn,j,k be 1 if cn,j,k ≥ 0 and −1 otherwise. Because of

|cn,j,k − cj,k| ≥ |cn,j,k − cj,k|/2,we get

‖mn −m(c)‖2 ≥∫g2(x) dx

14

∑

j,k

(cn,j,k − cj,k)2p2p+dj

≥∫g2(x) dx

∑

j,k

Icn,j,k =cj,kp2p+dj .

This proves

E‖mn −m(c)‖2 ≥∫g2(x) dx ·Rn(c), (3.2)


where

Rn(c) =∑

j:np2p+dj

≤1

Sj∑

k=1

p2p+dj · Pcn,j,k = cj,k. (3.3)

Relations (3.1) and (3.2) imply

infmn

sup(X,Y )∈D(p,C)

lim supn→∞

E‖mn −m‖2bnan

≥∫g2(x) dx inf

mnsupc∈C

lim supn→∞

Rn(c)bnan

. (3.4)

To bound the last term, we fix a sequence mn of estimates and choosec ∈ C randomly. Let C1,1, . . . , C1,S1 , C2,1, . . . , C2,S2 , . . . be a sequence ofindependent and identically distributed random variables independent of(X1, N1), (X2, N2), . . . , which satisfy

PC1,1 = 1 = PC1,1 = −1 =12.

Set

C = (C1,1, . . . , C1,S1 , C2,1, . . . , C2,S2 , . . .).

Next we derive a lower bound for

ERn(C) =∑

j:np2p+dj

≤1

Sj∑

k=1

p2p+dj · Pcn,j,k = Cj,k,

where cn,j,k can be interpreted as a decision on Cj,k using Dn. Its errorprobability is minimal for the Bayes decision Cn,j,k, which is 1 if PCj,k =1|Dn ≥ 1

2 and −1 otherwise, therefore

Pcn,j,k = Cj,k ≥ PCn,j,k = Cj,k.Let Xi1 , . . . , Xil

be those Xi ∈ Aj,k. Then

(Yi1 , . . . , Yil) = Cj,k · (gj,k(Xi1), . . . , gj,k(Xil

)) + (Ni1 , . . . , Nil),

while

Y1, . . . , Yn \ Yi1 , . . . , Yil

depends only on C \ Cj,k and on Xr’s and Nr’s with r ∈ i1, . . . , il,and therefore is independent of Cj,k given X1, . . . , Xn. Now conditioningon X1, . . . , Xn, the error of the conditional Bayes decision for Cj,k basedon (Y1, . . . , Yn) depends only on (Yi1 , . . . , Yil

), hence Lemma 3.2 implies

PCn,j,k = Cj,k|X1, . . . , Xn = Φ

−√√√√

l∑

r=1

g2j,k(Xir )

48 3. Lower Bounds

= Φ

−√√√√

n∑

i=1

g2j,k(Xi)

.

Since Φ(−√x) is convex, by Jensen’s inequality

PCn,j,k = Cj,k = EPCn,j,k = Cj,k|X1, . . . , Xn

= E

Φ

−√√√√

n∑

i=1

g2j,k(Xi)

≥ Φ

−√√√√E

n∑

i=1

g2j,k(Xi)

= Φ(−√nEg2

j,k(X1))

= Φ

(

−√

np2p+dj

∫g2(x) dx

)

independently of k, thus

ERn(C) ≥∑

j:np2p+dj

≤1

Sj∑

k=1

p2p+dj Φ

(

−√

np2p+dj

∫g2(x) dx

)

≥ Φ

(

−√∫

g2(x) dx

)∑

j:np2p+dj

≤1

Sjp2p+dj

≥ K1 ·∑

j:np2p+dj

≤1

p2p+1j , (3.5)

where

K1 = Φ

(

−√∫

g2(x) dx

)(12

)d−1

.

Since bn and an tend to zero we can take a subsequence ntt∈N of nn∈Nwith

bnt ≤ 2−t

and

a1/2pnt

≤ 2−t.

Define qt such that

2−t

qt=

⌈2−t

a1/2pnt

⌉

,


and choose pj as

q1, . . . , q1, q2, . . . , q2, . . . , qt, . . . , qt, . . . ,

where qt is repeated 2−t/qt times. So because of an = n− 2p2p+d ,

∑

j:np2p+dj

≤1

p2p+1j =

∑

t:nq2p+dt ≤1

2−t

qtq2p+1t

≥∑

t:nq2p+dt ≤1

bntq2pt

=∑

t:2−ta−1/2pnt

≥2−ta−1/2pn

bnt

2−t

⌈2−t

a1/2pnt

⌉

2p

≥∑

t:ant ≤an

bnt

2−t

2−t

a1/2pnt

+ 1

2p

=∑

t:nt≥n

bnt

(a1/2pnt

1 + 2ta1/2pnt

)2p

≥∑

t:nt≥n

bntant

22p

by a1/2pnt ≤ 2−t and, especially, for n = ns (3.5) implies

ERns(C) ≥ K1

∑

j:nsp2p+dj

≤1

p2p+1j ≥ K1

22p

∑

t≥s

bntant ≥ K1

22pbnsans . (3.6)

Using (3.6) one gets

infmn

supc∈C

lim supn→∞

Rn(c)bnan

≥ infmn

supc∈C

lim sups→∞

Rns(c)bnsans

≥ K1

22pinf

mnsupc∈C

lim sups→∞

Rns(c)ERns

(C)

≥ K1

22pinf

mnE

lim sups→∞

Rns(C)

ERns(C)

.

Because of (3.5) and the fact that, for all c ∈ C,

Rn(c) ≤∑

j:np2p+dj

≤1

Sjp2p+dj ≤

∑

j:np2p+dj

≤1

p2p+1j ,

50 3. Lower Bounds

the sequence Rns(C)/ERns

(C) is uniformly bounded, so we can applyFatou’s lemma to get

infmn

supc∈C

lim supn→∞

Rn(c)bnan

≥ K1

22pinf

mnlim sup

s→∞E(Rns(C)ERns

(C)

)=K1

22p> 0.

This together with (3.4) implies the assertion.


Versions of Theorem 3.1 appeared earlier in the literature. First, Cover(1968b) showed that for any sequence of classification rules, for sequencesan converging to zero at arbitrarily slow algebraic rates (i.e., as 1/nδ forarbitrarily small δ > 0), there exists a distribution such that the error prob-ability ≥ L∗ + an infinitely often. Devroye (1982b) strengthened Cover’sresult allowing sequences tending to zero arbitrarily slowly. Lemma 3.1 isdue to Devroye and Gyorfi (1985). Theorem 3.2 has been proved by Stone(1982). For related results on the general minimax theory of statisticalestimates see Ibragimov and Khasminskii (1980; 1981; 1982), Bretagnolleand Huber (1979), Birge (1983), and Korostelev and Tsybakov (1993). Theconcept of individual lower rates has been introduced in Birge (1986) con-cerning density estimation. Theorem 3.3 is from Antos, Gyorfi, and Kohler(2000). It uses the ideas from Antos and Lugosi (1998), who proved therelated results in pattern recognition.


Problem 3.1. Prove a version of Theorem 3.1 when m is arbitrarily many timesdifferentiable on [0, 1).

Hint: Let g be an arbitrarily many times differentiable function on R suchthat it is zero outside [0, 1]. In the proof of Theorem 3.1, let aj denote the leftend point of Aj . Put

m(x) =∞∑

j=1

g

(x − aj

pj

).

Then m is arbitrarily many times differentiable on [0, 1). Follow the line of theproof of Theorem 3.1.

Problem 3.2. Show that(1 − 1

2n

)n

≥ 12.

Hint: For n ≥ 2,(1 − 1

2n

)n

≥(

11 + 1/(2n − 1)

)n

≥( 1

e1/(2n−1)

)n

≥ 1e2/3 ≥ 1

2.


Problem 3.3. Sometimes rate of convergence results for nonparametric regres-sion estimation are only shown for bounded |Y |, so it is reasonable to considerminimax lower bounds for such classes. Let D∗(p,C) be the class of distributionsof (X, Y ) such that:

(I’) X is uniformly distributed on [0, 1]d;(II’) Y ∈ 0, 1 a.s. and PY = 1|X = x = 1 − PY = 0|X = x = m(x)

for all x ∈ [0, 1]d; and(III’) m ∈ F (p,C) and m(x) ∈ [0, 1] for all x ∈ [0, 1]d.

Prove that for the class D∗(p,C) the sequence

an = n− 2p

2p+d

is a lower minimax rate of convergence.

Problem 3.4. Often minimax rates of convergence are defined using tail prob-abilities instead of expectations. For example, in Stone (1982) an is called thelower rate of convergence for the class D if

lim infn→∞

infmn

sup(X,Y )∈D

P

‖mn − m‖2

an≥ c

= C0 > 0. (3.7)

Show that (3.7) implies that an is a lower minimax rate of convergenceaccording to Definition 3.1.

Problem 3.5. Show that (3.7) holds for D = D(p,C) and

an = n− 2p

2p+d .

Hint: Stone (1982).

Problem 3.6. Show that (3.7) holds for D = D∗(p,C) (as defined in Problem3.3) and

an = n− 2p

2p+d .

Problem 3.7. Give a definition of individual lower rates of convergence usingtail probabilities.

Problem 3.8. Let bn be an arbitrary positive sequence tending to zero. Provethat for your definition in Problem 3.7 the sequence

bnan = bnn− 2p

2p+d

is an individual lower rate of convergence for the class D(p,C).

4Partitioning Estimates

4.1 Introduction

Let Pn = An,1, An,2, . . . be a partition of Rd and for each x ∈ Rd

let An(x) denote the cell of Pn containing x. The partitioning estimate(histogram) of the regression function is defined as

mn(x) =∑n

i=1 YiIXi∈An(x)∑ni=1 IXi∈An(x)

with 0/0 = 0 by definition. This means that the partitioning estimate isa local averaging estimate such for a given x we take the average of thoseYi’s for which Xi belongs to the same cell into which x falls.

The simplest version of this estimate is obtained for d = 1 and when thecells An,j are intervals of size h = hn. Figures 4.1 – 4.3 show the estimatesfor various choices of h for our simulated data introduced in Chapter 1. Inthe first figure h is too small (undersmoothing, large variance), in the secondchoice it is about right, while in the third it is too large (oversmoothing,large bias).

For d > 1 one can use, e.g., a cubic partition, where the cells An,j arecubes of volume hd

n, or a rectangle partition which consists of rectanglesAn,j with side lengths hn1, . . . , hnd. For the sake of illustration we generatedtwo-dimensional data when the actual distribution is a correlated normaldistribution. The partition in Figure 4.4 is cubic, and the partition in Figure4.5 is made of rectangles.

4.1. Introduction 53

−1 −0.5 0.5 1

0.5

Figure 4.1. Undersmoothing: h = 0.03, L2 error = 0.062433.

Cubic and rectangle partitions are particularly attractive from the com-putational point of view, because the set An(x) can be determined for eachx in constant time, provided that we use an appropriate data structure. Inmost cases, partitioning estimates are computationally superior to the othernonparametric estimates, particularly if the search for An(x) is organizedusing binary decision trees (cf. Friedman (1977)).

The partitions may depend on the data. Figure 4.6 shows such apartition, where each cell contains an equal number of points. This par-

−1 −0.5 0.5 1

0.5

Figure 4.2. Good choice: h = 0.1, L2 error = 0.003642.

54 4. Partitioning Estimates

−1 −0.5 0.5 1

0.5

Figure 4.3. Oversmoothing: h = 0.5, L2 error = 0.013208.

tition consists of so-called statistically equivalent blocks. Data-dependentpartitions are dealt with in Chapter 13.

Another advantage of the partitioning estimate is that it can be repre-sented or compressed very efficiently. Instead of storing all data Dn, oneshould only know the estimate for each nonempty cell, i.e., for cells An,j

for which µn(An,j) > 0, where µn denotes the empirical distribution. Thenumber of nonempty cells is much smaller than n (cf. Problem 4.8).

Figure 4.4. Cubic partition.

4.2. Stone’s Theorem 55

Figure 4.5. Rectangle partition.

4.2 Stone’s Theorem

In the next section we will prove the weak universal consistency of par-titioning estimates. In the proof we will use Stone’s theorem (Theorem4.1 below) which is a powerful tool for proving weak consistency for localaveraging regression function estimates. It will also be applied to provethe weak universal consistency of kernel and nearest neighbor estimates inChapters 5 and 6.

Local averaging regression function estimates take the form

mn(x) =n∑

i=1

Wni(x) · Yi,

where the weights Wn,i(x) = Wn,i(x,X1, . . . , Xn) ∈ R are depending onX1, . . . , Xn.

Figure 4.6. Statistically equivalent blocks.


Usually the weights are nonnegative and Wn,i(x) is “small” if Xi is“far” from x. The most common examples for weights are the weights forpartitioning, kernel, and nearest neighbor estimates (cf. Section 2.1). Thenext theorem states conditions on the weights which guarantee the weakuniversal consistency of the local averaging estimates.

Theorem 4.1. (Stone’s theorem). Assume that the following condi-tions are satisfied for any distribution of X:

(i) There is a constant c such that for every nonnegative measurablefunction f satisfying Ef(X) < ∞ and any n,

E

n∑

i=1

|Wn,i(X)|f(Xi)

≤ cEf(X).

(ii) There is a D ≥ 1 such that

P

n∑

i=1

|Wn,i(X)| ≤ D

= 1,

for all n.(iii) For all a > 0,

limn→∞

E

n∑

i=1

|Wn,i(X)|I‖Xi−X‖>a

= 0.

(iv)n∑

i=1

Wn,i(X) → 1

in probability.(v)

limn→∞

E

n∑

i=1

Wn,i(X)2

= 0.

Then the corresponding regression function estimate mn is weakly univer-sally consistent, i.e.,

limn→∞

E∫


= 0

for all distributions of (X,Y ) with EY 2 < ∞.

For nonnegative weights and noiseless data (i.e., Y = m(X) ≥ 0) con-dition (i) says that the mean value of the estimate is bounded above bysome constant times the mean value of the regression function. Condi-tions (ii) and (iv) state that the sum of the weights is bounded and isasymptotically 1. Condition (iii) ensures that the estimate at a point x is


asymptotically influenced only by the data close to x. Condition (v) statesthat asymptotically all weights become small.

Proof of Theorem 4.1. Because of (a + b + c)2 ≤ 3a2 + 3b2 + 3c2 wehave that

Emn(X) −m(X)2 ≤ 3E

(n∑

i=1

Wn,i(X)(Yi −m(Xi))

)2

+ 3E

(n∑

i=1

Wn,i(X)(m(Xi) −m(X))

)2

+ 3E

((n∑

i=1

Wn,i(X) − 1

)

m(X)

)2

= 3In + 3Jn + 3Ln.

By the Cauchy–Schwarz inequality, and condition (ii),

Jn ≤ E

(n∑

i=1

√|Wn,i(X)|

√|Wn,i(X)| |m(Xi) −m(X)|

)2

≤ E

(n∑

i=1

|Wn,i(X)|)(

n∑

i=1

|Wn,i(X)|(m(Xi) −m(X))2)

≤ DE

n∑

i=1

|Wn,i(X)|(m(Xi) −m(X))2

= DJ ′n.

Because of Theorem A.1, for ε > 0 we can choose m bounded and uniformlycontinuous such that

E(m(X) − m(X))2 < ε.

Then

J ′n ≤ 3E

n∑

i=1

|Wn,i(X)|(m(Xi) − m(Xi))2

+ 3E

n∑

i=1

|Wn,i(X)|(m(Xi) − m(X))2

+ 3E

n∑

i=1

|Wn,i(X)|(m(X) −m(X))2

= 3Jn1 + 3Jn2 + 3Jn3.


For arbitrary δ > 0,

Jn2 = E

n∑

i=1

|Wn,i(X)| · (m(Xi) − m(X))2I‖Xi−X‖>δ

+E

n∑

i=1

|Wn,i(X)| · (m(Xi) − m(X))2I‖Xi−X‖≤δ

≤ E

n∑

i=1

|Wn,i(X)| · (2m(Xi)2 + 2m(X)2)I‖Xi−X‖>δ

+E

n∑

i=1

|Wn,i(X)| · (m(Xi) − m(X))2I‖Xi−X‖≤δ

≤ 4 · supu∈Rd

|m(u)|2 · E

n∑

i=1

|Wn,i(X)| · I‖Xi−X‖>δ

+D ·(

supu,v∈Rd : ‖u−v‖≤δ

|m(u) − m(v)|)2

.

By (iii),

lim supn→∞

Jn2 ≤ D ·(

supu,v∈Rd : ‖u−v‖≤δ

|m(u) − m(v)|)2

.

Using m uniformly continuous we get, with δ → 0,

Jn2 → 0.

By (ii),

Jn3 ≤ DE(m(X) −m(X))2 < Dε,

moreover, by (i),

lim supn→∞

Jn1 ≤ cE(m(X) −m(X))2 ≤ cε,

so

lim supn→∞

J ′n ≤ 3cε+ 3Dε.

Put

σ2(x) = E(Y −m(X))2|X = x,then EY 2 < ∞ implies that Eσ2(X) < ∞, and

In = E

(n∑

i=1

Wn,i(X)(Yi −m(Xi))

)2


=n∑

i=1

n∑

j=1

EWn,i(X)Wn,j(X)(Yi −m(Xi))(Yj −m(Xj)).

For i = j,

E Wn,i(X)Wn,j(X)(Yi −m(Xi))(Yj −m(Xj))= E E Wn,i(X)Wn,j(X)(Yi −m(Xi))(Yj −m(Xj))|X1, . . . , Xn, Yi= E Wn,i(X)Wn,j(X)(Yi −m(Xi))E (Yj −m(Xj))|X1, . . . , Xn, Yi= E Wn,i(X)Wn,j(X)(Yi −m(Xi))(m(Xj) −m(Xj))= 0,

hence,

In = E

n∑

i=1

Wn,i(X)2(Yi −m(Xi))2

= E

n∑

i=1

Wn,i(X)2σ2(Xi)

.

If σ2(x) is bounded then (v) implies that In → 0. For general σ2(x) andε > 0, Theorem A.1 implies that there exists bounded σ2(x) ≤ L such that

E|σ2(X) − σ2(X)| < ε.

Then, by (ii),

In ≤ E

n∑

i=1

Wn,i(X)2σ2(Xi)

+ E

n∑

i=1

Wn,i(X)2|σ2(Xi) − σ2(Xi)|

≤ LE

n∑

i=1

Wn,i(X)2

+DE

n∑

i=1

|Wn,i(X)||σ2(Xi) − σ2(Xi)|

,

therefore, by (i) and (v),

lim supn→∞

In ≤ cDE|σ2(X) − σ2(X)| < cDε.

Concerning the third term

Ln = E

((n∑

i=1

Wn,i(X) − 1

)

m(X)

)2

→ 0

by conditions (ii), (iv), and by the dominated convergence theorem.

From the proof it is clear that under conditions (ii), (iii), (iv), and (v)alone weak consistency holds if the regression function is uniformly contin-uous and the conditional variance function σ2(x) is bounded. Condition (i)makes the extension possible. For nonnegative weights conditions (i), (iii),and (v) are necessary (see Problems 4.1 – 4.4).


Definition 4.1. The weights Wn,i are called normal if∑n

i=1Wn,i(x) =1. The weights Wn,i are called subprobability weights if they are nonneg-ative and sum up to ≤ 1. They are called probability weights if they arenonnegative and sum up to 1.

Obviously for subprobability weights condition (ii) is satisfied, and forprobability weights conditions (ii) and (iv) are satisfied.

4.3 Consistency

The purpose of this section is to prove the weak universal consistency ofthe partitioning estimates. This is the first such result that we mention.Later we will prove the same property for other estimates, too. The nexttheorem provides sufficient conditions for the weak universal consistency ofthe partitioning estimate. The first condition ensures that the cells of theunderlying partition shrink to zero inside a bounded set, so the estimateis local in this sense. The second condition means that the number of cellsinside a bounded set is small with respect to n, which implies that withlarge probability each cell contains many data points.

Theorem 4.2. If for each sphere S centered at the origin

limn→∞

maxj:An,j∩S =∅

diam(An,j) = 0 (4.1)

and

limn→∞

|j : An,j ∩ S = ∅|n

= 0 (4.2)

then the partitioning regression function estimate is weakly universallyconsistent.

For cubic partitions,

limn→∞

hn = 0 and limn→∞

nhdn = ∞

imply (4.1) and (4.2).In order to prove Theorem 4.2 we will verify the conditions of Stone’s

theorem. For this we need the following technical lemma. An integer-valued random variable B(n, p) is said to be binomially distributed withparameters n and 0 ≤ p ≤ 1 if

PB(n, p) = k =(n

k

)pk(1 − p)n−k, k = 0, 1, . . . , n.

4.3. Consistency 61

Lemma 4.1. Let the random variable B(n, p) be binomially distributedwith parameters n and p. Then:

(i)

E

11 +B(n, p)

≤ 1

(n+ 1)p,

(ii)

E

1B(n, p)

IB(n,p)>0

≤ 2

(n+ 1)p.

Proof. Part (i) follows from the following simple calculation:

E

11 +B(n, p)

=

n∑

k=0

1k + 1

(n

k

)pk(1 − p)n−k

=1

(n+ 1)p

n∑

k=0

(n+ 1k + 1

)pk+1(1 − p)n−k

≤ 1(n+ 1)p

n+1∑

k=0

(n+ 1k

)pk(1 − p)n−k+1

=1

(n+ 1)p(p+ (1 − p))n+1

=1

(n+ 1)p.

For (ii) we have

E

1B(n, p)

IB(n,p)>0

≤ E

2

1 +B(n, p)

≤ 2

(n+ 1)p

by (i).

Proof of Theorem 4.2. The proof proceeds by checking the conditionsof Stone’s theorem (Theorem 4.1). Note that if 0/0 = 0 by definition, then

Wn,i(x) = IXi∈An(x)/n∑

l=1

IXl∈An(x).

To verify (i), it suffices to show that there is a constant c > 0, such thatfor any nonnegative function f with Ef(X) < ∞,

E

n∑

i=1

f(Xi)IXi∈An(X)∑nl=1 IXl∈An(X)

≤ cEf(X).


Observe that

E

n∑

i=1

f(Xi)IXi∈An(X)∑nl=1 IXl∈An(X)

=n∑

i=1

E

f(Xi)IXi∈An(X)

1 +∑

l =i IXl∈An(X)

= nE

f(X1)IX1∈An(X)1

1 +∑

l =1 IXl∈An(X)

= nEEf(X1)IX1∈An(X)

11 +

∑nl=2 IXl∈An(X)

∣∣∣∣X,X1

= nEf(X1)IX1∈An(X)E

1

1 +∑n

l=2 IXl∈An(X)

∣∣∣∣X,X1

= nEf(X1)IX1∈An(X)E

1

1 +∑n

l=2 IXl∈An(X)

∣∣∣∣X

by the independence of the random variables X,X1, . . . , Xn. Using Lemma4.1, the expected value above can be bounded by

nEf(X1)IX1∈An(X)

1nµ(An(X))

=∑

j

PX ∈ Anj∫

Anj

f(u)µ(du)1

µ(Anj)

=∫

Rd

f(u)µ(du) = Ef(X).

Therefore, the condition is satisfied with c = 1. The weights are sub-probability weights, so (ii) is satisfied. To see that condition (iii) is satisfiedfirst choose a ball S centered at the origin, and then by condition (4.1) alarge n such that for An,j ∩ S = ∅ we have diam(An,j) < a. Thus X ∈ Sand ‖Xi −X‖ > a imply Xi /∈ An(X), therefore

IX∈S

n∑

i=1

Wn,i(X)I‖Xi−X‖>a

= IX∈S

∑ni=1 IXi∈An(X),‖X−Xi‖>a

nµn(An(X))

= IX∈S

∑ni=1 IXi∈An(X),Xi /∈An(X),‖X−Xi‖>a

nµn(An(X))= 0.

4.3. Consistency 63

Thus

lim supn

En∑

i=1

Wn,i(X)I‖Xi−X‖>a ≤ µ(Sc).

Concerning (iv) note that

P

n∑

i=1

Wn,i(X) = 1

= P µn(An(X)) = 0=

∑

j

P X ∈ An,j , µn(An,j) = 0

=∑

j

µ(An,j)(1 − µ(An,j))n

≤∑

j:An,j∩S=∅µ(An,j) +

∑

j:An,j∩S =∅µ(An,j)(1 − µ(An,j))n.

Elementary inequalities

x(1 − x)n ≤ xe−nx ≤ 1en

(0 ≤ x ≤ 1)

yield

P

n∑

i=1

Wn,i(X) = 1

≤ µ(Sc) +1en

|j : An,j ∩ S = ∅| .

The first term on the right-hand side can be made arbitrarily small by thechoice of S, while the second term goes to zero by (4.2). To prove thatcondition (v) holds, observe that

n∑

i=1

Wn,i(x)2 =

1∑n

l=1IXl∈An(x)

if µn(An(x)) > 0,

0 if µn(An(x)) = 0.

Then we have

E

n∑

i=1

Wn,i(X)2

≤ PX ∈ Sc +∑

j:An,j∩S =∅EIX∈An,j

1nµn(An,j)

Iµn(An,j)>0

≤ µ(Sc) +∑

j:An,j∩S =∅µ(An,j)

2nµ(An,j)

(by Lemma 4.1)

= µ(Sc) +2n

|j : An,j ∩ S = ∅| .


A similar argument to the previous one concludes the proof.


In this section we bound the rate of convergence of E‖mn −m‖2 for cubicpartitions and regression functions which are Lipschitz continuous.

Theorem 4.3. For a cubic partition with side length hn assume that

Var(Y |X = x) ≤ σ2, x ∈ Rd,

|m(x) −m(z)| ≤ C‖x− z‖, x, z ∈ Rd, (4.3)

and that X has a compact support S. Then

E‖mn −m‖2 ≤ cσ2 + supz∈S |m(z)|2

n · hdn

+ d · C2 · h2n,

where c depends only on d and on the diameter of S, thus for

hn = c′(σ2 + supz∈S |m(z)|2

C2

)1/(d+2)

n− 1d+2

we get

E‖mn −m‖2 ≤ c′′(σ2 + sup

z∈S|m(z)|2

)2/(d+2)

C2d/(d+2)n−2/(d+2).

Proof. Set

mn(x) = Emn(x)|X1, . . . , Xn =∑n

i=1m(Xi)IXi∈An(x)nµn(An(x))

.

Then

E(mn(x) −m(x))2|X1, . . . , Xn= E(mn(x) − mn(x))2|X1, . . . , Xn + (mn(x) −m(x))2. (4.4)

We have

E(mn(x) − mn(x))2|X1, . . . , Xn

= E

(∑ni=1(Yi −m(Xi))IXi∈An(x)

nµn(An(x))

)2 ∣∣∣X1, . . . , Xn

=∑n

i=1 Var(Yi|Xi)IXi∈An(x)(nµn(An(x)))2

≤ σ2

nµn(An(x))Inµn(An(x))>0.


By Jensen’s inequality

(mn(x) −m(x))2 =(∑n

i=1(m(Xi) −m(x))IXi∈An(x)nµn(An(x))

)2

Inµn(An(x))>0

+m(x)2Inµn(An(x))=0

≤∑n

i=1(m(Xi) −m(x))2IXi∈An(x)nµn(An(x))

Inµn(An(x))>0

+m(x)2Inµn(An(x))=0

≤ d · C2h2nInµn(An(x))>0 +m(x)2Inµn(An(x))=0

(by (4.3) and maxz∈An(x)

‖x− z‖ ≤ d · h2n)

≤ d · C2h2n +m(x)2Inµn(An(x))=0.

Without loss of generality assume that S is a cube and the union ofAn,1, . . . , An,ln is S. Then

ln ≤ c

hdn

for some constant c proportional to the volume of S and, by Lemma 4.1and (4.4),

E∫


= E∫

(mn(x) − mn(x))2µ(dx)

+ E∫


=ln∑

j=1

E

∫

An,j


+ln∑

j=1

E

∫

An,j


≤ln∑

j=1

Eσ2µ(An,j)nµn(An,j)

Iµn(An,j)>0

+ dC2h2

n

+ln∑

j=1

E

∫

An,j

m(x)2µ(dx)Iµn(An,j)=0

≤ln∑

j=1

2σ2µ(An,j)nµ(An,j)

+ dC2h2n +

ln∑

j=1

∫

An,j

m(x)2µ(dx)Pµn(An,j) = 0


≤ ln2σ2

n+ dC2h2

n + supz∈S

m(z)2

ln∑

j=1

µ(An,j)(1 − µ(An,j))n

≤ ln2σ2

n+ dC2h2

n + lnsupz∈S m(z)2

nsup

jnµ(An,j)e−nµ(An,j)

≤ ln2σ2

n+ dC2h2

n + lnsupz∈S m(z)2e−1

n

(since supz ze−z = e−1)

≤ (2σ2 + supz∈S m(z)2e−1)cnhd

n

+ dC2h2n.

According to Theorem 4.3 the cubic partition estimate has optimal ratein the class D(1,C) (cf. Theorem 3.2) only under condition that X hasdistribution with compact support.

Unfortunately, the partitioning estimate is not optimal for smootherregression functions. For example, let, for d = 1,

m(x) = x,

X be uniformly distributed on [0, 1] and let

Y = m(X) +N,

where N is standard normal and X and N are independent. Put

mn(x) =

∫An(x)m(z) dz

hn.

Assume that lnhn = 1 for some integer ln. Then∫ 1

0(mn(x) −m(x))2 dx

=∫ 1

0(mn(x) − mn(x))2 dx+

∫ 1

0(mn(x) −m(x))2 dx

≥∫ 1

0(mn(x) −m(x))2 dx

=ln∑

j=1

∫ jhn

(j−1)hn

(mn(x) −m(x))2 dx

=ln∑

j=1

∫ jhn

(j−1)hn

((j − 1/2)hn − x)2 dx

= lnh3n/12

= h2n/12.

4.5. Bibliographic Notes 67

Then, according to the decomposition (4.4),

E∫


= E∫


+ E∫


≥ln∑

j=1

E

µ(An,j)nµn(An,j)

Iµn(An,j)>0

+∫ 1

0(mn(x) −m(x))2 dx

≥ln∑

j=1

µ(An,j)(1 − (1 − µ(An,j))n)2

nµ(An,j)+ h2

n/12

(by Problem 4.10)

=lnn

(1 − (1 − hn)n)2 + h2n/12

=1nhn

(1 + o(1)) + h2n/12,

if nhn → ∞ and hn → 0 as n → ∞. The above term is minimal forhn = c′n− 1

3 , hence

E‖mn −m‖2 ≥ c′′n− 23 ,

which is not the optimal rate, because this distribution of (X,Y ) is inD(p,C) for all p ≥ 1, with the optimal rate of n− 2p

2p+1 for d = 1 (cf. Theorem3.2).

The optimality of the partitioning estimates can be extended using a localpolynomial partitioning estimate where, within each cell, the estimate is apolynomial (cf. Section 11.2).


Theorem 4.1 is due to Stone (1977). The partitioning estimate, called aregressogram, was introduced by Tukey (1947; 1961) and studied by Col-lomb (1977), Bosq and Lecoutre (1987), and Lecoutre (1980). Concerningits consistency, see Devroye and Gyorfi (1983) and Gyorfi (1991). In gen-eral, the behavior of ‖mn − m‖2 cannot be characterized by the rate ofconvergence of its expectation E‖mn − m‖2. However, here the randomvariable is close to its expectation in the sense that the ratio of the two isclose to 1 with large probability. In this respect, Beirlant and Gyorfi (1998)proved that under some conditions and for cubic partitions

nhd/2n

(‖mn −m‖2 − E‖mn −m‖2) /σ0D→ N(0, 1),


which means that

‖mn −m‖2 − E‖mn −m‖2

is of order 1nh

d/2n

. This should be compared with the rate of convergence of

E‖mn −m‖2 which is at least 1nhd

n(cf. Problem 4.9). This implies that

‖mn −m‖2

E‖mn −m‖2 ≈ 1,

thus the L2 error is relatively stable. The relative stability holds undermore general conditions (cf. Gyorfi, Schafer, and Walk (2002)).


Problem 4.1. Assume that there is a constant c such that for every nonnegativemeasurable function f , satisfying Ef(X) < ∞,

lim supn→∞

E

n∑

i=1

|Wn,i(X)|f(Xi)

≤ cEf(X).

Prove that (i) in Theorem 4.1 is satisfied (Proposition 7 in Stone (1977)).Hint: Apply an indirect proof.

Problem 4.2. Assume that the weights Wn,i are nonnegative and the estimateis weakly universally consistent. Prove that (i) in Theorem 4.1 is satisfied (Stone(1977)).

Hint: Apply Problem 4.1 for

Y = f(X) = m(X),

and show that

limn→∞

(

E

n∑

i=1

|Wn,i(X)|f(Xi)

− Ef(X)

)2

= 0.

Problem 4.3. Assume that a local averaging estimate is weakly universallyconsistent. Prove that (v) in Theorem 4.1 is satisfied (Stone (1977)).

Hint: Consider the case m = 0 and Y ±1 valued.

Problem 4.4. Assume that the weights Wn,i are nonnegative and that thecorresponding local averaging estimate is weakly universally consistent. Provethat (iii) in Theorem 4.1 is satisfied (Stone (1977)).

Hint: For any fixed x0 and a > 0 let f be a nonnegative continuous functionwhich is 0 on Sx0,a/3 and is 1 on Sc

x0,2a/3. Choose Y = f(X) = m(X), then

IX∈Sx0,a/3

n∑

i=1

Wn,i(X)f(Xi) ≥ IX∈Sx0,a/3

n∑

i=1

Wn,i(X)I‖Xi−X‖>a → 0


in probability, therefore, for any compact set B,

IX∈B

n∑

i=1

Wni(X)I‖Xi−X‖>a → 0

in probability.

Problem 4.5. Noiseless observations. Call the observations noiseless if Yi =m(Xi), so that it is the problem of function interpolation for a random design.Prove that under the conditions (i) – (iv) of Theorem 4.1 the regression estimateis weakly consistent for noiseless observations.

Hint: Check the proof of Theorem 4.1.

Problem 4.6. Let a rectangle partition consist of rectangles with side lengthshn1, . . . , hnd. Prove weak universal consistency for

limn→∞

hnj = 0 (j = 1, . . . , d) and limn→∞

nhn1 . . . hnd = ∞.

Problem 4.7. Prove the extension of Theorem 4.3 for rectangle partitions:

E‖mn − m‖2 ≤ c

n∏d

j=1 hnj

+ C2d∑

j=1

h2nj .

Problem 4.8. A cell A is called empty if µn(A) = 0. Let Mn be the number ofnonempty cells for Pn. Prove that under the condition (4.2), Mn/n → 0 a.s.

Hint: For a sufficiently large sphere S consider 1n

∑n

i=1 IXi ∈S for n → ∞.

Problem 4.9. Prove that, for σ2(x) ≥ c1 > 0,

E‖mn − m‖2 ≥ c2

nhdn

with some constant c2 (cf. Beirlant and Gyorfi (1998)).Hint: Take the lower bound of the first term in the right-hand side in the

decomposition (4.4).

Problem 4.10. Let the random variable B(n, p) be binomially distributed withparameters n and p. Then

E

1

B(n, p)IB(n,p)>0

≥ 1

np(1 − (1 − p)n)2.

Hint: Apply the Jensen inequality.

5Kernel Estimates

5.1 Introduction

The kernel estimate of a regression function takes the form

mn(x) =

∑ni=1 YiK

(x−Xi

hn

)

∑ni=1K

(x−Xi

hn

) ,

if the denominator is nonzero, and 0 otherwise. Here the bandwidth hn > 0depends only on the sample size n, and the function K : Rd → [0,∞) iscalled a kernel. (See Figure 5.1 for some examples.) Usually K(x) is “large”if ‖x‖ is “small,” therefore the kernel estimate again is a local averagingestimate.

Figures 5.2–5.5 show the kernel estimate for the naive kernel (K(x) =I‖x‖≤1) and for the Epanechnikov kernel (K(x) = (1 − ‖x‖2)+) usingvarious choices for hn for our simulated data introduced in Chapter 1.

K(x) = I||x||≤1

x

K(x) = (1 − x2)+

x

K(x) = e−x2

x

Figure 5.1. Examples for univariate kernels.

5.2. Consistency 71

−1 −0.5 0.5 1

0.5

Figure 5.2. Kernel estimate for the naive kernel: h = 0.1, L2 error = 0.004066.

Figure 5.6 shows the L2 error as a function of h.

5.2 Consistency

In this section we use Stone’s theorem (Theorem 4.1) in order to prove theweak universal consistency of kernel estimates under general conditions onh and K.

−1 −0.5 0.5 1

0.5

Figure 5.3. Undersmoothing for the Epanechnikov kernel: h = 0.03, L2 error =0.031560.

72 5. Kernel Estimates

−1 −0.5 0.5 1

0.5

Figure 5.4. Kernel estimate for the Epanechnikov kernel: h = 0.1, L2 error =0.003608.

Theorem 5.1. Assume that there are balls S0,r of radius r and balls S0,R

of radius R centered at the origin (0 < r ≤ R), and constant b > 0 suchthat

Ix∈S0,R ≥ K(x) ≥ bIx∈S0,r

(boxed kernel), and consider the kernel estimate mn. If hn → 0 and nhdn →

∞, then the kernel estimate is weakly universally consistent.

−1 −0.5 0.5 1

0.5

Figure 5.5. Oversmoothing for the Epanechnikov kernel: h = 0.5, L2 error =0.012551.

5.2. Consistency 73

0.1

0.2

0.1 0.25 h

Error

Figure 5.6. The L2 error for the Epanechnikov kernel as a function of h.

As one can see in Figure 5.7, the weak consistency holds for a boundedkernel with compact support such that it is bounded away from zero at theorigin. The bandwidth must converge to zero but not too fast.

Proof. Put

Kh(x) = K(x/h).

We check the conditions of Theorem 4.1 for the weights

Wn,i(x) =Kh(x−Xi)∑n

j=1Kh(x−Xj).

Condition (i) means that

E

∑ni=1Kh(X −Xi)f(Xi)∑n

j=1Kh(X −Xj)

≤ cEf(X)

K(x)

x

1

b

−r r−R R

Figure 5.7. Boxed kernel.


x

r

z

r2

Figure 5.8. If x ∈ Sz,r/2, then Sz,r/2 ⊆ Sx,r.

with c > 0. Because of

E

∑ni=1Kh(X −Xi)f(Xi)∑n

j=1Kh(X −Xj)

= nE

Kh(X −X1)f(X1)∑n

j=1Kh(X −Xj)

= nE

Kh(X −X1)f(X1)

Kh(X −X1) +∑n

j=2Kh(X −Xj)

= n

∫f(u)

[

E

∫Kh(x− u)

Kh(x− u) +∑n

j=2Kh(x−Xj)µ(dx)

]

µ(du)

it suffices to show that, for all u and n,

E

∫Kh(x− u)

Kh(x− u) +∑n

j=2Kh(x−Xj)µ(dx)

≤ c

n.

The compact support of K can be covered by finitely many balls, withtranslates of S0,r/2, where r > 0 is the constant appearing in the conditionon the kernel K, and with centers xi, i = 1, 2, . . . ,M . Then, for all x andu,

Kh(x− u) ≤M∑

k=1

Ix∈u+hxk+S0,rh/2.

Furthermore, x ∈ u+ hxk + S0,rh/2 implies that

u+ hxk + S0,rh/2 ⊂ x+ S0,rh

5.2. Consistency 75

(cf. Figure 5.8). Now, by these two inequalities,

E

∫Kh(x− u)

Kh(x− u) +∑n

j=2Kh(x−Xj)µ(dx)

≤M∑

k=1

E

∫

u+hxk+S0,rh/2

Kh(x− u)Kh(x− u) +

∑nj=2Kh(x−Xj)

µ(dx)

≤M∑

k=1

E

∫

u+hxk+S0,rh/2

11 +

∑nj=2Kh(x−Xj)

µ(dx)

≤ 1b

M∑

k=1

E

∫

u+hxk+S0,rh/2

11 +

∑nj=2 IXj∈x+S0,rh

µ(dx)

≤ 1b

M∑

k=1

E

∫

u+hxk+S0,rh/2

11 +

∑nj=2 IXj∈u+hxk+S0,rh/2

µ(dx)

=1b

M∑

k=1

E

µ(u+ hxk + S0,rh/2)

1 +∑n

j=2 IXj∈u+hxk+S0,rh/2

≤ 1b

M∑

k=1

µ(u+ hxk + S0,rh/2)nµ(u+ hxk + S0,rh/2)

(by Lemma 4.1)

≤ M

nb.

The condition (ii) holds since the weights are subprobability weights.Concerning (iii) notice that, for hnR < a,

n∑

i=1

|Wn,i(X)|I‖Xi−X‖>a =∑n

i=1Khn(X −Xi)I‖Xi−X‖>a∑ni=1Khn(X −Xi)

= 0.

In order to show (iv), mention that

1 −n∑

i=1

Wn,i(X) = I∑n

i=1Khn (X−Xi)=0,

therefore,

P

1 =n∑

i=1

Wn,i(X)

= P

n∑

i=1

Khn(X −Xi) = 0

≤ P

n∑

i=1

IXi ∈SX,rhn = 0

= P µn(SX,rhn) = 0


=∫

(1 − µ(Sx,rhn))nµ(dx).

Choose a sphere S centered at the origin, then

P

1 =n∑

i=1

Wn,i(X)

≤∫

S

e−nµ(Sx,rhn )µ(dx) + µ(Sc)

=∫

S

nµ(Sx,rhn)e−nµ(Sx,rhn ) 1nµ(Sx,rhn

)µ(dx) + µ(Sc)

= maxu

ue−u

∫

S

1nµ(Sx,rhn

)µ(dx) + µ(Sc).

By the choice of S, the second term can be small. For the first term we canfind z1, . . . , zMn

such that the union of Sz1,rhn/2, . . . , SzMn ,rhn/2 covers S,and

Mn ≤ c

hdn

.

Then∫

S

1nµ(Sx,rhn)

µ(dx) ≤Mn∑

j=1

∫ Ix∈Szj,rhn/2

nµ(Sx,rhn)µ(dx)

≤Mn∑

j=1

∫ Ix∈Szj,rhn/2

nµ(Szj ,rhn/2)µ(dx)

≤ Mn

n

≤ c

nhdn

→ 0. (5.1)

Concerning (v), since K(x) ≤ 1 we get that, for any δ > 0,

n∑

i=1

Wn,i(X)2 =∑n

i=1Khn(X −Xi)2

(∑n

i=1Khn(X −Xi))2

≤∑n

i=1Khn(X −Xi)

(∑n

i=1Khn(X −Xi))2

≤ minδ,

1∑n

i=1Khn(X −Xi)

≤ minδ,

1∑n

i=1 bIXi∈SX,rhn


≤ δ +1

∑ni=1 bIXi∈SX,rhn

I∑n

i=1IXi∈SX,rhn

>0,

therefore it is enough to show that

E

1∑n

i=1 IXi∈SX,rhn I∑n

i=1IXi∈SX,rhn

>0

→ 0.

Let S be as above, then

E

1∑n


i=1IXi∈SX,rhn

>0

≤ E

1∑n


i=1IXi∈SX,rhn

>0IX∈S

+ µ(Sc)

≤ 2E

1(n+ 1)µ(SX,hn

)IX∈S

+ µ(Sc)

(by Lemma 4.1)

→ µ(Sc)

as above.


In this section we bound the rate of convergence of E‖mn −m‖2 for a naivekernel and a Lipschitz continuous regression function.

Theorem 5.2. For a kernel estimate with a naive kernel assume that

Var(Y |X = x) ≤ σ2, x ∈ Rd,

and

|m(x) −m(z)| ≤ C‖x− z‖, x, z ∈ Rd,

and X has a compact support S∗. Then

E‖mn −m‖2 ≤ cσ2 + supz∈S∗ |m(z)|2

n · hdn

+ C2h2n,

where c depends only on the diameter of S∗ and on d, thus for

hn = c′(σ2 + supz∈S∗ |m(z)|2

C2

)1/(d+2)

n− 1d+2

we have

E‖mn −m‖2 ≤ c′′(σ2 + sup

z∈S∗|m(z)|2

)2/(d+2)

C2d/(d+2)n−2/(d+2).


Proof. We proceed similarly to Theorem 4.3. Put

mn(x) =

∑ni=1m(Xi)IXi∈Sx,hn

nµn(Sx,hn),

then we have the decomposition (4.4). If Bn(x) = nµn(Sx,hn) > 0, then

E(mn(x) − mn(x))2|X1, . . . , Xn

= E

(∑ni=1(Yi −m(Xi))IXi∈Sx,hn

nµn(Sx,hn)

)2

|X1, . . . , Xn

=

∑ni=1 Var(Yi|Xi)IXi∈Sx,hn

(nµn(Sx,hn))2

≤ σ2

nµn(Sx,hn)IBn(x).

By Jensen’s inequality and the Lipschitz property of m,

(mn(x) −m(x))2

=(∑n

i=1(m(Xi) −m(x))IXi∈Sx,hn

nµn(Sx,hn)

)2

IBn(x) +m(x)2IBn(x)c

≤∑n

i=1(m(Xi) −m(x))2IXi∈Sx,hn

nµn(Sx,hn)IBn(x) +m(x)2IBn(x)c

≤ C2h2nIBn(x) +m(x)2IBn(x)c

≤ C2h2n +m(x)2IBn(x)c .

Using this, together with Lemma 4.1,

E∫


= E∫


+ E∫


≤∫

S∗E

σ2

nµn(Sx,hn)Iµn(Sx,hn )>0

µ(dx) + C2h2

n

+∫

S∗Em(x)2Iµn(Sx,hn )=0

µ(dx)

≤∫

S∗

2σ2

nµ(Sx,hn)µ(dx) + C2h2

n +∫

S∗m(x)2(1 − µ(Sx,hn

))nµ(dx)

≤∫

S∗

2σ2

nµ(Sx,hn)µ(dx) + C2h2

n + supz∈S∗

m(z)2∫

S∗e−nµ(Sx,hn )µ(dx)


≤ 2σ2∫

S∗

1nµ(Sx,hn

)µ(dx) + C2h2

n

+ supz∈S∗

m(z)2 maxu

ue−u

∫

S∗

1nµ(Sx,hn)

µ(dx).

Now we refer to (5.1) such that there the set S is a sphere containing S∗.Combining these inequalities the proof is complete.

According to Theorem 5.2, the kernel estimate is of optimum rate forthe class D(1,C) (cf. Definition 3.2 and Theorem 3.2). In Theorem 5.2 theonly condition on X is that it has compact support, there is no densityassumption.

In contrast to the partitioning estimate, the kernel estimate can “track”the derivative of a differentiable regression function. Using a nonnegativesymmetric kernel, in the pointwise theory the kernel regression estimate isof the optimum rate of convergence for the class D(2,C) (cf. Hardle (1990)).Unfortunately, in the L2 theory, this is not the case.

In order to show this, consider the following example:- X is uniform on [0, 1];- m(x) = x; and- Y = X +N , where N is standard normal and is independent of X.

This example belongs to D(p,C) for any p ≥ 1. In Problem 5.1 we showthat for a naive kernel and for hn → 0 and nhn → ∞,

E∫ 1

0(mn(x) −m(x))2µ(dx) ≥ 1 + o(1)

21nhn

+1 + o(1)

54h3

n, (5.2)

where the lower bound is minimized by hn = cn− 14 , and then

E∫ 1

0(mn(x) −m(x))2µ(dx) ≥ c′n− 3

4 ,

therefore the kernel estimate is not optimal for D(2,C).The main point of this example is that because of the end points of the

uniform density the squared bias is of order h3n and not h4

n. From this,one can expect that the kernel regression estimate is optimal for the classD(1.5,C).

Theorem 5.3. For a naive kernel the kernel estimate is of an optimal rateof convergence for the class D(1.5,C).

Proof. See Problem 5.2.

In the pointwise theory, the kernel estimate with a nonnegative kernelcan have an optimal rate only for the class D(p,C) with p ≤ 2 (cf. Hardle(1990)). If p > 2, then that theory suggests a higher-order kernels whichcan take negative values, too. A kernel is called a higher-order kernel, i.e.,


K is a kernel of order k, if∫K(x)xj dx = 0

for 1 ≤ j ≤ k − 1, and

0 <∫K(x)xk dx < ∞.

So the naive kernel is of order 2. An example for a forth-order kernel is

K(x) =38(3 − 5x2)I|x|≤1. (5.3)

Unfortunately, in the L2 theory, we lose the consistency if the kernelcan take on negative values. For example, let X be uniformly distributedon [0, 1], Y is ±1 valued with EY = 0, and X and Y are independent.These conditions imply that m = 0. In Problem 5.4 we show, for the kerneldefined by (5.3), that

E∫

(mn(x) −m(x))2µ(dx) = ∞. (5.4)

5.4 Local Polynomial Kernel Estimates

In the pointwise theory, the optimality of a kernel estimate can be extendedusing a local polynomial kernel estimate. Similarly to the partitioning es-timate, notice that the kernel estimate can be written as a solution of thefollowing minimization problem:

mn(x) = arg minc

n∑

i=1

(Yi − c)2Khn(x−Xi)

(cf. Problem 2.2). To generalize this, choose functions φ0, . . . , φM on Rd,and define the estimate by

mn(x) =M∑

l=0

cl(x)φl(x), (5.5)

where

(c0(x), . . . , cM (x)) = arg min(c0,...,cM )

n∑

i=1

(

Yi −M∑

l=0

clφl(Xi)

)2

Khn(x−Xi).

(5.6)The most popular example for estimates of this kind is the local polyno-

mial kernel estimate, where the φl(x)’s are monomials of the components ofx. For simplicity we consider only d = 1. Then φl(x) = xl (l = 0, 1, . . .M),and the estimate mn is defined by locally fitting (via (5.5) and (5.6)) a

5.4. Local Polynomial Kernel Estimates 81

polynomial to the data. If M = 0, then mn is the standard kernel estimate.If M = 1, then mn is the so-called locally linear kernel estimate.

The locally polynomial estimate has no global consistency properties.Similarly to the piecewise linear partitioning estimate, the locally linearkernel estimate mn is not weakly universally consistent. This can be shownby the same example: let X be uniformly distributed on [0, 1], Y is ±1valued with EY = 0, and X and Y are independent. These conditionsimply that m = 0. Then for the naive kernel

E∫

(mn(x) −m(x))2µ(dx) = ∞ (5.7)

(cf. Problem 5.8).The main point in the above counterexample for consistency is that,

due to interpolation effects, the locally linear kernel estimate can take ar-bitrarily large values even for bounded data. These interpolation effectsoccur only with very small probability but, nevertheless, they force theexpectation of the L2 error to be infinity.

This problem can be avoided if one minimizes in (5.6) only over coeffi-cients which are bounded in absolute value by some constant depending onn and converging to infinity.

Theorem 5.4. Let M ∈ N0. For n ∈ N choose βn, hn > 0 such that

βn → ∞, hnβn → 0

andnhn

β2n log n

→ ∞.

Let K be the naive kernel. Define the estimate mn by

mn(x) =M∑

l=0

cl(x)xl,

where c0(x), . . . , cM (x) ∈ [−βn, βn] is chosen such that

n∑

i=1

(

Yi −M∑

l=0

cl(x)X li

)2

Khn(x−Xi)

≤ minc0(x),...,cM (x)∈[−βn,βn]

n∑

i=1

(

Yi −M∑

l=0

clXli

)2

Khn(x−Xi) +1n.

Then

E∫


→ 0

for all distributions of (X,Y ) with X bounded a.s. and EY 2 < ∞.


Proof. See Kohler (2002b).

The assumption that X is bounded a.s., can be avoided if one sets theestimate to zero outside an interval which depends on the sample size nand tends to R for n tending to infinity (cf. Kohler (2002b)).


Kernel regression estimates were originally derived from the kernel esti-mate in density estimation studied by Parzen (1962), Rosenblatt (1956),Akaike (1954), and Cacoullos (1965). They were introduced in regressionestimation by Nadaraya (1964; 1970) and Watson (1964). Statistical anal-ysis of the kernel regression function estimate can be found in Nadaraya(1964; 1970), Rejto and Revesz (1973), Devroye and Wagner (1976; 1980a;1980b), Greblicki (1974; 1978b; 1978a), Krzyzak (1986; 1990), Krzyzakand Pawlak (1984b), Devroye (1978b), Devroye and Krzyzak (1989), andPawlak (1991), etc. Theorem 5.1 is due to Devroye and Wagner (1980a)and to Spiegelman and Sacks (1980).

Several authors studied the pointwise properties of the kernel estimates,i.e., the pointwise optimality of the locally polynomial kernel estimates un-der some regularity conditions on m and µ: Stone (1977; 1980), Katkovnik(1979; 1983; 1985), Korostelev and Tsybakov (1993), Cleveland (1979),Hardle (1990), Fan and Gijbels (1992; 1995), Fan (1993), Tsybakov (1986),and Fan, Hu, and Truong (1994). Kernel regression estimate without band-width, called Hilbert kernel estimate, was investigated by Devroye, Gyorfi,and Krzyzak (1998).

The counterexample for the consistency of the local polynomial kernelestimate in Problem 5.8 is due to Devroye (personal communication, 1998).Under regularity conditions on the distribution of X (in particular, for Xuniformly distributed on [0, 1]), Stone (1982) showed that the L2 errorof the local polynomial kernel estimate converges in probability to zerowith the rate n−2p/(2p+1) if the regression function is (p, C)-smooth. So theabove-mentioned counterexample is true only for the expected L2 error. Itis an open problem whether the result of Stone (1982) holds without anyregularity assumptions on the distribution of X besides boundedness, andwhether the expected L2 error of the estimate in Theorem 5.4 convergesto zero with the optimal rate of convergence if the regression function is(p, C)-smooth.


Problem 5.1. Prove (5.2).Hint:


Step (a).

E(mn(x) − m(x))2 ≥ E(mn(x) − Emn(x)|X1, . . . Xn)2 + (Emn(x) − m(x))2.

Step (b).

E(mn(x) − Emn(x)|X1, . . . Xn)2

=1

nµ([x − hn, x + hn])Pµn([x − hn, x + hn]) > 02.

Step (c).

E∫ 1

0

(mn(x) − Emn(x)|X1, . . . Xn)2µ(dx)

≥∫ 1−hn

hn

12nhn

(1 − (1 − 2hn)n)2 dx

≥ 1 + o(1)2nhn

.

Step (d).

(Emn(x) − m(x))2

=

(nE

X1I|X1−x|≤hn

E

1

1 +∑n

i=2 I|Xi−x|≤hn

− x

)2

.

Step (e). Fix 0 ≤ x ≤ hn/2, then

EX1I|X1−x|≤hn

=

(x + hn)2

2,

Step (f).

E

1

1 +∑n

i=2 I|Xi−x|≤hn

≥ 1

1 + (n − 1)(x + hn).

Step (g). For large nhn,

nEX1I|X1−x|≤hn

E

1

1 +∑n

i=2 I|Xi−x|≤hn

≥ x + hn

3≥ x.

Step (h). For large nhn,∫ 1

0

(Emn(x) − m(x))2µ(dx) ≥∫ hn/2

0

(x + hn

3− x

)2

dx

=h3

n

54.

Problem 5.2. Prove Theorem 5.3.Hint: For the class D(1.5,C), m is differentiable and

‖m′(x) − m′(z)‖ ≤ C‖x − z‖1/2.

Show that

E‖mn − m‖2 ≤ c

nhdn

+ C2h3n,


thus, for hn = c′n− 1d+3 ,

E‖mn − m‖2 ≤ c′′n− 3d+3 .

With respect to the proof of Theorem 5.2 the difference is that we show

E

∫(mn(x) − m(x))2Iµn(Sx,hn )>0µ(dx)

≤ C′h3

n + C′′h2n

1nhd

n,

i.e.,

E

∫ (∑n

i=1(m(Xi) − m(x))IXi∈Sx,hn nµn(Sx,hn)

)2

µ(dx)

≤ C′h3n + C′′h2

n1

nhdn

.

Step (a).

E

∫ (∑n

i=1(m(Xi) − m(x))IXi∈Sx,hn nµn(Sx,hn)

)2

µ(dx)

≤ C′′h2n

1nhd

n+ 2

∫ (∫

Sx,hn(m(u) − m(x))µ(du))2

µ(Sx,hn)2µ(dx).

Step (b). By the mean value theorem, for a convex linear combination u′ of xand u,

m(u) − m(x) = (m(u′)′, u − x),

therefore,(∫

Sx,hn(m(u) − m(x))µ(du)

µ(Sx,hn)

)2

≤ 2

(

m(x)′,

∫Sx,hn

uµ(du)

µ(Sx,hn)− x

)2

+ 2C2h3n.

Step (c). Let B be the set of points of [0, 1]d which are further from the borderof [0, 1]d than hn. Then

∫

[0,1]d

(

m(x)′,

∫Sx,hn

uµ(du)

µ(Sx,hn)− x

)2

µ(dx) ≤ 2d maxx

‖m(x)′‖2h3n.

Problem 5.3. Prove that the kernel defined by (5.3) is of order 4.

Problem 5.4. Prove (5.4).Hint: Let A be the event that X1, X2 ∈ [0, 4h], 2

√35h ≤ |X1 − X2| ≤ 2h,

X3, . . . , Xn ∈ [6h, 1], and Y1 = Y2. Then PA > 0 and

E∫

(mn(x) − m(x))2µ(dx) ≥(

E

∫|mn(x)|dx|A

PA

)2

.

For the event A and for x ∈ [0, 4h], the quantity |mn(x)| is a ratio of |K((x −X1)/h)−K((x−X2)/h)| and |K((x− X1)/h)+K((x−X2)/h)|, such that thereare two x’s for which the denominator is 0 and the numerator is positive, so theintegral of |m(x)| is ∞.


Problem 5.5. The kernel estimate can be generalized to product kernels. LetK∗

j (j = 1, 2, . . . , d) be kernels defined on R and hn1, . . . , hnd bandwidthsequences. Put

Kn(x − z) =d∏

j=1

K∗j

(xj − zj

hnj

).

Formulate consistency conditions on hn1, . . . , hnd.

Problem 5.6. Extend Theorem 5.2 to boxed kernels.

Problem 5.7. Extend Theorem 5.2 to product kernels:

E‖mn − m‖2 ≤ c

n∏d

j=1 hnj

+ C2d∑

j=1

h2nj .


Step (a). Let A be the event that X1, X2 ∈ [h/2, 3h/2], X3, . . . , Xn ∈ [5h/2, 1],and Y1 = Y2. Then PA > 0 and

E∫

(mn(x)−m(x))2µ(dx) = E∫

mn(x)2dx ≥(

E

∫|mn(x)|dx|A

PA

)2

.

Step (b). Given A, on [h/2, 3h/2], the locally linear kernel estimate mn hasthe form

mn(x) =±2∆

(x − c),

where ∆ = |X1 − X2| and h/2 ≤ c ≤ 3h/2. Then

E

∫|mn(x)|dx|A

≥ E

2∆

∫ 3h/2

h/2

|x − h|dx|A

.

Step (c).

E 1

∆|A

= E 1

∆

= ∞.

6k-NN Estimates

6.1 Introduction

We fix x ∈ Rd, and reorder the data (X1, Y1), . . . , (Xn, Yn) according toincreasing values of ‖Xi − x‖. The reordered data sequence is denoted by

(X(1,n)(x), Y(1,n)(x)), . . . , (X(n,n)(x), Y(n,n)(x))

or by

(X(1,n), Y(1,n)), . . . , (X(n,n), Y(n,n))

if no confusion is possible. X(k,n)(x) is called the kth nearest neighbor(k-NN) of x.

The kn-NN regression function estimate is defined by

mn(x) =1kn

kn∑

i=1

Y(i,n)(x).

If Xi and Xj are equidistant from x, i.e., ‖Xi − x‖ = ‖Xj − x‖, then wehave a tie. There are several rules for tie breaking. For example, Xi mightbe declared “closer” if i < j, i.e., the tie breaking is done by indices. For thesake of simplicity we assume that ties occur with probability 0. In principle,this is an assumption on µ, so the statements are formally not universal,but adding a component to the observation vector X we can automaticallysatisfy this condition as follows: Let (X,Z) be a random vector, whereZ is independent of (X,Y ) and uniformly distributed on [0, 1]. We alsoartificially enlarge the data set by introducing Z1, Z2, . . . , Zn, where the

6.1. Introduction 87

x

X(1,6)(x)

X(2,6)(x)X(3,6)(x)

X(4,6)(x)X(5,6)(x)

X(6,6)(x)

Figure 6.1. Illustration of nearest neighbors.

Zi’s are i.i.d. uniform [0, 1] as well. Thus, each (Xi, Zi) is distributed as(X,Z). Then ties occur with probability 0. In the sequel we shall assumethatX has such a component and, therefore, for each x the random variable‖X − x‖2 is absolutely continuous, since it is a sum of two independentrandom variables such that one of the two is absolutely continuous.

Figures 6.2 – 6.4 show kn-NN estimates for various choices of kn for oursimulated data introduced in Chapter 1. Figure 6.5 shows the L2 error asa function of kn.

−1 −0.5 0.5 1

0.5

Figure 6.2. Undersmoothing: kn = 3, L2 error =0.011703.

88 6. k-NN Estimates

−1 −0.5 0.5 1

0.5

Figure 6.3. Good choice: kn = 12, L2 error =0.004247.

6.2 Consistency

In this section we use Stone’s theorem (Theorem 4.1) in order to proveweak universal consistency of the k-NN estimate. The main result is thefollowing theorem:

Theorem 6.1. If kn → ∞, kn/n → 0, then the kn-NN regression functionestimate is weakly consistent for all distributions of (X,Y ) where ties occurwith probability zero and EY 2 < ∞.

−1 −0.5 0.5 1

0.5

Figure 6.4. Oversmoothing: kn = 50, L2 error =0.009931.

6.2. Consistency 89

10 50 100

0.1

0.2

k

Error

Figure 6.5. L2 error of the k-NN estimate as a function of k.

According to Theorem 6.1 the number of nearest neighbors (kn), overwhich one averages in order to estimate the regression function, shouldon the one hand converge to infinity but should, on the other hand, besmall with respect to the sample size n. To verify the conditions of Stone’stheorem we need several lemmas.

We will use Lemma 6.1 to verify condition (iii) of Stone’s theorem. Denotethe probability measure for X by µ, and let Sx,ε be the closed ball centeredat x of radius ε > 0. The collection of all x with µ(Sx,ε) > 0 for all ε > 0is called the support of X or µ. This set plays a key role because of thefollowing property:

Lemma 6.1. If x ∈ support(µ) and limn→∞ kn/n = 0, then

‖X(kn,n)(x) − x‖ → 0

with probability one.

Proof. Take ε > 0. By definition, x ∈ support(µ) implies that µ(Sx,ε) > 0.Observe that

‖X(kn,n)(x) − x‖ > ε =

1n

n∑

i=1

IXi∈Sx,ε <kn

n

.

By the strong law of large numbers,

1n

n∑

i=1

IXi∈Sx,ε → µ(Sx,ε) > 0

with probability one, while, by assumption,

kn

n→ 0.

Therefore, ‖X(kn,n)(x) − x‖ → 0 with probability one.


The next two lemmas will enable us to establish condition (i) of Stone’stheorem.

Lemma 6.2. Let

Ba(x′) =x : µ(Sx,‖x−x′‖) ≤ a

.

Then, for all x′ ∈ Rd,

µ(Ba(x′)) ≤ γda,

where γd depends on the dimension d only.

Proof. Let Cj ⊂ Rd be a cone of angle π/3 and centered at 0. It is aproperty of cones that if u, u′ ∈ Cj and ‖u‖ < ‖u′‖, then ‖u − u′‖ < ‖u′‖(cf. Figure 6.6). Let C1, . . . , Cγd

be a collection of such cones with differentcentral directions such that their union covers Rd:

γd⋃

j=1

Cj = Rd.

Then

µ(Ba(x′)) ≤γd∑

i=1

µ(x′ + Ci ∩Ba(x′)).

Let x∗ ∈ x′ + Ci ∩ Ba(x′). Then, by the property of cones mentionedabove, we have

µ(x′ + Ci ∩ Sx′,‖x′−x∗‖ ∩Ba(x′)) ≤ µ(Sx∗,‖x′−x∗‖) ≤ a,

where we use the fact that x∗ ∈ Ba(x′). Since x∗ is arbitrary,

µ(x′ + Ci ∩Ba(x′)) ≤ a,

O

u

u′‖u− u′‖

‖u′‖

‖u‖

Figure 6.6. The cone property.

6.2. Consistency 91

which completes the proof of the lemma.

An immediate consequence of the lemma is that the number of pointsamong X1, . . . , Xn, such that X is one of their k nearest neighbors, is notmore than a constant times k.

Corollary 6.1. Assume that ties occur with probability zero. Then

n∑

i=1

IX is among the k NNs of Xi in X1,...,Xi−1,X,Xi+1,...,Xn ≤ kγd

a.s.

Proof. Apply Lemma 6.2 with a = k/n and let µ be the empiricalmeasure µn of X1, . . . , Xn, i.e., for each Borel set A ⊆ Rd, µn(A) =(1/n)

∑ni=1 IXi∈A. Then

Bk/n(X) =x : µn(Sx,‖x−X‖) ≤ k/n

and

Xi ∈ Bk/n(X)

⇔ µn(SXi,‖Xi−X‖) ≤ k/n

⇔ X is among the k NNs of Xi in X1, . . . , Xi−1, X,Xi+1, . . . , Xna.s., where for the second ⇔ we applied the condition that ties occur withprobability zero. This, together with Lemma 6.2, yields

n∑

i=1

IX is among the k NNs of Xi in X1,...,Xi−1,X,Xi+1,...,Xn

=n∑

i=1

IXi∈Bk/n(X)

= n · µn(Bk/n(X))

≤ kγd

a.s.

Lemma 6.3. Assume that ties occur with probability zero. Then for anyintegrable function f , any n, and any k ≤ n,

k∑

i=1

E|f(X(i,n)(X))| ≤ kγdE|f(X)|,

where γd depends upon the dimension only.


Proof. If f is a nonnegative function,k∑

i=1

Ef(X(i,n)(X))

= E

n∑

i=1

IXi is among the k NNs of X in X1,...,Xnf(Xi)

= E

f(X)n∑

i=1

IX is among the k NNs of Xi in X1,...,Xi−1,X,Xi+1,...,Xn

(by exchanging X and Xi)

≤ Ef(X)kγd,by Corollary 6.1. This concludes the proof of the lemma.

Proof of Theorem 6.1. We proceed by checking the conditions of Stone’sweak convergence theorem (Theorem 4.1) under the condition that tiesoccur with probability zero. The weight Wn,i(X) in Theorem 4.1 equals1/kn if Xi is among the kn nearest neighbors of X, and equals 0 otherwise,thus the weights are probability weights, and (ii) and (iv) are automaticallysatisfied. Condition (v) is obvious since kn → ∞. For condition (iii) observethat, for each ε > 0,

E

n∑

i=1

Wn,i(X)I‖Xi−X‖>ε

=∫

E

n∑

i=1

Wn,i(x)I‖Xi−x‖>ε

µ(dx)

=∫

E

1kn

kn∑

i=1

I‖X(i,n)(x)−x‖>ε

µ(dx) → 0

holds whenever∫

P‖X(kn,n)(x) − x‖ > ε

µ(dx) → 0, (6.1)

whereX(kn,n)(x) denotes the knth nearest neighbor of x amongX1, . . . , Xn.For x ∈ support(µ), kn/n → 0, together with Lemma 6.1, implies

P‖X(kn,n)(x) − x‖ > ε

→ 0 (n → ∞).

This together with the dominated convergence theorem implies (6.1). Fi-nally, we consider condition (i). It suffices to show that for any nonnegativemeasurable function f with Ef(X) < ∞, and any n,

E

n∑

i=1

1knIXi is among the kn NNs of Xf(Xi)

≤ c · E f(X)


for some constant c. But we have shown in Lemma 6.3 that this inequalityalways holds with c = γd. Thus, condition (i) is verified.


In this section we bound the rate of convergence of E‖mn − m‖2 for akn-nearest neighbor estimate.

Theorem 6.2. Assume that X is bounded,

σ2(x) = Var(Y |X = x) ≤ σ2 (x ∈ Rd)

and

|m(x) −m(z)| ≤ C‖x− z‖ (x, z ∈ Rd).

Assume that d ≥ 3. Let mn be the kn-NN estimate. Then

E‖mn −m‖2 ≤ σ2

kn+ c1 · C2

(kn

n

)2/d

,

thus for kn = c′(σ2/C2

)d/(2+d)n

2d+2 ,

E‖mn −m‖2 ≤ c′′σ4

d+2C2d

2+dn− 2d+2 .

For the proof of Theorem 6.2 we need the rate of convergence of nearestneighbor distances.

Lemma 6.4. Assume that X is bounded. If d ≥ 3, then

E‖X(1,n)(X) −X‖2 ≤ c

n2/d.

Proof. For fixed ε > 0,

P‖X(1,n)(X) −X‖ > ε = E(1 − µ(SX,ε))n.Let A1, . . . , AN(ε) be a cubic partition of the bounded support of µ suchthat the Aj ’s have diameter ε and

N(ε) ≤ c

εd.

If x ∈ Aj , then Aj ⊂ Sx,ε, therefore

E(1 − µ(SX,ε))n =N(ε)∑

j=1

∫

Aj

(1 − µ(Sx,ε))nµ(dx)

≤N(ε)∑

j=1

∫

Aj

(1 − µ(Aj))nµ(dx)


=N(ε)∑

j=1

µ(Aj)(1 − µ(Aj))n.

Obviously,

N(ε)∑

j=1

µ(Aj)(1 − µ(Aj))n ≤N(ε)∑

j=1

maxz

z(1 − z)n

≤N(ε)∑

j=1

maxz

ze−nz

=e−1N(ε)

n.

If L stands for the diameter of the support of µ, then

E‖X(1,n)(X) −X‖2 =∫ ∞

0P‖X(1,n)(X) −X‖2 > ε dε

=∫ L2

0P‖X(1,n)(X) −X‖ > √

ε dε

≤∫ L2

0min

1,e−1N(

√ε)

n

dε

≤∫ L2

0min

1,

c

enε−d/2

dε

=∫ (c/(en))2/d

01 dε+

c

en

∫ L2

(c/(en))2/d

ε−d/2dε

≤ c

n2/d

for d ≥ 3.

Proof of Theorem 6.2. We have the decomposition

E(mn(x) −m(x))2 = E(mn(x) − Emn(x)|X1, . . . , Xn)2+E(Emn(x)|X1, . . . , Xn −m(x))2

= I1(x) + I2(x).

The first term is easier:

I1(x) = E

(1kn

kn∑

i=1

(Y(i,n)(x) −m(X(i,n)(x))

))2

= E

1k2

n

kn∑

i=1

σ2(X(i,n)(x))


≤ σ2

kn.

For the second term

I2(x) = E

(1kn

kn∑

i=1

(m(X(i,n)(x)) −m(x))

)2

≤ E

(1kn

kn∑

i=1

|m(X(i,n)(x)) −m(x)|)2

≤ E

(1kn

kn∑

i=1

C‖X(i,n)(x) − x‖)2

.

Put N = kn nkn

. Split the data X1, . . . , Xn into kn + 1 segments suchthat the first kn segments have length n

kn, and let Xx

j be the first nearestneighbor of x from the jth segment. Then Xx

1 , . . . , Xxkn

are kn differentelements of X1, . . . , Xn, which implies

kn∑

i=1

‖X(i,n)(x) − x‖ ≤kn∑

j=1

‖Xxj − x‖,

therefore, by Jensen’s inequality,

I2(x) ≤ C2E

1kn

kn∑

j=1

‖Xxj − x‖

2

≤ C2 1kn

kn∑

j=1

E

‖Xxj − x‖2

= C2E

‖Xx1 − x‖2

= C2E

‖X(1, nkn

)(x) − x‖2.

Thus, by Lemma 6.4,

1C2

⌊ nkn

⌋2/d∫I2(x)µ(dx) ≤

⌊ nkn

⌋2/d

E

‖X(1, nkn

)(X) −X‖2

≤ const.

For d ≤ 2 the rate of convergence of Theorem 6.2 holds under additionalconditions on µ (cf. Problem 6.7).

According to Theorem 6.2, the nearest neighbor estimate is of optimumrate for the class D(1,C) (cf. Definition 3.2 and Theorem 3.2). In Theorem


6.2 the only condition on X for d ≥ 3 is that it has compact support, thereis no density assumption.

Similarly to the partitioning estimate, the nearest neighbor estimate can-not “track” the derivative of a differentiable regression function. In thepointwise theory the nearest neighbor regression estimate has the optimumrate of convergence for the class D(2,C) (cf. Hardle (1990)). Unfortunately,in the L2 theory, this is not the case.

In order to show this consider the following example:- X is uniform on [0, 1];- m(x) = x; and- Y = X +N , where N is standard normal and is independent of X.

This example belongs to D(p,C) for any p ≥ 1. In Problem 6.2 we will seethat, for kn/n → 0 and kn → ∞,

E∫ 1

0(mn(x) −m(x))2µ(dx) ≥ 1

kn+

124

(kn

n+ 1

)3

, (6.2)

where the lower bound is minimized by kn = cn3/4, and thus

E∫ 1

0(mn(x) −m(x))2µ(dx) ≥ c′n− 3

4 ,

therefore the nearest neighbor estimate is not optimal for D(2,C).The main point of this example is that because of the end points of the

uniform density the squared bias is of order(

kn

n

)3and not

(kn

n

)4. From

this one may conjecture that the nearest neighbor regression estimate isoptimal for the class D(1.5,C).


The consistency of the kn-nearest neighbor classification, and the cor-responding regression and density estimation has been studied by manyresearchers. See Beck (1979), Bhattacharya and Mack (1987), Bickel andBreiman (1983), Cheng (1995), Collomb (1979; 1980; 1981), Cover (1968a),Cover and Hart (1967), Devroye (1978a; 1981; 1982b), Devroye and Gyorfi(1985), Devroye et al. (1994), Fix and Hodges (1951; 1952), Guerre (2000)Gyorfi and Gyorfi (1975), Mack (1981), Stone (1977), Stute (1984), andZhao (1987). Theorem 6.1 is due to Stone (1977). Various versions ofLemma 6.2 appeared in Fritz (1974), Stone (1977), Devroye and Gyorfi(1985). Lemma 6.4 is a special case of the result of Kulkarni and Posner(1995).



Problem 6.1. Prove that for d ≤ 2 Lemma 6.4 is not distribution-free, i.e.,construct a distribution of X for which Lemma 6.4 does not hold.

Hint: Put d = 1 and assume a density f(x) = 3x2, then F (x) = x3 and

E‖X(1,n)(X) − X‖2 ≥∫ 1/4

0

∫ √ε

0

(1 − [F (x +√

ε) − F (x − √ε)])nf(x) dx dε

≥ C

n5/3 .


Step (a).

E(mn(x) − m(x))2

≥ E(mn(x) − Emn(x)|X1, . . . Xn)2 + (Emn(x) − m(x))2.

Step (b).

E(mn(x) − Emn(x)|X1, . . . Xn)2 =1kn

.

Step (c). Observe that the function

Emn(x)|X1, . . . , Xn =1kn

kn∑

i=1

X(i,n)(x)

is a monotone increasing function of x, therefore

Emn(x)|X1, . . . , Xn ≥ 1kn

kn∑

i=1

X(i,n)(0).

Let X∗1 , . . . , X∗

n be the ordered sample of X1, . . . , Xn, then X(i,n)(0) = X∗i , and

so

Emn(x) ≥ E

1kn

kn∑

i=1

X∗i

= αkn .

Thus∫ 1

0

(Emn(x) − m(x))2µ(dx) ≥ α3kn

3.

Step (d).

αkn =12

kn

n + 1.

Problem 6.3. Prove that for fixed k the k-NN regression estimate is weaklyconsistent for noiseless observations.

Hint: See Problem 4.5.


Problem 6.4. Let mn(x) be the k-NN regression estimate. Prove that, for fixedk,

limn→∞

E∫

(mn(x) − m(x))2µ(dx) =E(Y − m(X))2

k

for all distributions of (X, Y ) with EY 2 < ∞.Hint: Use the decomposition

mn(x) =1k

k∑

i=1

m(X(i,n)(x)) +1k

k∑

i=1

(Y(i,n)(x) − m(X(i,n)(x))).

Handle the first term by Problem 6.3. Show that

E∫ (

1k

k∑

i=1

(Y(i,n)(x) − m(X(i,n)(x)))

)2

µ(dx) =1k2

k∑

i=1

Eσ2(X(i,n)(X))

→ E(Y − m(X))2

k.

Problem 6.5. Let gn be the k-NN classification rule for M classes:

gn(x) = arg max1≤j≤M

k∑

i=1

IY(i,n)(x)=j.

Show that, for kn → ∞ and kn/n → 0,

limn→∞

Pgn(X) = Y = Pg∗(X) = Y

for all distributions of (X, Y ), where g∗ is the Bayes decision rule (Devroye,Gyorfi, and Lugosi (1996)).

Hint: Apply Problem 1.5 and Theorem 6.1.

Problem 6.6. Let gn be the 1-NN classification rule. Prove that

limn→∞

Pgn(X) = Y = 1 −M∑

j=1

Em(j)(X)2

for all distributions of (X, Y ), where m(j)(X) = PY = j|X (Cover and Hart(1967), Stone (1977)).

Hint:

Step (a). Show that

Pgn(X) = Y = 1 −M∑

j=1

PY = j, gn(X) = j

= 1 −M∑

j=1

Em(j)(X)m(j)(X(1,n)(X)).

Step (b). Problem 6.3 implies that

limn→∞

E(m(j)(X) − m(j)(X(1,n)(X)))2 = 0.


Problem 6.7. For d ≤ 2 assume that there exist ε0 > 0, a nonnegative functiong such that for all x ∈ Rd, and 0 < ε ≤ ε0,

µ(Sx,ε) > g(x)εd (6.3)

and∫

1g(x)2/d

µ(dx) < ∞.

Prove the rate of convergence given in Theorem 6.2.Hint: Prove that under the conditions of the problem

E‖X(1,n)(X) − X‖2 ≤ c

n2/d.

Formula (6.3) implies that for almost all x mod µ and ε0 < ε < L,

µ(Sx,ε) ≥ µ(Sx,ε0) ≥ g(x)εd0 ≥ g(x)

(ε0L

)d

εd,

hence we can assume w.l.o.g. that (6.3) holds for all 0 < ε < L. In this case, weget, for fixed L > ε > 0,

P‖X(1,n)(X) − X‖ > ε = E(1 − µ(SX,ε))n≤ E

e−nµ(SX,ε)

≤ E

e−ng(X)εd

,

therefore,

E‖X(1,n)(X) − X‖2 =∫ L2

0

P‖X(1,n)(X) − X‖ >√

ε dε

≤∫ L2

0

E

e−ng(X)εd/2

dε

≤∫ ∫ ∞

0

e−ng(x)εd/2dε µ(dx)

=∫

1n2/dg(x)2/d

∫ ∞

0

e−zd/2dz µ(dx)

=c

n2/d.

7Splitting the Sample

In the previous chapters the parameters of the estimates with the opti-mal rate of convergence depend on the unknown distribution of (X,Y ),especially on the smoothness of the regression function. In this and inthe following chapter we present data-dependent choices of the smoothingparameters. We show that for bounded Y the estimates with parameterschosen in such an adaptive way achieve the optimal rate of convergence.

7.1 Best Random Choice of a Parameter

Let Dn = (X1, Y1), . . . , (Xn, Yn) be the sample as before. Assume a finiteset Qn of parameters such that for every parameter h ∈ Qn there is aregression function estimate m(h)

n (·) = m(h)n (·, Dn). Let h = h(Dn) ∈ Qn

be such that∫

|m(h)n (x) −m(x)|2µ(dx) = min

h∈Qn

∫|m(h)

n (x) −m(x)|2µ(dx),

where h is called the best random choice of the parameter. Obviously, h isnot an estimate, it depends on the unknown m and µ.

This best random choice can be approximated by splitting the data. LetDnl

= (X1, Y1), . . . , (Xnl, Ynl

) be the learning (training) data of sizenl and Dn\Dnl

the testing data of size nt (n = nl + nt ≥ 2). For everyparameter h ∈ Qn letm(h)

nl (·) = m(h)nl (·, Dnl

) be an estimate ofm dependingonly on the learning data Dnl

of the sample Dn. Use the testing data to

7.1. Best Random Choice of a Parameter 101

choose a parameter H = H(Dn) ∈ Qn:

1nt

nl+nt∑

i=nl+1

|m(H)nl

(Xi) − Yi|2 = minh∈Qn

1nt

nl+nt∑

i=nl+1

|m(h)nl

(Xi) − Yi|2. (7.1)

Define the estimate by

mn(x) = mn(x,Dn) = m(H)nl

(x,Dnl). (7.2)

We show that H approximates the best random choice h in the sense thatE∫ |mn(x) −m(x)|2µ(dx) approximates E

∫ |m(h)nl (x) −m(x)|2µ(dx).

Theorem 7.1. Let 0 < L < ∞. Assume

|Y | ≤ L a.s. (7.3)

and

maxh∈Qn

‖m(h)nl

‖∞ ≤ L a.s. (7.4)

Then, for any δ > 0,

E∫


≤ (1 + δ)E∫

|m(h)nl

(x) −m(x)|2µ(dx) + c1 + log(|Qn|)

nt, (7.5)

where h = h(Dnl) and c = L2(16/δ + 35 + 19δ).

The only assumption on the underlying distribution in Theorem 7.1 isthe boundedness of |Y | (cf. (7.3)). It can be applied to any estimate whichis bounded in supremum norm by the same bound as the data (cf. (7.4)).We can always truncate an estimate at ±L, which implies that (7.4) holds.If (7.3) holds, then the regression function will be bounded in absoultevalue by L, too, and hence the L2 error of the truncated estimate will beless than or equal to the L2 error of the original estimate, so the truncationhas no negative consequence in view of the error of the estimate.

In the next section we will apply this theorem to partitioning, kernel, andnearest neighbor estimates. We will choose Qn and nt such that the secondterm on the right-hand side of (7.5) is less than the first term. This impliesthat the expected L2 error of the estimate is bounded by some constanttimes the expected L2 error of an estimate, which is applied to a data setof size nl (rather than n) and where the parameter is chosen in an optimalway for this data set. Observe that this is not only true asymptotically, buttrue for each finite sample size.

Proof of Theorem 7.1. An essential tool in the proof will be Bernstein’sinequality together with majorization of a variance by some constant timesthe corresponding expectation. This will yield the denominator nt in the

102 7. Splitting the Sample

result instead of√nt attainable by the use of Hoeffding’s inequality (cf.

Problem 7.2).We will use the error decomposition

E∫

|mn(x) −m(x)|2µ(dx)∣∣∣Dnl

= E∫

|m(H)nl

(x) −m(x)|2µ(dx)∣∣∣Dnl

= E

|m(H)nl

(X) − Y |2∣∣∣Dnl

− E|m(X) − Y |2

=: T1,n + T2,n,

where

T1,n = E

|m(H)nl

(X) − Y |2∣∣∣Dnl

− E|m(X) − Y |2 − T2,n

and

T2,n = (1 + δ)1nt

nl+nt∑

i=nl+1

(|m(H)

nl(Xi) − Yi|2 − |m(Xi) − Yi|2

).

Because of (7.1),

T2,n ≤ (1 + δ)1nt

nl+nt∑

i=nl+1

(|m(h)

nl(Xi) − Yi|2 − |m(Xi) − Yi|2

),

hence,

ET2,n|Dnl ≤ (1 + δ)

(E

|m(h)nl

(X) − Y |2∣∣∣∣Dnl

− E|m(X) − Y |2

)

= (1 + δ)∫

|m(h)nl

(x) −m(x)|2µ(dx).

In the sequel we will show

E T1,n|Dnl ≤ c

(1 + log |Qn|)nt

, (7.6)

which, together with the inequality above, implies the assertion. Let s > 0be arbitrary. Then

PT1,n ≥ s|Dnl

= P

(1 + δ)(E|m(H)

nl(X) − Y |2|Dnl

− E|m(X) − Y |2

− 1nt

n∑

i=nl+1

|m(H)nl

(Xi) − Yi|2 − |m(Xi) − Yi|2)

7.1. Best Random Choice of a Parameter 103

≥ s+ δ(E|m(H)

nl(X) − Y |2|Dnl

− E|m(X) − Y |2) ∣∣∣∣Dnl

≤ P

∃h ∈ Qn : E|m(h)nl

(X) − Y |2|Dnl − E|m(X) − Y |2

− 1nt

n∑

i=nl+1

|m(h)

nl(Xi) − Yi|2 − |m(Xi) − Yi|2

≥ 11 + δ

(s+ δE

|m(h)

nl(X) − Y |2 − |m(X) − Y |2

∣∣∣∣Dnl

) ∣∣∣∣Dnl

≤ |Qn| maxh∈Qn

P

E

|m(h)nl

(X) − Y |2|Dnl

− E|m(X) − Y |2

− 1nt

n∑

i=nl+1

|m(h)

nl(Xi) − Yi|2 − |m(Xi) − Yi|2

≥ 11 + δ

(s+ δE

|m(h)

nl(X) − Y |2 − |m(X) − Y |2

∣∣∣∣Dnl

) ∣∣∣∣Dnl

.

Fix h ∈ Qn. Set

Z = |m(h)nl

(X) − Y |2 − |m(X) − Y |2

and

Zi = |m(h)nl

(Xnl+i) − Ynl+i|2 − |m(Xnl+i) − Ynl+i|2 (i = 1, . . . , n− nl).

Using Bernstein’s inequality (see Lemma A.2) and

σ2 := VarZ|Dnl

≤ EZ2|Dnl

= E∣∣∣(m(h)

nl(X) − Y ) − (m(X) − Y )

∣∣∣2

×∣∣∣(m(h)

nl(X) − Y ) + (m(X) − Y )

∣∣∣2∣∣∣∣Dnl

≤ 16L2∫

|m(h)nl

(x) −m(x)|2µ(dx)

= 16L2EZ|Dnl (7.7)

we get


P

E|m(h)

nl(X) − Y |2|Dnl

− E|m(X) − Y |2

− 1nt

n∑

i=nl+1

|m(h)nl

(Xi) − Yi|2 − |m(Xi) − Yi|2

≥ 11 + δ

(s+ δE

|m(h)

nl(X) − Y |2 − |m(X) − Y |2

∣∣∣∣Dnl

) ∣∣∣∣Dnl

= P

EZ|Dnl − 1

nt

nt∑

i=1

Zi ≥ 11 + δ

(s+ δ · EZ|Dnl)

∣∣∣∣Dnl

≤ P

EZ|Dnl − 1

nt

nt∑

i=1

Zi ≥ 11 + δ

(s+ δ · σ2

16L2

) ∣∣∣∣Dnl

≤ exp

−nt

1(1+δ)2

(s+ δ σ2

16L2

)2

2σ2 + 23

8L2

1+δ

(s+ δ σ2

16L2

)

.

Here we don’t need the factor 2 before the exponential term because wedon’t have absolute value inside the probability (cf. proof of Lemma A.2).Next we observe

1(1+δ)2

(s+ δ σ2

16L2

)2

2σ2 + 23

8L2

1+δ

(s+ δ σ2

16L2

)

≥ s2 + 2sδ σ2

16L2

163 L

2(1 + δ)s+ σ2(2(1 + δ)2 + 1

3δ(1 + δ)) .

An easy but tedious computation (cf. Problem 7.1) shows

s2 + 2sδ σ2

16L2

163 L

2(1 + δ)s+ σ2(2(1 + δ)2 + 1

3δ(1 + δ)) ≥ s

c, (7.8)

where c = L2(16/δ + 35 + 19δ). Using this we get that

P T1,n ≥ s|Dnl ≤ |Qn| exp

(−nt

s

c

).

It follows, for arbitrary u > 0,

E T1,n|Dnl ≤ u+

∫ ∞

u

P T1,n > s|Dnl ds

≤ u+|Qn| cnt

exp(−nt u

c

).

Setting u = c log(|Qn|)nt

, this implies (7.6), which in turn implies the assertion.

7.2. Partitioning, Kernel, and Nearest Neighbor Estimates 105

7.2 Partitioning, Kernel, and Nearest NeighborEstimates

In Theorems 4.3, 5.2, and 6.2 we showed that partitioning, kernel, and near-est neighbor estimates are able to achieve the minimax lower bound for theestimation of (p, C)-smooth regression functions if p = 1 and if the param-eters are chosen depending on C (the Lipschitz constant of the regressionfunction). Obviously, the value of C will be unknown in an application,therefore, one cannot use estimates where the parameters depend on Cin applications. In the sequel we show that, in the case of bounded data,one can also derive similar bounds for estimates where the parameters arechosen by splitting the sample.

We start with the kernel estimate. Let m(h)n be the kernel estimate with

naive kernel and bandwidth h. We choose the finite set Qn of bandwidthssuch that we can approach the choice of the bandwidth in Theorem 5.2 upto some factor less than some constant, e.g., up to factor 2. This can bedone, e.g., by setting

Qn =2k : k ∈ −n,−(n− 1), . . . , 0, . . . , n− 1, n .

Theorems 7.1 and 5.2 imply

Corollary 7.1. Assume that X is bounded,

|m(x) −m(z)| ≤ C · ‖x− z‖ (x, z ∈ Rd)

and |Y | ≤ L a.s. Set

nl =⌈n

2

⌉and nt = n− nl.

Let mn be the kernel estimate with naive kernel and bandwidth h ∈Qn chosen as in Theorem 7.1, where Qn is defined as above. Then(log n)(d+2)/(2d)n−1/2 ≤ C implies, for n ≥ 2,

E∫

|mn(x) −m(x)|2µ(dx) ≤ c1C2d/(d+2)n−2/(d+2)

for some constant c1 which depends only on L, d, and the diameter of thesupport of X.

Proof. Without loss of generality we can assume C ≤ n1/d (otherwise,the assertion is trivial because of boundedness of Y ). Theorems 7.1 and 5.2imply

E∫


≤ 2 minh∈Qn

E∫

|m(h)nl

(x) −m(x)|2µ(dx) + c · 1 + log(|Qn|)nt


≤ 2 minh∈Qn

(c · 2L2

nlhd+ C2h2

)+ c · 1 + log(2n+ 1)

nt

≤ 2(c · 2L2

nlhdn

+ C2h2n

)+ c · 1 + log(2n+ 1)

nt,

where hn ∈ Qn is chosen such that

C−2/(d+2)n−1/(d+2) ≤ hn ≤ 2C−2/(d+2)n−1/(d+2).

The choices of hn, nl, and nt together with C ≥ (log n)(d+2)/(2d)n−1/2

imply

E∫


≤ c · C2d/(d+2)n−2/(d+2) + 4c · 1 + log(2n+ 1)n

≤ c1 · C2d/(d+2)n−2/(d+2).

Similarly, one can show the following result concerning the partitioningestimate:


|m(x) −m(z)| ≤ C · ‖x− z‖ (x, z ∈ Rd)


nl =⌈n

2


Let mn be the partitioning estimate with cubic partition and grid sizeh ∈ Qn chosen as in Theorem 7.1, where Qn is defined as above. Then(log n)(d+2)/(2d)n−1/2 ≤ C implies, for n ≥ 2,

E∫

|mn(x) −m(x)|2µ(dx) ≤ c2C2d/(d+2)n−2/(d+2)


Proof. See Problem 7.6

Finally we consider the k-nearest neighbor estimates. Here we can setQn = 1, . . . , n, so the optimal value from Theorem 6.2 is contained inQn. Immediately from Theorems 7.1 and 6.2 we can conclude


|m(x) −m(z)| ≤ C · ‖x− z‖ (x, z ∈ Rd)


nl =⌈n

2


7.2. Partitioning, Kernel, and Nearest Neighbor Estimates 107

Let mn be the k-nearest neighbor estimate with k ∈ Qn = 1, . . . , nl chosenas in Theorem 7.1. Then (log n)(d+2)/(2d)n−1/2 ≤ C together with d ≥ 3implies, for n ≥ 2,

E∫

|mn(x) −m(x)|2µ(dx) ≤ c3C2d/(d+2)n−2/(d+2)


Here we use for each component of X the same smoothing parameter.But the results can be extended to optimal scaling, where one uses foreach component a different smoothing parameter. Here Problems 4.7 and5.7 characterize the rate of convergence, and splitting of the data can beused to approximate the optimal scaling parameters, which depend on theunderlying distribution (cf. Problem 7.7).

In Corollaries 7.1–7.3 the expected L2 error of the estimates is boundedfrom above up to a constant by the corresponding minimax lower boundfor (p, C)-smooth regression functions, if p = 1. We would like to mentiontwo important aspects of these results: First, the definition of the estimatesdoes not depend on C, therefore they adapt automatically to the unknownsmoothness of the regression function measured by the Lipschitz constantC. Second, the bounds are valid for finite sample size. So we are able toapproach the minimax lower bound not only asymptotically but even forfinite sample sizes (observe that in the proof of Theorem 3.2 we have infact shown that the lower bound is valid for finite sample size).

Approaching the minimax lower bound for fixed sample size by someconstant does not imply that one can get asymptotically the minimax rateof convergence with the optimal constant in front of n−2p/(2p+d). But aswe show in the next theorem, this goal can also be reached by splitting thesample:

Theorem 7.2. Under the conditions of Theorem 7.1 assume that

log |Qn| ≤ c log n

and

E

minh∈Qn

∫|m(h)

n (x) −m(x)|2µ(dx)

≤ Copt(1 + o(1))n−γ

for some 0 < γ < 1. Choose γ < γ′ < 1 and set

nt =⌈nγ′⌉

and nl = n− nt.

Then

E∫

|mn(x) −m(x)|2µ(dx) ≤ Copt(1 + o(1))n−γ .


Proof. Theorem 7.1 implies that

E∫


≤ (1 + δ)E

minh∈Qn

∫|m(h)

nl(x) −m(x)|2µ(dx)

+ c

1 + log(|Qn|)nt

≤ (1 + δ)Copt(1 + o(1))n−γl + c

1 + c log nnt

≤ (1 + δ)Copt(1 + o(1))(1 − o(1))−γn−γ + c1 + c log n

nγ′

(since nl = n− nγ′ and n− nγ′= (1 − n−(1−γ′)) · n )

= (1 + δ)Copt(1 + o(1))n−γ .

Since δ > 0 is arbitrary we get that

E∫

|mn(x) −m(x)|2µ(dx) ≤ Copt(1 + o(1))n−γ .


The bound (7.7) on the variance can be improved (see Barron (1991) orProblem 7.3).

In the proof of Theorem 7.1 we used the union bound together withBernstein’s inequality to bound the deviation between an expectation and(1 + δ) times the corresponding sample mean. By using instead Jensen’sinequality, together with the bound on the exponential moment derivedin the proof of Bernstein’s inequality, one can improve the constants inTheorem 7.1 (cf. Hamers and Kohler (2001)).

In the context of pattern recognition and density estimation splitting thesample was investigated in Devroye (1988) and Devroye and Lugosi (2001),respectively.

In this chapter we tried to choose one estimate from a given finite col-lection of the estimates that is at least as good as the best of the originalones, plus a small residual. It might not always be optimal to choose oneestimate from the original set of estimates. Instead, it might be usefulto construct a new estimator as a function of original estimates, such asa convex combination (see, e.g., Niemirovsky (2000), and the referencestherein).



Problem 7.1. Prove (7.8).

Hint: Set

a = s2, b = 2sδ/(16L2), c = 16L2(1 + δ)s/3, and d = 2(1 + δ)2 + δ(1 + δ)/3.

Then the left-hand side of (7.8) is equal to f(σ2) where

f(u) =a + b · u

c + d · u(u > 0).

Compute the derivative f ′ of f and show

f ′(u) = 0 for all u > 0.

Use this to determine

minu>0

f(u)

by considering

f(0) and limu→∞

f(u).

Problem 7.2. Prove a weaker version of Theorem 7.1: under the conditions ofTheorem 7.1,

E∫

|mn(x) − m(x)|2µ(dx)

≤ E∫

|m(h)nl

(x) − m(x)|2µ(dx) + 8√

2L2

√log(2|Qn|)

nt.

Hint:

∫|mn(x) − m(x)|2µ(dx) −

∫|m(h)

nl(x) − m(x)|2µ(dx)

= E

(m(H)nl

(X) − Y )2∣∣∣Dn

− E

(m(h)

nl(X) − Y )2

∣∣∣Dn

= E

(m(H)nl

(X) − Y )2∣∣∣Dn

− 1

nt

nl+nt∑

i=nl+1

(m(H)nl

(Xi) − Yi)2

+1nt

nl+nt∑

i=nl+1

(m(H)nl

(Xi) − Yi)2 − 1nt

nl+nt∑

i=nl+1

(m(h)nl

(Xi) − Yi)2

+1nt

nl+nt∑

i=nl+1

(m(h)nl

(Xi) − Yi)2 − E(m(h)nl

(X) − Y )2∣∣∣Dn

≤ E

(m(H)nl

(X) − Y )2∣∣∣Dn

− 1

nt

nl+nt∑

i=nl+1

(m(H)nl

(Xi) − Yi)2


+1nt

nl+nt∑

i=nl+1

(m(h)nl

(Xi) − Yi)2 − E

(m(h)nl

(X) − Y )2∣∣∣Dn

≤ 2 maxh∈Qn

∣∣∣∣∣

1nt

nl+nt∑

i=nl+1

(m(h)nl

(Xi) − Yi)2 − E

(m(h)nl

(X) − Y )2∣∣∣Dn

∣∣∣∣∣

= 2 maxh∈Qn

∣∣∣∣∣

1nt

nl+nt∑

i=nl+1

(m(h)nl

(Xi) − Yi)2 − E

(m(h)nl

(X) − Y )2∣∣∣Dnl

∣∣∣∣∣.

Use Hoeffding’s inequality (cf. Lemma A.3) to conclude

P

∫|mn(x) − m(x)|2µ(dx) −

∫|m(h)

nl(x) − m(x)|2µ(dx) > ε|Dnl

≤ 2|Qn|e−ntε2

32L4 .

Compare also Problem 8.2.

Problem 7.3. See Barron (1991).(a) Show that for any random variable V with values in some interval of length

B one has

VarV ≤ B2

4.

(b) Show that the inequality (7.7) can be improved as follows: Assume |Y | ≤ La.s. and let f be a function f : Rd → [−L, L]. Set

Z = |f(X) − Y |2 − |m(X) − Y |2.Then

σ2 = VarZ ≤ 8L2EZ.

Hint: Use

Z = −2(Y − m(X)) · (f(X) − m(X)) + (f(X) − m(X))2

and

E((Y − m(X)) · (f(X) − m(X)))2

= E(f(X) − m(X))2E

(Y − m(X))2|X

≤ L2E|f(X) − m(X)|2

,

where the last inequality follows from (a).

Problem 7.4. Use Problem 7.3 to improve the constant c in Theorem 7.1.

Problem 7.5. Show that if the assumptions of Theorem 7.2 are satisfied and, inaddition,

E

min

h∈Qn

∫|m(h)

n (x) − m(x)|2µ(dx)

≥ Copt(1 + o(1))n−γ ,


then

limn→∞

E∫ |mn(x) − m(x)|2µ(dx)

E

minh∈Qn

∫ |m(h)nl (x) − m(x)|2µ(dx)

= 1.

Problem 7.6. Prove Corollary 7.2.Hint: Proceed as in the proof of Corollary 7.1, but use the bounds from

Theorem 4.3 instead of those from Theorem 5.2.

Problem 7.7. (a) Use splitting the data to choose the side lengths of rectangularpartitioning such that the resulting estimate approaches the rate of convergencein Problem 4.7.

(b) Use splitting the data to choose the scaling for product kernel estimatessuch that the resulting estimate approaches the rate of convergence in Problem5.7.

(c) Our results in Chapter 6 concerning nearest neighbor estimates used theEuclidean distance. Obviously, all the results of this chapter hold with scaling,i.e., for norms defined by

‖x‖2 =d∑

j=1

cj |x(j)|2,

where c1, . . . , cd > 0 are the scaling factors. Use splitting the data to choose thescaling factors together with k for k-NN estimates based on such norms.

8Cross-Validation

8.1 Best Deterministic Choice of the Parameter

Let Dn = (X1, Y1), . . . , (Xn, Yn) be the sample as before. Assume a finiteset Qn of parameters such that for every parameter h ∈ Qn there is aregression function estimate m(h)

n (·) = m(h)n (·, Dn). Let hn ∈ Qn be such

that

E∫

|m(hn)n (x) −m(x)|2µ(dx)

= min

h∈Qn

E∫

|m(h)n (x) −m(x)|2µ(dx)

,

where hn is called the best deterministic choice of the parameter. Obviously,hn is not an estimate, it depends on the unknown distribution of (X,Y ),in particular on m and µ.

This best deterministic choice can be approximated by cross-validation.For every parameter h ∈ Qn let m(h)

n and m(h)n,i be the regression estimates

from Dn and Dn\(Xi, Yi), respectively, where

Dn\(Xi, Yi) = (X1, Y1), . . . , (Xi−1, Yi−1), (Xi+1, Yi+1), . . . , (Xn, Yn).The cross-validation selection of h is

H = Hn = arg minh∈Qn

1n

n∑

i=1

(m(h)n,i (Xi) − Yi)2.

Define the cross-validation regression estimate by

mn(x) = m(H)n (x). (8.1)

8.2. Partitioning and Kernel Estimates 113

Throughout this chapter we use the notation

∆(h)n = E

∫|m(h)

n (x) −m(x)|2µ(dx).

In the sequel we show that Hn approximates the best deterministic choiceh = hn−1 for sample size n− 1 in the sense that E

∆(Hn)

n−1

approximates

∆(hn−1)n−1 with an asymptotically small correction term.

8.2 Partitioning and Kernel Estimates

Theorem 8.1 yields relations between E

∆(Hn)n−1

and ∆(hn−1)

n−1 .

Theorem 8.1. Let |Y | ≤ L < ∞. Choose m(h)n of the form

m(h)n (x) =

∑nj=1 YjKh(x,Xj)

∑nj=1Kh(x,Xj)

where the binary valued function Kh : Rd ×Rd → 0, 1 with Kh(x, x) = 1fulfills the covering assumption (C) that a constant ρ > 0 depending onlyon Kh;h ∈ ∪nQn exists with

∫Kh(x, z)∫Kh(x, t)µ(dt)

µ(dx) ≤ ρ

for all z ∈ Rd, all h ∈ ∪nQn, and all probability measures µ.(a)

E

∆(Hn)n−1

≤ ∆(hn−1)

n−1 + c

√log(|Qn|)

n

for some constant c depending only on L and ρ.(b) For any δ > 0

E

(Hn)n−1

≤ (1 + δ) (hn−1)

n−1 +c|Qn|n

log n,

where c depends only on δ, L, and ρ.

The proof, which is difficult and long, will be given in Section 8.3. Werecommend skipping it during the first reading.

The covering assumption (C) in Theorem 8.1 is fulfilled for kernel es-timates using naive kernel and partitioning estimates (see below). Beforewe consider the application of Theorem 8.1 to these estimates in detail, wegive some comments concerning convergence order.

Neglecting log n, the correction terms in parts (a) and (b) are both ofthe order n−1/2 if |Qn| = O(n1/2). One is interested that the correction

114 8. Cross-Validation

term is less than (hn−1)n−1 . For Lipschitz-continuous m one has

(hn−1)n−1 = O

(n−2/(d+2)

)

in naive kernel estimation and cubic partitioning estimation according toTheorems 5.2 and 4.3, respectively, which is optimal according to Theorem3.2. In this case, for d ≥ 3 and log(|Qn|) = O(log n), i.e., |Qn| ≤ ns forsome s > 0, or for log(|Qn|) = O(nt) for some 0 < t < (d− 2)/(d+2), part(a) yields the desired result, and for d ≥ 1 and log(|Qn|) ≤ c∗ log n withc∗ < d/(d+ 2), part (b) yields the desired result. The latter also holds if

(hn−1)n−1 = O(n−γ)

with γ < 1 near to 1, if c∗ is chosen sufficiently small.Now we give more detailed applications of Theorem 8.1 to kernel and

partitioning estimates. Let Ph be a partition of Rd, and denote by m(h)n

the partitioning estimate for this partition and sample size n. Because ofthe proof of Theorem 4.2 the covering assumption (C) is satisfied withρ = 1:

∫Iz∈An(x)µ(An(x))

µ(dx) =∫Iz∈An(x)µ(An(z))

µ(dx)

=µ(An(z))µ(An(z))

≤ 1.

Or, let

m(h)n (x) =

∑nj=1 YjK

(x−Xj

h

)

∑nj=1K

(x−Xj

h

) .

be the kernel estimate with bandwidth h and naive kernel K. Then accord-ing to Lemma 23.6 the covering assumption is satisfied with a ρ dependingon d only.

For these estimates Theorem 8.1, together with Theorems 5.2 and 4.3,implies


|m(x) −m(z)| ≤ C · ‖x− z‖ (x, z ∈ Rd)

and |Y | ≤ L a.s.Let mn be the partitioning estimate with cubic partitioning and grid size

h ∈ Qn chosen as in Theorem 8.1, or let mn be the kernel estimate withnaive kernel and bandwidth h ∈ Qn chosen as in Theorem 8.1. Let d ≥ 3,

Qn =2k : k ∈ −n,−(n− 1), . . . , 0, . . . , n− 1, n

8.3. Proof of Theorem 8.1 115

and

(log n)(d+2)/(4d)n−(d−2)/(4d) ≤ C,

or, let d ≥ 1,

Qn =

2−n1/4+k : k ∈ 1, 2, . . . , 2n1/4

and

(log n)(d+2)/(2d)n−(3d−2)/(8d) ≤ C.

Then, in each of the four cases,

E

∆(Hn)n−1

≤ c1C

2d/(d+2)n−2/(d+2)


Proof. See Problem 8.1

As in the previous chapter the results can be extended to optimal scalingand adapting to the optimal constant in front of n−2/(d+2). We leave thedetails to the reader (cf. Problems 8.4 and 8.5).

8.3 Proof of Theorem 8.1

Proof of (a). For ε > 0, we show

P∆(Hn)n−1 − ∆(hn−1)

n−1 > ε ≤ 2|Qn|e−nε2/(128L4) + 2|Qn|e−nε2/(128L4(1+4ρ)2),

from which (a) follows (cf. Problem 8.2). Observe that, for each h > 0,

∆(h)n−1 = E

1n

n∑

i=1

(m

(h)n,i (Xi) −m(Xi)

)2

= E

1n

n∑

i=1

((m(h)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)

. (8.2)

Therefore, because of the definition of H = Hn and h = hn−1,

∆(H)n−1 − ∆(h)

n−1

= ∆(H)n−1 − 1

n

n∑

i=1

((m(H)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)

+1n

n∑

i=1

((m(H)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)


− 1n

n∑

i=1

((m(h)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)

+1n

n∑

i=1

((m(h)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)

− ∆(h)n−1

≤ ∆(H)n−1 − 1

n

n∑

i=1

((m(H)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)

+1n

n∑

i=1

((m(h)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)

− ∆(h)n−1

≤ 2 maxh∈Qn

∣∣∣∣∣∆(h)

n−1 − 1n

n∑

i=1

((m(h)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)∣∣∣∣∣

= 2 maxh∈Qn

∣∣∣∣∣1n

n∑

i=1

((m(h)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2

− E(m(h)n,i (Xi) − Yi)2 − (m(Xi) − Yi)2

)∣∣∣∣∣.

Consequently,

P∆(H)n−1 − ∆(h)

n−1 > ε

≤ P

maxh∈Qn

∣∣∣∣∣1n

n∑

i=1

((m(h)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2

− E(m(h)n,i (Xi) − Yi)2 − (m(Xi) − Yi)2

)∣∣∣∣∣> ε/2

≤∑

h∈Qn

P

∣∣∣∣∣1n

n∑

i=1

((m(h)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2

− E(m(h)n,i (Xi) − Yi)2 − (m(Xi) − Yi)2

)∣∣∣∣∣> ε/2

≤∑

h∈Qn

P

∣∣∣∣∣1n

n∑

i=1

((m(Xi) − Yi)2 − E(m(Xi) − Yi)2

)∣∣∣∣∣> ε/4

+∑

h∈Qn

P

∣∣∣∣∣1n

n∑

i=1

((m(h)

n,i (Xi) − Yi)2 − E(m(h)n,i (Xi) − Yi)2

)∣∣∣∣∣> ε/4

.


By Hoeffding’s inequality (Lemma A.3) the first sum on the right-hand sideis upper bounded by

2|Qn|e−nε2/(128L4).

For the term indexed by h of the second sum we use McDiarmid’s inequality(Theorem A.2). Fix 1 ≤ l ≤ n. Let

D′n = (X ′

1, Y′1), . . . , (X ′

l , Y′l ), . . . , (X ′

n, Y′n)

= (X1, Y1), . . . , (X ′l , Y

′l ), . . . , (Xn, Yn),

and define m′n,j as mn,j with Dn replaced by D′

n (j = 1, . . . , n). We willshow that

∣∣∣∣∣∣

n∑

j=1

(m(h)n,j(Xj) − Yj)2 −

n∑

j=1

(m(h)′n,j (X ′

j) − Y ′j )2

∣∣∣∣∣∣≤ 4L2(1 + 4ρ). (8.3)

Then McDiarmid’s inequality yields the bound for the second sum

2|Qn|e−nε2/(128L4(1+4ρ)2).

In order to prove (8.3), because of the symmetry, we can assume that l = 1.We obtain

∣∣∣∣∣

n∑

i=1

(m(h)n,i (Xi) − Yi)2 −

n∑

i=1

(m(h)′n,i (X ′

i) − Y ′i )2

∣∣∣∣∣

≤ 4L2 + 4Ln∑

i=2

|m(h)n,i (Xi) −m

(h)′n,i (Xi)|. (8.4)

In view of a bound forn∑

i=2

|m(h)n,i (Xi) −m

(h)′n,i (Xi)|

we write

m(h)n,i (Xi) =

∑j∈2,...,n\iKh(Xi, Xj)Yj +Kh(Xi, X1)Y1∑

j∈2,...,n\iKh(Xi, Xj) +Kh(Xi, X1)

and

m(h)′n,i (Xi) =

∑j∈2,...,n\iKh(Xi, Xj)Yj +Kh(Xi, X

′1)Y

′1∑

j∈2,...,n\iKh(Xi, Xj) +Kh(Xi, X ′1)

and distinguish the four cases:

(1) Kh(Xi, X1) = Kh(Xi, X′1) = 0;

(2) Kh(Xi, X1) = 1, Kh(Xi, X′1) = 0;

(3) Kh(Xi, X1) = 0, Kh(Xi, X′1) = 1;

(4) Kh(Xi, X1) = Kh(Xi, X′1) = 1.


In the first case,

|m(h)n,i (Xi) −m

(h)′n,i (Xi)| = 0.

In the second case,

|m(h)n,i (Xi) −m

(h)′n,i (Xi)| =

∣∣∣∣Y1 −

∑j∈2,...,n\i Kh(Xi,Xj)Yj∑

j∈2,...,n\i Kh(Xi,Xj)

∣∣∣∣

∑j∈2,...,n\iKh(Xi, Xj) + 1

≤ 2L∑

j∈2,...,n\iKh(Xi, Xj) + 1.

The same bound can be obtained in the third case and in the fourth case,in the fourth case because of

|m(h)n,i (Xi) −m

(h)′n,i (Xi)| =

|Y1 − Y ′1 |

∑j∈2,...,n\iKh(Xi, Xj) + 1

≤ 2L∑

j∈2,...,n\iKh(Xi, Xj) + 1.

Using these bounds, which in each case may be multiplied by Kh(Xi, X1)+Kh(Xi, X

′1), we obtain

n∑

i=2

|m(h)n,i (Xi) −m

(h)′n,i (Xi)| ≤ 2L

n∑

i=2

Kh(Xi, X1) +Kh(Xi, X′1)∑n

j=2Kh(Xi, Xj)

= 2L∫

Kh(x,X1)∫Kh(x, t)µn−1(dt)

µn−1(dx)

+2L∫

Kh(x,X ′1)∫

Kh(x, t)µn−1(dt)µn−1(dx)

≤ 4Lρ,

where for the last inequality the covering assumption (C) is used for theempirical measure µn−1 for the sample X2, . . . , Xn.

Proof of (b). By the definition of H = Hn and h = hn−1,

(H)n−1 − (1 + δ) (h)

n−1

= (H)n−1 − 1 + δ

n

n∑

i=1

((m(H)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)

+1 + δ

n

n∑

i=1

((m(H)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)

−1 + δ

n

n∑

i=1

((m(h)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)


+1 + δ

n

n∑

i=1

((m(h)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)

− (1 + δ) (h)n−1

≤ (H)n−1 − 1 + δ

n

n∑

i=1

((m(H)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)

+(1 + δ)

[1n

n∑

i=1

((m(h)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)

− (h)n−1

]

.

Then (8.2) yields

E (H)n−1

≤ (1 + δ) (h)n−1

+∫ ∞

0P

(H)n−1 −1 + δ

n

n∑

i=1

((m(H)

n,i (Xi) − Yi)2

−(m(Xi) − Yi)2)

≥ s

ds.

Now by Chebyshev’s inequality and

(m(h)n,i (Xi) − Yi)2 − (m(Xi) − Yi)2

= (m(h)n,i (Xi) −m(Xi))2 + 2(m(h)

n,i (Xi) −m(Xi)) · (m(Xi) − Yi)

we get

P

(h)n−1 − 1 + δ

n

n∑

i=1

((m(h)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)

≥ s

≤ (1 + δ)2Var

∑ni=1

((m(h)

n,i (Xi) − Yi)2 − (m(Xi) − Yi)2)

n2(s+ δ(h)n−1)2

≤ (1 + δ)2 ·

Var∑n

i=1

(m


)2

n2(s+ δ(h)n−1)2

+4E

∑ni=1

(m


)(m(Xi) − Yi)

2

n2(s+ δ(h)n−1)2

≤ cn(h)

n−1 +1

n2(s+ δ(h)n−1)2

log n,


with some c > 0, by Lemmas 8.2 and 8.3 below. Thus

E (H)n−1

≤ (1 + δ) (h)n−1 +

∑

h∈Qn

∫ ∞

0min

1, cn(h)

n−1 +1

n2(s+ δ(h)n−1)2

log n

ds

≤ (1 + δ) (h)n−1 +

∑

h∈Qn

∫ ∞

0c

(h)n−1

n(s+ δ(h)n−1)2

log n ds

+∑

h∈Qn

∫ ∞

0min

1, c

log nn2s2

ds

= (1 + δ) (h)n−1 +

c

δn|Qn| log n+

2√c

n|Qn|

√log n.

This yields the assertion.

In the proof we have used Lemmas 8.2 and 8.3 below. The proof ofLemma 8.2 is based on Lemma 8.1.

Lemma 8.1. Let (X1, Y1), . . . , (Xn, Yn), (Xn, Yn) be i.i.d. Then c > 0exists with

En−1∑

i=1

|m(h)n,i (Xi) −m(Xi)|2Kh(Xi, Xn)

1 +∑

j∈1,...,n−1\iKh(Xi, Xj)

≤ c (log n)[E|m(h)

n,1(X1) −m(X1)|2 +1n

]

for each h ∈ Qn, n ∈ 2, 3, . . ..

Proof. Set mn,i = m(h)n,i and Rh(x) = t ∈ Rd;Kh(x, t) = 1. Then the

left-hand side equals

(n− 1)∫µ(Rh(x1))E|mn,1(x1) −m(x1)|2 1

1 +∑n−1

j=2 Kh(x1, Xj)µ(dx1).

For x1 ∈ Rd, we note

|mn,1(x1) −m(x1)|2

= |n∑

l=2

(Yl −m(x1))Kh(x1, Xl)1 +

∑j∈2,...,n\lKh(x1, Xj)

−m(x1)I[Kh(x1,Xl)=0(l=2,...,n)]|2

= |n∑

l=2

(Yl −m(x1))Kh(x1, Xl)1 +

∑j∈2,...,n\lKh(x1, Xj)

|2 +m(x1)2I[Kh(x1,Xl)=0(l=2,...,n)]

=n∑

l=2

|Yl −m(x1)|2Kh(x1, Xl)[1 +

∑j∈2,...,n\lKh(x1, Xj)]2


+∑

l,l′∈2,...,nl =l′

(Yl −m(x1))Kh(x1, Xl)(Yl′ −m(x1))Kh(x1, Xl′)[2 +

∑j∈2,...,n\l,l′Kh(x1, Xl)]2

+m(x1)2I[Kh(x1,Xl)=0(l=2,...,n)]. (8.5)

We shall show the existence of a c > 0 with

(n− 1)µ(Rh(x1))

×En∑

l=2

|Yl −m(x1)|2Kh(x1, Xl)[1 +

∑

j∈2,...,n\lKh(x1, Xj)]2

12 +

∑

j∈2,...,n−1\lKh(x1, Xj)

≤ cEn∑

l=2

|Yl −m(x1)|2Kh(x1, Xl)[1 +

∑j∈2,...,n\lKh(x1, Xj)]2

(8.6)

and

(n− 1)µ(Rh(x1))E∑

l =l′


∑j∈2,...,n\l,l′Kh(x1, Xl)]2

× 13 +

∑j∈2,...,n\l,l′Kh(x1, Xj)

≤ cE∑

l =l′


∑j∈2,...,n\l,l′Kh(x1, Xl)]2

(8.7)

and

(n− 1)µ(Rh(x1))m(x1)2EI[Kh(x1,Xl)=0 (l=2,...,n)]1

1 +∑n−1

j=2 Kh(x1, Xj)

≤ c (log n)[m(x1)2EI[Kh(x1,Xl)=0 (l=2,...,n)] +

1n

](8.8)

for all x1 ∈ Rd. These results together with (8.5) yield the assertion. Setp = µ(Rh(x1)) and let B(n, p) be a binomially (n, p)–distributed randomvariable. In view of (8.6) and (8.7) it remains to show

(n− 1)pE1

[1 +B(n− 2, p)]3≤ cE

1[1 +B(n− 2, p)]2

(8.9)

for n ∈ 3, 4, . . . and, because of

E(Yl −m(x1)Kh(x1, Xl)(Yl′ −m(x1))Kh(x1, Xl′)

= [E(Yl −m(x1))Kh(x1, Xl)]2 ≥ 0 (l = l′),

we would like to get

(n− 1)pE1

[2 +B(n− 3, p)]3≤ cE

1[2 +B(n− 3, p)]2

(8.10)


for n ∈ 4, 5, . . ., respectively. Each of these relations is easily verified viathe equivalent relations

np

n∑

k=0

1(1 + k)3

(n

k

)pk(1 − p)n−k ≤ c′

n∑

k=0

1(1 + k)2

(n

k

)pk(1 − p)n−k

for some c′ > 0. The latter follows fromn∑

k=0

(n+ 3k + 3

)pk+3(1 − p)n+3−(k+3) ≤

n∑

k=0

(n+ 2k + 2

)pk+2(1 − p)n+2−(k+2),

i.e.,

1 − p+ (n+ 2)p ≤ (1 − p)2 + (n+ 3)p(1 − p) + (n+ 3)(n+ 2)p2/2.

The left-hand side of (8.8) equals

(n− 1)m(x1)2p (1 − p)n−1.

Distinguishing the cases p ≤ (log n)/n and p > (log n)/n, we obtain theupper bounds

m(x1)2(log n)(1 − p)n−1

and (noticing (1 − p)n−1 ≤ e−(n−1)p ≤ e · e−np and monotonicity of s · e−s

for s ≥ 1)

L2enpe−np ≤ L2elog nn

,

respectively. Thus (8.8) is obtained.

Lemma 8.2. There is a constant c∗ > 0 such that

Var

(n∑

i=1

|m(h)n,i (Xi) −m(Xi)|2

)

≤ c∗(log n)[nE|m(h)n,1(X1) −m(X1)|2 + 1]

for each h ∈ Qn, n ∈ 2, 3, . . ..

Proof. Set mn,i = m(h)n,i . For each h ∈ Qn, n ∈ 2, 3, . . ., and for i ∈

1, . . . , n,

En−1∑

i=1

|mn,i(Xi) −m(Xi)|2Kh(Xi, Xn)1 +

∑j∈1,...,n−1\iKh(Xi, Xj)

≤ 2En∑

i=1

|mn,i(Xi) −m(Xi)|2Kh(Xi, Xn)1 +

∑j∈1,...,n\iKh(Xi, Xj)

=2nE

n∑

l=1

n∑

i=1

|mn,i(Xi) −m(Xi)|2Kh(Xi, Xl)1 +

∑j∈1,...,n\iKh(Xi, Xj)


≤ 2nE

n∑

i=1

|mn,i(Xi) −m(Xi)|2

= 2E|mn,1(X1) −m(X1)|2. (8.11)

We use Theorem A.3 with m = d+ 1, Zi = (Xi, Yi), i = 1, . . . , n,

f(Z1, . . . , Zn) =n∑

i=1

|mn,i(Xi) −m(Xi)|2

and the notations there. Further we notice that (X1, Y1), . . . , (Xn, Yn) areindependent and identically distributed (d + 1)-dimensional random vec-tors. mn,i shall be obtained from mn,i via replacing (Xn, Yn) by its copy(Xn, Yn) there, where (X1, Y1), . . . , (Xn, Yn), (Xn, Yn) are independent (i =1, . . . , n− 1). We set

Vn =n∑

i=1

(mn,i(Xi) −m(Xi))2

−[

n−1∑

i=1

(mn,i(Xi) −m(Xi))2 + (mn,n(Xn) −m(Xn)2]

.

It suffices to show

EV 2n ≤ c∗(log n)

[E|mn,1(X1) −m(X1)|2 +

1n

].

Let

Ui = mn,i(Xi) −m(Xi)

=∑

l∈1,...,n\i

YlKh(Xi, Xl)1 +

∑j∈1,...,n\i,lKh(Xi, Xj)

−m(Xi)

and

Wi = mn,i(Xi) −m(Xi)

=∑

l∈1,...,n−1\i

YlKh(Xi, Xl)1 +

∑j∈1,...,n−1\i,lKh(Xi, Xj) +Kh(Xi, Xn)

+YnKh(Xi, Xn)

1 +∑

j∈1,...,n−1\iKh(Xi, Xj)−m(Xi)

for i = 1, . . . , n− 1. Thus

Vn =n−1∑

i=1

(U2

i −W 2i

)+ |mn,n(Xn) −m(Xn)|2 − |mn,n(Xn) −m(Xn)|2.


Therefore

V 2n ≤ 3|

n−1∑

i=1

(U2i −W 2

i )|2 +3|mn,n(Xn)−m(Xn)|4 +3|mn,n(Xn)−m(Xn)|4.(8.12)

We obtain

|Ui| ≤ 2L, |Wi| ≤ 2L,

|Ui −Wi| ≤ 2LKh(Xi, Xn) +Kh(Xi, Xn)

1 +∑

j∈1,...,n−1\iKh(Xi, Xj)

for i ∈ 1, . . . , n− 1, thusn−1∑

i=1

|Ui −Wi| ≤ 4Lρ

(by covering assumption (C)), then via the Cauchy–Schwarz inequality(

n−1∑

i=1

(U2i −W 2

i )

)2

≤ 2

(n−1∑

i=1

|Ui||Ui −Wi|)2

+ 2

(n−1∑

i=1

|Wi||Ui −Wi|)2

≤ 8Lρn−1∑

i=1

U2i |Ui −Wi| + 8Lρ

n−1∑

i=1

W 2i |Ui −Wi|

and

E

(n−1∑

i=1

(U2i −W 2

i )

)2

≤ 16L2ρE

n−1∑

i=1

(U2i +W 2

i )Kh(Xi, Xn) +Kh(Xi, Xn)

1 +∑

j∈1,...,n−1\iKh(Xi, Xj)

.

Via (8.12) and a symmetry relation with respect to (Xn, Yn) and (Xn, Yn)we obtain

EV 2n

≤ 96L2ρE

n−1∑

i=1

|mn,i(Xi) −m(Xi)|2 Kh(Xi, Xn) +Kh(Xi, Xn)1 +

∑

j∈1,...,n−1\iKh(Xi, Xj)

+24L2E(mn,n(Xn) −m(Xn)2

.

Now the assertion follows from (8.11) and Lemma 8.1.


Lemma 8.3. There is a constant c∗∗ > 0 such that

E

(n∑

i=1

(m(h)n,i (Xi) −m(Xi))(m(Xi) − Yi)

)2

≤ c∗∗nE|m(h)n,1(X1) −m(X1)|2

for each h ∈ Qn, n ∈ 2, 3, . . ..

Proof. Set mn,i = m(h)n,i and

m∗n(x) =

∑nj=1m(Xj)Kh(x,Xj)∑n

j=1Kh(x,Xj),

then

E|mn(x) −m(x)|2

= E

(∑ni=1(Yi −m(Xi))Kh(x,Xi)∑n

i=1Kh(x,Xi)+m∗

n(x) −m(x))2

= E

(n∑

i=1

(Yi −m(Xi))Kh(x,Xi)1 +

∑l∈1,...,n\iKh(x,Xl)

)2

+ E

|m∗n(x) −m(x)|2

= nE

((Y1 −m(X1))Kh(x,X1)

1 +∑n

l=2Kh(x,Xl)

)2

+ E|m∗

n(x) −m(x)|2 . (8.13)

We notice

E

n∑

i=1

[(mn,i(Xi) −m(Xi))(m(Xi) − Yi)]2

≤ 4L2 nE|mn,1(X1) −m(X1)|2.Further, noticing

EYl −m(Xl)|X1, . . . , Xn, Y1, . . . , Yl−1, Yl+1, . . . , Yn = 0

(l = 1, . . . , n), we obtain

E∑

i=j

(mn,i(Xi) −m(Xi))(m(Xi) − Yi)

×(mn,j(Xj) −m(Xj))(m(Xj) − Yj)

=∑

i=j

EYjYiKh(Xi, Xj)Kh(Xj , Xi)(m(Xi) − Yi)(m(Xj) − Yj)[

1 +∑

l∈1,...,n\i,jKh(Xi, Xl)

][

1 +∑

l∈1,...,n\i,jKh(Xj , Xl)

]


= n(n− 1)E

(Y1 −m(X1))2(Y2 −m(X2))2

× Kh(X2, X1)Kh(X1, X2)[1 +

∑nl=3Kh(X2, Xl)] [1 +

∑nl=3Kh(X1, Xl)]

≤ n(n− 1)E

(Y1 −m(X1))2(Y2 −m(X2))2Kh(X2, X1)

[1 +∑n

l=3Kh(X2, Xl)]2

(by ab ≤ a2/2 + b2/2 and symmetry)

≤ 4L2n(n− 1)E(Y1 −m(X1))2Kh(X2, X1)

[1 +∑n

l=3Kh(X2, Xl)]2

= 4L2n(n− 1)∫

E(Y1 −m(X1))2Kh(x,X1)[1 +

∑n−1l=2 Kh(x,Xl)

]2 µ(dx)

≤ 4L2n

∫E|mn−1(x) −m(x)|2µ(dx) (by (8.13))

= 4L2nE|mn,1(X1) −m(X1)|2

.

8.4 Nearest Neighbor Estimates

Theorem 8.1 cannot be applied for a nearest neighbor estimate. Let m(k)n

be the k-NN estimate for sample size n ≥ 2. Then h = k can be consideredas a parameter, and we choose Qn = 1, . . . , n. Let mn denote the cross-validation nearest neighbor estimate, i.e., put

H = Hn = arg minh

1n

n∑

i=1

(m(h)n,i (Xi) − Yi)2

and

mn = m(H)n .

For the nearest neighbor estimate again we have covering (Corollary 6.1)with ρ = γd.

Theorem 8.2. Assume that |Y | ≤ L. Then, for the cross-validationnearest neighbor estimate mn,

E∆(Hn)n−1 ≤ ∆(hn−1)

n−1 + c

√log nn

for some constant c depending only on L and γd.



Theorems 8.2 and 6.2 imply


|m(x) −m(z)| ≤ C · ‖x− z‖ (x, z ∈ Rd)

and |Y | ≤ L a.s. Let mn be the k-nearest neighbor estimate with k chosenas in Theorem 8.2. Then for d ≥ 3 and

(log n)(d+2)/(4d)n−(d−2)/(4d) ≤ C,

one has

E∆(Hn)n−1 ≤ c1C

2d/(d+2)n−2/(d+2)




The concept of cross-validation in statistics was introduced by Lunts andBrailovsky (1967), Allen (1974), and M. Stone (1974), for regression estima-tion by Clark (1975), and Wahba and Wold (1975). Further literature canbe found, e.g., in Hardle (1990) and Simonoff (1996). Cross-validation forkernel and nearest neighbor estimates has been studied by Chiu (1991), Hall(1984), Hardle, Hall, and Marron (1985; 1988), Hardle and Kelly (1987),Hardle and Marron (1985; 1986), Li (1984) and Wong (1983), under certainoptimality aspects. Assuming bounded Qn, a consequence of the resultsof Hall (1984) and Hardle and Marron (1985; 1986) is that, for X withcontinuous density and for continuous m,

∆(Hn)n−1

∆(hn−1)n−1

→ 1 a.s.

A corresponding result of stochastic convergence for fixed design and near-est neighbor regression is due to Li (1987). Part (b) of Theorem 8.1 wasobtained by Walk (2002b). Theorem 8.2 is a slight modification of Devroyeand Wagner (1979).


Problem 8.1. Prove Corollary 8.1.Hint: Proceed as in the proof of Corollary 7.1.


Problem 8.2. Prove that, for a random variable Z,

PZ > ε ≤ Ce−cnε2 for all ε > 0

(C > 1, c > 0) implies that

EZ ≤√

1 + log C

cn.

Hint: Without loss of generality assume Z ≥ 0 (otherwise replace Z by Z ·IZ≥0). Then

EZ ≤√

EZ2

=

√∫ ∞

0

PZ2 > ε dε

≤

√√√√

∫ log Ccn

0

1 dε +∫ ∞

log Ccn

Ce−cnε dε

=

√1 + log C

cn.

Problem 8.3. Prove Theorem 8.2.Hint:

Step (1). With h = k, follow the line of the proof of Theorem 8.1 until (8.4):∣∣∣∣∣

n∑

j=1

(m(h)n,j(Xj) − Yj)2 −

n∑

j=1

(m(h)′n,j (X ′

j) − Y ′j )2

∣∣∣∣∣

≤ 4L2 + 4L

n∑

j=2

|m(h)n,j(Xj) − m

(h)′n,j (Xj)|.

Step (2). Apply Corollary 6.1 to show thatn∑

j=2

|m(k)n,j(Xj) − m

(k)′n,j (Xj)| ≤ 4Lγd.

According to Corollary 6.1,n∑

j=2

|m(k)n,j(Xj) − m

(k)′n,j (Xj)|

=n∑

j=2

∣∣∣∣∣1k

Y1IX1 among X1,...,Xj−1,Xj+1,...,Xn is one of the k NNs of Xj

+1k

∑

l∈2,...,n−j

YlIXl among X1,...,Xj−1,Xj+1,...,Xn is one of the k NNs of Xj


− 1k

Y ′1IX′

1 among X′1,...,Xj−1,Xj+1,...,Xn is one of the k NNs of Xj

− 1k

∑

l∈2,...,n−j

YlIXl among X′1,...,Xj−1,Xj+1,...,Xn is one of the k NNs of Xj

∣∣∣∣∣

≤ 1k

L

n∑

j=2

IX1 among X1,...,Xj−1,Xj+1,...,Xn is one of the k NNs of Xj

+1k

L

n∑

j=2

IX′1 among X′

1,...,Xj−1,Xj+1,...,Xn is one of the k NNs of Xj

+1k

L

n∑

j=2

∑

l∈2,...,n−j∣∣∣∣∣IXl among X1,...,Xj−1,Xj+1,...,Xn is one of the k NNs of Xj

−IXl among X′1,...,Xj−1,Xj+1,...,Xn is one of the k NNs of Xj

∣∣∣∣∣

≤ Lγd + Lγd + 2Lγd.

Problem 8.4. (a) Use cross-validation to choose the side lengths of rectangularpartitioning such that the resulting estimate approaches the rate of convergencein Problem 4.7.

(b) Use cross-validation to choose the scaling for product kernel estimates suchthat the resulting estimate approaches the rate of convergence in Problem 5.7.

Hint: Proceed as in Problem 7.7.

Problem 8.5. Formulate and prove a version of Theorem 7.2 which uses cross-validation instead of splitting of the data.

Problem 8.6. Prove Corollary 8.2.Hint: Use Theorems 8.2 and 6.2.

Problem 8.7. Show that, under the conditions of part (a) of Theorem 8.1,

E∫

|m(Hn)n (x) − m(x)|2µ(dx) ≤ ∆(hn)

n + c · |Qn|√n

for some constant c depending only on L and ρ.Hint: Use Theorem A.3 for the treatment of

∫|m(h)

n (x) − m(x)|2µ(dx) − ∆(h)n (h ∈ Qn),

further Problem 8.2 and Theorem 8.1 (a).

9Uniform Laws of Large Numbers

In the least squares estimation problem we minimize the empirical L2 risk

1n

n∑

i=1

|f(Xi) − Yi|2

over a set of functions Fn depending on the sample size n. One of themain steps in proving the consistency of such estimates is to show that theempirical L2 risk is uniformly (over Fn) close to the L2 risk. More precisely,we will need to show

supf∈Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi|2 − E|f(X) − Y |2∣∣∣∣∣→ 0 (n → ∞) a.s. (9.1)

Set Z = (X,Y ), Zi = (Xi, Yi) (i = 1, . . . , n), gf (x, y) = |f(x) − y|2 forf ∈ Fn and Gn = gf : f ∈ Fn. Then (9.1) can be written as

supg∈Gn

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)∣∣∣∣∣→ 0 (n → ∞) a.s.

Thus we are interested in bounding the distance between an average andits expectation uniformly over a set of functions. In this chapter we discusstechniques for doing this.

9.1. Basic Exponential Inequalities 131

9.1 Basic Exponential Inequalities

For the rest of this chapter let Z, Z1, Z2, . . . be independent and identicallydistributed random variables with values in Rd, and for n ∈ N let Gn be aclass of functions g : Rd → R. We derive sufficient conditions for

supg∈Gn

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)∣∣∣∣∣→ 0 (n → ∞) a.s. (9.2)

For each fixed function g with E|g(Z)| < ∞ the strong law of large numbersimplies

limn→∞

∣∣∣∣∣∣

1n

n∑

j=1

g(Zj) − E g(Z)∣∣∣∣∣∣= 0 a.s.

We will use Hoeffding’s inequality (Lemma A.3) to extend it to sets offunctions. Recall that if g is a function g : Rd → [0, B], then by Hoeffding’sinequality

P

∣∣∣∣∣∣

1n

n∑

j=1

g(Zj) − Eg(Z)∣∣∣∣∣∣> ε

≤ 2e− 2nε2

B2 ,

which together with the union bound implies

P

supg∈Gn

∣∣∣∣∣∣

1n

n∑

j=1

g(Zj) − Eg(Z)∣∣∣∣∣∣> ε

≤ 2|Gn|e− 2nε2

B2 . (9.3)

Thus, for finite classes Gn satisfying

∞∑

n=1

|Gn|e− 2nε2

B2 < ∞ (9.4)

for all ε > 0, (9.2) follows from (9.3) and the Borel–Cantelli lemma (seethe proof of Lemma 9.1 below for details).

In our applications (9.4) is never satisfied because the cardinality of Gn isalways infinite. But sometimes it is possible to choose finite sets Gn,ε whichsatisfy (9.4) and

supg∈Gn

∣∣∣∣∣∣

1n

n∑

j=1

g(Zj) − E g(Z)∣∣∣∣∣∣> ε

⊂

sup

g∈Gn,ε

∣∣∣∣∣∣

1n

n∑

j=1

g(Zj) − E g(Z)∣∣∣∣∣∣> ε′

(9.5)

132 9. Uniform Laws of Large Numbers

ε

g

gj

Figure 9.1. Sup norm distance between function g and member gj of cover is lessthan ε.

for some ε′ depending on ε (but not on n). Clearly, then (9.2) follows from(9.5) and (9.4) (for Gn,ε instead of Gn) with the above argument applied toGn,ε.

To construct classes Gn,ε which satisfy (9.5) one can use covers withrespect to the supremum norm:

Definition 9.1. Let ε > 0 and let G be a set of functions Rd → R. Everyfinite collection of functions g1, . . . , gN : Rd → R with the property that forevery g ∈ G there is a j = j(g) ∈ 1, . . . , N such that

‖g − gj‖∞ := supz

|g(z) − gj(z)| < ε

is called an ε-cover of G with respect to ‖ · ‖∞.

If g1, . . . , gN is an ε-cover of G with respect to ‖·‖∞, then G is a subsetof the union of all ‖ · ‖∞-balls with center gi and radius ε (i = 1, . . . , n).The fewer balls are needed to cover G, the smaller G is in some sense.

Definition 9.2. Let ε > 0 and let G be a set of functions Rd → R. LetN (ε,G, ‖ · ‖∞) be the size of the smallest ε-cover of G w.r.t. ‖ · ‖∞. TakeN (ε,G, ‖ · ‖∞) = ∞ if no finite ε-cover exists. Then N (ε,G, ‖ · ‖∞) iscalled an ε-covering number of G w.r.t. ‖ · ‖∞ and will be abbreviated toN∞(ε,G).

Lemma 9.1. For n ∈ N let Gn be a set of functions g : Rd → [0, B] andlet ε > 0. Then

P

supg∈Gn

∣∣∣∣∣∣

1n

n∑

j=1

g(Zj) − E g(Z)∣∣∣∣∣∣> ε

≤ 2N∞ (ε/3,Gn) e− 2nε2

9B2 .

Furthermore, if∞∑

n=1

N∞ (ε/3,Gn) e− 2nε2

9B2 < ∞ (9.6)

9.1. Basic Exponential Inequalities 133

for all ε > 0, then (9.2) holds.

Since in the above probability the supremum is taken over a possibleuncountable set, there may be some measurability problems. In the bookof van der Vaart and Wellner (1996) this issue is handled very elegantlyusing the notion of outer probability. In most of our applications it willsuffice to consider countable sets of functions, therefore, here and in thesequel, we shall completely ignore this problem.

Proof. Let Gn, ε3

be an ε3–cover of Gn w.r.t. ‖ · ‖∞ of minimal cardinality.

Fix g ∈ Gn. Then there exists g ∈ Gn, ε3

such that ‖g− g‖∞ < ε3 . Using this

one gets∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)∣∣∣∣∣

≤∣∣∣∣∣1n

n∑

i=1

g(Zi) − 1n

n∑

i=1

g(Zi)

∣∣∣∣∣+

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)∣∣∣∣∣

+ |Eg(Z) − Eg(Z)|

≤ ‖g − g‖∞ +

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)∣∣∣∣∣+ ‖g − g‖∞

≤ 23ε+

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)∣∣∣∣∣.

Hence,

P

supg∈Gn

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)∣∣∣∣∣> ε

≤ P

supg∈Gn, ε

3

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)∣∣∣∣∣>ε

3

(9.3)≤ 2

∣∣Gn, ε

3

∣∣ e− 2n( ε

3 )2

B2 = 2N∞

( ε3,Gn

)e− 2nε2

9 B2 .

If (9.6) holds, then this implies

∞∑

n=1

P

supg∈Gn

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)∣∣∣∣∣>

1k

< ∞

for each k ∈ N . Using the Borel–Cantelli lemma one concludes that, foreach k ∈ N ,

lim supn→∞

supg∈Gn

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)∣∣∣∣∣≤ 1k

a.s.,


hence also with probability one

lim supn→∞

supg∈Gn

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)∣∣∣∣∣≤ 1k

for all k ∈ N ,

which implies (9.2).

9.2 Extension to Random L1 Norm Covers

Supremum norm covers are often too large to satisfy (9.6). But clearly, (9.6)is not necessary for the proof of (9.2). To motivate a weaker condition letus again consider (9.5). For the sake of simplicity ignore the expected valueEg(Z) (think of it as an average 1

n

∑2nj=n+1 g(Zj), where Z1, . . . , Z2n are

i.i.d.). So instead of a data-independent set Gn,ε we now try to construct aset Gn,ε depending on Z1, . . . , Z2n which satisfies

supg∈Gn

∣∣∣∣∣∣

1n

n∑

j=1

g(Zj) − 1n

2n∑

j=n+1

g(Zj)

∣∣∣∣∣∣> ε

⊂

sup

g∈Gn,ε

∣∣∣∣∣∣

1n

n∑

j=1

g(Zj) − 1n

2n∑

j=n+1

g(Zj)

∣∣∣∣∣∣> ε′

for some ε′ depending on ε (but not on n). Then it is clear that all we needis a data-dependent cover which can approximate each g ∈ Gn with respectto

‖g − h‖2n =12n

2n∑

j=1

|g(Zj) − h(Zj)|.

To formulate this idea introduce the following covering numbers:

Definition 9.3. Let ε > 0, let G be a set of functions Rd → R, 1 ≤ p < ∞,and let ν be a probability measure on Rd. For a function f : Rd → R set

‖f‖Lp(ν) :=∫

|f(z)|pdν 1

p

.

(a) Every finite collection of functions g1, . . . , gN : Rd → R with theproperty that for every g ∈ G there is a j = j(g) ∈ 1, . . . , N such that

‖g − gj‖Lp(ν) < ε

is called an ε-cover of G with respect to ‖ · ‖Lp(ν).(b) Let N (ε,G, ‖ · ‖Lp(ν)) be the size of the smallest ε-cover of G w.r.t.‖ · ‖Lp(ν). Take N (ε,G, ‖ · ‖Lp(ν)) = ∞ if no finite ε-cover exists. ThenN (ε,G, ‖ · ‖Lp(ν)) is called an ε-covering number of G w.r.t. ‖ · ‖Lp(ν).

9.2. Extension to Random L1 Norm Covers 135

ε

Figure 9.2. Example of ε-cover.

(c) Let zn1 = (z1, . . . , zn) be n fixed points in Rd. Let νn be the

corresponding empirical measure, i.e.,

νn(A) =1n

n∑

i=1

IA(zi) (A ⊆ Rd).

Then

‖f‖Lp(νn) =

1n

n∑

i=1

|f(zi)|p 1

p

and any ε-cover of G w.r.t. ‖ · ‖Lp(νn) will be called an Lp ε-cover of G onzn1 and the ε-covering number of G w.r.t. ‖ · ‖Lp(νn) will be denoted by

Np(ε,G, zn1 ).

In other words, Np(ε,G, zn1 ) is the minimal N ∈ N such that there exist

functions g1, . . . , gN : Rd → R with the property that for every g ∈ G thereis a j = j(g) ∈ 1, . . . , N such that

1n

n∑

i=1

|g(zi) − gj(zi)|p 1

p

< ε.

If Zn1 = (Z1, . . . , Zn) is a sequence of i.i.d. random variables, then

N1(ε,G, Zn1 ) is a random variable, whose expected value plays a central

role in our problem. With these notations we can extend the first half of


Lemma 9.1 from fixed supremum norm covers to L1 covers on a randomset of points.

Theorem 9.1. Let G be a set of functions g : Rd → [0, B]. For any n, andany ε > 0,

P

supg∈G

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)∣∣∣∣∣> ε

≤ 8E N1(ε/8,G, Zn1 ) e−nε2/(128B2).

In the proof we use symmetrization and covering arguments, which we willalso later apply in many other proofs, e.g., in the proofs of Theorems 11.2,11.4, and 11.6.

Proof. The proof will be divided into four steps.Step 1. Symmetrization by a ghost sample.Replace the expectation inside the above probability by an empirical meanbased on a ghost sample Z

′n1 = (Z ′

1, . . . , Z′n) of i.i.d. random variables

distributed as Z and independent of Zn1 . Let g∗ be a function g ∈ G such

that∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)

∣∣∣∣∣> ε,

if there exists any such function, and let g∗ be an other arbitrary functioncontained in G, if such a function doesn’t exist. Note that g∗ depends onZn

1 and that Eg∗(Z)|Zn1 is the expectation of g∗(Z) with respect to Z.

Application of Chebyshev’s inequality yields

P

∣∣∣∣∣Eg∗(Z)|Zn

1 − 1n

n∑

i=1

g∗(Z ′i)

∣∣∣∣∣>ε

2

∣∣∣∣∣Zn

1

≤ Varg∗(Z)|Zn1

n( ε2 )2

≤B2

4

n · ε2

4

=B2

nε2,

where we have used 0 ≤ g∗(Z) ≤ B which implies

Varg∗(Z)|Zn1 = Var

g∗(Z) − B

2

∣∣∣∣∣Zn

1

≤ E

∣∣∣∣g

∗(Z) − B

2

∣∣∣∣

2∣∣∣∣∣Zn

1

≤ B2

4.

Thus, for n ≥ 2B2

ε2 , we have

P

∣∣∣∣∣Eg∗(Z)|Zn

1 − 1n

n∑

i=1

g∗(Z ′i)

∣∣∣∣∣≤ ε

2

∣∣∣∣Z

n1

≥ 12. (9.7)


Hence

P

supg∈G

∣∣∣∣∣1n

n∑

i=1

g(Zi) − 1n

n∑

i=1

g(Z ′i)

∣∣∣∣∣>ε

2

≥ P

∣∣∣∣∣1n

n∑

i=1

g∗(Zi) − 1n

n∑

i=1

g∗(Z ′i)

∣∣∣∣∣>ε

2

≥ P

∣∣∣∣∣1n

n∑

i=1

g∗(Zi) − Eg∗(Z)|Zn1

∣∣∣∣∣> ε,

∣∣∣∣∣1n

n∑

i=1

g∗(Z ′i) − Eg∗(Z)|Zn

1 ∣∣∣∣∣≤ ε

2

.

The last probability can be determined by computing in a first step thecorresponding probability conditioned on Zn

1 , and by averaging in a secondstep the result with respect to Zn

1 . Whether∣∣∣∣∣1n

n∑

i=1


∣∣∣∣∣> ε

holds or not, depends only on Zn1 . If it holds, then the probability above

conditioned on Zn1 is equal to

P

∣∣∣∣∣1n

n∑

i=1


1 ∣∣∣∣∣≤ ε

2

∣∣∣∣∣Zn

1

,

otherwise it is zero. Using this, we get

P ∣∣∣∣∣1n

n∑

i=1


∣∣∣∣∣> ε,

∣∣∣∣∣1n

n∑

i=1


1 ∣∣∣∣∣≤ ε

2

= E

I| 1n

∑n

i=1g∗(Zi)−Eg∗(Z)|Zn

1 |>ε

×P

∣∣∣∣∣1n

n∑

i=1


1 ∣∣∣∣∣≤ ε

2

∣∣∣∣∣Zn

1

≥ 12P

∣∣∣∣∣1n

n∑

i=1


∣∣∣∣∣> ε


=12P

supg∈G

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)

∣∣∣∣∣> ε

,

where the last inequality follows from (9.7). Thus we have shown that, forn ≥ 2B2

ε2 ,

P

supg∈G

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)∣∣∣∣∣> ε

≤ 2P

supg∈G

∣∣∣∣∣1n

n∑

i=1

g(Zi) − 1n

n∑

i=1

g(Z ′i)

∣∣∣∣∣>ε

2

.

Step 2. Introduction of additional randomness by random signs.Let U1, . . . , Un be independent and uniformly distributed over −1, 1 andindependent of Z1, . . . , Zn, Z ′

1, . . . , Z′n. Because of the independence and

identical distribution of Z1, . . . , Zn, Z ′1, . . . , Z

′n, the joint distribution of

Zn1 ,Z ′n

1 , is not affected if one randomly interchanges the correspondingcomponents of Zn

1 and Z ′n1 . Hence

P

supg∈G

∣∣∣∣∣1n

n∑

i=1

(g(Zi) − g(Z ′i))

∣∣∣∣∣>ε

2

= P

supg∈G

∣∣∣∣∣1n

n∑

i=1

Ui (g(Zi) − g(Z ′i))

∣∣∣∣∣>ε

2

≤ P

supg∈G

∣∣∣∣∣1n

n∑

i=1

Uig(Zi)

∣∣∣∣∣>ε

4

+ P

supg∈G

∣∣∣∣∣1n

n∑

i=1

Uig(Z ′i)

∣∣∣∣∣>ε

4

= 2P

supg∈G

∣∣∣∣∣1n

n∑

i=1

Uig(Zi)

∣∣∣∣∣>ε

4

.

Step 3. Conditioning and introduction of a covering.Next we condition in the last probability on Zn

1 , which is equivalent tofixing z1, . . . , zn ∈ Rd and to considering

P

∃g ∈ G :

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣>ε

4

. (9.8)

Let G ε8

be an L1ε8–cover of G on zn

1 . Fix g ∈ G. Then there exists g ∈ G ε8

such that

1n

n∑

i=1

|g(zi) − g(zi)| < ε

8. (9.9)


W.l.o.g. we may assume 0 ≤ g(z) ≤ B (otherwise, truncate g at 0 and B).Then (9.9) implies

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣

=

∣∣∣∣∣1n

n∑

i=1

Uig(zi) +1n

n∑

i=1

Ui(g(zi) − g(zi))

∣∣∣∣∣

≤∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣+

1n

n∑

i=1

|g(zi) − g(zi)|

<

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣+ε

8.

Using this we can bound the probability in (9.8) by

P

∃g ∈ G ε8

:

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣+ε

8>ε

4

≤ |G ε8| maxg∈G ε

8

P

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣>ε

8

.

Choose G ε8

as an L1ε8 -cover on zn

1 of minimal size. Then we have

P

∃g ∈ G :

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣>ε

4

≤ N1

( ε8,G, zn

1

)maxg∈G ε

8

P

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣>ε

8

.

Step 4. Application of Hoeffding’s inequality.In this step we bound

P

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣>ε

8

,

where z1, . . . , zn ∈ Rd, g : Rd → R, and 0 ≤ g(z) ≤ B.Since U1g(z1), . . . , Ung(zn) are independent random variables with

−B ≤ Uig(zi) ≤ B (i = 1, . . . , n),

we have, by Hoeffding’s inequality,

P

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣>ε

8

≤ 2 exp

(

−2n(

ε8

)2

(2B)2

)

≤ 2 exp(

− n ε2

128B2

).

In the case of n ≥ 2B2/ε2 the assertion is now implied by the four steps.For n < 2B2/ε2 the bound on the probability trivially holds, because theright-hand side is greater than one.


9.3 Covering and Packing Numbers

From Theorem 9.1 one can derive (9.2) by an application of the Borel–Cantelli lemma, if one has suitable upper bounds for the L1 coveringnumbers. These upper bounds will be derived in the sequel.

We first introduce the concept of Lp packing numbers and study therelationship between packing and covering numbers. These results will beused to obtain upper bounds on covering numbers.

Definition 9.4. Let ε > 0, let G be a set of functions Rd → R, 1 ≤ p <∞, and let ν be a probability measure on Rd. Recall that, for a functionf : Rd → R,

‖f‖Lp(ν) :=∫

|f(z)|pdν 1

p

.

(a) Every finite collection of functions g1, . . . , gN ∈ G with

‖gj − gk‖Lp(ν) ≥ ε

for all 1 ≤ j < k ≤ N is called an ε-packing of G w.r.t. ‖ · ‖Lp(ν).(b) Let M(ε,G, ‖ · ‖Lp(ν)) be the size of the largest ε-packing of G w.r.t.‖ · ‖Lp(ν). Take M(ε,G, ‖ · ‖Lp(ν)) = ∞ if there exists an ε-packing of Gw.r.t. ‖ · ‖Lp(ν) of size N for every N ∈ N . Then M(ε,G, ‖ · ‖Lp(ν)) iscalled an ε-packing number of G w.r.t. ‖ · ‖Lp(ν).(c) Let zn

1 = (z1, . . . , zn) be n fixed points in Rd. Let νn be thecorresponding empirical measure, i.e.,

νn(A) =1n

n∑

i=1

IA(zi) (A ⊆ Rd).

Then

‖f‖Lp(νn) =

1n

n∑

i=1

|f(zi)|p 1

p

and any ε-packing of G w.r.t. ‖ · ‖Lp(νn) will be called an Lp ε-packing ofG on zn

1 and the ε-packing number of G w.r.t. ‖ · ‖Lp(νn) will be denoted by

Mp(ε,G, zn1 ).

In other words, Mp(ε,G, zn1 ) is the maximal N ∈ N such that there exist

functions g1, . . . , gN ∈ G with

1n

n∑

i=1

|gj(zi) − gk(zi)|p 1

p

≥ ε

for all 1 ≤ j < k ≤ N .

9.3. Covering and Packing Numbers 141

ε/2

Figure 9.3. ε-packing.

As the next lemma shows, the Lp packing numbers are closely related tothe Lp covering numbers.

Lemma 9.2. Let G be a class of functions on Rd and let ν be a probabilitymeasure on Rd, p ≥ 1 and ε > 0. Then

M(2ε,G, ‖ · ‖Lp(ν)) ≤ N (ε,G, ‖ · ‖Lp(ν)) ≤ M(ε,G, ‖ · ‖Lp(ν)).

In particular,

Mp(2ε,G, zn1 ) ≤ Np(ε,G, zn

1 ) ≤ Mp(ε,G, zn1 )

for all z1, . . . , zn ∈ Rd.

Proof. Let f1, . . . , fl be a 2ε-packing of G w.r.t. ‖ · ‖Lp(ν). Then any set

Uε(g) =h : Rd → R : ‖h− g‖Lp(ν) < ε

can contain at most one of the fi’s. This proves the first inequality.In the proof of the second inequality, we assume M(ε,G, ‖ · ‖Lp(ν)) < ∞

(otherwise the proof is trivial). Let g1, . . . , gl be an ε-packing of G w.r.t.‖ · ‖Lp(ν) of maximal cardinality l = M(ε,G, ‖ · ‖Lp(ν)). Let h ∈ G bearbitrary. Then h, g1, . . . , gl is a subset of G of cardinality l + 1, hencecannot be an ε-packing of G with respect to ‖ · ‖Lp(ν). Thus there existsj ∈ 1, . . . , l such that

‖h− gj‖Lp(ν) < ε.


This proves that g1, . . . , gl is an ε-cover of G with respect to ‖ · ‖Lp(ν),which implies

N (ε,G, ‖ · ‖Lp(ν)) ≤ l = M(ε,G, ‖ · ‖Lp(ν)).

Next we use the above lemma to bound L2 ε-covering numbers on pointszn1 of balls in linear vector spaces.

Lemma 9.3. Let F be a set of functions f : Rd → R. Assume that F isa linear vector space of dimension D. Then one has, for arbitrary R > 0,ε > 0, and z1, . . . , zn ∈ Rd,

N2

(

ε,

f ∈ F :1n

n∑

i=1

|f(zi)|2 ≤ R2

, zn1

)

≤(

4R+ ε

ε

)D

.

Proof. For f, g : Rd → R set

< f, g >n :=1n

n∑

i=1

f(zi)g(zi) and ‖f‖2n := < f, f >n .

Let f1, . . . , fN be an arbitrary ε-packing of f ∈ F : ‖f‖n ≤ R w.r.t.‖ · ‖n, i.e., f1, . . . , fN ∈ f ∈ F : ‖f‖n ≤ R satisfy

‖fi − fj‖n ≥ ε for all 1 ≤ i < j ≤ N.

Because of Lemma 9.2 it suffices to show

N ≤(

4R+ ε

ε

)D

. (9.10)

In order to show (9.10) let B1, . . . , BD be a basis of the linear vector spaceF . Then, for any a1, b1, . . . , aD, bD ∈ R,

‖D∑

j=1

ajBj −D∑

j=1

bjBj‖2n

= <

D∑

j=1

(aj − bj)Bj ,

D∑

j=1

(aj − bj)Bj >n = (a− b)TB(a− b),

where

B = (< Bj , Bk >n)1≤j,k≤D and (a− b) = (a1 − b1, . . . , aD − bD)T .

Because of

aTBa = ‖D∑

j=1

ajBj‖2n ≥ 0 (a ∈ RD),

9.4. Shatter Coefficients and VC Dimension 143

the symmetric matrix B is positive semidefinite, hence there exists asymmetric matrix B1/2 such that

B = B1/2 ·B1/2.

We have

‖D∑

j=1

ajBj −D∑

j=1

bjBj‖2n = (a− b)TB1/2B1/2(a− b) = ‖B1/2(a− b)‖2,

where ‖ · ‖ is the Euclidean norm in RD.Because fi ∈ F we get

fi =D∑

j=1

a(i)j Bj

for some

a(i) = (a(i)1 , . . . , a

(i)D )T ∈ RD (i = 1, . . . , N).

It follows

‖B1/2a(i)‖ = ‖fi‖n ≤ R

and

‖B1/2a(i) −B1/2a(j)‖ = ‖fi − fj‖n ≥ ε

for all 1 ≤ i < j ≤ N . Hence the N balls in Rd with centers B1/2a(1), . . . ,B1/2a(N) and radius ε/4 are disjoint subsets of the ball with center zeroand radius R+ ε/4. By comparing the volumes of the balls one gets

N · cD ·( ε

4

)D

≤ cD ·(R+

ε

4

)D

,

where cD is the volume of a ball with radius one in RD. This implies theassertion.

9.4 Shatter Coefficients and VC Dimension

In this section we derive bounds for Lp covering numbers of sets of func-tions, which are not necessarily subsets of some finite-dimensional vectorspace of functions. Therefore we need the following definition:

Definition 9.5. Let A be a class of subsets of Rd and let n ∈ N .(a) For z1, . . . , zn ∈ Rd define

s (A, z1, . . . , zn) = |A ∩ z1, . . . , zn : A ∈ A| ,that is, s(A, z1, . . . , zn) is the number of different subsets of z1, . . . , znof the form A ∩ z1, . . . , zn, A ∈ A.


Figure 9.4. Three points can be shattered by half-spaces on the plane.

(b) Let G be a subset of Rd of size n. One says that A shatters G ifs(A, G) = 2n, i.e., if each subset of G can be represented in the form A∩Gfor some A ∈ A.(c) The nth shatter coefficient of A is

S(A, n) = maxz1,...,zn⊆Rd

s (A, z1, . . . , zn) .

That is, the shatter coefficient is the maximal number of different subsetsof n points that can be picked out by sets from A.

Clearly, s (A, z1, . . . , zn) ≤ 2n and S(A, n) ≤ 2n. If S(A, n) < 2n thens (A, z1, . . . , zn) < 2n for all z1, . . . , zn ∈ Rd. If s (A, z1, . . . , zn) < 2n,then z1, . . . , zn has a subset such that there is no set in A that containsexactly that subset of z1, . . . , zn.

It is easy to see that S(A, k) < 2k implies S(A, n) < 2n for all n > k.The last time when S(A, k) = 2k is important:

Definition 9.6. Let A be a class of subsets of Rd with A = ∅. The VCdimension (or Vapnik–Chervonenkis dimension) VA of A is defined by

VA = sup n ∈ N : S(A, n) = 2n ,i.e., the VC dimension VA is the largest integer n such that there exists aset of n points in Rd which can be shattered by A.

Example 9.1. The class of all intervals in R of the form (−∞, b] (b ∈R) fails to pick out the largest of any two distinct points, hence its VCdimension is 1. The class of all intervals in R of the form (a, b] (a, b ∈ R)shatters every two-point set but cannot pick out the largest and the smallestpoint of any set of three distinct points. Thus its VC dimension is 2.

Our next two theorems, which we will use later to derive bounds on Lp

packing numbers, state the suprising fact that either S(A, n) = 2n for alln (in which case VA = ∞) or S(A, n) is bounded by some polynomial in nof degree VA < ∞.


Theorem 9.2. Let A be a set of subsets of Rd with VC dimension VA.Then, for any n ∈ N ,

S(A, n) ≤VA∑

i=0

(ni

).

Proof. Let z1, . . . , zn be any n distinct points. Clearly, it suffices to showthat

s(A, z1, . . . , zn) = |A ∩ z1, . . . , zn : A ∈ A| ≤VA∑

i=0

(ni

).

Denote by F1, . . . , Fk the collection of all k =(

nVA+1

)subsets of

z1, . . . , zn of size VA + 1. By the definition of VC dimension, A shat-ters none of the sets Fi, i.e., for each i ∈ 1, . . . , k there exists Hi ⊆ Fi

such that

A ∩ Fi = Hi for all A ∈ A. (9.11)

Now Fi ⊆ z1, . . . , zn implies A∩Fi = (A∩ z1, . . . , zn) ∩Fi and, hence,(9.11) can be rewritten as

(A ∩ z1, . . . , zn) ∩ Fi = Hi for all A ∈ A. (9.12)

Set

C0 = C ⊆ z1, . . . , zn : C ∩ Fi = Hi for each i .Then (9.12) implies

A ∩ z1, . . . , zn : A ∈ A ⊆ C0,

hence it suffices to prove

|C0| ≤VA∑

i=0

(ni

).

This is easy in one special case: If Hi = Fi for every i, then

C ∩ Fi = Hi ⇔ C ∩ Fi = Fi ⇔ Fi ⊆ C,

which implies that C0 consists of all subsets of z1, . . . , zn of cardinalityless than VA + 1, and hence

|C0| =VA∑

i=0

(ni

).

We will reduce the general case to the special case just treated.For each i define

H ′i = (Hi ∪ z1) ∩ Fi,


that is, augment Hi by z1 provided z1 is contained in Fi. Set

C1 = C ⊆ z1, . . . , zn : C ∩ Fi = H ′i for each i .

We will show that the cardinality of C1 is not less than the cardinality ofC0. (Notice: This is not equivalent to C0 ⊆ C1.)

The sets C0 and C1 can be written as disjoint unions

C0 = (C0 ∩ C1) ∪ (C0 \ C1) , C1 = (C0 ∩ C1) ∪ (C1 \ C0) ,

therefore is suffices to prove

|C0 \ C1| ≤ |C1 \ C0| .To do this, we will show that the map

C → C \ z1is one-to-one from C0 \ C1 into C1 \ C0.

Let C ∈ C0 \C1. Then C ∩Fi = Hi for each i, but C ∩Fi0 = H ′i0

for somei0. Clearly, this implies

H ′i0 = (Hi0 ∪ z1) ∩ Fi0 = Hi0 ,

and hence z1 does not belong to Hi0 , but belongs to Fi0 , H′i0

, and C.Because z1 ∈ C for all C ∈ C0 \ C1, stripping of the point z1 defines a

one-to-one map.It remains to show that C \ z1 ∈ C1 \ C0. Because of C ∩Fi0 = H ′

i0and

z1 ∈ Hi0 ,

(C \ z1) ∩ Fi0 = (C ∩ Fi0) \ z1 = H ′i0 \ z1 = Hi0 ,

hence C \ z1 ∈ C0. In addition, C \ z1 ∈ C1, because if z1 ∈ Fi, thenC ∈ C0 implies

(C \ z1) ∩ Fi = C ∩ Fi = Hi = H ′i;

and if z1 ∈ Fi, then z1 is also contained in H ′i, but certainly not in C \z1,

hence

(C \ z1) ∩ Fi = H ′i.

This proves that the cardinality of C1 is not less than the cardinality ofC0. By repeating this procedure (n−1) times (with z2, z3, . . . taken insteadof z1, and starting with C1, C2, . . . instead of C0) one generates classes C2,C3, . . . , Cn with

|C0| ≤ |C1| ≤ · · · ≤ |Cn|.The sets Hi in the definition of Cn satisfy Hi = Fi and, hence the assertionfollows from the special case already considered.

Theorem 9.3. Let A be a set of subsets of Rd with VC dimensionVA < ∞. Then, for all n ∈ N ,

S(A, n) ≤ (n+ 1)VA ,


Figure 9.5. Subgraph of a function.

and, for all n ≥ VA,

S(A, n) ≤(e n

VA

)VA

.

Proof. Theorem 9.2, together with the binomial theorem, imply

S(A, n) ≤VA∑

i=0

(ni

)=

VA∑

i=0

n!(n− i)!

· 1i!

≤VA∑

i=0

ni ·(VAi

)= (n+ 1)VA .

If VA/n ≤ 1 then, again by Theorem 9.2 and the binomial theorem,

(VAn

)VA

S(A, n) ≤(VAn

)VA VA∑

i=0

(ni

)≤

VA∑

i=0

(VAn

)i (ni

)

≤n∑

i=0

(VAn

)i (ni

)=

(1 +

VAn

)n

≤ eVA .

Next we use these results to derive upper bounds on Lp packing numbers.Let G be a class of functions on Rd taking their values in [0, B]. To boundthe Lp packing number of G we will use the VC dimension of the set

G+ :=(z, t) ∈ Rd × R ; t ≤ g(z) ; g ∈ G

of all subgraphs of functions of G.


Theorem 9.4. Let G be a class of functions g : Rd → [0, B] with VG+ ≥ 2,let p ≥ 1, let ν be a probability measure on Rd, and let 0 < ε < B

4 . Then

M (ε,G, ‖ · ‖Lp(ν)

) ≤ 3(

2eBp

εplog

3eBp

εp

)VG+

.

Proof. The proof is divided into four steps. In the first three steps weprove the assertion in the case p = 1, in the fourth step we reduce thegeneral case to the case p = 1.Step 1. We first relate the packing number M (

ε,G, ‖ · ‖L1(ν))

to a shattercoefficient of G+. Set m = M (

ε,G, ‖ · ‖L1(ν))

and let G = g1, . . . , gm bean ε-packing of G w.r.t. ‖ · ‖L1(ν).

Let Q1, . . . , Qk ∈ Rd be k independent random variables with commondistribution ν. Generate k independent random variables T1, . . . , Tk uni-formly distributed on [0, B], which are also independent from Q1, . . . , Qk.Denote Ri = (Qi, Ti) (i = 1, . . . , k), Gf = (z, t) : t ≤ f(z) forf : Rd → [0, B], and Rk

1 = R1, . . . , Rk.Then

S(G+, k) = maxz1,...,zk∈Rd×R

s(G+, z1, . . . , zk)

≥ Es(G+, R1, . . . , Rk)

≥ Es(Gf : f ∈ G, R1, . . . , Rk)

≥ Es(Gf : f ∈ G, Gf ∩Rk1 = Gg ∩Rk

1 for all g ∈ G, g = f, Rk1)

= E

∑

f∈G

IGf ∩Rk1 =Gg∩Rk

1 for all g∈G,g =f

=∑

f∈G

PGf ∩Rk1 = Gg ∩Rk

1 for all g ∈ G, g = f

=∑

f∈G

(1 − P∃g ∈ G, g = f,Gf ∩Rk1 = Gg ∩Rk

1)

≥∑

f∈G

(1 −m maxg∈G,g =f


1). (9.13)

Fix f, g ∈ G, f = g. By the independence and identical distribution ofR1, . . . , Rk,


1= PGf ∩ R1 = Gg ∩ R1, . . . , Gf ∩ Rk = Gg ∩ Rk= (PGf ∩ R1 = Gg ∩ R1)k.


Now

PGf ∩ R1 = Gg ∩ R1= 1 − E PGf ∩ R1 = Gg ∩ R1

∣∣Q1

= 1 − EPf(Q1) < T1 ≤ g(Q1) or g(Q1) < T1 ≤ f(Q1)

∣∣Q1

= 1 − E |f(Q1) − g(Q1)|

B

= 1 − 1B

∫|f(x) − g(x)|ν(dx)

≤ 1 − ε

B

since f and g are ε-separated. Hence


1 ≤(1 − ε

B

)k

≤ exp(

−εk

B

)

which, together with (9.13), implies

S(G+, k) ≥∑

f∈G

(1 −m max

g∈G,g =fPGf ∩Rk

1 = Gg ∩Rk1)

≥ m

(1 −m exp

(−εk

B

)). (9.14)

Set k =⌊

Bε log(2m)

⌋. Then

1 −m exp(

−εk

B

)

≥ 1 −m exp(

− ε

B

(B

εlog(2m) − 1

))= 1 − 1

2exp

( εB

)

≥ 1 − 12

exp(

14

)≥ 1 − 1

2· 1.3 ≥ 1

3,

hence

M (ε,G, ‖ · ‖L1(ν)

)= m ≤ 3S(G+, k), (9.15)

where k =⌊

Bε log

(2M (

ε,G, ‖ · ‖L1(ν)))⌋

.Step 2. Application of Theorem 9.3.If

k =⌊B

εlog

(2M (

ε,G, ‖ · ‖L1(ν)))⌋

≤ VG+

then


M (ε,G, ‖ · ‖L1(ν)

)

≤ 12

exp(ε(VG+ + 1)

B

)≤ e

2exp(VG+) ≤ 3

(2eBε

log3eBε

)VG+

,

where we have used 0 < ε ≤ B/4. Therefore it suffices to prove the assertionin the case

k > VG+ .

In this case we can apply Theorem 9.3 to (9.15) and conclude

M (ε,G, ‖ · ‖L1(ν)

) ≤ 3(ek

VG+

)VG+

≤ 3(eB

εVG+log

(2M (

ε,G, ‖ · ‖L1(ν))))VG+

.

Step 3. Let a ∈ R+, b ∈ N , with a ≥ e and b ≥ 2. We will show that

x ≤ 3ab

log(2x)b

implies

x ≤ 3(2a log(3a))b. (9.16)

Setting a = eBε and b = VG+ this, together with Step 2, implies the assertion

in the case p = 1.Note that

x ≤ 3ab

log(2x)b

is equivalent to

(2x)1/b ≤ 61/b a

blog(2x) = 61/ba log((2x)1/b).

Set u = (2x)1/b and c = 61/ba. Then e ≤ a ≤ c and the last inequality canbe rewritten

u ≤ c log(u). (9.17)

We will show momentarily that this implies

u ≤ 2c log(c). (9.18)

From (9.18) one easily concludes (9.16). Indeed,

x =12ub ≤ 1

2(2c log c)b =

12(2 · 61/ba log(61/ba))b ≤ 3(2a log(3a))b,

where the last inequality follows from 61/b ≤ 3 for b ≥ 2.In conclusion we will show that (9.17) implies (9.18). Set f1(u) = u and

f2(u) = c log(u). Then it suffices to show

f1(u) > f2(u)


for u > 2c log(c). Because

f ′1(u) = 1 ≥ 1

2 log(e)≥ 1

2 log(c)=

c

2c log(c)≥ c

u= f ′

2(u)

for u > 2c log(c), this is equivalent to

f1(2c log(c)) > f2(2c log(c)).

This in turn is equivalent to

2c log(c) > c log(2c log(c)) ⇔ 2c log(c) > c log(2) + c log(c) + c log(log(c))

⇔ c log(c) − c log(2) − c log(log(c)) > 0

⇔ log(

c

2 log(c)

)> 0

⇔ c > 2 log(c). (9.19)

Set g1(v) = v and g2(v) = 2 log(v). Then

g1(e) = e > 2 log(e) = g2(e)

and for v ≥ e one has

g′1(v) = 1 ≥ 2

v= g′

2(v).

This proves

g1(v) > g2(v)

for v ≥ e, which together with c ≥ e implies (9.19). Steps 1 to 3 imply theassertion in the case p = 1.

Step 4. Let 1 < p < ∞. Then for any gj , gk ∈ G,

‖gj − gk‖pLp(ν) ≤ Bp−1‖gj − gk‖L1(ν).

Therefore any ε-packing of G w.r.t. ‖ · ‖Lp(ν) is also an εp

Bp−1 packing ofG w.r.t. ‖ · ‖L1(ν) which, together with the results of the first three steps,implies

M (ε,G, ‖ · ‖Lp(ν)

) ≤ M(

εp

Bp−1 ,G, ‖ · ‖L1(ν)

)

≤ 3(

2eBεp/Bp−1 log

3eBεp/Bp−1

)VG+

.

= 3(

2eBp

εplog

3eBp

εp

)VG+

.

The proof is complete.

In order to derive, via Lemma 9.2 and Theorem 9.4, upper bounds on Lp

packing numbers, all we need now is an upper bound on the VC dimension


VG+ . We have

G+ =

(z, t) ∈ Rd × R : t ≤ g(z)

: g ∈ G

⊆ (z, t) ∈ Rd × R : α · t+ g(z) ≥ 0

: g ∈ G, α ∈ R

.

If G is a linear vector space of dimensionK, then α·t+g(z) : g ∈ G, α ∈ Ris a linear vector space of dimension r = K+1 and by the following theoremwe get VG+ ≤ r.

Theorem 9.5. Let G be an r-dimensional vector space of real functionson Rd, and set

A =

z : g(z) ≥ 0 : g ∈ G.

Then

VA ≤ r.

Proof. It suffices to show that no set of size r+1 can be shattered by setsof the form z : g(z) ≥ 0, g ∈ G.

Choose any collection z1, . . . , zr+1 of distinct points from Rd. Definethe linear mapping L : G → Rr+1 by

L(g) = (g(z1), . . . , g(zr+1))T (g ∈ G).

Denote the image of G by LG. Clearly, LG is a linear subspace of the (r+1)-dimensional space Rr+1, and the dimension of LG is less than or equal tothe dimension of G, i.e., it is at most r. Hence there exists a nonzero vector

γ = (γ1, . . . , γr+1)T ∈ Rr+1,

that is orthogonal to LG, i.e., that satisfies

γ1g(z1) + · · · + γr+1g(zr+1) = 0 for all g ∈ G. (9.20)

Replacing γ by −γ if necessary, we may assume that at least one of the γi’sis negative. Equation (9.20) implies

∑

i:γi≥0

γig(zi) =∑

i:γi<0

(−γi)g(zi) for all g ∈ G. (9.21)

Suppose there is a g ∈ G for which z : g(z) ≥ 0 picks out preciselythose points zi with γi ≥ 0. For this g, the left-hand side of (9.21) would benonnegative (because γi ≥ 0 implies g(zi) ≥ 0 and hence γig(zi) ≥ 0), butthe right-hand side would be negative (because γi < 0 implies g(zi) < 0and hence −γig(zi) < 0). This is a contradiction, so z1, . . . , zr+1 cannotbe shattered and the proof is complete.

For the sake of illustration let us compare the results which we canderive from Theorem 9.4 with Lemma 9.3. Let F be a linear vector space

9.5. A Uniform Law of Large Numbers 153

of dimension D consisting of functions f : Rd → R and let ε, R > 0 andz1, . . . , zn ∈ Rd. Then Lemma 9.2, and Theorems 9.4 and 9.5 imply

N2 (ε, f ∈ F : ‖f‖∞ ≤ R , zn1 ) ≤ 3

(2e(2R)2

ε2· log

3e(2R)2

ε2

)D+1

,

(9.22)while by Lemma 9.3 we have

N2

(

ε,

f ∈ F :1n

n∑

i=1

|f(zi)|2 ≤ R2

, zn1

)

≤(

4R+ ε

ε

)D

. (9.23)

Because of

f ∈ F : ‖f‖∞ ≤ R ⊆

f ∈ F :1n

n∑

i=1

|f(zi)|2 ≤ R2

,

formula (9.23) implies

N2 (ε, f ∈ F : ‖f‖∞ ≤ R , zn1 ) ≤

(4R+ ε

ε

)D

,

so in this special case we get a bound similar to (9.22).The advantage of (9.23) in comparison to (9.22) is that in some special

cases (cf. Chapter 19) we will be able to apply it for bounds R on theempirical L2 norm of the functions of similar size as ε, in which case thecovering number will be bounded by

constD+1.

On the other hand, in all of our applications the bound in (9.22) will alwaysbe much larger than the above term, because the bound on the supremumnorm of the functions will always be much larger than ε.

The advantage of Theorem 9.4 is that it can (and will) be applied tomuch more general situations than Lemma 9.3.

9.5 A Uniform Law of Large Numbers

Let Z,Z1, Z2, . . . be independent and identically distributed random vari-ables with values in Rd. Let G be a class of functions g : Rd → R such thatEg(Z) exists for each g ∈ G. One says that G satisfies the uniform law oflarge numbers (ULLN), if

supg∈G

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)

∣∣∣∣∣→ 0 (n → ∞) a.s.

In order to illustrate our previous results we apply them to derive thefollowing ULLN.


g1g2g3

Figure 9.6. Envelope of g1, g2, g3.

Theorem 9.6. Let G be a class of functions g : Rd → R and let

G : Rd → R, G(x) := supg∈G

|g(x)| (x ∈ Rd)

be an envelope of G. Assume EG(Z) < ∞ and VG+ < ∞. Then

supg∈G

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)

∣∣∣∣∣→ 0 (n → ∞) a.s.

Proof. For L > 0 set

GL :=g · IG≤L : g ∈ G .

For any g ∈ G,∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)

∣∣∣∣∣

≤∣∣∣∣∣1n

n∑

i=1

g(Zi) − 1n

n∑

i=1

g(Zi) · IG(Zi)≤L

∣∣∣∣∣

+

∣∣∣∣∣1n

n∑

i=1

g(Zi) · IG(Zi)≤L − Eg(Z) · IG(Z)≤L

∣∣∣∣∣

+∣∣E

g(Z) · IG(Z)≤L

− E g(Z)∣∣

≤∣∣∣∣∣1n

n∑

i=1

g(Zi) · IG(Zi)≤L − Eg(Z) · IG(Z)≤L

∣∣∣∣∣

9.5. A Uniform Law of Large Numbers 155

+1n

n∑

i=1

G(Zi) · IG(Zi)>L + EG(Z) · IG(Z)>L

,

which implies

supg∈G

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)

∣∣∣∣∣

≤ supg∈GL

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)

∣∣∣∣∣

+1n

n∑

i=1

G(Zi) · IG(Zi)>L + EG(Z) · IG(Z)>L

.

By EG(Z) < ∞ and the strong law of large numbers we get

1n

n∑

i=1

G(Zi) · IG(Zi)>L → EG(Z) · IG(Z)>L

(n → ∞) a.s.

and

EG(Z) · IG(Z)>L

→ 0 (L → ∞).

Hence, it suffices to show

supg∈GL

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)

∣∣∣∣∣→ 0 (n → ∞) a.s. (9.24)

Let ε > 0 be arbitrary. The functions in GL are bounded in absolute valueby L. Application of Theorem 9.1, Lemma 9.2, and Theorem 9.4 yields

P

supg∈GL

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)

∣∣∣∣∣> ε

≤ 8E

N1

( ε8,GL, Z

n1

)· exp

(− nε2

128(2L)2

)

≤ 8E

M1

( ε8,GL, Z

n1

)· exp

(− nε2

512L2

)

≤ 24(

2e(2L)ε/8

· log3e(2L)ε/8

)VG+L · exp

(− nε2

512L2

).

If (x1, y1), . . . , (xk, yk) is shattered by G+L , then |yl| ≤ G(xl) (l = 1, . . . , k)

and, hence, is also shattered by G+. Therefore VG+L

≤ VG+ which, togetherwith the above results, yields


P

supg∈GL

∣∣∣∣∣1n

n∑

i=1

g(Zi) − Eg(Z)

∣∣∣∣∣> ε

≤ 24(

2e(2L)ε/8

· log3e(2L)ε/8

)VG+

· exp(

− nε2

512L2

).

The right-hand side is summable for each ε > 0 which, together with theBorel–Cantelli lemma, implies (9.24).


Theorem 9.1 is due to Pollard (1984). The symmetrization technique usedin the proof of Pollard’s inequality follows the ideas of Vapnik and Chervo-nenkis (1971), which were later extended by Dudley (1978), Pollard (1984),and Gine (1996). Theorem 9.1 is a generalization of the celebrated Vapnik-Chervonenkis inequality for uniform deviations of relative frequencies fromtheir probabilities, see Vapnik and Chervonenkis (1971).

Various extensions of this inequality were provided in Devroye (1982a),Alexander (1984), Massart (1990), Talagrand (1994), van der Vaart andWellner (1996), and Vapnik (1998). Various aspects of the empirical pro-cess theory and computational learning are discussed in Alexander (1984),Dudley (1984), Shorack and Wellner (1986), Pollard (1989; 1990), Ledouxand Talagrand (1991), Talagrand (1994), Gine (1996), Devroye, Gyorfi,and Lugosi (1996), Ledoux (1996), Gaenssler and Ross (1999), van de Geer(2000), van der Vaart and Wellner (1996), Vapnik (1998), and Bartlettand Anthony (1999). Theorem 9.2, known in the literature as the Sauerlemma, has been proved independently by Vapnik and Chervonenkis (1971),Sauer (1972), and Shelah (1972) and its extensions were studied by Szarekand Talagrand (1997), Alesker (1997), and Alon, Ben-David, and Haussler(1997).

The inequality of Theorem 9.4 is, for p = 1, due to Haussler (1992). The-orem 9.5 was proved by Steele (1975) and Dudley (1978). There are muchmore general versions of uniform laws of large numbers in the literaturethan Theorem 9.6, see, e.g., van de Geer (2000).


Problem 9.1. (a) Let A be the class of all intervals in R of the form (−∞, b](b ∈ R). Show

S(A, n) = n + 1.

(b) Let A be the class of all intervals in R of the form (a, b] (a, b ∈ R). Show

S(A, n) =n · (n + 1)

2.


(c) Generalize (a) and (b) to the multivariate case.

Problem 9.2. (a) Show that the VC dimension of the set of all intervals in Rd,of the form

(−∞, x1] × · · · × (−∞, xd] (x1, . . . , xd ∈ R)

is d.(b) Show that the VC dimension of the set of all intervals in Rd, of the form

(x1, y1] × · · · × (xd, yd] (x1, y1, . . . , xd, yd ∈ R),

is 2 · d.

Problem 9.3. (a) Determine the VC dimension of the set of all balls in R2.(b) Use Lemma 9.5 to derive an upper bound for the VC dimension of the set ofall balls in Rd.

Problem 9.4. Let A be a class of sets A ⊆ Rd. Show for any p ≥ 1, anyz1, . . . , zn ∈ Rd, and any 0 < ε < 1,

Np (ε, IA : A ∈ A , zn1 ) ≤ s (A, z1, . . . , zn) .

Hint: Use

1n

n∑

i=1

|g1(zi) − g2(zi)|p1/p

≤ maxi=1,...,n

|g1(zi) − g2(zi)|.

Problem 9.5. Let Z, Z1, Z2, . . . be i.i.d. real-valued random variables. Let F bethe distribution function of Z given by

F (t) = PZ ≤ t = EI(−∞,t](Z),

and let Fn be the empirical distribution function of Z1, . . . , Zn given by

Fn(t) =#1 ≤ i ≤ n : Zi ≤ t

n=

1n

n∑

i=1

I(−∞,t](Zi).

(a) Show, for any 0 < ε < 1,

P

supt∈R

|Fn(t) − F (t)| ≥ ε

≤ 8 · (n + 1) · exp

(−n · ε2

128

).

(b) Conclude from a) the Glivenko–Cantelli theorem:

supt∈R

|Fn(t) − F (t)| → 0 (n → ∞) a.s.

(c) Generalize (a) and (b) to multivariate (empirical) distribution functions.Hint: Apply Theorem 9.1 and Problems 9.1 and 9.4.

10Least Squares Estimates I: Consistency

In this chapter we show how one can use the techniques introduced inChapter 9 to derive sufficient conditions for the consistency of various leastsquares estimates.

10.1 Why and How Least Squares?

We know from Section 1.1 that the regression function m satisfies

E(m(X) − Y )2

= inf

fE(f(X) − Y )2

,

where the infimum is taken over all measurable functions f : Rd → R, thusE(m(X) − Y )2

can be computed by minimizing E

(f(X) − Y )2

over

all measurable functions. Clearly, this is impossible in the regression func-tion estimation problem, because the functional to be optimized dependson the unknown distribution of (X,Y ).

The idea of the least squares principle is to estimate the L2 risk

E(f(X) − Y )2

by the empirical L2 risk

1n

n∑

j=1

|f(Xj) − Yj |2 (10.1)

and to choose as a regression function estimate a function that minimizesthis empirical L2 risk.

10.1. Why and How Least Squares? 159

If X1, . . . , Xn are all distinct (which happens with probability 1 if X hasa density), then minimizing (10.1) leads to an estimate interpolating thedata (X1, Y1), . . . , (Xn, Yn) and having empirical L2 risk 0. Obviously, suchan estimate will not be consistent in general.

Therefore one first chooses a “suitable” class of functions Fn (maybedepending on the data but, at least, depending on the sample size n) andthen selects a function from this class which minimizes the empirical L2risk, i.e., one defines the estimate mn by


1n

n∑

j=1

|f(Xj) − Yj |2, (10.2)

which means, by definition,

mn ∈ Fn and1n

n∑

j=1

|mn(Xj) − Yj |2 = minf∈Fn

1n

n∑

j=1

|f(Xj) − Yj |2.

Here we assumed the existence of minimizing functions, though not neces-sarily their uniqueness. In cases where the minima do not exist, the sameanalysis can be carried out with functions whose error is arbitrarily close tothe infimum but, for the sake of simplicity, we maintain the assumption ofexistence throughout the book. We will show later (see (10.4) and (10.5))that in most of our applications the minima indeed exist.

The class of candidate functions grows as the sample size n grows. Thisis the “method of sieves,” introduced by Grenander (1981).

The choice of Fn has two effects on the error of the estimate. On onehand, if Fn is not too “massive” (and we will later give precise conditions onFn using the concepts introduced in Chapter 9), then the empirical L2 riskwill be close to the L2 risk uniformly over Fn. Thus the error introducedby minimizing the empirical L2 risk instead of the L2 risk will be small. Onthe other hand, because of the requirement that our estimate is containedin Fn, it cannot be better (with respect to the L2 error) than for the bestfunction in Fn. This is formulated in the following lemma:

Lemma 10.1. Let Fn = Fn(Dn) be a class of functions f : Rd → Rdepending on the data Dn = (X1, Y1), . . . , (Xn, Yn). If mn satisfies (10.2)then

∫|mn(x) −m(x)|2µ(dx)

≤ 2 supf∈Fn

∣∣∣∣∣∣

1n

n∑

j=1

|f(Xj) − Yj |2 − E(f(X) − Y )2

∣∣∣∣∣∣

+ inff∈Fn

∫|f(x) −m(x)|2µ(dx).

160 10. Least Squares Estimates I: Consistency

Proof. It follows from Section 1.1 (cf. (1.1)) that∫

|mn(x) −m(x)|2µ(dx) = E|mn(X) − Y |2|Dn − E|m(X) − Y |2

=(E|mn(X) − Y |2|Dn − inf

f∈Fn

E|f(X) − Y |2)

+(

inff∈Fn

E|f(X) − Y |2 − E|m(X) − Y |2). (10.3)

By (1.1)

inff∈Fn

E|f(X) − Y |2 − E|m(X) − Y |2 = inff∈Fn

∫|f(x) −m(x)|2µ(dx).

Thus all we need is an upper bound for the first term. By (10.2), one gets

E

|mn(X) − Y |2∣∣∣Dn

− inf

f∈Fn

E|f(X) − Y |2

= supf∈Fn

(

E

|mn(X) − Y |2∣∣∣Dn

− 1n

n∑

j=1

|mn(Xj) − Yj |2

+1n

n∑

j=1

|mn(Xj) − Yj |2 − 1n

n∑

j=1

|f(Xj) − Yj |2

+1n

n∑

j=1

|f(Xj) − Yj |2 − E|f(X) − Y |2

)

≤ supf∈Fn

(

E

|mn(X) − Y |2∣∣∣Dn

− 1n

n∑

j=1

|mn(Xj) − Yj |2

+1n

n∑

j=1

|f(Xj) − Yj |2 − E|f(X) − Y |2

)

≤ 2 supf∈Fn

∣∣∣∣∣∣

1n

n∑

j=1

|f(Xj) − Yj |2 − E|f(X) − Y |2

∣∣∣∣∣∣.

Often the first term on the right-hand side of (10.3), i.e.,

E|mn(X) − Y |2|Dn − inff∈Fn

E|f(X) − Y |2,

is called the estimation error, and the second term, i.e.,

inff∈Fn

E|f(X) − Y |2 − E|m(X) − Y |2 = inff∈Fn

∫|f(x) −m(x)|2µ(dx),


m∗

mn

mapproximationerror

estimationerror

Figure 10.1. Approximation and estimation errors.

is called the approximation error of the estimator. The estimation errormeasures the distance between the L2 risk of the estimate and the L2 riskof the best function in Fn. The approximation error measures how well theregression function can be approximated by functions of Fn in L2 (compareFigure 10.1).

In order to get universally consistent estimates it suffices to show thatboth terms converge to 0 for all distributions of (X,Y ) with EY 2 < ∞.

For the approximation error this is often quite simple: If, for example,Fn is nested, that is, Fn ⊆ Fn+1 for all n, then

limn→∞

inff∈Fn

∫|f(x) −m(x)|2µ(dx) = 0

for all measures µ and all m ∈ L2 simply means that⋃∞

n=1 Fn is dense inL2 = L2(µ) for all distributions µ. This is true, for example, if

⋃∞n=1 Fn is

dense in C∞0 (Rd) with respect to the sup norm ‖·‖∞ since C∞

0 (Rd) is densein L2(µ) for all distributions µ and

∫ |f(x)−m(x)|2µ(dx) ≤ ‖f −m‖2∞ (cf.

Corollary A.1).The estimation error is more difficult. The main tools for analyzing it are

exponential distribution-free inequalities for a uniform distance of the L2risk from empirical L2 risk over the class Fn, e.g., the inequalities whichwe have introduced in Chapter 9.

The inequalities we will use require the uniform boundedness of|f(X) − Y |2 over Fn. We will see in the next section that it suffices toshow the convergence of the estimation error to 0 for bounded Y . So letus assume for the rest of this section that Y is bounded, i.e., |Y | ≤ L a.s.for some L > 0. Then, in order to ensure that |f(X) − Y |2 is uniformlybounded over Fn, one can simply choose Fn such that all functions in Fn


are bounded by a constant βn > 0, depending on the sample size n andconverging to infinity (this is needed for the approximation error). A typicalresult which one can get in this way is the following theorem:

Theorem 10.1. Let ψ1, ψ2, . . . : Rd → R be bounded functions with|ψj(x)| ≤ 1. Assume that the set of functions

∞⋃

K=1

K∑

j=1

ajψj(x) : a1, . . . , aK ∈ R

is dense in L2(µ) for any probability measure µ on Rd. Define the regressionfunction estimate mn as a function minimizing the empirical L2 risk

1n

n∑

i=1

(f(Xi) − Yi)2

over functions f(x) =∑Kn

j=1 ajψj(x) with∑Kn

j=1 |aj | ≤ βn. If EY 2 < ∞,and Kn and βn satisfy

Kn → ∞, βn → ∞,Knβ

4n log βn

n→ 0 and

β4n

n1−δ→ 0

for some δ > 0, then∫

(mn(x) − m(x))2µ(dx) → 0 with probability one,i.e., the estimate is strongly universally consistent.

For the proof see Problem 10.2.Unfortunately, the assumption that all functions in Fn are bounded by

a constant βn > 0, makes the computation of the estimator difficult. Inmost of our applications Fn will be defined as a set of linear combinationsof some basis functions. Uniform boundedness in this case means that onehas to restrict the values of the coefficients of these linear combinations,i.e., one would choose

Fn =

Kn∑

j=1

ajfj,n :Kn∑

j=1

|aj | ≤ βn

for some bounded basis functions f1,n, . . . , fKn,n. To compute a func-tion which minimizes the empirical L2 risk over such a class one has tosolve a quadratic minimization problem with inequality constraints for thecoefficients aj . There is no known fast algorithm which can do this.

If one doesn’t require the uniform boundedness of Fn then the compu-tation of a function which minimizes the empirical L2 risk is much easier,since for arbitrary functions f1,n, . . . , fKn,n and

Fn =

Kn∑

j=1

ajfj,n : aj ∈ R (j = 1, . . . ,Kn)


f

L

−L

TLf

Figure 10.2. Truncation of function.

equation (10.2) is equivalent to mn =∑Kn

j=1 ajfj,n with

‖Y − Ba‖22 = inf

b∈RKn‖Y − Bb‖2

2, (10.4)

where ‖ · ‖2 is the Euclidean norm in RKn ,

Y = (Y1, . . . , Yn)T , B = (fj,n(Xi))i=1,...,n, j=1,...,Kn,

and

a = (a1, . . . , aKn)T .

It is well-known from numerical mathematics (cf. Stoer (1993), Chapter4.8.1) that (10.4) is equivalent to

BTBa = BTY (10.5)

and that a solution of (10.5) always exists. Thus all one has to do to com-pute a function which minimizes the empirical L2 risk is to solve a systemof linear equations.

Therefore, we do not require uniform boundedness of Fn. To ensure theconsistency of the estimator (cf. Problem 10.3) we truncate it after thecomputation, i.e., we choose mn such that

mn ∈ Fn and1n

n∑

j=1


1n

n∑

j=1

|f(Xj) − Yj |2 (10.6)

and define the estimate mn by truncation of mn,

mn(x) = Tβnmn(x), (10.7)

where TL is the truncation operator

TLu =u if |u| ≤ L,L sign(u) otherwise,


(compare Figure 10.2). The next lemma shows that this estimate behavessimilarly to an estimate defined by empirical L2 risk minimization over aclass of truncated functions

TβnFn = Tβn

f : f ∈ Fn.Lemma 10.2. Let Fn = Fn(Dn) be a class of functions f : Rd → R. Ifmn satisfies (10.6) and (10.7) and |Y | ≤ βn a.s., then


≤ 2 supf∈Tβn Fn

∣∣∣∣∣∣

1n

n∑

j=1

|f(Xj) − Yj |2 − E(f(X) − Y )2

∣∣∣∣∣∣

+ inff∈Fn, ‖f‖∞≤βn

∫|f(x) −m(x)|2µ(dx).

Proof. Using the decomposition from the proof of Lemma 10.1, with Fn

replaced by f ∈ Fn : ‖f‖∞ ≤ βn, one gets∫


= E|mn(X) − Y |2|Dn − inff∈Fn,‖f‖∞≤βn

E|f(X) − Y |2


∫|f(x) −m(x)|2µ(dx).

Now

E

|mn(X) − Y |2∣∣∣Dn

− inf

f∈Fn,‖f‖∞≤βn

E|f(X) − Y |2

≤ supf∈Fn,‖f‖∞≤βn

E

|mn(X) − Y |2∣∣∣Dn

− 1n

n∑

j=1

|mn(Xj) − Yj |2

+1n

n∑

j=1

|mn(Xj) − Yj |2 − 1n

n∑

j=1

|mn(Xj) − Yj |2

+1n

n∑

j=1

|mn(Xj) − Yj |2 − 1n

n∑

j=1

|f(Xj) − Yj |2

+1n

n∑

j=1

|f(Xj) − Yj |2 − E|f(X) − Y |2

.

Because of (10.6) the third term on the right-hand side is less than or equalto 0. The same is true for the second term, since if u, v ∈ R, |v| ≤ βn, and

10.2. Consistency from Bounded to Unbounded Y 165

u = Tβnu, then |u − v| ≤ |u − v|. Therefore the assertion follows from

mn ∈ TβnFn and f ∈ Fn : ‖f‖∞ ≤ βn ⊆ TβnFn.

10.2 Consistency from Bounded to Unbounded Y

The aim of this section is to prove Theorem 10.2, which extends Lemma10.2 to unbounded Y . To formulate this theorem we use the notation

YL = TLY

and

Yi,L = TLYi.

Theorem 10.2. Let Fn = Fn(Dn) be a class of functions f : Rd → Rand assume that the estimator mn satisfies (10.6) and (10.7).(a) If

limn→∞

βn = ∞, (10.8)

limn→∞

inff∈Fn, ‖f‖∞≤βn

∫|f(x) −m(x)|2µ(dx) = 0 a.s., (10.9)

limn→∞

supf∈Tβn Fn

∣∣∣∣∣∣

1n

n∑

j=1

|f(Xj) − Yj,L|2 − E(f(X) − YL)2

∣∣∣∣∣∣= 0 (10.10)

a.s. for all L > 0, then

limn→∞

∫|mn(x) −m(x)|2µ(dx) = 0 a.s.

(b) If (10.8) is fulfilled and

limn→∞

E

inff∈Fn, ‖f‖∞≤βn

∫|f(x) −m(x)|2µ(dx)

= 0, (10.11)

limn→∞

E

sup

f∈Tβn Fn

∣∣∣∣∣∣

1n

n∑

j=1

|f(Xj) − Yj,L|2 − E(f(X) − YL)2

∣∣∣∣∣∣

= 0

(10.12)for all L > 0, then

limn→∞

E∫


= 0.

Observe that in the above theorem Fn may depend on the data, and hence

inff∈Fn,‖f‖∞≤βn

∫|f(x) −m(x)|2µ(dx)


is a random variable.

Proof. (a) Because of∫

Rd

|mn(x) −m(x)|2µ(dx) = E |mn(X) − Y |2∣∣Dn

− E|m(X) − Y |2

it suffices to showE |mn(X) − Y |2∣∣Dn

12 −

E|m(X) − Y |212 → 0 a.s. (10.13)

We use the decomposition

0 ≤ E |mn(X) − Y |2∣∣Dn

12 −

E|m(X) − Y |212

=(E |mn(X) − Y |2∣∣Dn

12 − inf


E|f(X) − Y |2

12

)

+(


E|f(X) − Y |2

12 −

E|m(X) − Y |212

). (10.14)

It follows from (10.9), by the triangle inequality, that


E|f(X) − Y |2

12 −

E|m(X) − Y |212

≤ inff∈Fn,‖f‖∞≤βn

∣∣∣E|f(X) − Y |2

12 −

E|m(X) − Y |212

∣∣∣

≤ inff∈Fn,‖f‖∞≤βn

E|(f(X) − Y ) − (m(X) − Y )|2

12

= inff∈Fn,‖f‖∞≤βn

∫|f(x) −m(x)|2µ(dx)

12

→ 0 (n → ∞) a.s.

Therefore for (10.13) we have to show that

lim supn→∞

(E |mn(X) − Y |2∣∣Dn

12

− inff∈Fn,‖f‖∞≤βn

E|f(X) − Y |2

12

)

≤ 0 a.s. (10.15)

To this end, let L > 0 be arbitrary. Because of (10.8) we can assume w.l.o.g.that βn > L. Then

E |mn(X) − Y |2∣∣Dn

12 − inf


E|f(X) − Y |2

12

= supf∈Fn,‖f‖∞≤βn

E |mn(X) − Y |2∣∣Dn

12 −

E|f(X) − Y |212


≤ supf∈Fn,‖f‖∞≤βn

E |mn(X) − Y |2∣∣Dn

12

−E |mn(X) − YL|2∣∣Dn

12

+E |mn(X) − YL|2∣∣Dn

12 −

1n

n∑

i=1

|mn(Xi) − Yi,L|2 1

2

+

1n

n∑

i=1

|mn(Xi) − Yi,L|2 1

2

−

1n

n∑

i=1

|mn(Xi) − Yi,L|2 1

2

+

1n

n∑

i=1

|mn(Xi) − Yi,L|2 1

2

−

1n

n∑

i=1

|mn(Xi) − Yi|2 1

2

+

1n

n∑

i=1

|mn(Xi) − Yi|2 1

2

−

1n

n∑

i=1

|f(Xi) − Yi|2 1

2

+

1n

n∑

i=1

|f(Xi) − Yi|2 1

2

−

1n

n∑

i=1

|f(Xi) − Yi,L|2 1

2

+

1n

n∑

i=1

|f(Xi) − Yi,L|2 1

2

− E|f(X) − YL|2

12

+E|f(X) − YL|2

12 −

E|f(X) − Y |212

.

Now we give upper bounds for the terms in each row on the right-hand sideof the last inequality: The second and seventh term are bounded above by

supf∈Tβn Fn

∣∣∣∣∣∣

1n

n∑

i=1

|f(Xi) − Yi,L|2 1

2

− E|f(X) − YL|2

12

∣∣∣∣∣∣

(observe that mn ∈ TβnFn). Because of (10.6) the fifth term is boundedabove by zero. For the third term observe that, if u, v ∈ R with |v| ≤ βn

and u = Tβn u, then |u− v| ≤ |u− v|. Therefore the third term is also notgreater than zero.

Using these upper bounds and the triangle inequality for the remainingterms one gets


E |mn(X) − Y |2∣∣Dn

12 − inf


E|f(X) − Y |2

12

≤ 2 · E|Y − YL|212 + 2 ·

1n

n∑

i=1

|Yi − Yi,L|2 1

2

+ 2 · supf∈Tβn Fn

∣∣∣∣∣∣

1n

n∑

i=1

|f(Xi) − Yi,L|2 1

2

− E|f(X) − YL|2

12

∣∣∣∣∣∣.

Equation (10.10), the uniform continuity of x → √x on [0,∞), and the

strong law of large numbers imply

lim supn→∞

(E |mn(X) − Y |2∣∣Dn

12 − inf


E|f(X) − Y |2

12

)

≤ 4 · E|Y − YL|212 a.s.

One gets the assertion with L → ∞.(b) Because of

∫

Rd


= E|mn(X) − Y |2|Dn − E|m(X) − Y |2=

((E|mn(X) − Y |2|Dn)

12 − (

E|m(X) − Y |2)12)

×((

E|mn(X) − Y |2|Dn)12 +

(E|m(X) − Y |2)

12)

=((

E|mn(X) − Y |2|Dn)12 − (

E|m(X) − Y |2)12)2

+2(E|m(X) − Y |2)

12((

E|mn(X) − Y |2|Dn)12

− (E|m(X) − Y |2)

12)

it suffices to show that

E((

E|mn(X) − Y |2|Dn)12 − (

E|m(X) − Y |2)12)2

→ 0 (n → ∞).

We use the same error decomposition as in (a):

E((

E|mn(X) − Y |2|Dn)12 − (

E|m(X) − Y |2)12)2

≤ 2E(E|mn(X) − Y |2|Dn)

12 − inf


(E|f(X) − Y |2)

12

2

+ 2E


(E|f(X) − Y |2)

12 − (

E|m(X) − Y |2)12

2

.


By the triangle inequality and (10.11), one gets, for the second term on theright-hand side of the last inequality,

2E(


(E|f(X) − Y |2)

12 − (

E|m(X) − Y |2)12

)2

≤ 2E(


(E|f(X) −m(X)|2)

12

)2

= 2E


E|f(X) −m(X)|2

→ 0 (n → ∞).

Thus it suffices to show that

E(

E|mn(X) − Y |2|Dn)12

− inff∈Fn,‖f‖∞≤βn

(E|f(X) − Y |2)

12

2

→ 0 (n → ∞). (10.16)

On one hand,

(E|mn(X) − Y |2|Dn)

12 − inf


(E|f(X) − Y |2)

12

≥ (E|m(X) − Y |2)

12 − inf


(E|f(X) − Y |2)

12

≥ − inff∈Fn,‖f‖∞≤βn

(∫

Rd

|f(x) −m(x)|2µ(dx)) 1

2

.

On the other hand, it follows from the proof of part (a) that

(E|mn(X) − Y |2|Dn)

12 − inf


(E|f(X) − Y |2)

12

≤ 2(E|Y − YL|2)

12 + 2

(1n

n∑

i=1

|Yi − Yi,L|2) 1

2

+ 2 supf∈Tβn Fn

∣∣∣∣∣∣

(1n

n∑

i=1

|f(Xi) − Yi,L|2) 1

2

− (E|f(X) − YL|2)

12

∣∣∣∣∣∣.

The inequalities (a+ b+ c)2 ≤ 3a2 + 3b2 + 3c2 (a, b, c ∈ R) and

(√a−

√b)2 ≤ |√a−

√b| · |√a+

√b| = |a− b| (a, b ∈ R+)

imply

E(E|mn(X) − Y |2|Dn)

12 − inf


(E|f(X) − Y |2)

12

2


≤ E


∫

Rd

|f(x) −m(x)|2µ(dx)

+12E|Y − YL|2 + 12E

1n

n∑

i=1

|Yi − Yi,L|2

+12E

supf∈Tβn Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣

→ 24E|Y − YL|2 (n → ∞),

where we have used (10.11), (10.12), and the strong law of large numbers.With L → ∞ the assertion follows.

10.3 Linear Least Squares Series Estimates

For the sake of illustration we formulate an analogue of Theorem 10.1:

Theorem 10.3. Let ψ1, ψ2, . . . : Rd → R be bounded functions. Assumethat the set of functions

∞⋃

k=1

k∑

j=1

ajψj(x) : a1, . . . , ak ∈ R

(10.17)

is dense in L2(µ) for any probability measure µ on Rd. Define the regressionfunction estimate mn as a function minimizing the empirical L2 risk

1n

n∑

i=1

(f(Xi) − Yi)2

over Fn =∑Kn

j=1 ajψj(x) : a1, . . . , aKn∈ R

, and put

mn(x) = Tβnmn(x).

(a) If EY 2 < ∞, and Kn and βn satisfy

Kn → ∞, βn → ∞, andKnβ

4n log βn

n→ 0, (10.18)

then

E∫

(mn(x) −m(x))2µ(dx) → 0,

i.e., the estimate is weakly universally consistent.(b) If EY 2 < ∞, and Kn and βn satisfy (10.18) and, in addition,

β4n

n1−δ→ 0 (10.19)

10.3. Linear Least Squares Series Estimates 171


(mn(x) −m(x))2µ(dx) → 0 a.s.,

i.e., the estimate is strongly universally consistent.

In the proof we will use the denseness of (10.17) in L2(µ) in order to showthat the approximation error converges to zero as postulated in (10.9) and(10.11). In order to show that the estimation error converges to zero weuse that the “complexity” of the space Fn of functions (measured by itsvector space dimension Kn) is restricted by (10.18).

Proof. Because of Theorem 10.2 it suffices to show that (10.18) implies(10.11) and (10.12), and that (10.18) and (10.19) imply (10.9) and (10.10).

Proof of (10.9) and (10.11). Let ε > 0. By assumption,

∞⋃

k=1

k∑

j=1

ajψj(x) : a1, . . . , ak ∈ R

is dense in L2(µ), where µ denotes the distribution of X. It follows fromEY 2 < ∞ that m ∈ L2(µ). Hence there exist k∗ ∈ N and a∗

1, . . . , a∗k∗ ∈ R

such that

∫

Rd

∣∣∣∣∣∣

k∗∑

j=1

a∗jψj(x) −m(x)

∣∣∣∣∣∣

2

µ(dx) < ε.

Since ψ1, ψ2 . . . are bounded,

supx∈Rd

∣∣∣∣∣∣

k∗∑

j=1

a∗jψj(x)

∣∣∣∣∣∣< ∞.

Using Kn → ∞ (n → ∞) and βn → ∞ (n → ∞) one concludes that, forall n ≥ n0(ε),

k∗∑

j=1

a∗jψj ∈

f ∈ Fn : ‖f‖∞ ≤ βn

.

Hence


∫

Rd

|f(x) −m(x)|2µ(dx) < ε

for n ≥ n0(ε). Since ε > 0 was arbitrary, this implies (10.9) and (10.11).

Proof of (10.10) and (10.12). Let L > 0 be arbitrary. Because ofβn → ∞ (n → ∞) we may assume w.l.o.g. L < βn. Set

Z = (X,Y ), Z1 = (X1, Y1), . . . , Zn = (Xn, Yn),


and

Hn =h : Rd × R → R : ∃f ∈ TβnFn such that h(x, y) = |f(x) − TLy|2

.

Observe that the functions in Hn satisfy

0 ≤ h(x, y) ≤ 2β2n + 2L2 ≤ 4β2

n.

By Theorem 9.1 one has, for arbitrary ε > 0,

P

supf∈Tβn Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣> ε

= P

suph∈Hn

∣∣∣∣∣1n

n∑

i=1

h(Zi) − Eh(Z)∣∣∣∣∣> ε

≤ 8EN1

( ε8,Hn, Z

n1

)e

− nε2

128(4β2n)2 . (10.20)

Next we bound the covering number in (10.20). By Lemma 9.2 we canbound it by the corresponding packing number

N1

( ε8,Hn, Z

n1

)≤ M1

( ε8,Hn, Z

n1

).

Let hi(x, y) = |fi(x) − TLy|2 ((x, y) ∈ Rd × R) for some fi ∈ Fn. Then

1n

n∑

i=1

|h1(Zi) − h2(Zi)|

=1n

n∑

i=1

∣∣|f1(Xi) − TLYi|2 − |f2(Xi) − TLYi|2

∣∣

=1n

n∑

i=1

|f1(Xi) − f2(Xi)| · |f1(Xi) − TLYi + f2(Xi) − TLYi|

≤ 4βn1n

n∑

i=1

|f1(Xi) − f2(Xi)|.

Thus, if h1, . . . , hl is an ε8 -packing of Hn on Zn

1 , then f1, . . . , fl is anε/8(4βn)-packing of TβnFn on Xn

1 . Then

M1

( ε8,Hn, Z

n1

)≤ M1

(ε

32βn, TβnFn, X

n1

). (10.21)

By Theorem 9.4 we can bound the latter term

M1

(ε

32βn, TβnFn, X

n1

)

≤ 3

(2e(2βn)

ε32βn

log

(3e(2βn)

ε32βn

))VTβn

F+n

10.3. Linear Least Squares Series Estimates 173

= 3(

128eβ2n

εlog

(192eβ2

n

ε

))VTβn

F+n

. (10.22)

Let (x, y) ∈ Rd × R. If y > βn, then (x, y) is contained in none of the setsTβn

F+n and, if y ≤ −βn, then (x, y) is contained in every set of Tβn

F+n .

Hence, if TβnF+n shatters a set of points, then the y-coordinates of these

points are all bounded in absolute value by βn and F+n also shatters this

set of points. This proves

VTβn F+n

≤ VF+n, (10.23)

where VF+n

can be bounded by Theorem 9.5. Observe that

F+n = (x, t) : t ≤ f(x) : f ∈ Fn

⊆ (x, t) : f(x) + a0t ≥ 0 : f ∈ Fn, a0 ∈ R .Nowf(x)+a0t : f ∈ Fn, a0 ∈ R

=

Kn∑

j=1

ajψj(x) + a0t : a0, . . . , aKn ∈ R

is a linear vector space of dimension Kn + 1, thus Theorem 9.5 implies

VF+n

≤ Kn + 1. (10.24)

Formulae (10.20)–(10.24) imply

P

supf∈Tβn Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣> ε

≤ 24(

128eβ2n

εlog

(192eβ2

n

ε

))Kn+1

e− nε2

2048β4n

≤ 24(

192eβ2n

ε

)2(Kn+1)

e− nε2

2048β4n , (10.25)

where we have used log(x) ≤ x−1 ≤ x (x ∈ R+). Now, assume that (10.18)and (10.19) hold. Then

∞∑

n=1

P

supf∈Tβn Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣> ε

≤∞∑

n=1

24 · exp(

2(Kn + 1) log192eβ2

n

ε− nε2

2048β4n

)

=∞∑

n=1

24 · exp

(

−nδ n1−δ

β4n

(ε2

2048− 2(Kn + 1)β4

n log 192eβ2n

ε

n

))

< ∞,


where we have used that (10.18) and (10.19) imply

n1−δ

β4n

→ ∞ (n → ∞)

and

2(Kn + 1)β4n log 192eβ2

n

ε

n→ 0 (n → ∞).

This, together with the Borel–Cantelli lemma, proves (10.10).Let Z be a nonnegative random variable and let ε > 0. Then

EZ =∫ ∞

0PZ > t dt ≤ ε+

∫ ∞

ε

PZ > t dt.

Using this and (10.25) one gets

E

supf∈TβnFn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2

∣∣∣∣∣

≤ ε+∫ ∞

ε

24 ·(

192eβ2n

t

)2(Kn+1)

· exp(

− n · t22048β4

n

)dt

≤ ε+ 24 ·(

192eβ2n

ε

)2(Kn+1)

·[−2048β4

n

n · ε · exp(

− n · ε · t2048β4

n

)]∞

t=ε

= ε+ 24 ·(

192eβ2n

ε

)2(Kn+1)

· 2048β4n

n · ε · exp(

− n · ε22048β4

n

)

= ε+ 24 · 2048β4n

n · ε · exp(

2(Kn + 1) · log(

192eβ2n

ε

)− n · ε2

2048β4n

)

→ ε (n → ∞),

if (10.18) holds. With ε → 0 one gets (10.12).

10.4 Piecewise Polynomial Partitioning Estimates

Let Pn = An,1, An,2, . . . be a partition of Rd and let mn be thecorresponding partitioning estimate, i.e.,

mn(x) =∑n

i=1 Yi · IXi∈An(x)∑ni=1 IXi∈An(x)

,

where An(x) denotes the cell An,j ∈ Pn which contains x. As we havealready seen in Chapter 2, mn satisfies


1n

n∑

i=1

|f(Xi) − Yi|2,

10.4. Piecewise Polynomial Partitioning Estimates 175

−1 −0.5 0.5 1

0.5

Figure 10.3. Piecewise polynomial partitioning estimate, degree M = 1, h = 0.1,L2 error = 0.472119.

where Fn is the set of all piecewise constant functions with respect to Pn.Hence the partitioning estimate fits (via the principle of least squares) apiecewise constant function to the data.

As we have seen in Chapter 4 the partitioning estimate does not achievethe optimal (or even a nearly optimal) rate of convergence if m is (p, C)-smooth for some p > 1.

A straightforward generalization of fitting a piecewise constant functionto the data is to fit (via the principle of least squares) a piecewise poly-nomial of some fixed degree M > 0 to the data. We will call the resultingestimate a piecewise polynomial partitioning estimate. Figures 10.3–10.8show application of this estimate to our standard data example.

In this section we show how one can use the results of this chapter toderive the consistency of such estimates. In the next chapter we will seethat these estimates are able to achieve (at least up to a logarithmic factor)the optimal rate of convergence if the regression function is (p, C)-smooth,even if p > 1.

For simplicity we assume X ∈ [0, 1] a.s., the case of unbounded andmultivariate X will be handled in Problems 10.6 and 10.7.

Let M ∈ N0 and for n ∈ N let Kn ∈ N , βn ∈ R+ and let Pn =An,1, . . . , An,Kn be a partition of [0, 1] consisting of Kn cells. Let GM bethe set of all polynomials of degree M (or less), and set

GM Pn =

Kn∑

j=1

pjIAn,j: pj ∈ GM (j = 1, . . . ,Kn)

,


−1 −0.5 0.5 1

0.5


−1 −0.5 0.5 1

0.5


where GM Pn is the set of all piecewise polynomials of degree M (or less)w.r.t. Pn.

Define the piecewise polynomial partitioning estimate by

mn(·) = arg minf∈GM Pn

1n

n∑

i=1

|f(Xi) − Yi|2 and mn(·) = Tβnmn(·).(10.26)


−1 −0.5 0.5 1

0.5


−1 −0.5 0.5 1

0.5


Theorem 10.4. Let M , βn, Kn, Pn be as above and define the estimatemn by (10.26).(a) If

Kn → ∞, βn → ∞, andKnβ

4n log βn

n→ 0, (10.27)


−1 −0.5 0.5 1

0.5


and

maxj=1,...Kn

diam (An,j) → 0 (n → ∞), (10.28)

then

E∫

|mn(x) −m(x)|2µ(dx) → 0 (n → ∞)

for all distributions of (X,Y ) with X ∈ [0, 1] a.s. and EY 2 < ∞.(b) If Kn and βn satisfy (10.27) and (10.28) and, in addition,

β4n

n1−δ→ 0 (n → ∞) (10.29)


|mn(x) −m(x)|2µ(dx) → 0 (n → ∞) a.s.

for all distributions of (X,Y ) with X ∈ [0, 1] a.s. and EY 2 < ∞.

For degree M = 0 the estimate in Theorem 10.4 is a truncated version ofthe partitioning estimate in Theorem 4.2. The conditions in Theorem 4.2are weaker than in Theorem 10.4, and in Theorem 4.2 there is no truncationof the estimate required. But Theorem 10.4 is applicable to more generalestimates. It follows from Problem 10.3 that in this more general contexttruncation of the estimate is necessary.


Proof. Because of Theorem 10.2 it suffices to show that (10.27) and(10.28) imply

inff∈GM Pn,‖f‖∞≤βn

∫|f(x) −m(x)|2µ(dx) → 0 (n → ∞) (10.30)

and

E

supf∈Tβn GM Pn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣

→ 0

(n → ∞) (10.31)

for all L > 0, and that (10.27), (10.28), and (10.29) imply

supf∈Tβn GM Pn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣→ 0

(n → ∞) a.s. (10.32)

for all L > 0. The proof of (10.31) and (10.32) is left to the reader (cf.Problem 10.5). In order to show (10.30) let ε > 0 be arbitrary. By TheoremA.1 there exists a continuous function m such that

∫|m(x) −m(x)|2µ(dx) ≤ ε

4.

Since m is uniformly continuous on the compact interval [0, 1], there existsa δ > 0 such that |m(x) − m(z)| < √

ε/2 for all x, z ∈ [0, 1], |x − z| < δ.Choose arbitrary points zn,j ∈ An,j (j = 1, . . . ,Kn) and set

fn =Kn∑

j=1

m(zn,j) · IAn,j ∈ G0 Pn ⊆ GM Pn.

Then z ∈ An,j and diam(An,j) < δ imply

|fn(z) − m(z)| = |m(zn,j) − m(z)| < √ε/2.

Using this one gets, for n sufficiently large (i.e., for n so large that βn ≥maxz∈[0,1] |m(z)| and maxj=1,...,Kn diam(An,j) < δ),

inff∈GM Pn,‖f‖∞≤βn

∫|f(x) −m(x)|2µ(dx)

≤ 2 inff∈GM Pn,‖f‖∞≤βn

∫|f(x) − m(x)|2µ(dx) + 2 · ε

4

≤ 2∫

|fn(x) − m(x)|2µ(dx) +ε

2

≤ 2 supx∈[0,1]

|fn(x) − m(x)|2 +ε

2

≤ 2 · ε4

+ε

2= ε.


With ε → 0 this implies (10.30).

It is possible to modify the estimate such that it is weakly and stronglyuniversally consistent (cf. Problem 10.6). Multivariate piecewise polyno-mial partitioning estimates can be defined by using piecewise multivariatepolynomials (cf. Problem 10.7).


It is well-known that solving the set of normal equations of linear leastsquares estimates (i.e., (10.5)) may cause serious numerical problemsdue to ill–conditioning. Numerical methods for solving these equationsare discussed, e.g., in Daniel and Wood (1980), Farebrother (1988), andMaindonald (1984).

For least squares estimates one minimizes the empirical L2 risk. Asymp-totic properties of more general empirical risk minimization problems werestudied by several authors such as Vapnik and Chervonenkis (1971), Vap-nik (1982; 1998), and Haussler (1992). Minimization of the empirical L2risk has also become known in the statistics literature as “minimum con-trast estimation,” e.g., see Nemirovsky et al. (1985), van de Geer (1990),and Birge and Massart (1993). Consistency of least squares and other min-imum contrast estimates under general conditions was investigated, e.g., inNemirovsky et al. (1983), and Nemirovsky et al. (1984).

In the context of pattern recognition many nice results concerning em-pirical risk minimization can be found in the book of Devroye, Gyorfi, andLugosi (1996).

Theorem 10.1 is due to Lugosi and Zeger (1995). Consistency of sievesestimates has been studied, e.g., by Geman and Hwang (1982) and vande Geer and Wegkamp (1996). The latter article also contains necessaryconditions for the consistency of least squares estimates.


Problem 10.1. Let Fn be a class of functions f : Rd → R and define theestimator mn by


1n

n∑

i=1

|f(Xi) − Yi|2.

Assume EY 2 < ∞. Show that without truncation of the estimate the followingmodification of Theorem 10.2 is valid:

inff∈Fn

∫|f(x) − m(x)|2µ(dx) → 0 (n → ∞)


and

supf∈Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − TLYi|2 − E|f(X) − TLY |2

∣∣∣∣∣→ 0 (n → ∞) a.s.

for all L > 0 imply∫

|mn(x) − m(x)|2µ(dx) → 0 (n → ∞) a.s.

Problem 10.2. Prove Theorem 10.1.Hint: Proceed as in the proof of Theorem 10.3, but apply Problem 10.1 instead

of Theorem 10.2

Problem 10.3. (Devroye, personal communication, 1998). Let n ∈ N be fixed.Let Fn be the set of all functions which are piecewise linear on a partition of[0, 1] consisting of intervals. Assume that [0, h1] is one of these intervals (h1 > 0).Show that the piecewise linear estimate


1n

n∑

i=1

|f(Xi) − Yi|2

satisfies

E∫

|mn(x) − m(x)|2µ(dx) = ∞

if X is uniformly distributed on [0, 1], Y is −1, 1-valued with EY = 0, and X,Y are independent.

Hint:

Step (a). Let A be the event that X1, X2 ∈ [0, h1], X3, . . . , Xn ∈ [h1, 1], andY1 = Y2. Then PA > 0 and

E∫

(mn(x)−m(x))2µ(dx) = E∫

mn(x)2dx ≥(

E

∫|mn(x)|dx

∣∣∣∣A

PA)2

.

Step (b). Given A, on [0, h1] the piecewise linear estimate mn has the form

mn(x) =±2∆

(x − c),

where ∆ = |X1 − X2| and 0 ≤ c ≤ h1. Then

E

∫|mn(x)|dx

∣∣∣∣A

≥ E

2∆

∫ h1

0

|x − h1/2|dx

∣∣∣∣A

.

Step (c).

E

1∆

∣∣∣∣A

= E 1

∆

= ∞.

Problem 10.4. Let β > 0, let G be a set of functions g : Rd → [−β, β], and letH be the set of all functions h : Rd × R → R defined by

h(x, y) = |g(x) − Tβy|2 ((x, y) ∈ Rd × R)


for some g ∈ G. Show that for any ε > 0 and any (x, y)n1 = ((x1, y1), . . . , (xn, yn)) ∈

(Rd × [−β, β])n,

N1 (ε, H, (x, y)n1 ) ≤ N1

(ε

4β, G, xn

1

).

Hint: Choose an L1 cover of G on xn1 of minimal size. Show that you can

assume w.l.o.g. that the functions in this cover are bounded in absolute value byβ. Use this cover as in the proof of Theorem 10.3 to construct an L1 cover of Hon (x, y)n

1 .

Problem 10.5. Show that under the assumptions of Theorem 10.4, (10.31) and(10.32) hold.

Hint: Proceed as in the proof of Theorem 10.3.

Problem 10.6. Modify the estimate in Theorem 10.4 in such a way that it isweakly and strongly universally consistent.

Hint: Choose An ∈ R+ such that An tends “not too quickly” to infinity. DefineFn as the set of all piecewise polynomials of degree M (or less) with respect to anequidistant partition of [−An, An] consisting of Kn intervals, and extend thesefunctions on the whole R by setting them to zero outside [−An, An].

Problem 10.7. Construct a multivariate version of the estimate in Problem 10.6and show that it is strongly universally consistent.

Hint: Use functions which are equal to a multivariate polynomial of degree M(or less, in each coordinate) with respect to suitable partitions.

11Least Squares Estimates II: Rate ofConvergence

In this chapter we study the rates of convergence of least squares estimates.We separately consider linear and nonlinear estimates. The key tools in thederivation of these results are extensions of the exponential inequalities inChapter 9, which we will also use later to define adaptive versions of theestimates.

11.1 Linear Least Squares Estimates

In this section we will study a truncated version mn(·) = TLmn(·) of alinear least squares estimate


1n

n∑

i=1

|f(Xi) − Yi|2, (11.1)

where Fn is a linear vector space of functions f : Rd → R, which dependson n. Examples of such estimates are linear least squares series estimatesand piecewise polynomial partitioning estimates.

We are interested in the rate of convergence of

‖mn −m‖ =∫

|mn(x) −m(x)|2 µ(dx)1/2

184 11. Least Squares Estimates II: Rate of Convergence

to zero. To bound mn −m in the L2(µ) norm, we will first bound it in theempirical norm ‖ · ‖n, given by

‖f‖2n =

∫|f(x)|2 µn(dx) =

1n

n∑

i=1

|f(Xi)|2,

and then show how one can bound the L2(µ) norm by the empirical norm.Our first result gives a bound on ‖mn −m‖2

n. Observe that if we assumethat m is bounded in absolute value by L (which we will do later), then|mn(x) − m(x)| ≤ |mn(x) − m(x)| for all x which implies ‖mn − m‖2

n ≤‖mn −m‖2

n.

Theorem 11.1. Assume

σ2 = supx∈Rd

VarY |X = x < ∞.

Let the estimate mn be defined by (11.1), where Fn is a linear vector spaceof functions f : Rd → R which may depend on X1, . . . , Xn. Let Kn =Kn(X1, . . . , Xn) be the vector space dimension of Fn. Then

E

‖mn −m‖2n

∣∣∣∣X1, . . . , Xn

≤ σ2Kn

n+ min

f∈Fn

‖f −m‖2n.

Proof. In order to simplify the notation we will use the abbreviation

E∗· = E·|X1, . . . , Xn.In the first step of the proof we show that

‖E∗mn −m‖2n = min

f∈Fn

‖f −m‖2n. (11.2)

By the results of Section 10.1, (11.1) is equivalent to

mn =Kn∑

j=1

ajfj,n,

where f1,n, . . . , fKn,n is a basis of Fn and a = (aj)j=1,...,Kn satisfies

1nBTBa =

1nBTY

with

B = (fj,n(Xi))1≤i≤n,1≤j≤Knand Y = (Y1, . . . , Yn)T .

If we take the conditional expectation given X1, . . . , Xn, then we get

E∗mn =Kn∑

j=1

E∗aj · fj,n,

11.1. Linear Least Squares Estimates 185

where E∗a = (E∗aj)j=1,...,Knsatisfies

1nBTBE∗a =

1nBT (m(X1), . . . ,m(Xn))T .

There we have used

E∗

1nBTY

=1nBT (E∗Y1, . . . ,E∗Yn)T

=1nBT (m(X1), . . . ,m(Xn))T .

Hence, again by the results of Section 10.1, E∗mn is the least squaresestimate in Fn on the data (X1,m(X1)), . . . , (Xn,m(Xn)) and thereforesatisfies

‖E∗mn −m‖2n =

1n

n∑

i=1

|E∗mn(Xi) −m(Xi)|2

= minf∈Fn

1n

n∑

i=1

|f(Xi) −m(Xi)|2 = minf∈Fn

‖f −m‖2n.

Next, we observe

E∗‖mn −m‖2n

= E∗

1n

n∑

i=1

|mn(Xi) −m(Xi)|2

= E∗

1n

n∑

i=1

|mn(Xi) − E∗mn(Xi)|2

+ E∗

1n

n∑

i=1

|E∗mn(Xi) −m(Xi)|2

= E∗ ‖mn − E∗mn‖2n

+ ‖E∗mn −m‖2

n,

where the second equality follows from

E∗

1n

n∑

i=1

(mn(Xi) − E∗mn(Xi)) (E∗mn(Xi) −m(Xi))

=1n

n∑

i=1

(E∗mn(Xi) −m(Xi))E∗mn(Xi) − E∗mn(Xi)

=1n

n∑

i=1

(E∗mn(Xi) −m(Xi)) · 0


= 0.

Thus it remains to show

E∗ ‖mn − E∗mn‖2n

≤ σ2Kn

n. (11.3)

Choose a complete orthonormal system f1, . . . , fK in Fn with respect tothe empirical scalar product < ·, · >n, given by

< f, g >n =1n

n∑

i=1

f(Xi)g(Xi).

Such a system will depend on X1, . . . , Xn, but it will always satisfy K ≤Kn. Then we have, on X1, . . . , Xn,

span f1, . . . , fK = Fn,

hence mn is also the least squares estimate of m in span f1, . . . , fK.Therefore, for x ∈ X1, . . . , Xn,

mn(x) = f(x)T 1nBTY,

where

f(x) = (f1(x), . . . , fK(x))T and B = (fj(Xi))1≤i≤n,1≤j≤K .

Here we have used1nBTB = (< fj , fk >n)1≤j,k≤K = (δj,k)1≤j,k≤K , (11.4)

where δj,k is the Kronecker symbol, i.e., δj,k = 1 for j = k and δj,k = 0otherwise. Now,

E∗ |mn(x) − E∗mn(x)|2

= E∗

∣∣∣∣f(x)T 1

nBTY − f(x)T 1

nBT (m(X1), . . . ,m(Xn))T

∣∣∣∣

2

= E∗

∣∣∣∣f(x)T 1

nBT (Y1 −m(X1), . . . , Yn −m(Xn))T

∣∣∣∣

2

= E∗

f(x)T 1nBT

((Yi −m(Xi))(Yj −m(Xj))

)

1≤i,j≤n

1nBf(x)

= f(x)T 1nBT

(E∗ (Yi −m(Xi))(Yj −m(Xj))

)

1≤i,j≤n

1nBf(x).

Since

E∗ (Yi −m(Xi))(Yj −m(Xj)) = δi,j VarYi|Xi


we get, for any vector c = (c1, . . . , cn)T ∈ Rn,

cT (E∗ (Yi −m(Xi))(Yj −m(Xj)))1≤i,j≤n c

=n∑

i=1

VarYi|Xic2i ≤ σ2cT c,

which, together with (11.4), implies

E∗|mn(x) − E∗mn(x)|2 ≤ σ2

nf(x)T f(x) =

σ2

n

K∑

j=1

|fj(x)|2.

It follows that

E∗‖mn − E∗mn‖2n =

1n

n∑

i=1

E∗ |mn(Xi) − E∗mn(Xi)|2

≤ 1n

n∑

i=1

σ2

n

K∑

j=1

|fj(Xi)|2

=σ2

n

K∑

j=1

‖fj‖2n

=σ2

nK ≤ σ2

nKn.

To bound the L2(µ) norm ‖ · ‖ by the empirical norm ‖ · ‖n we will usethe next theorem.

Theorem 11.2. Let F be a class of functions f : Rd → R bounded inabsolute value by B. Let ε > 0. Then

P ∃f ∈ F : ‖f‖ − 2‖f‖n > ε ≤ 3EN2

(√2

24ε,F , X2n

1

)

exp(

− nε2

288B2

).

Proof. The proof will be divided in several steps.

Step 1. Replace the L2(µ) norm by the empirical norm defined by a ghostsample.

Let X ′n1 = (Xn+1, . . . , X2n) be a ghost sample of i.i.d. random variables

distributed as X and independent of Xn1 . Define

‖f‖′2n =

1n

2n∑

i=n+1

|f(Xi)|2.

Let f∗ be a function f ∈ F such that

‖f‖ − 2‖f‖n > ε,


if there exists any such function, and let f∗ be an other arbitrary functioncontained in F , if such a function doesn’t exist. Observe that f∗ dependson Xn

1 . Then

P

2‖f∗‖′n +

ε

2> ‖f∗‖

∣∣∣∣X

n1

≥ P

4‖f∗‖′2n +

ε2

4> ‖f∗‖2

∣∣∣∣X

n1

= 1 − P

4‖f∗‖′2n +

ε2

4≤ ‖f∗‖2

∣∣∣∣X

n1

= 1 − P

3‖f∗‖2 +ε2

4≤ 4(‖f∗‖2 − ‖f∗‖′2

n )∣∣∣∣X

n1

.

By the Chebyshev inequality

P

3‖f∗‖2 +ε2

4≤ 4(‖f∗‖2 − ‖f∗‖′2

n )∣∣∣∣X

n1

≤16Var

(1n

∑2ni=n+1 |f∗(Xi)|2|Xn

1

)

(3‖f∗‖2 + ε2

4

)2

≤ 16 1nB

2‖f∗‖2

(3‖f∗‖2 + ε2

4

)2

≤163

B2

n

(3‖f∗‖2 + ε2

4

)

(3‖f∗‖2 + ε2

4

)2

≤ 16B2

3n· 4ε2

=64B2

3ε2n.

Hence, for n ≥ 64B2

ε2 ,

P

2‖f∗‖′n +

ε

2> ‖f∗‖

∣∣∣∣X

n1

≥ 2

3. (11.5)

Next,

P

∃f ∈ F : ‖f‖′n − ‖f‖n >

ε

4

≥ P

2‖f∗‖′n − 2‖f∗‖n >

ε

2

≥ P

2‖f∗‖′n +

ε

2− 2‖f∗‖n > ε, 2‖f∗‖′

n +ε

2> ‖f∗‖

≥ P

‖f∗‖ − 2‖f∗‖n > ε, 2‖f∗‖′n +

ε

2> ‖f∗‖

= EI‖f∗‖−2‖f∗‖n>ε · P

2‖f∗‖′

n +ε

2> ‖f∗‖

∣∣∣∣X

n1


≥ 23P ‖f∗‖ − 2‖f∗‖n > ε

(for n ≥ 64B2/ε2 by (11.5))

=23P ∃f ∈ F : ‖f‖ − 2‖f‖n > ε .

This proves

P ∃f ∈ F : ‖f‖ − 2‖f‖n > ε ≤ 32P

∃f ∈ F : ‖f‖′n − ‖f‖n >

ε

4

for n ≥ 64B2/ε2. Observe that for n ≤ 64B2/ε2 the assertion is trivial,because in this case the right-hand side of the inequality in the theorem isgreater than 1.

Step 2. Introduction of additional randomness.Let U1, . . . , Un be independent and uniformly distributed on −1, 1 andindependent of X1, . . . , X2n. Set

Zi =Xi+n if Ui = 1,Xi if Ui = −1, and Zi+n =

Xi if Ui = 1,Xi+n if Ui = −1,

(i = 1, . . . , n). Because of the independence and identical distribution ofX1, . . . , X2n the joint distribution of X2n

1 is not affected if one randomlyinterchanges the corresponding components of Xn

1 and X2nn+1 and is, hence,

equal to the joint distribution of Z2n1 . Thus

P

∃f ∈ F : ‖f‖′n − ‖f‖n >

ε

4

= P

∃f ∈ F :

(1n

2n∑

i=n+1

|f(Xi)|2)1/2

−(

1n

n∑

i=1

|f(Xi)|2)1/2

>ε

4

= P

∃f ∈ F :

(1n

2n∑

i=n+1

|f(Zi)|2)1/2

−(

1n

n∑

i=1

|f(Zi)|2)1/2

>ε

4

.

Step 3. Conditioning and introduction of a covering.Next we condition in the last probability on X2n

1 . Let

G =

gj : j = 1, . . . , N2

(√2

24ε,F , X2n

1

)

be a√

224 ε-cover of F w.r.t. ‖ · ‖2n of minimal size, where

‖f‖22n =

12n

2n∑

i=1

|f(Xi)|2.


W.l.o.g. we may assume −B ≤ gj(x) ≤ B for all x ∈ Rd. Fix f ∈ F . Thenthere exists a g ∈ G such that

‖f − g‖2n ≤√

224ε.

It follows that

1n

2n∑

i=n+1

|f(Zi)|2 1

2

−

1n

n∑

i=1

|f(Zi)|2 1

2

=

1n

2n∑

i=n+1

|f(Zi)|2 1

2

−

1n

2n∑

i=n+1

|g(Zi)|2 1

2

+

1n

2n∑

i=n+1

|g(Zi)|2 1

2

−

1n

n∑

i=1

|g(Zi)|2 1

2

+

1n

n∑

i=1

|g(Zi)|2 1

2

−

1n

n∑

i=1

|f(Zi)|2 1

2

≤

1n

2n∑

i=n+1

|g(Zi) − f(Zi)|2 1

2

+

1n

2n∑

i=n+1

|g(Zi)|2 1

2

−

1n

n∑

i=1

|g(Zi)|2 1

2

+

1n

n∑

i=1

|g(Zi) − f(Zi)|2 1

2

(by triangle inequality)

≤√

2‖f − g‖2n +

1n

2n∑

i=n+1

|g(Zi)|2 1

2

−

1n

n∑

i=1

|g(Zi)|2 1

2

+√

2‖f − g‖2n

(by definition of Z2n1 )

≤ ε

6+

1n

2n∑

i=n+1

|g(Zi)|2 1

2

−

1n

n∑

i=1

|g(Zi)|2 1

2

.

Using this we get

P

∃f ∈ F :

(1n

2n∑

i=n+1

|f(Zi)|2)1/2

−(

1n

n∑

i=1

|f(Zi)|2)1/2

>ε

4

∣∣∣∣X

2n1


≤ P

∃g ∈ G :

(1n

2n∑

i=n+1

|g(Zi)|2)1/2

−(

1n

n∑

i=1

|g(Zi)|2)1/2

>ε

12

∣∣∣∣X

2n1

≤ |G| · maxg∈G

P

(1n

2n∑

i=n+1

|g(Zi)|2)1/2

−(

1n

n∑

i=1

|g(Zi)|2)1/2

>ε

12

∣∣∣∣X

2n1

,

where

|G| = N2

(√2

24ε,F , X2n

1

)

.


P

(1n

2n∑

i=n+1

|g(Zi)|2)1/2

−(

1n

n∑

i=1

|g(Zi)|2)1/2

>ε

12

∣∣∣∣∣X2n

1

,

where g : Rd → R satisfies −B ≤ g(x) ≤ B for all x ∈ Rd.By definition of Z2n

1 , and√a+

√b ≥ √

a+ b for all a, b ≥ 0,(

1n

2n∑

i=n+1

|g(Zi)|2)1/2

−(

1n

n∑

i=1

|g(Zi)|2)1/2

≤

∣∣∣∣∣∣∣

1n

∑2ni=n+1 |g(Zi)|2 − 1

n

∑ni=1 |g(Zi)|2

(1n

∑2ni=n+1 |g(Zi)|2

)1/2+( 1

n

∑ni=1 |g(Zi)|2

)1/2

∣∣∣∣∣∣∣

≤∣∣ 1n

∑ni=1 Ui

(|g(Xi)|2 − |g(Xi+n)|2)∣∣(

1n

∑2ni=n+1 |g(Zi)|2 + 1

n

∑ni=1 |g(Zi)|2

)1/2

=

∣∣ 1n

∑ni=1 Ui

(|g(Xi)|2 − |g(Xi+n)|2)∣∣(

1n

∑2ni=1 |g(Xi)|2

)1/2 ,

which implies that the above probability is bounded by

P

∣∣∣∣∣1n

n∑

i=1

Ui

(|g(Xi)|2 − |g(Xi+n)|2)∣∣∣∣∣>

ε

12

(1n

2n∑

i=1

|g(Xi)|2)1/2 ∣∣

∣∣∣X2n

1

.

By Hoeffding’s inequality (cf. Lemma A.3) this in turn is bounded by

2 exp

−2n2 ε2

144

(1n

∑2ni=1 |g(Xi)|2

)

∑ni=1 4 (|g(Xi)|2 − |g(Xi+n)|2)2


≤ 2 exp

(

− nε2∑2n

i=1 |g(Xi)|2∑ni=1 288B2 (|g(Xi)|2 + |g(Xi+n)|2)

)

= 2 exp(

− nε2

288B2

),

where the last inequality follows from(|g(Xi)|2 − |g(Xi+n)|2)2 ≤ |g(Xi)|4 + |g(Xi+n)|4

≤ B2 (|g(Xi)|2 + |g(Xi+n)|2) .Steps 1 to 4 imply the assertion.

Combining Theorems 11.1 and 11.2 we get


σ2 = supx∈Rd

VarY |X = x < ∞

and

‖m‖∞ = supx∈Rd

|m(x)| ≤ L

for some L ∈ R+. Let Fn be a linear vector space of functions f : Rd → R.Let Kn be the vector space dimension of Fn. Define the estimate mn by

mn(·) = TLmn(·) where mn(·) = arg minf∈Fn

1n

n∑

i=1

|f(Xi) − Yi|2.

Then

E∫


≤ c · maxσ2, L2 (log(n) + 1) ·Kn

n+ 8 inf

f∈Fn

∫|f(x) −m(x)|2µ(dx),

for some universal constant c.

The second term on the right-hand side of the above inequality is eighttimes the approximation error of the estimate while the first term is anupper bound on the estimation error. Observe that in this theorem we donot assume that Y is bounded, we only assume that m is bounded and thatwe know this bound.

Proof of Theorem 11.3. We start with the decomposition∫


= (‖mn −m‖ − 2‖mn −m‖n + 2‖mn −m‖n)2


≤ (max ‖mn −m‖ − 2‖mn −m‖n, 0 + 2‖mn −m‖n)2

≤ 2 (max ‖mn −m‖ − 2‖mn −m‖n, 0)2 + 8‖mn −m‖2n

= T1,n + T2,n.

Because of ‖m‖∞ ≤ L we have

‖mn −m‖2n ≤ ‖mn −m‖2

n,

which together with Theorem 11.1 implies

ET2,n ≤ 8EE‖mn −m‖2n|X1, . . . , Xn

≤ 8σ2Kn

n+ 8E

minf∈Fn

‖f −m‖2n

≤ 8σ2Kn

n+ 8 inf

f∈Fn

∫|f(x) −m(x)|2µ(dx).

Hence it suffices to show

ET1,n ≤ c · L2 (log(n) + 1) ·Kn

n. (11.6)

In order to show this, let u > 576L2/n be arbitrary. Then, by Theorem11.2,

P T1,n > u = P

2 (max ‖mn −m‖ − 2‖mn −m‖n, 0)2 > u

≤ P

∃f ∈ TLFn : ‖f −m‖ − 2‖f −m‖n >√u/2

≤ 3EN2(√u/24,Fn, X

2n1) · exp

(− n · u

576(2L)2

)

≤ 3EN2

(L√n,Fn, X

2n1

)· exp

(− n · u

576(2L)2

).

Using Lemma 9.2, Theorem 9.4, VTLF+n

≤ VF+n

, and Theorem 9.5 we get

N2(L/

√n,Fn, X

2n1) ≤ 3

(3e(2L)2

(L/√n)2

)2(Kn+1)

= 3(12en)2(Kn+1).

It follows, for any u > 576L2/n,

PT1,n > u ≤ 9 · (12en)2(Kn+1) · exp(− n · u

2304 · L2

).

We get, for any v > 576L2/n,

ET1,n ≤ v +∫ ∞

v

PT1,n > tdt

≤ v + 9 · (12en)2(Kn+1) ·∫ ∞

v

exp(

− n · t2304 · L2

)dt


= v + 9 · (12en)2(Kn+1) · 2304L2

n· exp

(− n · v

2304 · L2

).

Setting

v =2304L2

n· log

(9(12en)2(Kn+1)

)

this implies (11.6), which in turn implies the assertion.


In this section we illustrate the previous results by applying them to piece-wise polynomial partitioning estimates. The next lemma will be needed tobound the approximation error.

Lemma 11.1. Let M ∈ N0, K ∈ N , C > 0, q ∈ 0, . . . ,M, r ∈ (0, 1],and set p = q + r. Let m : [0, 1] → R be some (p, C)-smooth function, i.e.,assume that the qth derivative m(q) of m exists and satisfies

|m(q)(x) −m(q)(z)| ≤ C · |x− z|r (x, z ∈ [0, 1]).

Then there exists a piecewise polynomial f of degree M (or less) with respectto an equidistant partition of [0, 1] consisting of K intervals of length 1/Ksuch that

supx∈[0,1]

|f(x) −m(x)| ≤ 12p · q! · C

Kp.

Proof. Fix z0 ∈ (0, 1) and let gk be the Taylor polynomial of m of degreek around z0 given by

gk(z) =k∑

j=0

m(j)(z0)j!

(z − z0)j ,

where m(j)(z0) is the jth derivative of m at the point z0 (k ∈ 0, 1, . . . , q).We will show

|gq(z) −m(z)| ≤ 1q!

· C · |z − z0|p (z ∈ [0, 1]). (11.7)

The assertion follows by choosing f on each interval as the Taylor polyno-mial of m around the midpoint of the interval. For q = 0 (11.7) followsdirectly from assumption m (p, C)-smooth. In order to show (11.7) inthe case q > 0, we use the well-known integral form of the Taylor seriesremainder, which can be proven by induction and integration by parts,

m(z) − gk(z) =1k!

∫ z

z0

(z − t)km(k+1)(t) dt.


Hence, for f = gq,

m(z) − f(z)

= m(z) − gq−1(z) − m(q)(z0)q!

(z − z0)q

=1

(q − 1)!

∫ z

z0

(z − t)q−1m(q)(t)dt− m(q)(z0)(q − 1)!

∫ z

z0

(z − t)q−1dt

=1

(q − 1)!

∫ z

z0

(z − t)q−1 · (m(q)(t) −m(q)(z0))dt.

From this, and the assumption that m is (p, C)-smooth, one concludes

|m(z) − f(z)| ≤∣∣∣∣

1(q − 1)!

∫ z

z0

(z − t)q−1 · C · |t− z0|rdt∣∣∣∣

≤ C · |z − z0|r(q − 1)!

∣∣∣∣

∫ z

z0

(z − t)q−1dt

∣∣∣∣

=C · |z − z0|q+r

q!.

We are now in a position to derive results on the convergence of thepiecewise polynomial partitioning estimate to the regression function. Firstwe consider the error in the empirical norm.

Corollary 11.1. Let M ∈ N0 and Kn ∈ N . Let Fn be the set of allpiecewise polynomials of degree M (or less) w.r.t. an equidistant partitionof [0, 1] into Kn intervals. Let the estimate mn be defined by (11.1). Assumethat the distribution of (X,Y ) satisfies X ∈ [0, 1] a.s. and

σ2 = supx∈Rd

VarY |X = x < ∞.

Then

E

‖mn −m‖2n

∣∣∣∣X1, . . . , Xn

≤ σ2 (M + 1) ·Kn

n+ min

f∈Fn

‖f −m‖2n.

Furthermore, if m is (p, C)-smooth for some p = q+ r ≤ (M + 1), q ∈ N0,r ∈ (0, 1], then

E

‖mn −m‖2n

∣∣∣∣X1, . . . , Xn

≤ σ2 (M + 1) ·Kn

n+

122pq!2

· C2

K2pn

and for

Kn =

⌈(2p

22pq!2(M + 1)· C

2n

σ2

)1/(2p+1)⌉


one gets for any C ≥ σ/n1/2

E

‖mn −m‖2n

∣∣∣∣X1, . . . , Xn

≤ cMC

22p+1 ·

(σ2

n

) 2p2p+1

for some constant cM depending only on M .

Proof. Fn is a linear vector space of dimension (M+1)·Kn, hence the firstinequality follows from Theorem 11.1. Furthermore, Lemma 11.1, togetherwith X ∈ [0, 1] a.s., implies

minf∈Fn

‖f −m‖2n ≤ min

f∈Fn

supx∈[0,1]

|f(x) −m(x)|2 ≤ 122pq!2

· C2

K2pn

.

From this one gets the second inequality. The definition of Kn implies thethird inequality.

By applying Theorem 11.3 to piecewise polynomial partitioning estimateswe can bound the error in the L2(µ) norm.

Corollary 11.2. Let M ∈ N0 and Kn ∈ N . Let Fn be the set of allpiecewise polynomials of degree M (or less) w.r.t. an equidistant partitionof [0, 1] into Kn intervals. Define the estimate mn by

mn(·) = TLmn(·) where mn(·) = arg minf∈Fn

1n

n∑

i=1

|f(Xi) − Yi|2.

Assume that the distribution of (X,Y ) satisfies X ∈ [0, 1] a.s.,

σ2 = supx∈Rd

VarY |X = x < ∞,

‖m‖∞ = supx∈Rd

|m(x)| ≤ L

and

m is (p, C) − smooth

for some C > 0, p = q + r, q ∈ 0, . . . ,M, r ∈ (0, 1]. Then

E∫


≤ c · maxσ2, L2 (log(n) + 1) ·Kn(M + 1)n

+ 81

22pq!2· C

2

K2pn

and for

Kn =

⌈(C2

maxσ2, L2n

log(n)

)1/(2p+1)⌉

11.3. Nonlinear Least Squares Estimates 197

one gets for any C ≥ maxσ, L · (log(n)/n)1/2

E∫


≤ cMC2

2p+1 ·(

maxσ2, L2 · (log(n) + 1)n

) 2p2p+1

for some constant cM depending only on M .

Proof. Lemma 11.1, together with X ∈ [0, 1] a.s., implies

inff∈Fn

∫|f(x) −m(x)|2µ(dx) ≤ inf

f∈Fn

supx∈[0,1]

|f(x) −m(x)|2

≤ 122pq!2

· C2

K2pn

.

From this together with Theorem 11.3, one gets the first inequality. Thedefinition of Kn implies the second inequality.

It follows from Chapter 3 that the above rate of convergence result is op-timal up to the logarithmic factor log(n)2p/(2p+1). For M = 0 the estimatein Corollary 11.2 is the partitioning estimate of Theorem 4.3. From this weknow that the logarithmic factor is not necessary for p = 1. We will latersee (cf. Chapter 19) how to get rid of the logarithmic factor for p = 1.

11.3 Nonlinear Least Squares Estimates

In this section we generalize Theorem 11.3 from linear vector spaces to gen-eral sets of functions. This will require the introduction of complicated, butextremely useful, exponential inequalities, which will be used throughoutthis book.

In the rest of this section we will assume |Y | ≤ L ≤ βn a.s. and theestimate mn will be defined by

mn(·) = Tβnmn(·) (11.8)

and


1n

n∑

i=1

|f(Xi) − Yi|2. (11.9)

Let us first try to apply the results which we have derived in order to showthe consistency of the estimate: it follows from Lemma 10.2 that

E∫



is bounded by

2E supf∈Tβn Fn

∣∣∣∣∣∣

1n

n∑

j=1

|f(Xj) − Yj |2 − E(f(X) − Y )2

∣∣∣∣∣∣


∫|f(x) −m(x)|2µ(dx).

To bound the first term we can apply Lemma 9.1 or Theorem 9.1, where wehave used Hoeffding’s inequality on fixed sup-norm and random L1 normcovers, respectively. In both cases we have bounded the probability

P

sup

f∈Tβn Fn

∣∣∣∣∣∣

1n

n∑

j=1

|f(Xj) − Yj |2 − E|f(X) − Y |2

∣∣∣∣∣∣> εn

by some term tending to infinity as n → ∞ times

exp(

−c nε2n

β4n

). (11.10)

Thus, if we want these upper bounds to converge to zero as n → ∞, thenεn must converge to zero not faster than β2

n n− 1

2 . Unfortunately, as we haveseen in Chapter 3, this is far away from the optimal rate of convergence.

Therefore, to analyze the rate of convergence of the expected value ofthe L2 error, we will use a different decomposition than in Section 10.1:


=E|mn(X) − Y |2|Dn

− E|m(X) − Y |2

−21n

n∑

i=1

|mn(Xi) − Yi|2 − |m(Xi) − Yi|2

+21n

n∑

i=1

|mn(Xi) − Yi|2 − |m(Xi) − Yi|2. (11.11)

Let us first observe that we can obtain a nice upper bound for the expec-tation of the second term on the right-hand side of (11.11). For simplicity,we ignore the factor 2. By using the definition of mn (see (11.8) and (11.9))and |Y | ≤ βn a.s. one gets

E

1n

n∑

i=1

|mn(Xi) − Yi|2 − |m(Xi) − Yi|2

≤ E

1n

n∑

i=1

|mn(Xi) − Yi|2 − |m(Xi) − Yi|2


= E

inff∈Fn

1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

≤ inff∈Fn

E

1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

= inff∈Fn

E|f(X) − Y |2 − E|m(X) − Y |2

= inff∈Fn

∫|f(x) −m(x)|2µ(dx). (11.12)

Next we derive an upper bound for the first term on the right-hand side of(11.11). We have

P

E|mn(X) − Y |2|Dn

− E|m(X) − Y |2

−21n

n∑

i=1

|mn(Xi) − Yi|2 − |m(Xi) − Yi|2 > ε

= P

E|mn(X) − Y |2|Dn

− E|m(X) − Y |2

− 1n

n∑

i=1

|mn(Xi) − Yi|2 − |m(Xi) − Yi|2

>ε

2+

12E|mn(X) − Y |2|Dn

− E|m(X) − Y |2

≤ P

∃f ∈ TβnFn : E|f(X) − Y |2 − E|m(X) − Y |2

− 1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

>ε

2+

12E|f(X) − Y |2 − E|m(X) − Y |2

. (11.13)

We know from Chapter 9 how to extend a bound on the right-hand sideof (11.13) from a fixed function to a set of functions by the use of fixedsup-norm or random L1 norm covers.

For simplicity let us consider for a moment the right-hand side of (11.13)for only one fixed function f ∈ TβnFn. Set Z = (X,Y ), Zi = (Xi, Yi)(i = 1, . . . , n),

g(Z) = |f(X) − Y |2 − |m(X) − Y |2


and

g(Zi) = |f(Xi) − Yi|2 − |m(Xi) − Yi|2 (i = 1, ..., n).

Then g(Z), g(Z1), ..., g(Zn) are i.i.d random variables such that |g(Z)| ≤4β2

n and we want to bound

P

Eg(Z) − 1n

n∑

i=1

g(Zi) >ε

2+

12Eg(Z)

. (11.14)

The main trick is that the variance of g(Z) is bounded by some constanttimes the expectation of g(Z) (compare also Problem 7.3): Indeed,

g(Z) = (f(X) − Y +m(X) − Y ) ((f(X) − Y ) − (m(X) − Y ))

= (f(X) +m(X) − 2Y )(f(X) −m(X))

and thus

σ2 = Var(g(Z)) ≤ Eg(Z)2 ≤ 16β2n E|f(X) −m(X)|2

= 16β2n (E|f(X) − Y |2 − E|m(X) − Y |2)

= 16β2n Eg(Z). (11.15)

This enables us to derive an excellent bound for (11.14) by applying theBernstein inequality (Lemma A.2):

P

Eg(Z) − 1n

n∑

i=1

g(Zi) >ε

2+

12Eg(Z)

≤ P

Eg(Z) − 1n

n∑

i=1

g(Zi) >ε

2+

12σ2

16β2n

≤ exp

−n ε

2 + σ2

32β2n2

2σ2 + 2 8β2n

3

ε2 + σ2

32β2n

≤ exp

−n ε

2 + σ2

32β2n2

(64β2

n + 16β2n

3

)·

ε2 + σ2

32β2n

= exp

−n

ε2 + σ2

32β2n

64β2n + 16

3 β2n

≤ exp(

− 1128 + 32

3

· nεβ2

n

).

The main advantage of this upper bound is that ε appears only in linearand not squared form as in (11.10) and therefore (for constant βn) this


upper bound converges to zero whenever

εn = εnn → ∞ as n → ∞.

What we now need is an extension of this upper bound from the case ofa fixed function to the general case of a set of functions as in (11.13). Aswe have mentioned before, this can be done by the use of random L1 normcovers. The result is summarized in the next theorem.

Theorem 11.4. Assume |Y | ≤ B a.s. and B ≥ 1. Let F be a set offunctions f : Rd → R and let |f(x)| ≤ B,B ≥ 1. Then, for each n ≥ 1,

P

∃f ∈ F : E|f(X) − Y |2 − E|m(X) − Y |2

− 1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

≥ ε · (α+ β + E|f(X) − Y |2 − E|m(X) − Y |2)

≤ 14 supxn1

N1

(βε

20B,F , xn

1

)exp

(− ε2(1 − ε)αn

214(1 + ε)B4

)

where α, β > 0 and 0 < ε ≤ 1/2.

We will prove this theorem in the next two sections.Now we are ready to formulate and prove our main result.

Theorem 11.5. Let n ∈ N and 1 ≤ L < ∞. Assume |Y | ≤ L a.s. Let theestimate mn be defined by minimization of the empirical L2 risk over a setof functions Fn and truncation at ±L. Then one has

E∫


≤ c1n

+(c2 + c3 log(n))VF+

n

n+ 2 inf

f∈Fn

∫|f(x) −m(x)|2µ(dx),

where

c1 = 24 · 214L4(1 + log 42), c2 = 48 · 214L4 log(480eL2),

and

c3 = 48 · 214L4.

If Fn is a linear vector space of dimension Kn then VF+n

is boundedfrom above by Kn +1 (cf. Theorem 9.5), and we get again the bound fromTheorem 11.3 with slightly different conditions. The main advantage of theabove theorem compared to Theorem 11.3 is that in the above theorem


Fn doesn’t have to be a linear vector space. But, unfortunately, we needstronger assumptions on Y : Y must be bounded while in Theorem 11.3we only needed assumptions on the variance of Y and boundedness of theregression function.

Proof. We use the error decomposition∫


=

E|mn(X) − Y |2|Dn − E|m(X) − Y |2

−2 ·(

1n

n∑

i=1

|mn(Xi) − Yi|2 − 1n

n∑

i=1

|m(Xi) − Yi|2)

+2 ·(

1n

n∑

i=1

|mn(Xi) − Yi|2 − 1n

n∑

i=1

|m(Xi) − Yi|2)

= T1,n + T2,n.

By (11.12),

ET2,n ≤ 2 inff∈Fn

∫

Rd

|f(x) −m(x)|2µ(dx),

thus it suffices to show

ET1,n ≤ c1n

+(c2 + c3 log(n)) · VF+

n

n.

Let t ≥ 1n be arbitrary. Then

PT1,n > t

= P

E|mn(X) − Y |2|Dn − E|m(X) − Y |2

− 1n

n∑

i=1

|mn(Xi) − Yi|2 − |m(Xi) − Yi|2

>12(t+ E|mn(X) − Y |2|Dn − E|m(X) − Y |2)

≤ P

∃f ∈ TLFn : E|f(X) − Y |2 − E|m(X) − Y |2

− 1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

11.4. Preliminaries to the Proof of Theorem 11.4 203

≥ 12

· (t

2+t

2+ E|f(X) − Y |2 − E|m(X) − Y |2)

≤ 14 supxn1

N1

(t

80L, TLFn, x

n1

)· exp

(− n

24 · 214L4 t)

≤ 14 supxn1

N1

(1

80L · n, TLFn, xn1

)· exp

(− n

24 · 214L4 t),

where we have used Theorem 11.4 and t ≥ 1n . By Lemma 9.2 and Theorem

9.4 we get, for the covering number

N1

(1

80L · n, TLFn, xn1

)≤ 3

(2e(2L)

180L·n

log(

3e(2L)1

80L·n

))VTLF+

n

≤ 3(480eL2n

)2VTLF+

n .

Using this and

VTLF+n

≤ VF+n

(see proof of Theorem 10.3), one gets, for arbitrary ε ≥ 1n ,

ET1,n =∫ ∞

0PT1,n > t dt ≤ ε+

∫ ∞

ε

PT1,n > t dt

≤ ε+∫ ∞

ε

42(480eL2n

)2VF+n exp

(− n

24 · 214L4 t)dt

= ε+ 42(480eL2n

)2VF+n

24 · 214L4

nexp

(− n

24 · 214L4 ε).

The above expression is minimized for

ε =24 · 214L4

nlog

(42

(480eL2n

)2VF+n

),

which yields

ET1,n ≤ 24 · 214L4

n·(log(42) + 2VF+

nlog(480eL2n)

)+

24 · 214L4

n.

11.4 Preliminaries to the Proof of Theorem 11.4

In the proof of Theorem 11.4, which will be given in the next section, wewill need the following two auxiliary results:


Lemma 11.2. Let V1, . . . , Vn be i.i.d. random variables, 0 ≤ Vi ≤ B, 0 <α < 1, and ν > 0. Then

P | 1

n

∑ni=1 Vi − EV1|

ν + 1n

∑ni=1 Vi + EV1

> α

≤ P

| 1n

∑ni=1 Vi − EV1|ν + EV1

> α

 α

= P

|n∑

i=1

(Vi − EVi)| > αn(ν + EV1)

≤ E |∑ni=1(Vi − EVi)|2

(αn(ν + EV1))2 =

Var(V1)nα2(ν + EV1)2

. (11.16)

Now

Var(V1) = E (V1 − EV1)(V1 − EV1)= E V1(V1 − EV1) ≤ EV1(B − EV1).

Substituting the bound on the variance into the right-hand side of (11.16)we get

P |∑n

i=1(Vi − EVi)|nν + nEV1

> α

≤ EV1(B − EV1)

nα2(ν + EV1)2. (11.17)

In order to maximize the bound in (11.17) with respect to EV1 considerthe function f(x) = x(B−x)/nα2(ν+x)2 which attains its maximal value

B2

4nα2ν(B+ν) for x = Bν/(B + 2ν). Hence the right-hand side of (11.17) isbounded above by

B2

4α2νn(B + ν) 0,0 < ε < 1, and n ≥ 1. Then

P

supg∈G

1n

∑ni=1 g(Zi) − Eg(Z)

α+ 1n

∑ni=1 g(Zi) + Eg(Z)

> ε

≤ 4EN1

(αε5,G, Zn

1

)exp

(−3ε2αn

40B

). (11.18)


Proof. The proof will be divided into several steps.

Step 1. Replace the expectation inside the probability in (11.18) by anempirical mean based on a ghost sample.Draw a “ghost” sample Z ′n

1 = (Z ′1, . . . , Z

′n) of i.i.d. random variables dis-

tributed as Z and independent of Zn1 . Let g∗ be a function g ∈ G such

that

1n

n∑

i=1

g(Zi) − Eg(Z) > ε

(

α+1n

n∑

i=1

g(Zi) + Eg(Z)

)

,

if there exists any such function, and let g∗ be an other arbitrary functioncontained in G, if such a function doesn’t exist. Note that g∗ depends onZn

1 .Observe that

1n

n∑

i=1

g(Zi) − Eg(Z) > ε

(

α+1n

n∑

i=1

g(Zi) + Eg(Z)

)

and

1n

n∑

i=1

g(Z ′i) − Eg(Z) ≤ ε

4

(

α+1n

n∑

i=1

g(Z ′i) + Eg(Z)

)

imply

1n

n∑

i=1

g(Zi) − 1n

n∑

i=1

g(Z ′i)

>3εα4

+ ε1n

n∑

i=1

g(Zi) − ε

41n

n∑

i=1

g(Z ′i) +

3ε4

Eg(Z),

which is equivalent to

(1 − 58ε)

(1n

n∑

i=1

g(Zi) − 1n

n∑

i=1

g(Z ′i)

)

>3ε8

(

2α+1n

n∑

i=1

g(Zi) +1n

n∑

i=1

g(Z ′i)

)

+3ε4

Eg(Z).

Because of 0 < 1 − 58ε < 1 and Eg(Z) ≥ 0 the last inequality implies

1n

n∑

i=1

g(Zi) − 1n

n∑

i=1

g(Z ′i) >

3ε8

(

2α+1n

n∑

i=1

g(Zi) +1n

n∑

i=1

g(Z ′i)

)

.

Using this we conclude


P

∃g ∈ G :1n

n∑

i=1

g(Zi) − 1n

n∑

i=1

g(Z ′i)

>3ε8

(

2α+1n

n∑

i=1

g(Zi) +1n

n∑

i=1

g(Z ′i)

)

≥ P

1n

n∑

i=1

g∗(Zi) − 1n

n∑

i=1

g∗(Z ′i)

>3ε8

(

2α+1n

n∑

i=1

g∗(Zi) +1n

n∑

i=1

g∗(Z ′i)

)

≥ P

1n

n∑

i=1


> ε

(

α+1n

n∑

i=1

g∗(Zi) + Eg∗(Z)|Zn1

)

,

1n

n∑

i=1


1

≤ ε

4

(

α+1n

n∑

i=1

g∗(Z ′i) + Eg∗(Z)|Zn

1 )

= E

I 1n

∑n

i=1g∗(Zi)−Eg∗(Z)|Zn

1 >ε(α+ 1n

∑n

i=1g∗(Zi)+Eg∗(Z)|Zn

1 )

×P

1n

n∑

i=1


1

≤ ε

4

(

α+1n

n∑

i=1


1 )∣∣∣∣∣Zn

1

.

Using Lemma 11.2 we get

P

1n

n∑

i=1


1

>ε

4

(

α+1n

n∑

i=1


1 )∣∣∣∣∣Zn

1

 8Bε2α , the probability inside the expectation is greater than or

equal to 12 and we can conclude

P

∃g ∈ G :1n

n∑

i=1

g(Zi) − 1n

n∑

i=1

g(Z ′i)

>3ε8

(

2α+1n

n∑

i=1

g(Zi) +1n

n∑

i=1

g(Z ′i)

)

≥ 12P

1n

n∑

i=1

g∗(Zi) − E(g∗(Z)|Zn1 )

> ε

(

α+1n

n∑

i=1

g∗(Zi) + Eg∗(Z)|Zn1

)

=12P

∃g ∈ G :1n

n∑

i=1

g(Zi) − Eg(Z)

> ε

(

α+1n

n∑

i=1

g(Zi) + Eg(Z)

)

.

This proves

P

∃g ∈ G :1n

∑ni=1 g(Zi) − Eg(Z)

α+ 1n

∑ni=1 g(Zi) + Eg(Z)

> ε

≤ 2P

∃g ∈ G :1n

n∑

i=1

g(Zi) − 1n

n∑

i=1

g(Z ′i)

>3ε8

(

2α+1n

n∑

i=1

g(Zi) +1n

n∑

i=1

g(Z ′i)

)

for n > 8Bε2α . Observe that for n ≤ 8B

ε2α the right-hand side of (11.18) isgreater than 1, and hence the assertion is trivial.

Step 2. Introduction of additional randomness by random signs.Let U1, . . . , Un be independent and uniformly distributed over −1, 1 andindependent of Z1, . . . , Zn, Z ′

1, . . . , Z′n. Because of the independence and

identical distribution of Z1, . . . , Zn, Z ′1, . . . , Z

′n the joint distribution of

Zn1 , Z

′n1 is not affected if one randomly interchanges the corresponding

components of Zn1 and Z ′n

1 . Clearly, this also doesn’t affect 1n

∑ni=1(g(Zi)+

g(Z ′i)). Hence


P

∃g ∈ G :1n

n∑

i=1

(g(Zi) − g(Z ′i))

>3ε8

(

2α+1n

n∑

i=1

(g(Zi) + g(Z ′i))

)

= P

∃g ∈ G :1n

n∑

i=1

Ui (g(Zi) − g(Z ′i))

>3ε8

(

2α+1n

n∑

i=1

(g(Zi) + g(Z ′i))

)

≤ P

∃g ∈ G :1n

n∑

i=1

Uig(Zi) >3ε8

(

α+1n

n∑

i=1

g(Zi)

)

+P

∃g ∈ G :1n

n∑

i=1

Uig(Z ′i) < −3ε

8

(

α+1n

n∑

i=1

g(Z ′i)

)

= 2P

∃g ∈ G :1n

n∑

i=1

Uig(Zi) >3ε8

(

α+1n

n∑

i=1

g(Zi)

)

,

where we have used the fact that −Ui has the same distribution as Ui.

Step 3. Conditioning and introduction of a covering.Next we condition the last probability on Zn

1 , which is equivalent to fixingz1, . . . , zn ∈ Rd and to considering

P

∃g ∈ G :1n

n∑

i=1

Uig(zi) >3ε8

(

α+1n

n∑

i=1

g(zi)

)

. (11.19)

Let δ > 0 and let Gδ be an L1 δ-cover of G on zn1 . Fix g ∈ G. Then there

exists a g ∈ Gδ such that

1n

n∑

i=1

|g(zi) − g(zi)| < δ. (11.20)

Without loss of generality we may assume 0 ≤ g(z) ≤ B. Formula (11.20)implies

1n

n∑

i=1

Uig(zi) =1n

n∑

i=1

Uig(zi) +1n

n∑

i=1

Ui(g(zi) − g(zi))

≤ 1n

n∑

i=1

Uig(zi) +1n

n∑

i=1

|g(zi) − g(zi)| < 1n

n∑

i=1

Uig(zi) + δ


and

1n

n∑

i=1

g(zi) ≥ 1n

n∑

i=1

g(zi) − 1n

n∑

i=1

|g(zi) − g(zi)| ≥ 1n

n∑

i=1

g(zi) − δ.

Using this we can bound the probability in (11.19) by

P

∃g ∈ Gδ :1n

n∑

i=1

Uig(zi) + δ >3ε8

(

α+1n

n∑

i=1

g(zi) − δ

)

≤ |Gδ|maxg∈Gδ

P

1n

n∑

i=1

Uig(zi) >3εα8

− 3εδ8

− δ +3ε8

1n

n∑

i=1

g(zi)

.

Set δ = εα5 , which implies

3εα8

− 3εδ8

− δ ≥ 3εα8

− 3εα40

− εα

5=εα

10,

and choose Gδ as an L1εα5 -cover on zn

1 of minimal size. Then we have

P

∃g ∈ G :1n

n∑

i=1

Uig(zi) >3ε8

(

α+1n

n∑

i=1

g(zi)

)

≤ N1

(εα5,G, zn

1

)max

g∈G εα5

P

1n

n∑

i=1

Uig(zi) >εα

10+

3ε8

1n

n∑

i=1

g(zi)

.


P

1n

n∑

i=1

Uig(zi) >εα

10+

3ε8

1n

n∑

i=1

g(zi)

,

where z1, . . . , zn ∈ Rd, g : Rd → R, and 0 ≤ g(z) ≤ B.U1g(z1), . . . , Ung(zn) are independent random variables with

−g(zi) ≤ Uig(zi) ≤ g(zi) (i = 1, . . . , n),

therefore, by Hoeffding’s inequality,

P

1n

n∑

i=1

Uig(zi) >εα

10+

3ε8

1n

n∑

i=1

g(zi)

≤ exp

(

−2n2(

εα10 + 3ε

81n

∑ni=1 g(zi)

)2

4∑n

i=1 g(zi)2

)

≤ exp

(

−n2(

εα10 + 3ε

81n

∑ni=1 g(zi)

)2

2B∑n

i=1 g(zi)

)


= exp

(

− 9ε2

128B

(n 4

15α+∑n

i=1 g(zi))2

∑ni=1 g(zi)

)

.

An easy calculation shows that, for arbitrary a > 0, one has

(a+ y)2

y≥ (a+ a)2

a= 4a (y ∈ R+).

This implies(n 4

15α+∑n

i=1 g(zi))2

∑ni=1 g(zi)

≥ 4n415α =

1615αn

and, hence,

P

1n

n∑

i=1

Uig(zi) >εα

10+

3ε8

1n

n∑

i=1

g(zi)

≤ exp(

− 9ε2

128B1615αn

)

= exp(

−3αε2n40B

).

The assertion is now implied by the four steps.

11.5 Proof of Theorem 11.4

Proof. Let us introduce the following notation

Z = (X,Y ), Zi = (Xi, Yi), i = 1, . . . , n,

and

gf (x, y) = |f(x) − y|2 − |m(x) − y|2.Observe that |f(x)| ≤ B, |y| ≤ B, and |m(x)| ≤ B imply

−4B2 ≤ gf (x, y) ≤ 4B2.

We can rewrite the probability in the theorem as follows

P

∃f ∈ F : Egf (Z) − 1n

n∑

i=1

gf (Zi) ≥ ε(α+ β + Egf (Z))

. (11.21)

The proof will proceed in several steps.

Step 1. Symmetrization by a ghost sample.Replace the expectation on the left-hand side of the inequality in (11.21)by the empirical mean based on the ghost sample Z ′n

1 of i.i.d. random


variables distributed as Z and independent of Zn1 . Consider a function

fn ∈ F depending upon Zn1 such that

Egfn(Z)|Zn1 − 1

n

n∑

i=1

gfn(Zi) ≥ ε(α+ β) + εEgfn(Z)|Zn1 ,

if such a function exists in F , otherwise choose an arbitrary function in F .Chebyshev’s inequality, together with

Var gfn(Z)|Zn

1 ≤ 16B2E gfn(Z)|Zn1

(cf. (11.15)), imply

P

Egfn(Z)|Zn

1 − 1n

n∑

i=1

gfn(Z ′

i)

>ε

2(α+ β) +

ε

2Egfn(Z)|Zn

1 ∣∣∣∣Z

n1

≤ Var gfn(Z)|Zn1

n · ( ε2 (α+ β) + ε

2E gfn(Z)|Zn

1 )2

≤ 16B2E gfn(Z)|Zn1

n · ( ε2 (α+ β) + ε

2E gfn(Z)|Zn

1 )2

≤ 16B2

ε2(α+ β)n,

where the last inequality follows from

f(x) =x

(a+ x)2≤ f(a) =

14a

for all x ≥ 0 and all a > 0. Thus, for n > 128B2

ε2(α+β) ,

P

Egfn(Z)|Zn

1 − 1n

n∑

i=1

gfn(Z ′

i) ≤ ε

2(α+ β) +

ε

2Egfn

(Z)|Zn1

∣∣∣∣∣Zn

1

≥ 78. (11.22)

Hence

P

∃f ∈ F :1n

n∑

i=1

gf (Z ′i) − 1

n

n∑

i=1

gf (Zi) ≥ ε

2(α+ β) +

ε

2Egf (Z)

≥ P

1n

n∑

i=1

gfn(Z ′i) − 1

n

n∑

i=1

gfn(Zi) ≥ ε

2(α+ β) +

ε

2Egfn(Z)|Zn

1


≥ P

Egfn(Z)|Zn1 − 1

n

n∑

i=1

gfn(Zi) ≥ ε(α+ β) + εEgfn(Z)|Zn1 ,

Egfn(Z)|Zn

1 − 1n

n∑

i=1

gf (Z ′i) ≤ ε

2(α+ β) +

ε

2Egfn(Z)|Zn

1

= EIEgfn (Z)|Zn

1 − 1n

∑n

i=1gfn (Zi)≥ε(α+β)+εEgfn (Z)|Zn

1 ×E

IEgfn (Z)|Zn

1 − 1n

∑n

i=1gf (Z′

i)≤ ε

2 (α+β)+ ε2Egfn (Z)|Zn

1 |Zn1

= E

I···P

Egfn(Z)|Zn1 − 1

n

n∑

i=1

gf (Z ′i)

≤ ε

2(α+ β) +

ε

2Egfn

(Z)|Zn1

∣∣∣∣Z

n1

≥ 78P

Egfn(Z)|Zn

1 − 1n

n∑

i=1

gfn(Zi) ≥ ε(α+ β) + εEgfn

(Z)|Zn1

=78P

∃f ∈ F : Egf (Z) − 1n

n∑

i=1

gf (Zi) ≥ ε(α+ β) + εEgf (Z)

,

where the last inequality follows from (11.22). Thus we have shown that,for n > 128B2

ε2(α+β) ,

P

∃f ∈ F : Egf (Z) − 1n

n∑

i=1

gf (Zi) ≥ ε(α+ β) + εEgf (Z)

≤ 87P

∃f ∈ F :1n

n∑

i=1

gf (Z ′i) − 1

n

n∑

i=1

gf (Zi)

≥ ε

2(α+ β) +

ε

2Egf (Z)

. (11.23)

Step 2. Replacement of the expectation in (11.23) by an empirical meanof the ghost sample.First we introduce additional conditions in the probability (11.23),

P

∃f ∈ F :1n

n∑

i=1

gf (Z ′i) − 1

n

n∑

i=1

gf (Zi) ≥ ε

2(α+ β) +

ε

2Egf (Z)

≤ P

∃f ∈ F :1n

n∑

i=1

gf (Z ′i) − 1

n

n∑

i=1

gf (Zi) ≥ ε

2(α+ β) +

ε

2Egf (Z),

1n

n∑

i=1

g2f (Zi) − Eg2

f (Z) ≤ ε

(

α+ β +1n

n∑

i=1

g2f (Zi) + Eg2

f (Z)

)

,


1n

n∑

i=1

g2f (Z ′

i) − Eg2f (Z) ≤ ε

(

α+ β +1n

n∑

i=1

g2f (Z ′

i) + Eg2f (Z)

)

+ 2P

∃f ∈ F :

1n

∑ni=1 g

2f (Zi) − Eg2

f (Z)(α+ β + 1

n

∑ni=1 g

2f (Zi) + Eg2

f (Z)) > ε

. (11.24)

Application of Theorem 11.6 to the second probability on the right–handside of (11.24) yields

P

∃f ∈ F :

1n

∑ni=1 g

2f (Zi) − Eg2

f (Z)(α+ β + 1

n

∑ni=1 g

2f (Zi) + Eg2

f (Z)) > ε

≤ 4EN1

((α+ β)ε

5, gf : f ∈ F, Zn

1

)exp

(−3ε2(α+ β)n

40(16B4)

).

Now we consider the first probability on the right-hand side of (11.24). Thesecond inequality inside the probability implies

(1 + ε)Eg2f (Z) ≥ (1 − ε)

1n

n∑

i=1

g2f (Zi) − ε(α+ β),


132B2 Eg2

f (Z) ≥ 1 − ε

32B2(1 + ε)1n

n∑

i=1

g2f (Zi) − ε

(α+ β)32B2(1 + ε)

.

We can deal similarly with the third inequality. Using this and the inequal-ity Egf (Z) ≥ 1

16B2 Eg2f (Z) = 2 1

32B2 Eg2f (Z) (see (11.15)) we can bound the

first probability on the right-hand side of (11.24) by

P

∃f ∈ F :1n

n∑

i=1

gf (Z ′i) − 1

n

n∑

i=1

gf (Zi)

≥ ε(α+ β)/2 +ε

2

(1 − ε

32B2(1 + ε)1n

n∑

i=1

g2f (Zi) − ε(α+ β)

32B2(1 + ε)

+1 − ε

32B2(1 + ε)1n

n∑

i=1

g2f (Z ′

i) − ε(α+ β)32B2(1 + ε)

)

.

This shows

P

∃f ∈ F :1n

n∑

i=1

gf (Z ′i) − 1

n

n∑

i=1

gf (Zi) ≥ ε(α+ β)/2 + εEgf (Z)/2


≤ P

∃f ∈ F :1n

n∑

i=1

(gf (Z ′i) − gf (Zi))

≥ ε(α+ β)/2 − ε2(α+ β)32B2(1 + ε)

+ε(1 − ε)

64B2(1 + ε)1n

n∑

i=1

(g2f (Zi) + g2

f (Z ′i))

+8EN1

((α+ β)ε

5, gf : f ∈ F, Zn

1

)exp

(−3ε2(α+ β)n

640B4

). (11.25)

Step 3. Additional randomization by random signs.Let U1, . . . , Un be independent and uniformly distributed over the set−1, 1 and independent of Z1, . . . , Zn, Z

′1, . . . , Z

′n. Because of the inde-

pendence and identical distribution of Z1, . . . , Z′n the joint distribution of

Zn1 , Z

′n1 is not affected by the random interchange of the corresponding com-

ponents in Zn1 and Z ′n

1 . Therefore the first probability on the right-handside of inequality (11.25) is equal to

P

∃f ∈ F :1n

n∑

i=1

Ui(gf (Z ′i) − gf (Zi))

≥ ε

2(α+ β) − ε2(α+ β)

32B2(1 + ε)+

ε(1 − ε)64B2(1 + ε)

(1n

n∑

i=1

(g2f (Zi) + g2

f (Z ′i))

)

and this, in turn, by the union bound, is bounded by

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

Uigf (Z ′i)

∣∣∣∣∣

≥ 12

(ε(α+ β)/2 − ε2(α+ β)

32B2(1 + ε)

)+

ε(1 − ε)64B2(1 + ε)

1n

n∑

i=1

g2f (Z ′

i)

+P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

Uigf (Zi)

∣∣∣∣∣

≥ 12

(ε(α+ β)/2 − ε2(α+ β)

32B2(1 + ε)

)+

ε(1 − ε)64B2(1 + ε)

1n

n∑

i=1

g2f (Zi)

= 2P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

Uigf (Zi)

∣∣∣∣∣

≥ ε(α+ β)/4 − ε2(α+ β)64B2(1 + ε)

+ε(1 − ε)

64B2(1 + ε)1n

n∑

i=1

g2f (Zi)

. (11.26)

Step 4. Conditioning and using covering.Next we condition the probability on the right-hand side of (11.26) on Zn

1 ,


which is equivalent to fixing z1, . . . , zn and considering

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

Uigf (zi)

∣∣∣∣∣

≥ ε(α+ β)/4 − ε2(α+ β)64B2(1 + ε)

+ε(1 − ε)

64B2(1 + ε)1n

n∑

i=1

g2f (zi)

.

Let δ > 0 and let Gδ be an L1 δ-cover of GF = gf : f ∈ F on z1, . . . , zn.Fix f ∈ F . Then there exists g ∈ Gδ such that

1n

n∑

i=1

|g(zi) − gf (zi)| < δ.

Without loosing generality we can assume −4B2 ≤ g(z) ≤ 4B2. Thisimplies

∣∣∣∣∣1n

n∑

i=1

Uigf (zi)

∣∣∣∣∣

=

∣∣∣∣∣1n

n∑

i=1

Uig(zi) +1n

n∑

i=1

Ui(gf (zi) − g(zi))

∣∣∣∣∣

≤∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣+

1n

n∑

i=1

|gf (zi) − g(zi)|

<

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣+ δ

and

1n

n∑

i=1

g2f (zi) =

1n

n∑

i=1

g2(zi) +1n

n∑

i=1

(g2f (zi) − g2(zi))

=1n

n∑

i=1

g2(zi) +1n

n∑

i=1

(gf (zi) + g(zi))(gf (zi) − g(zi))

≥ 1n

n∑

i=1

g2(zi) − 8B2 1n

n∑

i=1

|gf (zi) − g(zi)|

≥ 1n

n∑

i=1

g2(zi) − 8B2δ.

It follows

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

Uigf (zi)

∣∣∣∣∣

≥ ε(α+ β)/4 − ε2(α+ β)64B2(1 + ε)

+ε(1 − ε)

64B2(1 + ε)1n

n∑

i=1

g2f (zi)


≤ P

∃g ∈ Gδ :

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣+ δ

≥ ε(α+ β)/4 − ε2(α+ β)64B2(1 + ε)

+ε(1 − ε)

64B2(1 + ε)

(1n

n∑

i=1

g2(zi) − 8B2δ

)

≤ |Gδ|maxg∈Gδ

P

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣

≥ ε(α+ β)/4 − ε2(α+ β)64B2(1 + ε)

− δ − δε(1 − ε)8(1 + ε)

+ε(1 − ε)

64B2(1 + ε)1n

n∑

i=1

g2(zi)

.

Next we set δ = εβ/5. This, together with B ≥ 1 and 0 < ε ≤ 12 , implies

εβ

4− ε2β

64B2(1 + ε)− δ − δ

ε(1 − ε)8(1 + ε)

=εβ

20− ε2β

64B2(1 + ε)− ε2(1 − ε)β

40(1 + ε)≥ 0.

Thus

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

Uigf (zi)

∣∣∣∣∣

≥ ε(α+ β)/4 − ε2(α+ β)64B2(1 + ε)

+ε(1 − ε)

64B2(1 + ε)1n

n∑

i=1

g2f (zi)

≤ |G εβ5

| maxg∈G εβ

5

P

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣

≥ εα

4− ε2α

64B2(1 + ε)+

ε(1 − ε)64B2(1 + ε)

1n

n∑

i=1

g2(zi)

.

Step 5. Application of Bernstein’s inequality.In this step we use Bernstein’s inequality to bound

P

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣≥ εα

4− ε2α

64B2(1 + ε)+

ε(1 − ε)64B2(1 + ε)

1n

n∑

i=1

g2(zi)

,


where z1, . . . , zn ∈ Rd × R are fixed and g : Rd × R → R satisfies −4B2 ≤g(z) ≤ 4B2. First we relate 1

n

∑ni=1 g

2(zi) to the variance of Uig(zi),

1n

n∑

i=1

Var(Uig(zi)) =1n

n∑

i=1

g2(zi)Var(Ui) =1n

n∑

i=1

g2(zi).

Thus the probability above is equal to

P

∣∣∣∣∣1n

n∑

i=1

Vi

∣∣∣∣∣≥ A1 +A2σ

2

,

where

Vi = Uig(zi), σ2 =1n

n∑

i=1

Var(Uig(zi)),

A1 =εα

4− ε2α

64B2(1 + ε), A2 =

ε(1 − ε)64B2(1 + ε)

.

Observe that V1, . . . , Vn are independent random variables satisfying |Vi| ≤|g(zi)| ≤ 4B2 (i = 1, . . . , n), and that A1, A2 ≥ 0. We have, by Bernstein’sinequality,

P

∣∣∣∣∣1n

n∑

i=1

Vi

∣∣∣∣∣≥ A1 +A2σ

2

≤ 2 exp

(

− n(A1 +A2σ2)2

2σ2 + 2(A1 +A2σ2) 8B2

3

)

= 2 exp

− nA2

2163 B

2A2·

(A1A2

+ σ2)2

A1A2

+(1 + 3

8B2A2

)σ2

= 2 exp

−3 · n ·A2

16B2 ·(

A1A2

+ σ2)2

A1A2

+(1 + 3

8B2A2

)σ2

. (11.27)

An easy calculation (cf. Problem 11.1) shows that, for arbitrary a, b, u > 0,one has

(a+ u)2

a+ b · u ≥(a+ b−2

b a)2

a+ b b−2b a

= 4ab− 1b2

.


Thus setting a = A1/A2, b =(1 + 3

8B2A2

), u = σ2, and using the bound

above we get, for the exponent in (11.27),

3 · n ·A2

16B2 ·(

A1A2

+ σ2)2

A1A2

+(1 + 3

8B2A2

)σ2

≥ 3 · n ·A2

16B2 · 4 · A1

A2

38B2A2(

1 + 38B2A2

)2

= 18nA1A2

(8B2A2 + 3)2.

Substituting the formulas for A1 and A2 and noticing

A1 =εα

4− ε2α

64B2(1 + ε)≥ εα

4− εα

64=

15εα64

we obtain

18nA1A2

(8B2A2 + 3)2≥ 18n

15εα64

· ε(1 − ε)64B2(1 + ε)

· 1(

ε(1−ε)8(1+ε) + 3

)2

≥ 18n15 · ε2(1 − ε) · α642B2(1 + ε)

· 1( 1

32 + 3)2

=9 · 15

2 · 97 · 97· ε

2(1 − ε)1 + ε

· α · nB2

≥ ε2(1 − ε) · α · n140B2(1 + ε)

.

Plugging the lower bound above into (11.27) we finally obtain

P

∣∣∣∣∣1n

n∑

i=1

Uig(zi)

∣∣∣∣∣≥ εα

4− ε2α

64B2(1 + ε)+

ε(1 − ε)64B2(1 + ε)

1n

n∑

i=1

g2(zi)

≤ 2 exp(

− ε2(1 − ε)αn140B2(1 + ε)

).

Step 6. Bounding the covering number.In this step we construct an L1

εβ5 -cover of gf : f ∈ F on z1, . . . , zn. Let

f1, . . . , fl, l = N1( εβ20B ,F , xn

1 ) be an εβ20B -cover of F on xn

1 . Without loss ofgenerality we may assume |fj(x)| ≤ B for all j. Let f ∈ F be arbitrary.Then there exists an fj such that 1

n

∑ni=1 |f(xi) − fj(xi)| < εβ

20B . We have

1n

n∑

i=1

|gf (zi) − gfj (zi)|

=1n

n∑

i=1

∣∣|f(xi) − yi|2 − |m(xi) − yi|2 − |fj(xi) − yi|2 + |m(xi) − yi|2

∣∣


=1n

n∑

i=1

|f(xi) − yi + fj(xi) − yi||f(xi) − yi − fj(xi) + yi|

≤ 4B1n

n∑

i=1

|f(xi) − fj(xi)| < εβ

5.

Thus gf1 , . . . , gflis an εβ

5 -cover of gf : f ∈ F on zn1 of size N1( εβ

20B ,F , xn1 ).

Steps 3 through 6 imply

P

∃f ∈ F :1n

n∑

i=1

(gf (Z ′i) − gf (Zi))

≥ ε

2(α+ β) − ε2(α+ β)

32B2(1 + ε)

+ε(1 − ε)

64B2(1 + ε)1n

n∑

i=1

(g2f (Zi) + g2

f (Z ′i))

≤ 4 supxn1 ∈(Rd)n

N1

(εβ

20B,F , xn

1

)exp

(− ε2(1 − ε)αn

140B2(1 + ε)

).

Step 7. Conclusion.Steps 1, 2, and 6 imply, for n > 128B2

ε2(α+β) ,

P

∃f ∈ F : Egf (Z) − 1n

n∑

i=1

gf (Zi) > ε(α+ β + Egf (Z))

≤ 327

supxn1

N1

(εβ

20B,F , xn

1

)exp

(− ε2(1 − ε)αn

140B2(1 + ε)

)

+647

supxn1

N1

(ε(α+ β)

20B,F , xn

1

)exp

(−3ε2(α+ β)n

640B4

)

≤ 14 supxn1

N1

(εβ

20B,F , xn

1

)exp

(− ε2(1 − ε)αn

214(1 + ε)B4

).

For n ≤ 128B2

ε2(α+β) one has

exp(

− ε2(1 − ε)αn214(1 + ε)B4

)≥ exp

(−128

214

)≥ 1

14,

and hence the assertion follows trivially.


Theorem 11.1 is a well-known result from fixed design regression. Thebound of Theorem 11.2 is due to Pollard (1984). Results concerning the


equivalence of the L2(µ) norm and the empirical norm under regularityconditions on µ can be found in van de Geer (2000).

Theorem 11.4 has been proven by Lee, Bartlett, and Williamson (1996).The bound proven in Lemma 11.2 has been obtained by Haussler (1992)and the bound in Theorem 11.6 by Pollard (1986) and Haussler (1992).

The rate of convergence of regression estimates has been studied in manyarticles, see, e.g., Cox (1988), Rafajlowicz (1987), Shen and Wong (1994),Birge and Massart (1993; 1998), and the literature cited therein.


Problem 11.1. Show that for arbitrary a, b, u > 0 one has

(a + u)2

a + b · u≥

(a + b−2

ba)2

a + b b−2b

a= 4a

b − 1b2 .

Hint: Show that the function

f(u) =(a + u)2

a + b · u

satisfies

f ′(u) < 0 if u 0 if u >b − 2

b· a.

Problem 11.2. Show that under the assumptions of Theorem 11.5 one has forarbitrary δ > 0,

E

∫|mn(x) − m(x)|2µ(dx)

≤ cδ,L

(1 + log(n))VF+n

n+ (1 + δ) inf

f∈Fn

∫|f(x) − m(x)|2µ(dx)

for some constant cδ,L depending only on δ and L. How does cδ,L depend on δ?Hint: Use the error decomposition

∫|mn(x) − m(x)|2µ(dx)

=

E|mn(X) − Y |2|Dn − E|m(X) − Y |2

−(1 + δ) ·(

1n

n∑

i=1

|mn(Xi) − Yi|2 − 1n

n∑

i=1

|m(Xi) − Yi|2)


+(1 + δ) ·(

1n

n∑

i=1

|mn(Xi) − Yi|2 − 1n

n∑

i=1

|m(Xi) − Yi|2)

= T1,n + T2,n.

Problem 11.3. Prove Theorem 11.3 using Theorems 11.1 and 11.6.

Problem 11.4. Try to improve the constants in Theorem 11.4.

Problem 11.5. Formulate and prove a multivariate version of Corollary 11.2.

12Least Squares Estimates III:Complexity Regularization

12.1 Motivation

In this chapter we describe the complexity regularization principle whichenables one to define least squares estimates which automatically adapt tothe smoothness of the regression function.

Let us start with a motivation of the complexity regularization principle.Assume that for given data one wants to find a function f : Rd → R fromsome class F , best describing the data by using the least squares criterion,i.e., one wants to find a function in F for which the empirical L2 risk isequal to

minf∈F

1n

n∑

i=1

|f(Xi) − Yi|2. (12.1)

Clearly (12.1) favors large (more complex) classes F yielding smaller valuesby the least squares criterion. However, complex classes F lead to overfit-ting of the data and result in estimates which poorly fit new data (poorgeneralization capability because of large estimation error). In the com-plexity regularization approach we introduce in (12.1) an additive penaltymonotonically increasing with the complexity of F .

The motivation for the penalty used in this chapter is given by thefollowing lemma:

12.1. Motivation 223

Lemma 12.1. Let 1 ≤ β < ∞, δ ∈ (0, 1), and let F be a class of functionsf : Rd → R. Assume |Y | ≤ β a.s. Then the inequality

∫|f(x) −m(x)|2µ(dx) ≤ c1

log( 42δ )

n+ (c2 + c3 log(n))

VF+

n

+2n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

(12.2)

holds simultaneously for all f ∈ TβF with probability greater than or equalto 1 − δ, where

c1 = 5136β4, c2 = 10272β4 log(480eβ2),

and

c3 = 10272β4.

Proof. Because of∫

|f(x) −m(x)|2µ(dx) = E|f(X) − Y |2 − E|m(X) − Y |2

=

E|f(X) − Y |2 − E|m(X) − Y |2

− 2n

n∑

i=1

(|f(Xi) − Yi|2 − |m(Xi) − Yi|2)

+2n

n∑

i=1

(|f(Xi) − Yi|2 − |m(Xi) − Yi|2)

=: T1 + T2,

it suffices to show

P

∃f ∈ TβF : T1 > c1log( 42

δ )n

+ (c2 + c3 log(n))VF+

n

≤ δ. (12.3)

Let t ≥ 1n be arbitrary. Then one has

P ∃f ∈ TβF : T1 > t

224 12. Least Squares Estimates III: Complexity Regularization

= P

∃f ∈ TβF : 2E|f(X) − Y |2 − E|m(X) − Y |2

− 1n

n∑

i=1

(|f(Xi) − Yi|2 − |m(Xi) − Yi|2)

> t+ E|f(X) − Y |2 − E|m(X) − Y |2

≤ P

∃f ∈ TβF : E|f(X) − Y |2 − E|m(X) − Y |2

− 1n

n∑

i=1

(|f(Xi) − Yi|2 − |m(Xi) − Yi|2)

>12

· (t

2+t

2+ E|f(X) − Y |2 − E|m(X) − Y |2)

≤ 14 supxn1

N1

( 12

t2

20β, TβF , xn

1 ))

exp(

− ( 12 )2(1 − 1

2 ) t2n

214(1 + 12 )β4

)

(by Theorem 11.4)

≤ 14 supxn1

N1

(1

80βn, TβF , xn

1

)exp

(− t · n

8 · 3 · 214β4

)

(because of t ≥ 1

n

)

≤ 14 · 3

(2e(2β)

180βn

log3e(2β)

180βn

)VTβF+

exp(

− t · n5136β4

)

(by Lemma 9.2 and Theorem 9.4)

≤ 42(480eβ2n)2VF+ exp(

− t · n5136β4

),

where we have used the relation VTβF+ ≤ VF+ .If one defines t for a given δ ∈ (0, 1) by

δ = 42(480eβ2n)2VF+ exp(

− tn

5136β4

),

then it follows that

tn

5136β4 = log42δ

+ 2VF+ log(480eβ2n)

12.2. Definition of the Estimate 225

and thus

t = 5136β4 log(

42δ

)1n

+ 2 · 5136β4 log(480eβ2n)VF+

n

= 5136β4 log(

42δ

)1n

+(10272β4 log(480eβ2) + 10272β4 log n

) VF+

n

= c1log( 42

δ )n

+ (c2 + c3 log(n))VF+

n.

This implies (12.3) and therefore Lemma 12.1 is proved.

Now we apply (12.2) to a truncated version mn of the least squaresestimate mn, i.e., to mn(·) = mn(·, Dn) = Tβmn(·) where

mn(·) = mn(·, Dn) ∈ Fminimizes the empirical L2 error over F :

1n

n∑

i=1

|mn(Xi) − Yi|2 = minf∈F

1n

n∑

i=1

|f(Xi) − Yi|2.

It follows that one has, with probability greater than or equal to 1 − δ,∫


≤ c1log( 42

δ )n

+ (c2 + c3 log(n))VF+

n

+2n

n∑

i=1

|mn(Xi) − Yi|2 − |m(Xi) − Yi|2. (12.4)

The idea of the complexity regularization is to choose F in such a waythat the right-hand side of (12.4) is minimized, which is equivalent tominimizing

(c2 + c3 log(n))2

VF+

n+

1n

n∑

i=1

|mn(Xi) − Yi|2

(cf. Figure 12.1).This will be described in detail in the next section.

12.2 Definition of the Estimate

Let Pn be a finite set of parameters. For p ∈ Pn let Fn,p be a set of functionsf : Rd → R and let penn(p) ∈ R+ be a complexity penalty for Fn,p. Letmn,p be a truncated least squares estimate of m in Fn,p, i.e., choose

mn,p(·) = mn,p(·, Dn) ∈ Fn,p (12.5)


p∗ p

penn(p)

empirical L2risk of mn,p

Figure 12.1. Constructing an estimate by minimizing the sum of L2 risk and thepenalty.

Fn,1penn(1)

Fn,2penn(2)

Fn,3penn(3)

Fn,4penn(4)

Figure 12.2. A sequence of nested classes Fn,1 ⊆ F,2n ⊆ Fn,3 ⊆ · · · of functionsfor which one will often choose an increasing sequence penn(1) < penn(2) <penn(3) < · · · of penalties.

which satisfies

1n

n∑

i=1

|mn,p(Xi) − Yi|2 = minf∈Fn,p

1n

n∑

i=1

|f(Xi) − Yi|2 (12.6)

and set

mn,p(·) = Tβnmn,p(·) (12.7)

for some βn ∈ R+.

12.3. Asymptotic Results 227

Next choose an estimate mn,p∗ minimizing the sum of the empirical L2risk of mn,p∗ and penn(p∗), i.e., choose

p∗ = p∗(Dn) ∈ Pn (12.8)

such that

1n

n∑

i=1

|mn,p∗(Xi) − Yi|2 + penn(p∗)

= minp∈Pn

1n

n∑

i=1

|mn,p(Xi) − Yi|2 + penn(p)

(12.9)

and set

mn(·, Dn) = mn,p∗(Dn)(·, Dn). (12.10)

Here the penalty depends on the class of functions from which one choosesthe estimate. This is in contrast to the penalized least squares estimates (cf.Chapter 20) where the penalty depends on the smoothness of the estimate.

12.3 Asymptotic Results

Our main result concerning complexity regularization is the followingtheorem:

Theorem 12.1. Let 1 ≤ L < ∞, n ∈ N , and, L ≤ βn < ∞. Assume|Y | ≤ L a.s. Let the estimate mn be defined as above (cf. (12.5)–(12.10))with a penalty term penn(p) satisfying

penn(p) ≥ 2 · 2568β4

n

n

(log(120eβ4

nn)VF+n,p

+cp2

)(p ∈ Pn) (12.11)

for some cp ∈ R+ satisfying∑

p∈Pne−cp ≤ 1. Then one has

E∫


≤ 2 infp∈Pn

penn(p) + inf

f∈Fn,p

∫|f(x) −m(x)|2µ(dx)

+ 5 · 2568

β4n

n.

We know from Theorem 11.5 that for each fixed p ∈ Pn the estimatemn,p satisfies

E∫

|mn,p(x) −m(x)|2µ(dx)

≤ c1n

+c2 + c3 log(n)

n· VF+

n,p+ 2 inf

f∈Fn,p

∫|f(x) −m(x)|2µ(dx).


According to (12.11) we can choose our penalty, in Theorem 12.1 above,up to the additional term β4

ncp/n of the form

c2 + c3 log(n)n

· VF+n,p.

The big advantage of the bound in Theorem 12.1 compared to Theorem11.5 is the additional infimum over p ∈ Pn, which implies that we can usethe data to choose the best value for p. The additional term β4

ncp/n is theprice which we have to pay for choosing p in this adaptive way.

Proof. We use a decomposition similar to that in the proof of Lemma12.1:


= E|mn(X) − Y |2|Dn − E|m(X) − Y |2

=E|mn(X) − Y |2|Dn − E|m(X) − Y |2

− 2n

n∑

i=1

(|mn(Xi) − Yi|2 − |m(Xi) − Yi|2) − 2penn(p∗)

+2n

n∑

i=1

(|mn(Xi) − Yi|2 − |m(Xi) − Yi|2) + 2penn(p∗)

=: T1,n + T2,n.

Because of (12.9), (12.10), |Yi| ≤ βn a.s., and (12.6), one has

T2,n

= 2 infp∈Pn

1n

n∑

i=1


− 2n

n∑

i=1

|m(Xi) − Yi|2

≤ 2 infp∈Pn

1n

n∑

i=1


− 2n

n∑

i=1

|m(Xi) − Yi|2

= 2 infp∈Pn

inff∈Fn,p

(1n

n∑

i=1

|f(Xi) − Yi|2)

+ penn(p)

− 2n

n∑

i=1

|m(Xi) − Yi|2


= 2 infp∈Pn

inff∈Fn,p

(1n

n∑

i=1

|f(Xi) − Yi|2 − 1n

n∑

i=1

|m(Xi) − Yi|2)

+ penn(p)

,

thus it follows

ET2,n

≤ 2 infp∈Pn

inff∈Fn,p

E

(1n

n∑

i=1

|f(Xi) − Yi|2 − 1n

n∑

i=1

|m(Xi) − Yi|2)

+ penn(p)

= 2 infp∈Pn

inf

f∈Fn,p

∫|f(x) −m(x)|2µ(dx) + penn(p)

.

Therefore the assertion follows from

ET1,n ≤ 5 · 2568β4

n

n, (12.12)

which we will show next.We mimic the proof of Lemma 12.1. Let t > 0 be arbitrary. Then

P T1,n > t

= P

2(E|mn(X) − Y |2|Dn − E|m(X) − Y |2

− 1n

n∑

i=1

(|mn(Xi) − Yi|2 − |m(Xi) − Yi|2))

> t+ 2penn(p∗) + E|mn(X) − Y |2|Dn − E|m(X) − Y |2

≤ P

∃p ∈ Pn∃f ∈ TβFn,p : E|f(X) − Y |2 − E|m(X) − Y |2

− 1n

n∑

i=1

(|f(Xi) − Yi|2 − |m(Xi) − Yi|2)

>12

· (t+ 2penn(p) + E|f(X) − Y |2 − E|m(X) − Y |2)

(because of (12.5) and (12.7))


≤∑

p∈Pn

P

∃f ∈ TβFn,p : E|f(X) − Y |2 − E|m(X) − Y |2

− 1n

n∑

i=1

(|f(Xi) − Yi|2 − |m(Xi) − Yi|2)

>12

· (t+ 2penn(p) + E|f(X) − Y |2 − E|m(X) − Y |2)

≤∑

p∈Pn

14 supxn1

N1

( 12penn(p)

20βn, TβnFn,p, x

n1

)

· exp(

− ( 12 )2 1

2 (t+ penn(p))n214(1 + 1

2 )β4n

)

(by Theorem 11.4)

≤∑

p∈Pn

14 supxn1

N1

(1

20βnn, TβnFn,p, x

n1

)exp

(− (t+ penn(p))n

3 · 4 · 214β4n

)

(since penn(p) ≥ 2n

)

≤∑

p∈Pn

14 · 3

(2e(2βn)

120βnn

log3e(2βn)

120βnn

)VF+n,p

· exp(

−penn(p)n2568β4

n

)exp

(− t · n

2568β4n

)

(by Lemma 9.2 and Theorem 9.4)

≤

∑

p∈Pn

42(120eβ2nn)

2VF+n,p exp

(−penn(p)n

2568β4n

)

exp

(− t · n

2568β4n

).

Using (12.11) one gets

. . . ≤∑

p∈Pn

42(120eβ2nn)

2VF+n,p exp

(−2VF+

n,plog(120eβ2

nn) − cp

)

=∑

p∈Pn

42 exp(−cp) ≤ 42,

thus

PT1,n > t ≤ 42 exp(

− t · n2568β4

n

).


For arbitrary u > 0 it follows

ET1,n ≤∫ ∞

0PT1,n > tdt ≤ u+

∫ ∞

u

PT1,n > tdt

≤ u+∫ ∞

u

42 exp(

− t · n2568β4

n

)dt

= u+ 42 · 2568β4n · 1

n· exp

(− u · n

2568β4n

).

Setting u = 2568 · log(42) · β4n

n one gets

ET1,n ≤ 2568(1 + log(42))β4

n

n≤ 5 · 2568

β4n

n

and thus (12.12) (and also the assertion of Theorem 12.1) is proved.

Remark. In order to choose the penalties penn(p) such that (12.11) is sat-isfied, one needs an upper bound on the VC dimension of F+

n,p. Sometimesit is much easier to get bounds on covering numbers like

N1(ε, TβnFn,p, xn1 ) ≤ N1(ε, TβnFn,p) (12.13)

for all ε > 0, xn1 ∈ (Rd)n rather than on VC dimension. In this case, it is

possible to replace (12.11) by

penn(p) ≥ 2568β4

n

n·(

log(

N1

(1n, Tβn

Fn,p

))+ cp

)(p ∈ Pn). (12.14)

To show that Theorem 12.1 also holds if (12.11) is replaced by (12.13)and (12.14), one observes that

P T1,n > t

≤∑

p∈Pn

14 supxn1

N1

( 12penn(p)

20βn, Tβn

Fn,p, xn1

)exp

(− (t+ penn(p))n

2568β4n

)

(as in the proof of Theorem 12.1)

≤

∑

p∈Pn

14 · N1

(1n, Tβn

Fn,p

)exp

(−penn(p)n

2568β4n

)

exp

(− tn

2568β4n

)

(because of penn(p) ≥ 40βn

nand (12.13)

)

≤

∑

p∈Pn

14 exp(−cp)

exp

(− tn

2568β4n

)(because of (12.14))


≤ 14 exp(

− tn

2568β4n

).

From this one obtains the assertion as in the proof of Theorem 12.1.


Let

Pn = (M,K) ∈ N0 × N : 0 ≤ M ≤ log n, 1 ≤ K ≤ n .For (M,K) ∈ Pn let Fn,(M,K) be the set of all piecewise polynomials ofdegree M (or less) w.r.t. an equidistant partition of [0, 1] into K intervals.For the penalty we use an upper bound on the right-hand side of (12.11)given by

penn((M,K)) = log(n)2 · K(M + 1)n

.

We get the following result which shows that, by applying complexityregularization to piecewise polynomial partitioning estimates, we get (upto a logarithmic factor) the optimal rate of convergence and can at thesame time adapt to the unknown smoothness.

Corollary 12.1. Let 1 ≤ L < ∞ and set βn = L (n ∈ N ). Let Pn,Fn,(M,K), and penn((M,K)) be given as above and define the estimate by(12.5)–(12.10). Then, for n sufficiently large,

E∫


≤ min(M,K)∈Pn

2 log(n)2K(M + 1)

n

+2 inff∈Fn,(M,K)

∫|f(x) −m(x)|2µ(dx)

+ 5 · 2568L4

n(12.15)

for every distribution of (X,Y ) such that |Y | ≤ L a.s. In particular, ifX ∈ [0, 1] a.s., |Y | ≤ L a.s., and m is (p, C)-smooth for some C > 0,p = q + r, q ∈ N0, r ∈ (0, 1], then for n sufficiently large

E∫


≤ c · C 22p+1 ·

(log(n)2

n

) 2p2p+1

(12.16)

for some constant c depending only on L and p.

Proof. Set

c(M,K) = log(|Pn|) = log (n(log(n) + 1)) ((M,K) ∈ Pn) .


Since Fn,(M,K) is a linear vector space of dimension K(M + 1), we get, byTheorem 9.5,

VF+n,(M,K)

≤ K(M + 1) + 1.

It follows for n sufficiently large

2 · 2568β4

n

n

(log(120eβ4

nn) · VF+n,(M,K)

+c(M,K)

2

)

≤ 2 · 2568L4

n

(log(120eL4n) · (K(M + 1) + 1) +

c(M,K)

2

)

≤ log(n)2 · K(M + 1)n

= penn(M,K).

Hence (12.15) follows from Theorem 12.1. The proof of (12.16), which fol-lows from (12.15) and Lemma 11.1, is left to the reader (see Problem 12.1).

Let us compare Corollary 12.1 to Corollary 11.2. In Corollary 11.2 wehave weaker assumptions on Y (Y need not to be bounded, it is only as-sumed that the regression function is bounded) but the estimate used theredepends on the (usually unknown) smoothness of the regression function.This is in contrast to Corollary 12.1, where the estimate does not dependon the smoothness (measured by p and C) and where we derive, up to alogarithmic factor, the same rate of convergence. According to the resultsfrom Chapter 3 the derived bounds on the L2 error are, in both corollaries,optimal up to some logarithmic factor.


The complexity regularization principle for the learning problem was intro-duced by Vapnik and Chervonenkis (1974) and Vapnik (1982) in patternrecognition as structural risk minimization. It was applied to regression esti-mation in Barron (1991) and was further investigated in Barron, Birge, andMassart (1999), Krzyzak and Linder (1998), and Kohler (1998). Lugosi andNobel (1999) investigate complexity regularization with a data-dependentpenalty.

Our complexity regularization criterion is closely related to the classicalCp criterion of Mallows (1973). Results concerning this criterion and relatedpenalized criteria can be found, e.g., in Akaike (1974), Sjibata (1976; 1981),Li (1986; 1987), Polyak and Tsybakov (1990), and Braud, Comte, andViennet (2001).



Problem 12.1. Show that (12.15) implies (12.16).

Problem 12.2. In the following we describe a modification of the estimate inCorollary 12.1 which is weakly and strongly consistent (cf. Problem 12.4). Set

Pn =(M, K) ∈ N0 × N : 0 ≤ M ≤ log(n), log(n)2 ≤ K ≤ n1−δ

,

where 0 < δ < 1/2. For (M, K) ∈ Pn let Fn,(M,K) be the set of all piecewise poly-nomials of degree M (or less) w.r.t. an equidistant partition of [− log(n), log(n)]into K cells, and set βn =

√log(n) and penn((M, K)) = log(n)4 K(M−1)

n. Define

the estimate mn by (12.5)–(12.10). Show that for n sufficiently large

E∫


≤ min(M,K)∈Pn

2 log(n)4

K(M + 1)n

+ 2 inff∈Fn,(M,K)

∫|f(x) − m(x)|2µ(dx)

+5 · 2568log(n)2

n

for every distribution of (X, Y ) such that |Y | is bounded a.s.

Problem 12.3. Let mn be defined as in Problem 12.2. Show that for everydistribution of (X, Y ) which satisfies |X| bounded a.s., |Y | bounded a.s., andm (p, C)-smooth one has, for n sufficiently large,

E

∫|mn(x) − m(x)|2µ(dx)

≤ c · C

22p+1 ·

(log(n)5

n

) 2p2p+1

.

Problem 12.4. Show that the estimate defined in Problem 12.2 is weakly andstrongly universally consistent.

13Consistency of Data-DependentPartitioning Estimates

In this chapter we describe the so-called data-dependent partitioning es-timates. We will first prove a general consistency theorem, then we willintroduce several data-dependent partitioning estimates and prove theiruniversal consistency by checking the conditions of the general consistencytheorem.

13.1 A General Consistency Theorem

Initially we consider partitioning regression estimates based on data-dependent partitioning. These estimates use the data twice: First, apartition Pn = Pn(Dn) of Rd is chosen according to the data, and thenthis partition and the data are used to define an estimate mn(x) of m(x)by averaging those Yi for which Xi and x belong to the same cell of thepartition, i.e., the estimate is defined by

mn(x) =∑n

i=1 YiIXi∈An(x)∑ni=1 IXi∈An(x)

. (13.1)

Here An(x) = An(x,Dn) denotes the cell A ∈ Pn(Dn) which contains x.As usual we have used the convention 0

0 = 0. In (13.1) and in the followingwe suppress, in notation, the dependence of An(x) on Dn.

236 13. Consistency of Data-Dependent Partitioning Estimates

It turns out that the estimate mn is a least squares estimate. Indeed,define, for a given set G of functions g : Rd → R and a partition P of Rd,

G P =

f : Rd → R : f =∑

A∈PgA · IA for some gA ∈ G (A ∈ P)

.

Each function in G P is obtained by applying a different function of Gin each set of the partition P. Let Gc be the set of all constant functions.Then the estimate defined by (13.1) satisfies

mn(·, Dn) ∈ GcPn and1n

n∑

i=1

|mn(Xi)−Yi|2 = minf∈GcPn

1n

n∑

i=1

|f(Xi)−Yi|2

(13.2)(cf. Problem 2.3). Therefore we can appply the results of Chapter 10 toshow the consistency of the partitioning regression estimates based on data-dependent partitioning. These results require an additional truncation ofthe estimate: Let βn ∈ R+ with βn → ∞ (n → ∞) and define

mn(x) = Tβn(mn(x)). (13.3)

We will need the following definition:

Definition 13.1. Let Π be a family of partitions of Rd. For a set xn1 =

x1, . . . , xn ⊆ Rd let ∆(xn1 ,Π) be the number of distinct partitions of xn

1induced by elements of Π, i.e., ∆(xn

1 ,Π) is the number of different partitionsxn

1 ∩ A : A ∈ P of xn1 for P ∈ Π. The partitioning number ∆n(Π)

is defined by

∆n(Π) = max∆(xn1 ,Π) : x1, . . . , xn ∈ Rd.

The partitioning number is the maximum number of different partitionsof any n point set, that can be induced by members of Π.

Example 13.1. Let Πk be the family of all partitions of R into k non-empty intervals. A partition induced by an element of Πk on a set xn

1 withx1 < · · · < xn is determined by natural numbers 0 ≤ i1 ≤ · · · ≤ ik−1 ≤ n,where the k − 1-tuple (i1, . . . , ik−1) stands for the partition

x1, . . . , xi1, xi1+1, . . . , xi2, . . . , xik−1+1, . . . , xn.There are

(n+1+(k−1)−1

k−1

)=

(n+k−1

n

)such tuples of numbers, thus for any

xn1 one gets

∆(xn1 ,Πk) =

(n+ k − 1

n

)and ∆n(Πk) =

(n+ k − 1

n

). (13.4)

Let Π be a family of finite partitions of Rd. We will denote the maximalnumber of sets contained in a partition P ∈ Π by M(Π), i.e., we will define

M(Π) = max|P| : P ∈ Π.

13.1. A General Consistency Theorem 237

Set

Πn =Pn ((x1, y1), . . . , (xn, yn)) : (x1, y1), . . . , (xn, yn) ∈ Rd × R

.(13.5)

Πn is a family of partitions which contains all data-dependent partitionsPn(Dn). The next theorems describes general conditions which imply theconsistency of a data-dependent partitioning estimate. There it is first re-quired that the set of partitions, from which the data-dependent partitionis chosen, is not too “complex,” i.e., that the maximal number of cellsin a partition, and the logarithm of the partitioning number, are smallcompared to the sample size (cf. (13.7) and (13.8)). Second, it is requiredthat the diameter of the cells of the data-dependent partition (denoted bydiam(A)) converge in some sense to zero (cf. (13.10)).

Theorem 13.1. Let mn be defined by (13.1) and (13.3) and let Πn bedefined by (13.5). Assume that

βn → ∞ (n → ∞), (13.6)

M(Πn) · β4n · log(βn)n

→ 0 (n → ∞), (13.7)

log (∆n(Πn)) · β4n

n→ 0 (n → ∞), (13.8)

β4n

n1−δ→ 0 (n → ∞) (13.9)

for some δ > 0 and

infS:S⊆Rd,µ(S)≥1−δ

µ (x : diam(An(x) ∩ S) > γ) → 0 (n → ∞) a.s.

(13.10)

for all γ > 0, δ ∈ (0, 1). Then∫

|mn(x) −m(x)|2µ(dx) → 0 (n → ∞) a.s.

We will use Theorem 10.2 to prove Theorem 13.1. We need to show

inff∈Tβn GcPn

∫|f(x) −m(x)|2µ(dx) → 0 (n → ∞) a.s. (13.11)

and

supf∈Tβn GcPn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣→ 0 (n → ∞) a.s.

(13.12)for all L > 0. (Here we have observed that since the functions in Gc Pn

are piecewise constant, TβnGc Pn consists of all those functions in Gc Pn

which are bounded in absolute value by βn.)


We will use (13.10) to show (13.11). Then we will bound the left–handside of (13.12) by

supf∈Tβn GcΠn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣, (13.13)

where

Gc Πn =⋃

P∈Πn

Gc P

and apply Theorem 9.1 to (13.13). To bound the resulting covering numberswe will use the following lemma:

Lemma 13.1. Let 1 ≤ p < ∞. Let Π be a family of partitions of Rd

and let G be a class of functions g : Rd → R. Then one has, for eachx1, . . . , xn ∈ Rd and for each ε > 0,

Np (ε,G Π, xn1 ) ≤ ∆(xn

1 ,Π)

supz1,...,zm∈xn

1 ,m≤nNp(ε,G, zm

1 )

M(Π)

.

Proof. We will use the abbreviation

N = supz1,...,zm∈xn

1 ,m≤nNp(ε,G, zm

1 ).

Fix x1, . . . , xn ∈ Rd and ε > 0. Let P = Aj : j ∈ Π be arbitrary. ThenP induces a partition of xn

1 consisting of sets Bj = x1, . . . , xn ∩ Aj . Foreach j choose an ε-cover of size not greater than N of G on Bj , i.e., choosea set GBj

of functions g : Rd → R such that for each function g ∈ G thereexists a function g ∈ GBj which satisfies

1nj

∑

x∈Bj

|g(x) − g(x)|p < εp, (13.14)

where nj = |Bj |.Now let f ∈ G Π be such that f =

∑A∈P fAIA for some partition

P ∈ Π which induces the same partition on xn1 as P. Then it follows from

(13.14) that for each A ∈ P there exists some gBj∈ GBj

(with j satisfyingA ∩ xn

1 = Aj ∩ xn1 = Bj) such that

1nj

∑

x∈Bj

|fA(x) − gBj(x)|p < εp, (13.15)

and for f =∑

A∈P gA∩xn1IA∩xn

1we get

1n

n∑

i=1

|f(xi) − f(xi)|p =1n

∑

j

∑

x∈Bj

|f(x) − f(x)|p

13.1. A General Consistency Theorem 239

(13.15)<

1n

∑

j

nj · εp = εp.

Thus for each P ∈ Π there is an ε-cover of size not greater than NM(Π)

for the set of all f ∈ G Π defined for a partition which induces the samepartition on xn

1 as P. As there are at most ∆(xn1 ,Π) distinct partitions on

xn1 induced by members of Π, the assertion follows.

Proof of Theorem 13.1. Because of Theorem 10.2 it suffices to show(13.11) and (13.12).

Proof of (13.11). m can be approximated arbitrarily closely in L2(µ) byfunctions of C∞

0 (Rd) (cf. Corollary A.1). Hence it suffices to prove (13.11)for functions m ∈ C∞

0 (Rd). Because of (13.6) we may further assume‖m‖∞ ≤ βn.

Let ε > 0 and δ ∈ (0, 1). For S ⊆ Rd and given data Dn define fS ∈TβnGc Pn by

fS =∑

A∈P(Dn)

m(zA) · IA∩S

for some fixed zA ∈ A which satisfies zA ∈ A ∩ S if A ∩ S = ∅ (A ∈ Pn).Choose γ > 0 such that |m(x) − m(z)| < ε for all ‖x − z‖ < γ. Then itfollows that, for z ∈ S,

|fS(z) −m(z)|2 < ε2Idiam(An(z)∩S)<γ + 4‖m‖2∞Idiam(An(z)∩S)≥γ.

Using this one gets

inff∈Tβn GcPn

∫|f(x) −m(x)|2µ(dx)

≤ infS:µ(S)≥1−δ

∫|fS(x) −m(x)|2µ(dx)


∫

S

|fS(x) −m(x)|2µ(dx) + 4‖m‖2∞µ(Rd \ S)


∫

S

|fS(x) −m(x)|2µ(dx) + 4‖m‖2∞δ


∫

S

(ε2Idiam(An(z)∩S)<γ

+4‖m‖2∞Idiam(An(z)∩S)≥γ

)µ(dx)

+4‖m‖2∞δ


≤ ε2 + 4‖m‖2∞ inf

S:µ(S)≥1−δµ(x ∈ Rd : diam(An(x) ∩ S) ≥ γ

)

+4‖m‖2∞δ

(13.10)→ ε2 + 4‖m‖2∞δ (n → ∞) a.s.

Because ε > 0, δ ∈ (0, 1) are arbitrary, the assertion follows.

Proof of (13.12). Because of TβnGc Pn ⊆ Tβn

Gc Πn it suffices to prove(13.12) with Tβn

Gc Pn replaced by TβnGc Πn. It follows from Theorem

9.1 and Problem 10.4 that, for 0 ≤ L ≤ βn,

P

supf∈Tβn GcΠn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣> ε

≤ 8EN1

(ε

32βn, TβnGc Πn, X

n1

)exp

(− nε2

128 · (4β2n)2

).

Using Lemma 13.1 and Theorem 9.4 one gets

N1

(ε

32βn, Tβn

Gc Πn, Xn1

)

≤ ∆n(Πn)

supz1,...,zm∈X1,...Xn,m≤n

N1

(ε

32βn, Tβn

Gc, zm1

)M(Πn)

≤ ∆n(Πn)

3

(3e(2βn)

ε32βn

)2VTβn

G+c

M(Πn)

≤ ∆n(Πn)

333eβ2n

ε

2M(Πn)

because

VTβn G+c

≤ VG+c

≤ 1.

Thus

P

supf∈Tβn GcΠn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣> ε

≤ 8∆n(Πn)

333eβ2n

ε

2M(Πn)

exp(

− nε2

2048β4n

)

≤ 8 exp(

log (∆n(Πn)) + 2M(Πn) log333eβ2

n

ε− nε2

2048β4n

)

13.2. Cubic Partitions with Data-Dependent Grid Size 241

= 8 exp

(

− n

β4n

(ε2

2048− log (∆n(Πn))β4

n

n− 2M(Πn)β4

n log 333eβ2n

ε

n

))

and the assertion follows by an easy application of the Borel–Cantellilemma.

Let Pn be a finite data-dependent partition and mn(x) the correspond-ing partitioning estimate without truncation. Assume that there exists apositive integer kn such that, for all A ∈ Pn,

µn(A) ≥ kn · log(n)n

.

Then

kn → ∞ (n → ∞)

and

limn→∞

diam(An(X)) = 0

in probability imply that mn is weakly universally consistent (Breiman etal. (1984)).

13.2 Cubic Partitions with Data-Dependent GridSize

As a first example for a data-dependent partitioning estimate we considera partitioning estimate based on a cubic partition with a data-dependentgrid size. The partition is determined by a data-independent rectan-gle [Ln, Rn)d, which we partition in equidistant cubic cells, i.e., we usepartitions

Pk =Rd \ [Ln, Rn)d

∪

[Ln + i1hk, Ln + (i1 + 1)hk) × · · · × [Ln + idhk, Ln + (id + 1)hk) :

i1, . . . , id ∈ 0, . . . , k − 1,

where hk = Rn−Ln

k is the grid size of the partition. This grid size is cho-sen in a data-dependent manner by choosing a random K which satisfiesKmin(n) ≤ K ≤ Kmax(n), where Kmin(n),Kmax(n) ∈ N depend only onthe size of the sample. We will use Theorem 13.1 to show



Ln → −∞, Rn → ∞ (n → ∞), (13.16)

Rn − Ln

Kmin(n)→ 0 (n → ∞), (13.17)

(Kmax(n)d + log(n)) · β4n log(βn)

n→ 0 (n → ∞), (13.18)

βn → ∞ (n → ∞), (13.19)

andβ4

n

n1−δ→ 0 (n → ∞) (13.20)

for some δ > 0. Then any estimate mn defined by (13.1) and (13.3) withPn(Dn) = PK for some random K = K(Dn) satisfying

Kmin(n) ≤ K ≤ Kmax(n) (13.21)

is strongly universally consistent.

From this theorem we can conclude that a truncated partitioning es-timate which uses a deterministic cubic partition is strongly universallyconsistent. We will later see (cf. Chapter 23) that suitably defined de-terministic partitioning estimates are strongly universally consistent evenwithout truncation, but the proof there will be much more involved.

One great advantage of Theorem 13.2 is that it is valid for any data-dependent choice of K. So if one restricts the range of K as in (13.21),then one can apply, e.g., splitting of the sample (cf. Chapter 7) or cross-validation (cf. Chapter 8) to choose K and gets automatically a consistentestimate. Due to the fact that K may be chosen arbitrarily from the rangeof (13.21) (in particular, the worst value might be chosen), the conditionsthere can be improved at most by some logarithmic factor. But, on theother hand, for particular rules for choosing a data-dependent K they canbe relaxed, e.g., the restriction Kmin(n) ≤ K should not be necessary toprove consistency for a reasonable data-dependent choice of K.

Proof. It suffices to check the conditions (13.6)–(13.10) for

Πn = (Pk)Kmin(n)≤k≤Kmax(n).

Clearly,

M(Πn) = Kmax(n)d + 1. (13.22)

To determine ∆n(Πn) fix xn1 ∈ Rd·n. The partition, which is induced by

Pk on xn1 , is uniquely determined by ak = (a1,k, . . . , an,k) where

al,k ∈ 0 ∪ 0, . . . ,Kmax(n) − 1d

13.3. Statistically Equivalent Blocks 243

is defined by

al,k =

0 if xl ∈ Rd \ [Ln, Rn)d,(i1, . . . , id) if xl ∈ [Ln + i1hk, Ln + (i1 + 1)hk) × · · ·

×[Ln + idhk, Ln + (id + 1)hk).

If k1 < k2 then hk1 > hk2 and thus if al,k1 = (i1, . . . , id) and al,k2 =(j1, . . . , jd) then one has i1 ≤ j1, . . . , id ≤ jd. Therefore if k runs from 1up to Kmax(n) then each component of ak can change at most Kmax(n)d

times, and it follows that ∆(xn1 ,Πn) ≤ nKmax(n)d. This implies

∆n(Πn) ≤ nKmax(n)d. (13.23)

Now (13.6)–(13.9) follow easily from (13.18)–(13.23).To show (13.10), let γ > 0 and δ ∈ (0, 1) be arbitrary. Because of (13.16)

and (13.17) one has, for sufficiently large n,

µ([Ln, Rn)d

) ≥ 1 − δ and d · Rn − Ln

Kmin(n)≤ γ.

Then for sufficiently large n,

infS:µ(S)≥1−δ

µ (x : diam(An(x) ∩ S) > γ)

≤ µ(x : diam(An(x) ∩ [Ln, Rn)d) > γ

)

≤ µ

(x : d

Rn − Ln

Kmin(n)> γ

)= µ(∅) = 0.

This implies (13.10), and the assertion follows from Theorem 13.1.

13.3 Statistically Equivalent Blocks

A partition is based on statistically equivalent blocks if each set of thepartition contains the same number of data points.

In the sequel we only consider the case of univariate X (see Problem13.1 for multivariate examples). For simplicity we assume that X has adensity with respect to the Lebesgue measure. This implies that with prob-ability one X1, . . . , Xn are all distinct (this is the only point where weneed the density assumption). Let X(1), . . . , X(n) be the order statistics ofX1, . . . , Xn, i.e., X(1) < · · · < X(n) and X(1), . . . , X(n) = X1, . . . , Xnwith probability one.

For univariate X, partitioning based on statistically equivalent blocksreduces to the so-called k-spacing: Let kn ∈ N and choose a partitionPn(Dn) = A1, . . . , AN of R consisting of N = n

kn intervals such that

each interval except the rightmost interval contains exactly kn of the Xi.So A1, . . . , Am are intervals such that

X((j−1)kn+1), . . . , X(jkn) ∈ Aj (j = 1, . . . , N − 1)


and

X((N−1)kn+1), . . . , X(n) ∈ AN .

The exact position of the end points of the intervals is not important.We next show strong consistency of partitioning estimates using such

partitions.


βn → ∞ (n → ∞), (13.24)

kn

n→ 0 (n → ∞), (13.25)

β4n log(n)kn

→ 0 (n → ∞), (13.26)

andβ4

n

n1−δ→ 0 (n → ∞) (13.27)

for some δ > 0. Then any estimate mn defined by (13.1) and (13.3) withPn(Dn) defined via kn-spacing is strongly consistent for every distributionof (X,Y ) where X is an univariate random variable having a density andEY 2 < ∞.

Proof. Let Πn be the family of all partitions consisting of nkn

intervals.We will check the conditions of Theorem 13.1 for Πn.

Clearly, M(Πn) = nkn

. Furthermore, by (13.4),

∆n(Πn) =(n+ n

kn − 1

n

)≤ (n+ n

kn) n

kn ≤ (2n) n

kn.

Thus (13.6)–(13.9) are implied by (13.24), (13.26), and (13.27).To prove (13.10), fix γ > 0 and δ ∈ (0, 1). Choose L so large that

µ([−L,L]) ≥ 1 − δ. Observe that no more than 2 + 2Lγ of the intervals

A1, . . . , AN can satisfy diam(Ai) > γ and Ai ∩ [−L,L] = ∅. Hence

infS:µ(S)≥1−δ


≤ µ (x : diam(An(x) ∩ [−L,L]) > γ)

≤∑

i:diam(Ai)>γ,Ai∩[−L,L]=∅(µ(Ai) − µn(Ai) + µn(Ai))

≤(

supA∈P,P∈Πn

|µ(A) − µn(A)| +kn

n

)(2 +

⌈2Lγ

⌉)

→ 0 (n → ∞) a.s.,

13.4. Nearest Neighbor Clustering 245

where the first term in the parentheses above converges to zero by an obvi-ous extension of the classical Glivenko–Cantelli theorem, while the secondterm converges to zero because of (13.25).

The concept of statistically equivalent blocks can be extended to Rd

as follows (the so-called Gessaman rule): For fixed sample size n setM = (n/kn)1/d. According to the first coordinate axis, partition thedata into M sets such that the first coordinates form statistically equiva-lent blocks. We obtain M cylindrical sets. In the same fashion, cut each ofthese cylindrical sets along the second axis into M statistically equivalentblocks. Continuing in the same way along the remaining coordinate axes,we obtain Md rectangular cells, each of which (with the exception of thoseon the boundary) contains kn points (see Figure 4.6). The proof of theconsistency of (truncated) partitioning estimates using such a partition isleft to the reader (cf. Problem 13.1).

13.4 Nearest Neighbor Clustering

A clustering scheme is a function C : Rd → C, where C = c1, . . . , ck ⊆Rd is a finite set of vectors called cluster centers. Each clustering schemeC is associated with a partition PC = A1, . . . , Ak of Rd having cells Aj =x : C(x) = cj. A clustering scheme C is called a nearest neighborclustering scheme (NN-clustering scheme) if, for each x ∈ Rd,

‖x− C(x)‖ = mincj∈C

‖x− cj‖. (13.28)

An example of an NN-clustering scheme in R2 is given in Figure 13.1.Given only a set C, one can use (13.28) to define a nearest neighbor

clustering scheme C : Rd → C uniquely if one has an appropriate tie-breaking strategy. In the following we will use the tie-breaking strategy,which defines C(x) such that the index j of cj is minimal in (13.28).

We will choose the cluster centers of a clustering scheme by minimizingan empirical risk. Let E‖X‖2 < ∞ and define the risk of a clustering schemeC by

R(C) = E‖X − C(X)‖2 =∫

|x− C(x)|2µ(dx), (13.29)

i.e., R(C) is the expected squared distance of X to the closest cluster centerof C. Similarly, the empirical risk Rn(C) of a cluster scheme is defined by

Rn(C) =1n

n∑

i=1

‖Xi − C(Xi)‖2 =1n

n∑

i=1

minj=1,...,k

‖Xi − cj‖2. (13.30)


c1

c2

c3

c4

c5

c6

c7

Figure 13.1. An NN-clustering scheme in R2.

We will consider nearest neighbor clustering schemes which minimize theempirical risk, i.e., which satisfy

Rn(Cn) = minC NN−clustering scheme,|C(Rd)|≤kn

Rn(C). (13.31)

Next we show that such an empirical optimal clustering scheme alwaysexists. Fix X1, . . . , Xn, let Cn = c1, . . . , ckn and Cn = c1, . . . , ckn betwo sets of cluster centers and denote the corresponding clustering schemesby C and C. Then

|Rn(C) −Rn(C)|

=

∣∣∣∣∣1n

n∑

i=1

(min

j=1,...,kn

‖Xi − cj‖2 − minj=1,...,kn

‖Xi − cj‖2)∣∣∣∣∣


≤ 1n

n∑

i=1

maxj=1,...,n

(‖Xi − cj‖ + ‖Xi − cj‖) · ‖cj − cj‖

≤ maxi,j

(‖Xi − cj‖ + ‖Xi − cj‖) · maxj

‖cj − cj‖

and therefore the empirical risk is (on any bounded set of cluster centers)a continuous function of the cluster centers. Because of

infC:‖cj‖>L for some j

1n

n∑

i=1

minj=1,...,kn

‖Xi − cj‖2 → ∞ (L → ∞)

the infimal empirical risk will be obtained in a compact set and, becauseof the continuity, the infimal risk will really be obtained for some set ofcluster centers contained in this compact set.

Using such an empirical optimal nearest neighbor clustering scheme toproduce the partition of a data-dependent partitioning estimate, one getsconsistent regression estimates.


βn → ∞ (n → ∞), (13.32)

kn → ∞ (n → ∞), (13.33)

k2nβ

4n log(n)n

→ 0 (n → ∞), (13.34)

and

β4n

n1−δ→ 0 (n → ∞) (13.35)

for some δ > 0. Let mn be defined by (13.1) and (13.3) for some data-dependent partition Pn(Dn) = PC , where C is a kn-nearest neighborclustering scheme which minimizes the empirical risk. Then mn is stronglyconsistent for every distribution of (X,Y ) with E‖X‖2 < ∞ and EY 2 < ∞.

Remark. If one uses for the construction of the kn-NN clustering schemeonly those Xi which are contained in a data-independent rectangle[Ln, Rn]d (where Ln → −∞ and Rn → ∞ not too fast for n → ∞),then the resulting estimate is strongly universally consistent (i.e., then onecan avoid the condition E‖X‖2 < ∞ in the above theorem). The detailsare left to the reader (cf. Problem 13.2).

We will prove Theorem 13.4 by checking the conditions of Theorem 13.1.We will use the following lemma to show the shrinking of the cells:

Lemma 13.2. For each n ∈ N let Cn : Rd → Cn minimize the empiricalrisk Rn(Cn) over all nearest neighbor clustering schemes having kn cluster


centers. If E‖X‖2 < ∞ and kn → ∞ (n → ∞), then one has, for eachL > 0,

maxu∈supp(µ)∩[−L,L]d

minc∈Cn

‖u− c‖ → 0 (n → ∞) a.s. (13.36)

Proof.

Step 1. We will first show

Rn(Cn) → 0 (n → ∞) a.s. (13.37)

To show this, let u1, u2, . . . be a countable dense subset of Rd with u1 = 0.Let L > 0 be arbitrary. Then

Rn(Cn) ≤ 1n

n∑

i=1

minj=1,...,kn

‖Xi − uj‖2

≤ 1n

∑

i=1,...,n,

Xi∈[−L,L]d

minj=1,...,kn

‖Xi − uj‖2

+1n

∑

i=1,...,n,

Xi∈Rd\[−L,L]d

minj=1,...,kn

‖Xi − uj‖2

≤ maxx∈[−L,L]d

minj=1,...,kn

‖x− uj‖2 +1n

∑

i=1,...,n,

Xi∈Rd\[−L,L]d

‖Xi‖2

(because of u1 = 0)

→ 0 + E‖X‖2IX∈Rd\[−L,L]d

(n → ∞) a.s.

Because of E‖X‖2 < ∞ one gets, for L → ∞,

E‖X‖2IX∈Rd\[−L,L]d

→ 0 (L → ∞),

which implies (13.37).

Step 2. Next we will show

P

lim infn→∞

µn (Su,δ) > 0 for every u ∈ supp(µ), δ > 0

= 1. (13.38)

Let v1, v2, . . . be a countable dense subset of supp(µ). By definition ofsupp(µ) one has

µ(Su,δ) > 0 for every u ∈ supp(µ), δ > 0.

The strong law of large numbers implies, for every i, k ∈ N ,

limn→∞

µn

(Svi,1/k

)= µ

(Svi,1/k

)> 0 a.s.,


from which it follows that, with probability one,

limn→∞

µn

(Svi,1/k

)> 0 for every i, k ∈ N . (13.39)

Fix u ∈ supp(µ) and δ > 0. Then there exist i, k with 1k <

δ2 and u ∈ Svi,1/k,

which imply Su,δ ⊇ Svi,1/k. Hence

lim infn→∞

µn (Su,δ) ≥ lim infn→∞

µn

(Svi,1/k

).

This together with (13.39) proves (13.38).

Step 3. Now suppose that (13.36) doesn’t hold. Then there exist L > 0and δ > 0 such that the event

lim sup

n→∞max

u∈supp(µ)∩[−L,L]dminc∈Cn

‖u− c‖ > δ

has probability greater than zero. On this event there exists a (random)sequence nkk∈N and (random) unk

∈ supp(µ) ∩ [−L,L]d such that

minc∈Cnk

‖unk− c‖ > δ

2(k ∈ N ). (13.40)

Because supp(µ)∩[−L,L]d is compact one can assume w.l.o.g. (by replacingnkk∈N by a properly defined subsequence) unk

→ u∗ (n → ∞) for some(random) u∗ ∈ supp(µ).

This implies that on this event (and therefore with probability greaterthan zero) one has

Rnk(Cnk

) ≥ 1nk

∑

i=1,...,nk;Xi∈Su∗,δ/8

minc∈Cnk

‖Xi − c‖2

≥ 1nk

∑

i=1,...,nk;Xi∈Su∗,δ/8

minc∈Cnk

(‖unk− c‖ − ‖unk

−Xi‖)2

(13.40)≥ 1

nk

∑

i=1,...,nk;Xi∈Su∗,δ/8

Iunk∈Su∗,δ/8

(δ

2−(δ

8+δ

8

))2

=1nk

∑

i=1,...,nk;Xi∈Su∗,δ/8

Iunk∈Su∗,δ/8

δ2

16

= Iunk∈Su∗,δ/8 · δ

2

16· µnk

(Su∗,δ/8

).

It follows that

lim infk→∞

Rnk(Cnk

) ≥ δ2

16lim infk→∞

µnk

(Su∗,δ/8

) (13.38)> 0 a.s.,


which contradicts (13.37).

Proof of Theorem 13.4. Let Πn be the family consisting of all partitionsPC for some kn-nearest neighbor clustering scheme C. Clearly,

M(Πn) = kn.

Furthermore, each set in a partition PC is an intersection of at most k2n

hyperplanes perpendicular to one of the k2n pairs of cluster centers. It follows

from Theorems 9.3 and 9.5 that a hyperplane can split n points in at most(n+ 1)d+1 different ways (cf. Problem 13.3). Therefore

∆n(Πn) ≤ ((n+ 1)d+1)k2

n = (n+ 1)(d+1)·k2n .

Now (13.6)–(13.9) follow from (13.32)–(13.35).To show (13.10), fix γ > 0 and δ ∈ (0, 1). Choose L so large that

µ([−L,L]d) ≥ 1 − δ. Then

infS:µ(S)≥1−δ


≤ µ(x : diam(An(x) ∩ [−L,L]d) > γ

)

≤ µ

(x : 2 max

u∈supp(µ)∩[−L,L]dminc∈Cn

‖u− c‖ > γ

)

= 1 · I2 maxu∈supp(µ)∩[−L,L]d minc∈Cn ‖u−c‖>γ → 0 (n → ∞) a.s.

by Lemma 13.2. Thus Theorem 13.1 implies the assertion.


Lemmas 13.1 and 13.2 are due to Nobel (1996). Theorem 13.1 and theconsistency results in Sections 13.2, 13.3, and 13.4 are extensions of resultsfrom Nobel (1996). Related results concerning classification can be found inLugosi and Nobel (1996). Gessaman’s rule, described at the end of Section13.3, is due to Gessaman (1970).

There are several results on data-dependent partitioning estimates fornested partitions, i.e., for partitions where Pn+1 is a refinement of Pn

(maybe Pn+1 = Pn). This refinement means that the cells of Pn+1 areeither cells of Pn or splits of some cells of Pn. This splitting can be repre-sented by a tree, therefore, it is often called regression trees (cf. Gordon andOlshen (1980; 1984), Breiman et al. (1984), Devroye, Gyorfi, and Lugosi(1996), and the references therein).



Problem 13.1. Find conditions on βn and kn such that the truncated data-dependent partitioning estimate, which uses a partition defined by Gessaman’srule (cf. end of Section 13.3), is strongly consistent for all distributions of (X, Y )where each component of X has a density and EY 2 < ∞.

Problem 13.2. Show that if one uses, for the construction of the kn-NN-clustering scheme, only those Xi, which are contained in a data-independentrectangle [Ln, Rn]d (where Ln → −∞ and Rn → ∞ not too fast for n → ∞),then the resulting estimate is strongly universally consistent.

Problem 13.3. Show that a hyperplane can split n points in at most (n+1)d+1

different ways.Hint: Use Theorems 9.3 and 9.5 to bound the shatter coefficient of the set

A =

x ∈ Rd : a1x(1) + · · · + adx(d) + ad+1 ≥ 0

: a1, . . . , ad+1 ∈ R

.

Problem 13.4. Let M ∈ N and let GM be the set of all (multivariate) poly-nomials of degree M (or less, in each coordinate). Let Pn = Pn(Dn) be adata-dependent partition and set

GM Pn =

f : Rd → R : f =∑

A∈Pn

gAIA for some gA ∈ GM (A ∈ Pn)

.

Define the estimate mn by (13.2) and (13.3) with Gc replaced by GM . Show that(13.6)–(13.10) imply

∫|mn(x) − m(x)|2µ(dx) → 0 (n → ∞) a.s.

Problem 13.5. Use Problem 13.4 to define consistent least squares estimatesusing piecewise polynomials with respect to data-dependent partitions, wherethe partitions are:(a) cubic partitions with data-dependent grid size;(b) statistically equivalent blocks; and(c) defined via nearest neighbor clustering.

14Univariate Least Squares SplineEstimates

In Chapter 10 we have introduced least squares estimates. These estimatesheavily depend on the classes Fn of functions f : Rd → R over which oneminimizes the empirical L2 risk.

In Section 11.2 we have defined such classes Fn by choosing a partitionof Rd and taking all piecewise polynomials with respect to that partition.The main drawback of this is that such functions are generally not smooth(e.g., not continuous) and, therefore also, the corresponding least squaresestimate is generally not smooth. For the interpretability of the estimate itis often important that it is a smooth function. Moreover, using spaces ofpiecewise polynomials without any smoothness condition results in a highvariance of the estimate in cells of the partition which contain only fewof the Xi’s (because on such a cell the estimate depends only on the few(Xi, Yi)’s with Xi contained in this cell).

A remedy against this drawback of piecewise polynomials is to use setsof piecewise polynomial functions which satisfy some global smoothnesscondition (e.g., which are continuous). These so-called polynomial splinespaces will be investigated in this chapter.

14.1 Introduction to Univariate Splines

In the sequel we will define spaces of piecewise polynomials on an interval[a, b) (a, b ∈ R, a < b) which satisfy some global smoothness conditions. Todo so we choose M ∈ N0 and a partition of [a, b) into intervals [ui, ui+1)

14.1. Introduction to Univariate Splines 253

u0 u1 u2 u3 u4 u5x

f(x)

Figure 14.1. Examples for a function in Su,M ([u0, uK)) for M = 0.

(i = 0, . . . ,K − 1) (where a = u0 < u1 < · · · < uK = b). Then we definethe corresponding spline space as the set of all functions f : [a, b) → Rwhich are M −1 times continuously differentiable on [a, b) and are equal toa polynomial of degree M or less on each set [ui, ui+1) (i = 0, . . . ,K − 1).

Definition 14.1. Let M ∈ N0 and u0 < u1 < · · · < uK . Set u =ujj=0,...,K . We define the spline space Su,M ([u0, uK)) as

Su,M ([u0, uK))

=f : [u0, uK) → R : there exist polynomials p0, . . . , pK−1 of

degree M or less such that f(x) = pi(x)

for x ∈ [ui, ui+1) (i = 0, . . . ,K − 1)

and if M − 1 ≥ 0 then f is M − 1 times continuously

differentiable on [u0, uK).

u is called the knot vector and M is called the degree of the splinespace Su,M ([u0, uK)).

Note that 0 times continuously differentiable functions are simplycontinuous functions.

Example 14.1. The functions in Su,0([u0, uK)) are piecewise constant(see Figure 14.1).

Example 14.2. The functions in Su,1([u0, uK)) are piecewise linear andcontinuous on [u0, uK) (see Figure 14.2).

254 14. Univariate Least Squares Spline Estimates

u0 u1 u2 u3 u4 u5x

f(x)

Figure 14.2. Example for a function in Su,M ([u0, uK)) for M = 1.

Clearly, Su,M ([u0, uK)) is a linear vector space. The next lemma presentsa basis of this linear vector space. There we will use the notation

(x− u)M+ =

(x− u)M if x ≥ u,0 if x < u,

and the convention 00 = 1.

Lemma 14.1. Let M ∈ N0 and u0 < u1 < · · · < uK . Then the set offunctions

1, x, . . . , xM

∪ (x− uj)M

+ : j = 1, . . . ,K − 1

(14.1)

is a basis of Su,M ([u0, uK)), i.e., for each f ∈ Su,M ([u0, uK)) there existunique a0, . . . , aM , b1, . . . , bK−1 ∈ R such that

f(x) =M∑

i=0

aixi +

K−1∑

j=1

bj(x− uj)M+ (x ∈ [u0, uK)). (14.2)

Observe that Lemma 14.1 implies that the vector space dimension ofSu,M ([u0, uK)) is equal to (M + 1) + (K − 1) = M +K.

Proof of Lemma 14.1. Let us first observe that the functions in (14.1)are contained in Su,M ([u0, uK)). Indeed, one has

∂k

∂xk

((x− uj)M

)∣∣x=uj

= M ·(M−1) · · · (M−k+1)·((x− uj)M−k)∣∣x=uj

= 0

for k = 0, . . . ,M − 1, which implies that (x − uj)M+ is M − 1 times

continuously differentiable (j = 1, . . . ,K − 1).


Next we will show that the functions in (14.1) are linearly independent.Let a0, . . . , aM , b1, . . . , bK−1 ∈ R be arbitrary and assume

M∑

i=0

aixi +

K−1∑

j=1

bj(x− uj)M+ = 0 (x ∈ [u0, uK)). (14.3)

For x ∈ [u0, u1) one has (x − uj)M+ = 0 (j = 1, . . . ,K − 1), thus (14.3)

impliesM∑

i=0

aixi = 0 (x ∈ [u0, u1)). (14.4)

Because 1, x, . . . , xM are linearly independent on each set which containsat least M + 1 different points it follows that a0 = a1 = · · · = aM = 0.This, together with (14.3), implies

K−1∑

j=1

bj(x− uj)M+ = 0 (x ∈ [u0, uK)). (14.5)

Setting successively x = uj+uj+12 (j = 1, . . . ,K−1) in (14.5) one gets bj = 0

for j = 1, . . . ,K − 1 because(uj + uj+1

2− uk

)M

+= 0 for k > j.

It remains to show that for each f ∈ Su,M ([u0, uK)) there existssome a0, . . . , aM , b1, . . . , bK−1 ∈ R such that (14.2) holds. Therefore weshow by induction that for each k ∈ 0, . . . ,K − 1 there exists somea0, . . . , aM , b1, . . . , bk ∈ R such that

f(x) =M∑

i=0

aixi +

k∑

j=1

bj(x− uj)M+ (x ∈ [u0, uk+1)). (14.6)

For k = 0 this clearly holds because f is a polynomial of degree M , or less,on [u0, u1). Assume that (14.6) holds for some k < K − 1. Then g, definedby

g(x) = f(x) −M∑

i=0

aixi −

k∑

j=1

bj(x− uj)M+ ,

satisfies

g(x) = 0 (x ∈ [u0, uk+1)) (14.7)

and is M − 1 times continuously differentiable at uk+1 (because of f ∈Su,M ([u0, uK))). Thus

∂ig(uk+1)∂xi

= 0 for i = 0, . . . ,M − 1. (14.8)


Furthermore, because f ∈ Su,M ([u0, uK)), g is equal to a polynomial ofdegree M or less on [uk+1, uk+2). Thus there exist c0, . . . , cM ∈ R suchthat

g(x) =M∑

i=0

ci(x− uk+1)i (x ∈ [uk+1, uk+2)).

Since

∂jg(uk+1)∂xj

=M∑

i=j

ci · i · (i− 1) · · · (i− j + 1) · (x− uk+1)i−j∣∣x=uk+1

= j!cj

it follows from (14.8) that c0 = · · · = cM−1 = 0, thus

g(x) − cM (x− uk+1)M = 0 (x ∈ [uk+1, uk+2)).

This together with (14.7) implies (14.6) for k + 1 and bk+1 = cM .

We can use the basis from Lemma 14.1 to implement splines on a com-puter. To do this we represent splines as linear combinations of the basisfunctions and store in the computer only the coefficients of these linearcombinations.

Unfortunately, the basis from Lemma 14.1 does not provide efficientspline representation on a computer. For example, because the supportsof the basis functions are unbounded, evaluation of

f(x) =M∑

i=0

aixi +

K−1∑

j=1

bj(x− uj)M+

at some x ∈ R for given a0, . . . , aM , b1, . . . , bK−1 requires evaluation ofnearly all the (K +M) basis functions and thus the amount of time to dothis increases with K. Therefore one prefers a basis where the support ofthe basis functions is as small as possible. Next we will introduce such abasis, called a B-spline basis.

The definition of B-splines depends on 2M additional knots u−M , . . . ,u−1, uK+1, . . . , uK+M which satisfy u−M ≤ u−M+1 ≤ · · · ≤ u−1 ≤ u0 anduK ≤ uK+1 ≤ · · · ≤ uK+M . We will use again the notation u for theextended knot vector u = ujj=−M,...,K+M .

Definition 14.2. Let M ∈ N0, u−M ≤ · · · ≤ uK+M , and set u =ujj=−M,...,K+M . Then the B-splines Bj,l,u of degree l and with knotvector u are recursively defined by

Bj,0,u(x) =

1 if uj ≤ x < uj+1,0 if x < uj or x ≥ uj+1,

(14.9)

for j = −M, . . . ,K +M − 1, x ∈ R, and

Bj,l+1,u(x) =x− uj

uj+l+1 − ujBj,l,u(x) +

uj+l+2 − x

uj+l+2 − uj+1Bj+1,l,u(x) (14.10)


u0 u1 u2 u3 u4 u5x

B1,0,u(x)

1

u0 u1 u2 u3 u4 u5x

B1,1,u(x)

1

Figure 14.3. Examples for B-splines of degree 0 and 1.

for j = −M, . . . ,K +M − l − 2, l = 0, . . . ,M − 1, x ∈ R.

In (14.10) uj+l+1 − uj = 0 (or uj+l+2 − uj+1 = 0) implies Bj,l,u(x) = 0(or Bj+1,l,u(x) = 0). We have used the convention 0

0 = 0.

Example 14.3. For M = 0, Bj,0,u is the indicator function of the set[uj , uj+1). Clearly, Bj,0,u : j = 0, . . . ,K− 1 is a basis of Su,0([u0, uK)).

Example 14.4. For M = 1 one gets

Bj,1,u(x) =

x−uj

uj+1−ujfor uj ≤ x < uj+1,

uj+2−xuj+2−uj+1

for uj+1 ≤ x < uj+2,

0 if x < uj or x ≥ uj+2.

The B-spline Bj,1,u is equal to one at uj+1, zero at all the other knots,and is linear between consecutive knots (the so-called hat function). For aspecial knot sequence this basis is illustrated in Figure 14.4. There u−1 =u0, K = 5, u6 = u5, and Bj,1,u is the function with support [uj , uj+2](j = −1, . . . , 4). Observe that the support of the B-splines in Figure 14.4 is


u0 u1 u2 u3 u4 u5x

1

Figure 14.4. B-spline basis for degree M = 1.

much smaller than the support of the basis described in Lemma 14.1. It iseasy to see that Bj,1,u : j = −1, . . . ,K − 1 is a basis for Su,1([u0, uK)).

Before we show for arbitrary M that Bj,M,u : j = −M, . . . ,K−1 is abasis for Su,M ([u0, uK)) we prove some useful properties of the B-splines.

Lemma 14.2. Let M ∈ N0 and u−M ≤ · · · ≤ u0 < · · · < uK ≤ · · · ≤uK+M .(a)

Bj,M,u(x) = 0 for x /∈ [uj , uj+M+1) (14.11)

for j ∈ −M, . . . ,K − 1.(b)

Bj,M,u(x) ≥ 0 (14.12)

for x ∈ R and j ∈ −M, . . . ,K − 1.(c)

K−1∑

j=−M

ajBj,M,u(x)

=K−1∑

j=−(M−1)

aj

x− uj

uj+M − uj+ aj−1

uj+M − x

uj+M − uj

Bj,M−1,u(x)

(14.13)

for x ∈ [u0, uK), a−M , . . . , aK−1 ∈ R, and M > 0.

Proof. Equation (14.11) follows easily from (14.9) and (14.10) by induc-tion onM . Then (14.12) follows from (14.11), (14.9), and (14.10). It remains


u0 u1 u2 u3 u4 u5x

B1,2,u(x)

1

u0 u1 u2 u3 u4 u5x

B1,3,u(x)

1

Figure 14.5. Examples for B-splines of degree 2 and 3.

to show (14.13): For x ∈ [u0, uK) one has

K−1∑

j=−M

ajBj,M,u(x)

(14.10)=

K−1∑

j=−M

aj

(x− uj

uj+M − ujBj,M−1,u(x)

+uj+M+1 − x

uj+M+1 − uj+1Bj+1,M−1,u(x)

)

=K−1∑

j=−(M−1)

(aj

x− uj

uj+M − uj+ aj−1

uj+M − x

uj+M − uj

)Bj,M−1,u(x)

+a−Mx− u−M

u0 − u−MB−M,M−1,u(x)


+aK−1uK+M − x

uK+M − uKBK,M−1,u(x)

=K−1∑

j=−(M−1)

(aj

x− uj

uj+M − uj+ aj−1

uj+M − x

uj+M − uj

)Bj,M−1,u(x)

where the last equality follows from (14.11).

Because of (14.11) the B-splines have bounded support. Nevertheless,the recursive definition of the B-splines seems very inconvenient for rep-resenting spline functions with the aid of these B-splines on a computer.We will show that this is not true by explaining an easy way to evaluatea linear combination of B-splines at a given point x (the so-called de Booralgorithm).

Assume that we are given the coefficients aj : j = −M, . . . ,K − 1 ofa linear combination

f =K−1∑

j=−M

ajBj,M,u

of B-splines and that we want to evaluate this function f at some givenpoint x ∈ [u0, uK). Then setting aj,M := aj it follows from (14.13) thatone has

f(x) =K−1∑

j=−M

aj,MBj,M,u(x) =K−1∑

j=−(M−1)

aj,M−1Bj,M−1,u(x)

= · · · =K−1∑

j=0

aj,0Bj,0,u(x),

where (depending on x) the aj,l’s are recursively defined by

aj,l−1 = aj,lx− uj

uj+l − uj+ aj−1,l

uj+l − x

uj+l − uj(14.14)

(j ∈ −(l − 1), . . . ,K − 1, l ∈ 1, . . . ,M).Now let j0 be such that uj0 ≤ x < uj0+1. Then, because of (14.9), one

gets

f(x) = aj0,0.

Thus all that one has to do is to use (14.14) to compute aj0,0. To do this itsuffices to start with aj0−M,M = aj0−M , . . . , aj0,M = aj0 and to successivelyuse (14.14) to compute aj0−l,l, . . . , aj0,l for l = M − 1, . . . , 0 (cf. Figure14.6). The number of operations needed depends only on M and not on K– this is the great advantage of the B-spline basis compared with the basisof Lemma 14.1.

We will show next that Bj,M,u : j = −M, ...,K − 1 is indeed a basisof Su,M ([u0, uK)). For this we will need the following lemma:


aj0−2,2 = aj0−2 aj0−1,2 = aj0−1 aj0,2 = aj0

aj0−1,1 aj0,1

aj0,0

uj0+1−x

uj0+1−uj0−1

x−uj0−1

uj0+1−uj0−1

uj0+2−x

uj0+2−uj0

x−uj0uj0+2−uj0

uj0+1−x

uj0+1−uj0

x−uj0uj0+1−uj0

Figure 14.6. Computation of∑K

j=0 ajBj,2,u(x) for x ∈ [uj0 , uj0+1).

Lemma 14.3. Let M ∈ N0 and u−M ≤ · · · ≤ u0 < · · · < uK ≤ · · · ≤uK+M . For j ∈ −M, ...,K − 1 and t ∈ R set

ψj,M (t) = (uj+1 − t) · . . . · (uj+M − t).

Then

(x− t)M =K−1∑

j=−M

ψj,M (t)Bj,M,u(x) (14.15)

for all x ∈ [u0, uK), t ∈ R.

Proof. For l ∈ N , t ∈ R, set

ψj,l(t) = (uj+1 − t) · . . . · (uj+l − t)

and set ψj,0(t) = 1 (t ∈ R). We will show

K−1∑

j=−l

ψj,l(t)Bj,l,u(x) = (x− t)K−1∑

j=−(l−1)

ψj,l−1(t)Bj,l−1,u(x) (14.16)

for x ∈ [u0, uK), t ∈ R, and l ∈ 1, . . . ,M. From this one obtains theassertion by

K−1∑

j=−M

ψj,M (t)Bj,M,u(x)(14.16)

= (x− t)K−1∑

j=−(M−1)

ψj,M−1(t)Bj,M−1,u(x)

(14.16)= . . .


(14.16)= (x− t)M

K−1∑

j=0

ψj,0(t)Bj,0,u(x)

(14.9)= (x− t)M .

So it suffices to prove (14.16).It follows from (14.13) that

K−1∑

j=−l

ψj,l(t)Bj,l,u(x)

=K−1∑

j=−(l−1)

(ψj,l(t)

x− uj

uj+l − uj+ ψj−1,l(t)

uj+l − x

uj+l − uj

)Bj,l−1,u(x).

Therefore (14.16) follows from

ψj,l(t)x− uj

uj+l − uj+ ψj−1,l(t)

uj+l − x

uj+l − uj= (x− t)ψj,l−1(t) (14.17)

(x, t ∈ R, l ∈ N , j = −(l − 1), . . . ,K − 1), which we will show next. Forfixed t ∈ R both sides in (14.17) are linear polynomials in x. Therefore itsuffices to show (14.17) for x = uj and x = uj+l. For x = uj one gets, forthe left-hand side of (14.17),

ψj,l(t) · 0 + ψj−1,l(t) · 1 = (uj − t) · · · (uj+l−1 − t) = (uj − t) · ψj,l−1(t),

and for x = uj+l one gets for the left-hand side of (14.17),

ψj,l(t) · 1 + ψj−1,l(t) · 0 = (uj+1 − t) · . . . · (uj+l − t) = (uj+l − t) · ψj,l−1(t).

This implies (14.17), thus Lemma 14.3 is proved.

Theorem 14.1. For M ∈ N0 and u−M ≤ · · · ≤ u0 < · · · < uK ≤ · · · ≤uK+M , Bj,M,u : j = −M, . . . ,K − 1 restricted to [u0, uK) is a basis ofSu,M ([u0, uK)).

Proof. We will show

Su,M ([u0, uK)) ⊆ span Bj,M,u : j = −M, . . . ,K − 1 . (14.18)

By Lemma 14.1, Su,M ([u0, uK)) is a linear vector space of dimension K+Mand obviously span Bj,M,u : j = −M, . . . ,K − 1 is a linear vector spaceof dimension less than or equal to K +M , therefore it follows from (14.18)that the two vector spaces are equal and that Bj,M,u : j = −M, . . . ,K−1 is a basis of them. Thus it suffices to prove (14.18). Because of Lemma14.1, (14.18) follows from

p ∈ span Bj,M,u : j = −M, . . . ,K − 1 (14.19)

for each polynomial p of degree M or less and from

(x− uk)M+ ∈ span Bj,M,u : j = −M, . . . ,K − 1 (14.20)


for each k = 1, . . . ,K − 1. If t0, . . . , tM ∈ R are pairwise distinct then eachpolynomial of degree M or less can be expressed as a linear combinationof the polynomials (x − tj)M (j = 0, . . . ,M). Thus (14.19) follows fromLemma 14.3. We now show

(x− uk)M+ =

K−1∑

j=k

ψj,M (uk)Bj,M,u(x) (x ∈ [u0, uK), k ∈ 1, . . . ,K − 1)

(14.21)which implies (14.20).

For x < uk one has

(x− uk)M+ = 0 =

K−1∑

j=k

ψj,M (uk) · 0(14.11)

=K−1∑

j=k

ψj,M (uk)Bj,M,u(x).

For x ≥ uk one has

(x− uk)M+ = (x− uk)M Lemma 14.3=

K−1∑

j=−M

ψj,M (uk)Bj,M,u(x)

(14.11)=

K−1∑

j=k−M

ψj,M (uk)Bj,M,u(x) =K−1∑

j=k

ψj,M (uk)Bj,M,u(x),

because for j ∈ k −M, . . . , k − 1 one has

ψj,M (uk) = (uj+1 − uk) · . . . · (uj+M − uk) = 0.

In Definition 14.2 the B-splines Bj,M,u are defined on whole R. Thereforealso spanBj,M,u : j = −M, . . . ,K − 1 is a space of functions definedon whole R. By Theorem 14.1 the restriction of this space of functions to[u0, uK) is equal to Su,M ([u0, uK)).

To define the least squares estimates we need spaces of functions definedon whole R, therefore we introduce

Definition 14.3. For M ∈ N0 and u−M ≤ · · · ≤ u0 < · · · < uK ≤ · · · ≤uK+M the spline space Su,M of functions f : R → R is defined by

Su,M = span Bj,M,u : j = −M, . . . ,K − 1 .Here M is called the degree and u = ujj=−M,...,K−1 is called the knotsequence of Su,M .

Remark. (a) If we restrict all the functions in Su,M on [u0, uK), then weget the spline space Su,M ([u0, uK)).

(b) While the spline space Su,M ([u0, uK)) is independent of the knotsu−M , . . . , u−1 and uK+1, . . . , uK+M , the B-spline basis and therefore alsoSu,M depends on the knots.

(c) The functions in Su,M are equal to a polynomial of degree less than orequal to M on each set (−∞, u−M ), [u−M , u−M+1), . . . , [uK+M−1, uK+M ),


[uK+M ,∞). They are zero outside [u−M , uK+M ), M − 1 times continu-ously differentiable on [u0, uK), and in the case u−M < · · · < u0 anduK < · · · < uK+M even M − 1 times continuously differentiable on[u−M , uK+M ).

We will often need the following property of the B-splines:

Lemma 14.4. Let M ∈ N0 and u−M ≤ · · · ≤ u0 < · · · < uK ≤ · · · ≤uK+M . Then one has

K−1∑

j=−M

Bj,M,u(x) = 1 for x ∈ [u0, uK). (14.22)

Proof. Differentiating (14.15) M times with respect to t one gets

M ! · (−1)M =∂M

∂tM(x− t)M

=K−1∑

j=−M

∂M

∂tM((uj+1 − t) · . . . · (uj+M − t))Bj,M,u(x)

=K−1∑

j=−M

M ! · (−1)MBj,M,u(x),

which implies the assertion.

Lemma 14.5. Let 0 ≤ i ≤ K − 1. Then

Bi−M,M,u, . . . , Bi,M,uis a basis of span1, x, . . . , xM on [ui, ui+1), i.e.,

span Bi−M,M,u, . . . , Bi,M,u = span1, x, . . . , xM on [ui, ui+1)(14.23)

and

Bi−M,M,u, . . . , Bi,M,u are linearly independent on [ui, ui+1). (14.24)

Proof. Differentiating (14.15) (M − l) times with respect to t one gets

(−1)M−lM · (M − 1) · . . . · (l+ 1)(x− t)l =K−1∑

j=−M

∂M−lψj,M

∂tM−l(t) ·Bj,M,u(x).

If x ∈ [ui, ui+1), then (14.11) implies

Bj,M,u(x) = 0 for j /∈ i−M, . . . , i.Hence,

(−1)M−lM · (M − 1) · . . . · (l + 1)xl =i∑

j=i−M

∂M−lψj,M

∂tM−l(0) ·Bj,M,u(x),


which proves

1, x, . . . , xM ∈ span Bi−M,M,u, . . . , Bi,M,u on [ui, ui+1).

On the other hand, the definition of the spline spaces implies

Bi−M,M,u, . . . , Bi,M,u ∈ span1, x, . . . , xM on [ui, ui+1),

from which one concludes (14.23). Furthermore, 1, x, . . . , xM are linearly in-dependent on [ui, ui+1), thus the dimension of spanBi−M,M,u, . . . , Bi,M,uis equal to the number of functions in Bi−M,M,u, . . . , Bi,M,u, whichimplies (14.24).

By definition, the derivative of a spline function f of degree M is a splinefunction of degree M − 1 with respect to the same knot vector. Hence thederivative f ′ of a linear combination f of B-splines of degree M can berepresented by a linear combination of B-splines of degree M −1. Our nextlemma shows that it is easy to compute the coefficients of f ′ given thecoefficients of f .

Lemma 14.6. (a) For all j ∈ −M, . . . ,K − 1 and x ∈ [u0, uK),

∂

∂xBj,M,u(x) =

M

uj+M − ujBj,M−1,u(x) − M

uj+M+1 − uj+1Bj+1,M−1,u(x).

(b) For all x ∈ [u0, uK),

∂

∂x

K−1∑

j=−M

aj ·Bj,M,u(x) =K−1∑

j=−(M−1)

M

uj+M − uj(aj − aj−1)Bj,M−1,u(x).

Proof. Because of ∂∂xBj,M,u ∈ Su,M−1 we get

∂

∂xBj,M,u(x) =

K−1∑

i=−(M−1)

αi,jBi,M−1,u(x) (x ∈ [u0, uK))

for some α−(M−1),j , . . . , αK−1,j ∈ R. Let k ≤ j−1 or k ≥ j+M +1. Then

Bj,M,u(x) = 0 for all x ∈ [uk, uk+1),

which implies

0 =∂

∂xBj,M,u(x) =

K−1∑

i=−(M−1)

αi,jBi,M−1,u(x) =k∑

i=k−(M−1)

αi,jBi,M−1,u(x)

for all x ∈ [uk, uk+1). From this and Lemma 14.5 we conclude that

αk−(M−1),j = · · · = αk,j = 0

if k ≤ j − 1 or k ≥ j +M + 1, hence

α−(M−1),j = · · · = αj−1,j = αj+2,j = · · · = αK−1,j = 0


and therefore∂

∂xBj,M,u(x) = αjBj,M−1,u(x) + βjBj+1,M−1,u(x) (x ∈ [u0, uK))

for some αj , βj ∈ R. It remains to determine the explicit form of αj andβj .

Because of

0 =∂

∂x1 =

∂

∂x

K−1∑

j=−M

Bj,M,u(x)

=K−1∑

j=−M

(αjBj,M−1,u(x) + βjBj+1,M−1,u(x))

=K−1∑

j=−(M−1)

(αj + βj−1)Bj,M−1,u(x)

(because B−M,M−1,u(x) = BK,M−1,u(x) = 0 for x ∈ [u0, uK))

for all x ∈ [u0, uK), we get

βj = −αj+1 for j = −M, . . . ,K − 2.

By Lemma 14.3,

M · xM−1 =∂

∂xxM =

∂

∂x

K−1∑

j=−M

ψj,M (0)Bj,M,u(x)

=K−1∑

j=−M

ψj,M (0) (αjBj,M−1,u(x) − αj+1Bj+1,M−1,u(x))

=K−1∑

j=−(M−1)

αj (ψj,M (0) − ψj−1,M (0))Bj,M−1,u(x)

(because of B−M,M−1,u(x) = BK,M−1,u(x) = 0 for x ∈ [u0, uK))

for all x ∈ [u0, uK). On the other hand, by applying Lemma 14.3 withdegree M − 1 instead of M , we get

xM−1 =K−1∑

j=−(M−1)

ψj,M−1(0)Bj,M−1,u(x).

Hence,

M · ψj,M−1(0) = αj (ψj,M (0) − ψj−1,M (0)) ,

which implies

αj =M · ψj,M−1(0)

ψj,M (0) − ψj−1,M (0)

14.2. Consistency 267

=M · uj+1 · . . . · uj+M−1

uj+1 · . . . · uj+M − uj · . . . · uj+M−1

=M

uj+M − uj

and

βj = −αj+1 = − M

uj+M+1 − uj+1.

This proves (a). The assertion of (b) follows directly from (a).

14.2 Consistency

In this section we will investigate the question: How should one choose thedegree and the knot sequence of a univariate spline space depending on thedata in order to get universally consistent least squares estimates?

In order to give an answer to this question, we generalize Theorem 13.1,which deals with data-dependent histogram estimates, to least squaresestimates using data-dependent spline spaces. One of the conditions in The-orem 13.1 was that the measure of those cells in the partition, for whichthe diameter does not shrink to zero, converges to zero (cf. (13.10)). Thecondition (14.30) below generalizes this to data-dependent spline spaces.

Theorem 14.2. For n ∈ N let Mmax(n) ∈ N , Kmax(n) ∈ N , and βn ∈R+. Depending on Dn choose M ∈ N0, K ∈ N , and u−M , . . . , uK+M ∈ Rsuch that M ≤ Mmax(n), K ≤ Kmax(n), and u−M ≤ · · · ≤ u0 < · · · <uK ≤ · · · ≤ uK+M . Define the estimate mn by

mn = arg minf∈Fn

1n

n∑

j=1

|f(Xj) − Yj |2 (14.25)

and

mn(x) = Tβnmn(x) (14.26)

with Fn = Su,M .(a) Assume that

βn → ∞ (n → ∞), (14.27)

(Kmax(n) ·Mmax(n) +Mmax(n)2)β4n log(n)

n→ 0 (n → ∞) (14.28)

and

β4n

n1−δ→ 0 (n → ∞) (14.29)


for some δ > 0. If, in addition, the distribution µ of X satisfies

µ

(−∞, u0) ∪

⋃

k=1,...,K,uk−uk−M−1>γ

[uk−1, uk) ∪ [uK ,∞)

∩ [−L,L]

→ 0

(14.30)(n → ∞) a.s. for each L, γ > 0, or the empirical distribution µn ofX1, . . . , Xn satisfies

µn

(−∞, u0) ∪

⋃

k=1,...,K,uk−uk−M−1>γ

[uk−1, uk) ∪ [uK ,∞)

∩ [−L,L]

→ 0

(14.31)(n → ∞) a.s. for each L, γ > 0, then, for EY 2 < ∞,

∫|mn(x) −m(x)|2µ(dx) → 0 (n → ∞) a.s.

(b) Assume that (14.27) and (14.28) hold. If, in addition, the distributionµ of X satisfies

Eµ

(−∞, u0) ∪

⋃

k=1,...,K,uk−uk−M−1>γ

[uk−1, uk) ∪ [uK ,∞)

∩ [−L,L]

→ 0

(14.32)(n → ∞) for each L, γ > 0, or the empirical distribution µn of X1, . . . , Xn

satisfies

Eµn

(−∞, u0) ∪

⋃

k=1,...,K,uk−uk−M−1>γ

[uk−1, uk) ∪ [uK ,∞)

∩ [−L,L]

→ 0

(14.33)(n → ∞) for each L, γ > 0, then, for EY 2 < ∞,

E∫

|mn(x) −m(x)|2µ(dx) → 0 (n → ∞).

Remark. (a) The left-hand sides in (14.30) and (14.31) are randomvariables because the degree and the knot sequence depend on the data.

(b) In the case M = 0 the estimate mn is a truncated histogram estimateusing a data-dependent partition. In this case the assertion follows fromTheorem 13.1.

If the degree and the knot sequence are chosen such that (14.30) or(14.31) hold for every distribution µ of X, then Theorem 14.2 implies thatthe estimate is universally consistent. Before we prove the theorem we willgive examples for such choices of the degree and the knot sequence.


In the first example we consider data-independent knots.

Example 14.5. Let M ∈ N0. Let Ln, Rn ∈ R and Kn ∈ N (n ∈ N ) besuch that

Ln → −∞, Rn → ∞ (n → ∞) (14.34)

andRn − Ln

Kn→ 0 (n → ∞). (14.35)

Set K = Kn and

uk = Ln + k · Rn − Ln

Kn(k = −M, ...,Kn +M). (14.36)

Then (14.30) holds because, for fixed L, γ > 0,

(−∞, u0) ∪

⋃

k=1,...,K, uk−uk−M−1>γ

[uk−1, uk) ∪ [uK ,∞)

∩ [−L,L] = ∅

for n sufficiently large (i.e., for n so large that u0 = Ln < −L, uK = Rn >L, and uk − uk−M−1 = (M + 1) · (Rn − Ln)/Kn ≤ γ).

In the next example we consider data-dependent knots.

Example 14.6. Let M ∈ N0. Let Cn,Kn ∈ N , δn ≥ 0 (n ∈ N ) be suchthat

δn → 0 (n → ∞) (14.37)

andCn

n→ 0 (n → ∞). (14.38)

Set K = Kn and choose the knots such that there are less than Cn of theX1, ..., Xn in each of the intervals (−∞, u0) and [uKn ,∞) and such thatfor every k ∈ 1, ...,Kn with uk − uk−M−1 > δn there are less than Cn ofthe X1, ..., Xn in [uk−1, uk). Then (14.31) holds.

Indeed, let L, γ > 0. Because of (14.37) we can assume w.l.o.g. thatδn < γ. Then uk − uk−M−1 > γ implies µn([uk−1, uk)) ≤ Cn/n, thus

µn

(−∞, u0) ∪

⋃

k=1,...,Kn,uk−uk−M−1>γ

[uk−1, uk) ∪ [uKn,∞)

∩ [−L,L]


≤ µn ((−∞, u0)) +∑

uk−uk−M−1>γ,[uk−1,uk)∩[−L,L]=∅µn ([uk−1, uk))

+µn ([uKn ,∞))

≤ 2Cn

n+ (M + 1)

(2Lγ

+ 2)Cn

n→ 0 (n → 0)

because of (14.38).

Example 14.7. Assume X1, . . . , Xn are all distinct a.s. and chooseeach n

Knth-order statistic of X1, ..., Xn as a knot. Then each sequence

mnn∈N of estimators which satisfies (14.25) and (14.26) with Fn = Su,M

is weakly and strongly consistent for every distribution of (X,Y ) with Xnonatomic and EY 2 < ∞, provided that (14.27)–(14.29) hold and

Kn → ∞ (n → ∞). (14.39)

This follows immediately from Theorem 14.2 and the previous exampleby setting Cn = n

Kn + 1 and δn = 0.

Proof of Theorem 14.2. (a) Because of Theorem 10.2 it suffices to showthat, for each L > 0,

supf∈Tβn Su,M

∣∣∣∣∣∣

1n

n∑

j=1

|f(Xj) − Yj,L|2 − E|f(X) − YL|2∣∣∣∣∣∣→ 0 (n → ∞) a.s.

(14.40)and

inff∈Su,M , ‖f‖∞≤βn

∫|f(x) −m(x)|2µ(dx) → 0 (n → ∞) a.s. (14.41)

Proof of (14.40). Let Πn be the family of all partitions of R consistingof Kmax(n) + 2Mmax(n) + 2 or less intervals and let G be the set of allpolynomials of degree Mmax(n) or less. Then Su,M ⊂ GΠn and, therefore,it suffices to show (14.40) with the data-dependent set Su,M replaced by thedata-independent set G Πn. G is a linear space of functions of dimensionMmax(n) + 1, thus VG+ ≤ Mmax(n) + 2 (see Theorem 9.5). By Example13.1 the partitioning number of Πn satisfies

∆n(Πn) ≤(n+Kmax(n) + 2Mmax(n) + 1

n

)

≤ (n+Kmax(n) + 2Mmax(n) + 1)Kmax(n)+2Mmax(n)+1.

As in the proof of Theorem 13.1 this implies, for 0 ≤ L ≤ βn,

P

[

supf∈Tβn GΠn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣> t

]


≤ 8(n+Kmax(n) + 2Mmax(n) + 1)(Kmax(n)+2 Mmax(n)+1)

×(

333eβ2n

t

)2(Mmax(n)+2)(Kmax(n)+2 Mmax(n)+2)

exp(

− n t2

2048β4n

)

and from this, (14.28), and (14.29), one obtains the assertion by an easyapplication of the Borel–Cantelli lemma.

Proof of (14.41). C∞0 (R) is dense in L2(µ) (cf. Corollary A.1), hence it

suffices to prove (14.41) for some function m ∈ C∞0 (R). Because of (14.27)

we may further assume ‖m‖∞ ≤ βn.Define Qm ∈ Su,M by Qm =

∑K−1j=−M m(uj) · Bj,M,u. Then (14.12) and

(14.22) imply

|(Qm)(x)| ≤ maxj=−M,...,K−1

|m(uj)|K−1∑

j=−M

Bj,M,u(x) ≤ ‖m‖∞ ≤ βn,

thus Qm ∈ f ∈ Su,M : ‖f‖∞ ≤ βn. Let x ∈ [ui, ui+1) for some 0 ≤ i ≤K − 1. Then

|m(x) − (Qm)(x)| (14.22)=

∣∣∣∣∣∣

K−1∑

j=−M

(m(x) −m(uj))Bj,M,u(x)

∣∣∣∣∣∣

(14.11)=

∣∣∣∣∣∣

i∑

j=i−M

(m(x) −m(uj))Bj,M,u(x)

∣∣∣∣∣∣

(14.12)≤ max

j=i−M,...,i|m(x) −m(uj)|

i∑

j=i−M

Bj,M,u(x)

(14.22)≤ ‖m′‖∞ · |ui+1 − ui−M | ≤ ‖m′‖∞hu,M (x),

where

hu,M (x) =

(ui+1 − ui−M ) if x ∈ [ui, ui+1) for some 0 ≤ i ≤ K − 1,∞ if x ∈ (−∞, u0) or x ∈ [uk,∞).

Using this one gets, for arbitrary L, γ > 0,

inff∈Su,M , ‖f‖∞≤βn

∫|f(x) −m(x)|2µ(dx)

≤∫

|(Qm)(x) −m(x)|2µ(dx)


=∫

R\[−L,L]|(Qm)(x) −m(x)|2µ(dx)

+∫

x∈R : hu,M (x)>γ∩[−L,L]|(Qm)(x) −m(x)|2µ(dx)

+∫

x∈R : hu,M (x)≤γ∩[−L,L]|(Qm)(x) −m(x)|2µ(dx)

≤ 4‖m‖2∞ (µ(R \ [−L,L]) + µ(x ∈ R : hu,M (x) > γ ∩ [−L,L]))

+γ2‖m′‖2∞

= 4‖m‖2∞ ·

(

µ(R \ [−L,L])

+µ(

(−∞, u0) ∪⋃

k=1,...,K,uk−uk−M−1>γ

[uk−1, uk) ∪ [uK ,∞) ∩ [−L,L]

))

+γ2‖m′‖2∞.

If (14.30) holds, then one gets

lim supn→∞

inff∈Su,M ,‖f‖∞≤βn

∫|f(x) −m(x)|2µ(dx)

≤ 4‖m‖2∞µ (R \ [−L,L]) + γ2‖m′‖2

∞ a.s.

for each L, γ > 0, and the assertion follows with L → ∞ and γ → 0. If(14.31) holds, then

µ

(−∞, u0) ∪

⋃

k=1,...,K,uk−uk−M−1>γ

[uk−1, uk) ∪ [uK ,∞)

∩ [−L,L]

≤ µn

(−∞, u0) ∪

⋃

k=1,...,K,uk−uk−M−1>γ

[uk−1, uk) ∪ [uK ,∞)

∩ [−L,L]

+ supf∈G0Πn

∣∣∣∣∣1n

n∑

i=1

f(Xi) − Ef(X)

∣∣∣∣∣,

where G0 consists of two functions which are constant zero and constantone, respectively. For f ∈ G0 Πn one has f(x) ∈ 0, 1 (x ∈ R). This,together with (14.40), implies

supf∈G0Πn

∣∣∣∣∣1n

n∑

i=1

f(Xi) − Ef(X)

∣∣∣∣∣

14.3. Spline Approximation 273

= supf∈G0Πn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − 0|2 − E|f(X) − 0|2∣∣∣∣∣→ 0 (n → ∞) a.s.,

thus (14.31) implies (14.30) which in turn implies the assertion.(b) The proof of (b) is similar to the first part of the proof and is thereforeomitted (cf. Problem 14.1).

14.3 Spline Approximation

In this section we will investigate how well smooth functions (e.g., contin-uously differentiable functions) can be approximated by spline functions.Our aim is to derive results similar to the following result for approxima-tion with piecewise polynomials satisfying no global smoothness condition:If f is (M + 1) times continuously differentiable and one approximates fon an interval [a, b) by partitioning the interval into subintervals [uj , uj+1)(j = 0, . . . ,K − 1) and defines q on each interval [uj , uj+1) as a Taylorpolynomial of f of degree M around a fixed point in [uj , uj+1), then onehas, for each j ∈ 0, . . . ,K − 1 and each x ∈ [uj , uj+1),

|f(x) − q(x)| ≤ ‖f (M+1)‖∞,[uj ,uj+1)

(M + 1)!(uj+1 − uj)M+1. (14.42)

Here f (M+1) denotes the (M + 1)th derivative of f , and

‖f (M+1)‖∞,[uj ,uj+1) = supx∈[uj ,uj+1)

|f (M+1)(x)|.

In the sequel we will use the notation C(R) for the set of continuousfunctions f : R → R. The following definitions will be useful:

Definition 14.4. (a) For j ∈ −M, . . . ,K − 1 let Qj : C(R) → R bea linear mapping (i.e., Qj(αf + βg) = αQj(f) + βQj(g) for α, β ∈ R,f, g ∈ C(R)) such that Qjf depends only on the values of f in [uj , uj+M+1).Then the functional Q : C(R) → Su,M defined by

Qf =K−1∑

j=−M

(Qjf) ·Bj,M,u

is called a quasi interpolant.(b) A quasi interpolant Q is called bounded if there exists a constantc ∈ R such that

|Qjf | ≤ c · ‖f‖∞,[uj ,uj+M+1) (j ∈ −M, . . . ,K − 1, f ∈ C(R)). (14.43)

The smallest constant c such that (14.43) holds is denoted by ‖Q‖.(c) A quasi interpolant Q has order l if

(Qp)(x) = p(x) (x ∈ [u0, uK))


for each polynomial p of degree l or less.

Example 14.8. Let tj ∈ [uj , uj+M+1) for j ∈ −M, . . . ,K − 1. Then

Qf =K−1∑

j=−M

f(tj)Bj,M,u

defines a bounded quasi interpolant, which has order zero (because of(14.22)).

Example 14.9. Let M > 0 and define Q by

Qf =K−1∑

j=−M

f

(uj+1 + · · · + uj+M

M

)Bj,M,u.

Clearly, Q is a bounded quasi interpolant. We show that Q has order 1 :Because of (14.22) it suffices to show that Q reproduces the linear polyno-mial p(x) = x. Differentiating (14.15) M − 1 times with respect to t andsetting t = 0 yields

M ! · (−1)M−1x =K−1∑

j=−M

(−1)M−1(M − 1)! · (uj+1 + · · · + uj+M )Bj,M,u(x),

where we have used

ψj,M (t) = (−1)M tM + (uj+1 + · · · + uj+M )(−1)M−1tM−1 + q(t)

for some polynomial q of degree less than M − 1. This implies Qp = p.

For bounded quasi interpolants of order l one can show approximationresults similar to (14.42).

Theorem 14.3. Let M ∈ N0 and u−M ≤ · · · ≤ u0 < · · · < uK ≤ · · · ≤uK+M . For x ∈ R set

h(x) = maxj:uj≤x<uj+M+1

(uj+M+1 − uj)

(i.e., h(x) is the maximal length of support of the B-splines which do notvanish at x). Let Q be a bounded quasi interpolant of order l. Then onehas, for each f ∈ Cl+1([u−M , uK+M )),

|(Qf)(x) − f(x)| ≤ ‖Q‖‖f (l+1)‖∞,[u−M ,uK+M )

(l + 1)!h(x)l+1 (x ∈ [u0, uK)).

Proof. Fix x ∈ [u0, uK) and let p be the Taylor polynomial of f of orderl about x. Then one has, for each z ∈ R,

|f(z) − p(z)| ≤ ‖f (l+1)‖∞,[minx,z,maxx,z]

(l + 1)!|z − x|l+1. (14.44)

14.3. Spline Approximation 275

Further by definition of p one has f(x) = p(x) and because Q is of order l,(Qp)(x) = p(x). Using this one gets

|(Qf)(x) − f(x)| = |(Qf)(x) − p(x)| = |(Qf)(x) − (Qp)(x)|

=

∣∣∣∣∣∣

K−1∑

j=−M

Qjf ·Bj,M,u(x) −K−1∑

j=−M

Qjp ·Bj,M,u(x)

∣∣∣∣∣∣

≤K−1∑

j=−M

|Qj(f − p)|Bj,M,u(x) (because of (14.12))

≤∑

j=−M,...,K−1, uj≤x<uj+M+1

|Qj(f − p)|Bj,M,u(x)

(because of (14.11))

≤ maxj∈−M,...,K−1,uj≤x<uj+M+1

|Qj(f − p)|

(because of (14.12) and (14.22))

≤ ‖Q‖ maxj∈−M,...,K−1,uj≤x<uj+M+1

‖f − p‖∞,[uj ,uj+M+1).

For z ∈ [uj , uj+M+1), uj ≤ x < uj+M+1, it follows, from (14.44), that

|f(z) − p(z)| ≤ ‖f (l+1)‖∞,[minx,z,maxx,z]

(l + 1)!|z − x|l+1

≤ ‖f (l+1)‖∞,[u−M ,uK+M )

(l + 1)!(uj+M+1 − uj)l+1

≤ ‖f (l+1)‖∞,[u−M ,uK+M )

(l + 1)!h(x)l+1,


This theorem yields good approximation results provided one has a quasiinterpolant of high order. Clearly, for Su,M there cannot exist a quasi in-terpolant of order greater than M (because Qp is a piecewise polynomial ofdegree M or less which cannot be equal to a polynomial of degree greaterthan M on [u0, uK)). Next we show that there always exist bounded quasiinterpolants Q : C(R) → Su,M of order M .

Theorem 14.4. Let M ∈ N0 and u−M ≤ · · · ≤ u0 < · · · < uK ≤ · · · ≤uK+M . Then there exists a bounded quasi interpolant Q : C(R) → Su,M oforder M such that ‖Q‖ is bounded above by a constant depending only onM (and not on the knot sequence!).


Proof. For j ∈ −M, . . . ,K − 1 let tj,0, . . . , tj,M ∈ [uj , uj+M+1) andQj,0, . . . , Qj,M ∈ R. Define Qj : C(R) → R by

Qjf =M∑

k=0

Qj,k · f(tj,k) (f ∈ C(R)).

Then, clearly, Q : C(R) → Su,M , defined by Qf =∑K−1

j=−M Qjf ·Bj,M,u, isa quasi interpolant. We will show that it is possible to choose the tj,k, Qj,k

such that Q is a bounded quasi interpolant of order M with ‖Q‖ boundedby a constant depending only on M .

First observe that Q is of order M if and only if Q reproduces for eacht ∈ R the polynomial (x− t)M , i.e., if

(x− t)M =K−1∑

j=−M

Qj

((· − t)M

)Bj,M,u(x) (x ∈ [u0, uK), t ∈ R).

By Lemma 14.3 one has

(x− t)M =K−1∑

j=−M

ψj,M (t)Bj,M,u(x) (x ∈ [u0, uK), t ∈ R),

and thus it follows from Theorem 14.1 that Q is of order M if and only ifM∑

k=0

Qj,k(tj,k − t)M = (uj+1 − t) · . . . · (uj+M − t) (j ∈ −M, . . . ,K − 1)

(14.45)for every t ∈ R. The right-hand side of (14.45) is a polynomial of degreeM or less (with respect to t). If the tj,0, . . . , tj,M are distinct, then

(tj,k − t)M : k = 0, . . . ,Mis a basis of the space of all polynomials of degree M or less and thereexists uniquely determined Qj,0, . . . , Qj,M ∈ R such that (14.45) holds forevery t ∈ R. It remains to show that the tj,k can be chosen such that theresulting quasi interpolant is bounded by some constant depending only onM .

Both sides of (14.45) are polynomials of degree M or less. Therefore(14.45) holds for every t ∈ R if and only if it holds for M + 1 distinctvalues of t and hence is equivalent to

M∑

k=0

Qj,k(tj,k − tj,l)M = (uj+1 − tj,l) · . . . · (uj+M − tj,l) (14.46)

(j ∈ −M, . . . ,K − 1, l ∈ 0, . . . ,M), provided tj,0, . . . , tj,M are pairwisedistinct (j ∈ −M, . . . ,K − 1).

Let us now choose the tj,k such that the resulting quasi interpolantsatisfies the assertion of the theorem. For j ∈ −M, . . . ,K − 1 choose


νj ∈ j, . . . , j+M−1 such that [uνj, uνj+1) is the largest interval [ui, ui+1)

contained in [uj , uj+M ) and set

tj,k = uνj +k

M(uνj+1 − uνj ) (k ∈ 0, . . . ,M).

Then

tj,k − tj,l = (k − l) · uνj+1 − uνj

M

anduj+i − tj,luνj+1−uνj

M

=uj+i − uνj

uνj+1−uνj

M

− l.

Thus, in this case, (14.46) is equivalent to

M∑

k=0

(k − l)MQj,k =

(uj+1 − uνj

uνj+1−uνj

M

− l

)

· . . . ·(uj+M − uνj

uνj+1−uνj

M

− l

)

(14.47)

(l ∈ 0, . . . ,M) for each j ∈ −M, . . . ,K − 1.Fix j ∈ −M, . . . ,K − 1. Then (14.47) is an (M + 1) × (M + 1) linear

equation system for the Qj,0, . . . , Qj,M . It has a unique solution because(14.45) has a unique solution. Furthermore, its left-hand side depends onlyon M and by the choice of [uνj , uνj+1) the distance between uj+k and uνj isless than or equal to M times uνj+1 − uνj

(k ∈ 1, . . . ,M), which impliesthat the components of its right-hand side are bounded in absolute valueby (M2 +M)M . Therefore there exists some constant c depending only onM such that the solution of (14.47) satisfies

|Qj,k| ≤ c (j ∈ −M, . . . ,K − 1, k ∈ 0, . . . ,M).

Thus

|Qjf | =

∣∣∣∣∣

M∑

k=0

Qj,k · f(tj,k)

∣∣∣∣∣≤ c · (M + 1) · ‖f‖∞,[uj ,uj+M+1)

which in turn implies ‖Q‖ ≤ c · (M + 1).


In this section we derive rate of convergence results for least squaresestimates using spline spaces with equidistant knots. Throughout this sec-tion we will assume supp(X) bounded and, for simplicity, we also assumesupp(X) ⊆ [0, 1].

Let M ∈ N0 and let Knn∈N be a sequence of natural numbers. Set

uk =k

Kn(k = −M, . . . ,Kn +M).


−1 −0.5 0.5 1

0.5

Figure 14.7. Least squares spline estimate with Kn = 2, uk = −1 + 2k/Kn, andM = 1: L2 error = 0.013607.

Let SKn,M be the spline space with knot sequence ukKn+Mk=−M and degree

M . Define the estimate mn by

mn = arg minf∈SKn,M

1n

n∑

i=1

|f(Xi) − Yi|2

and set

mn(x) = TL(mn(x)) (x ∈ R)

where L ∈ R+ is some known bound on the supremum norm of theregression function.

Figures 14.7–14.9 show application of this estimate for various choices ofKn to our simulated data set of Chapter 1. A comparison of the estimate inFigure 14.9 with the piecewise polynomial partitioning estimate in Figure10.3 shows that the least squares estimate is indeed much less oscillatorythan the piecewise polynomial partitioning estimate.

Theorem 14.5. Let the estimate mn be defined as above. Let p ∈1, . . . ,M + 1 and assume that

σ2 = supx∈R

VarY |X = x < ∞,

‖m‖∞ = supx∈R

|m(x)| ≤ L,

supp(X) ⊆ [0, 1],

and that m is p times continuously differentiable. Then there exist constantsc2, c3 ∈ R+ which depend only on M such that


−1 −0.5 0.5 1

0.5


−1 −0.5 0.5 1

0.5


E∫


≤ c1 maxσ2, L2 · (log(n) + 1) · (Kn +M + 1)n

+ c2‖m(p)‖2∞

(M + 1Kn

)2p

.


In particular, for Kn =⌊(

nlog(n)

) 12p+1

⌋:

E∫

|mn(x) −m(x)|2µ(dx) ≤ c3

(log(n)n

) 2p2p+1

for some constant c3 ∈ R+ which depends on M,σ2, L2, and ‖m(p)‖∞.

Proof. By Theorem 11.3 we have

E∫


≤ c · maxσ2, L2 (log(n) + 1) · (Kn +M + 1)n

+8 inff∈SKn,M

∫|f(x) −m(x)|2µ(dx).

An application of Theorems 14.3 and 14.4 yields

8 inff∈SKn,M

∫|f(x) −m(x)|2µ(dx) ≤ 8 inf

f∈SKn,M

supx∈[0,1]

|f(x) −m(x)|2

≤ c · ‖m(p)‖2∞

(M + 1Kn

)2p

,


The definition of the estimate in Theorem 14.5 depends on the smooth-ness p of the regression function. This can be avoided by using complexityregularization.

Set Pn = 1, . . . , n and for K ∈ Pn define the estimate mn,K by

mn,K = arg minf∈SK,M

1n

n∑

i=1

|f(Xi) − Yi|2

and

mn,K(x) = TL(mn,K(x)) (x ∈ R).

To penalize the complexity of SK,M we use the penalty

penn(K) =L4 log2(n) ·K

n.

Set

K∗ = arg minK∈Pn

penn(K) +1n

n∑

i=1

|mn,K(Xi) − Yi|2

and

mn(x) = mn,K∗(x) (x ∈ R).


We have the following result:

Theorem 14.6. Let L > 0, M ∈ N0, and let mn be defined as above. Thenone has, for any p ∈ 1, . . . ,M + 1,

E∫

|mn(x) −m(x)|2µ(dx) = O

(

log2(n)n

) 2p2p+1

for every distribution of (X,Y ) with supp(X) ⊆ [0, 1], |Y | ≤ L a.s., and mp times continuously differentiable.


Observe that the estimate in Theorem 14.6 doesn’t depend on p anymore.


Theorem 14.2 is due to Kohler (1999). The proofs of the deterministic prop-erties of splines which we present in this and in the next chapter are basedon lectures given by K. Hollig at the University of Stuttgart, and partsof these lectures have been published in Hollig (1998). Different kinds ofintroductions to polynomial splines and references concerning the deter-ministic results proven in this chapter can be found in de Boor (1978) andSchumaker (1981).

Many references concerning the application of splines in statistics can befound in the survey articles by Wegman and Wright (1983) and Agarwal(1989). Consistency of least squares splines in supremum norm was studiedin Zhu (1992). In fixed design regression, Agarwal and Studden (1980)investigated the question of how to place the (nonequidistant) knots of aleast squares spline estimate in order to minimize the expected error of theestimate.


Problem 14.1. Prove Theorem 14.2 (b).Hint: Proceed as in the proof of the first part of Theorem 14.2, but apply part

(b) of Theorem 10.2.

Problem 14.2. Prove Theorem 14.6.Hint: Apply Theorem 12.1 and use the approximation result derived in the

proof of Theorem 14.5.


Problem 14.3. Let M ∈ N0, K ∈ N , and u−M ≤ · · · ≤ u0 < · · · < uK ≤ · · · ≤uK+M . For x ∈ R, set

h(x) = maxj:uj≤x<uj+M+1

(uj+M+1 − uj).

Let p = q + r for some q ∈ 0, . . . , M, r ∈ (0, 1]. Let Q be a bounded quasiinterpolant of order q. Show that for every (p, C)-smooth function f : R → R,

|(Qf)(x) − f(x)| ≤ ‖Q‖ · C

q!· h(x)p (x ∈ [u0, uK)).

Hint: Proceed as in the proof of Theorem 14.3, but use (11.7) instead of(14.44).

Problem 14.4. Use Problem 14.3 to derive a version of Theorem 14.5, wherethe assumption that m is p times continuously differentiable is replaced by theassumption m (p, C)-smooth.

Problem 14.5. Show that the estimate in Theorem 14.6 satisfies

E∫

|mn(x) − m(x)|2µ(dx) ≤ c · C2/(2p+1)

(log2(n)

n

)2p/(2p+1)

for every p ≤ M + 1 and every distribution of (X, Y ) with supp(X) ⊆ [0, 1],|Y | ≤ L a.s., and m (p, C)-smooth.

Hint: Apply Problem 14.3.

15Multivariate Least Squares SplineEstimates

15.1 Introduction to Tensor Product Splines

An easy way to construct multivariate spline spaces is to use tensor prod-ucts of univariate spline spaces. The tensor product of univariate functionsf1, . . . , fd : R → R is the function f : Rd → R defined by

f(x1, . . . , xd) = f1(x1) · . . . · fd(xd) (x1, . . . , xd ∈ R).

We will define tensor product spline spaces as a linear span of all tensorproducts of univariate spline functions belonging to d fixed univariate splinespaces.

In order to avoid complicated notation we will explain this in the sequelonly in the case d = 2. The general case can be handled in an analogousway.

Definition 15.1. Let Mx,Mz ∈ N0, u−Mx ≤ · · · ≤ u0 < · · · < uK ≤· · · ≤ uK+Mx

, and v−Mz≤ · · · ≤ v0 < · · · < vL ≤ · · · ≤ vL+Mz

. Then thetensor product spline space Su,Mx,v,Mz

of degree (Mx,Mz) and withknot sequences u = uii=−Mx,...,K+Mx and v = vjj=−Mz,...,L+Mz isdefined by

Su,Mx,v,Mz = span

f : R2 → R : f(x, z) = g(x)h(z) (x, z ∈ R)

for some g ∈ Su,Mx , h ∈ Sv,Mz

.

284 15. Multivariate Least Squares Spline Estimates

z

y

x

Mx = Mz = 1

v0 v1 v2 v3

u0

u1

u2

u3

Figure 15.1. A function of Su,1,v,1.

Example 15.1. Let Mx = Mz = 1 and choose the knot sequences u andv as in Figure 15.1. The knot sequences induce a partition of R2. Thefunctions in Su,1,v,1 are continuous and are piecewise bilinear with respectto this partition.

It follows from the definitions of Su,Mxand Sv,Mz that on each set

[ui, ui+1) × [vj , vj+1) the functions in Su,Mx,v,Mz are equal to a finitesum of tensor products of univariate polynomials of degree less than orequal to Mx and univariate polynomials of degree less than or equal to Mz

(−Mx ≤ i ≤ K + Mx − 1,−Mz ≤ j ≤ L + Mz − 1). Furthermore, theyare zero outside [u−Mx

, uK+Mx) × [v−Mz

, vL+Mz) and satisfy some global

smoothness conditions depending on (Mx,Mz) and the knot sequences.For instance, if u−Mx < · · · < uK+Mx and v−Mz < · · · < vL+Mz then thepartial derivatives of the functions in Su,Mx,v,Mz with respect to x (z) arecontinuous up to order Mx − 1 (Mz − 1).

A basis of Su,Mx,v,Mz can be constructed by the use of tensor productsof univariate B-splines.

15.1. Introduction to Tensor Product Splines 285

z

y

x

Mx = Mz = 1

v0 v1 v2 v3

u0

u1

u2

u3

1

Figure 15.2. Example of a tensor product B-spline of degree (1, 1).

Lemma 15.1. For −Mx ≤ i < K and −Mz ≤ j < L define the tensorproduct B-spline B(i,j),Mx,u,Mz,v by

B(i,j),Mx,u,Mz,v(x, z) = Bi,Mx,u(x) ·Bj,Mz,v(z) (x, z ∈ R). (15.1)

ThenB(i,j),Mx,u,Mz,v : −Mx ≤ i < K,−Mz ≤ j < L

is a basis of Su,Mx,v,Mz.

We leave the easy proof to the reader (cf. Problem 15.1).The next lemma summarizes some useful properties of tensor product

B-splines.

Lemma 15.2. Let Mx,Mz ∈ N0, u−Mx≤ · · · ≤ u0 < · · · < uK ≤ · · · ≤

uK+Mx , and v−Mz≤ · · · ≤ v0 < · · · < vL ≤ · · · ≤ vL+Mz

.(a)

B(i,j),Mx,u,Mz,v(x, z) = 0 for (x, z) ∈ [ui, ui+Mx+1) × [vj , vj+Mz+1)(15.2)


for −Mx ≤ i ≤ K − 1 and −Mz ≤ j ≤ L− 1.(b)

B(i,j),Mx,u,Mz,v(x, z) ≥ 0 (x, z ∈ R) (15.3)

for −Mx ≤ i ≤ K − 1 and −Mz ≤ j ≤ L− 1.(c)

K−1∑

i=−Mx

L−1∑

j=−Mz

B(i,j),Mx,u,Mz,v(x, z) = 1 (15.4)

for (x, z) ∈ [u0, uK) × [v0, vL).

Proof. The assertion follows directly from (15.1), (14.11), (14.12), and(14.22).

If a bivariate spline function is represented as a linear combination of ten-sor product B-splines, then it can be evaluated as fast as in the univariatecase. Indeed, assume that we are given coefficients

ai,j : −Mx ≤ i ≤ K − 1,−Mz ≤ j ≤ L− 1of a linear combination

f =K−1∑

i=−Mx

L−1∑

j=−Mz

ai,jB(i,j),Mx,u,Mz,v

of tensor product B-splines and that we want to evaluate this function atsome given point (x, z) ∈ [u0, uK) × [v0, vL). Choose i0, j0 such that

(x, z) ∈ [ui0 , ui0+1) × [vj0 , vj0+1).

Then

f(x, z)(15.2)=

i0∑

i=i0−Mx

j0∑

j=j0−Mz

ai,jB(i,j),Mx,u,Mz,v

(15.1)=

i0∑

i=i0−Mx

j0∑

j=j0−Mz

ai,jBj,Mz,v(z)

·Bi,Mx,u(x).

Now use Mx + 1 times the evaluation algorithm for linear combinations ofunivariate B-splines described in Section 14.1 to compute

bi =j0∑

j=j0−Mz

ai,jBj,Mz,v(z) (i = i0 −Mx, . . . , i0).

Then use it once more to compute

f(x, z) =i0∑

i=i0−Mx

bi ·Bi,Mx,u(x).


As in the univariate case the number of operations needed to do thisdepends only on Mx and Mz (and not on K or on L).

To derive results concerning the approximation of smooth functions bytensor product splines we will again use so–called quasi interpolants. Thefollowing definition extends Definition 14.4 to the case of tensor productsplines.

Definition 15.2. (a) For i ∈ −Mx, . . . ,K−1 and j ∈ −Mz, . . . , L−1let Q(i,j) : C(R2) → R be a linear mapping such that Q(i,j)(f) depends onlyon the values of f in [ui, ui+Mx+1) × [vj , vj+Mz+1). Then the functionalQ : C(R2) → Su,Mx,v,Mz , defined by

Qf =K−1∑

i=−Mx

L−1∑

j=−Mz

Q(i,j)(f) ·B(i,j),Mx,u,Mz,v,

is called a quasi interpolant.(b) A quasi interpolant Q : C(R2) → Su,Mx,v,Mz is called bounded if thereexists some constant c ∈ R such that

|Q(i,j)(f)| ≤ c · ‖f‖∞,[ui,ui+Mx+1)×[vj ,vj+Mz+1) (15.5)

for i ∈ −Mx, . . . ,K − 1, j ∈ −Mz, . . . , L− 1, and f ∈ C(R2). Here

‖f‖∞,[ui,ui+Mx+1)×[vj ,vj+Mz+1) = sup(x,z)∈[ui,ui+Mx+1)×[vj ,vj+Mz+1)

|f(x, z)|.

The smallest constant c such that (15.5) holds is denoted by ‖Q‖.(c) A quasi interpolant Q has order (lx, lz) if

(Qp)(x, z) = p(x, z) ((x, z) ∈ [u0, uK) × [v0, vL))

for each polynomial p : R2 → R of degree less than or equal to (lx, lz).

Here and in the following a polynomial p : R2 → R of degree less than orequal to (lx, lz) is a finite sum of tensor products of polynomials of degreeless than or equal to lx and of polynomials of degree less than or equal tolz.

The next theorem extends Theorem 14.3 to the case of tensor productsplines.

Theorem 15.1. Let Mx,Mz ∈ N0, u−Mx ≤ · · · ≤ u0 < · · · < uK ≤ · · · ≤uK+Mx

, and v−Mz≤ · · · ≤ v0 < · · · < vL ≤ · · · ≤ vL+Mz

. For x, z ∈ R set

hu(x) = maxi:ui≤x<ui+Mx+1

(ui+Mx+1 − ui)

and

hv(z) = maxj:vj≤x<vj+Mz+1

(vj+Mz+1 − vj).

Let Q be a bounded quasi interpolant of order (l, l). Let f : R2 → R be afunction for which the partial derivatives ∂l1+l2f

∂xl1∂zl2of order l1 + l2 ≤ l + 1


are continuous on [u−Mx, uK+Mx

) × [v−Mz, vL+Mz

). Then there exists aconstant c(f) ∈ R+ which depends only on l and on the partial derivativesof f of order less than or equal to l such that

|(Qf)(x, z) − f(x, z)| ≤ ‖Q‖ · c(f) · (hu(x)l+1 + hv(z)l+1)

for all (x, z) ∈ [u0, uK) × [v0, vL).

Proof. Fix (x, z) ∈ [ui0 , ui0+1) × [vj0 , vj0+1). Let p be the Taylorpolynomial

p(u, v) =∑

0≤l1+l2≤l

∂l1+l2f(x, z)∂xl1∂zl2

· (u− x)l1(v − z)l2

l1! · l2! .

Then the following formula holds for the remainder of the Taylorpolynomial (cf. Problem 15.2)

f(u, v) − p(u, v)

=∑

l1+l2=l+1

∫ 1

0(1 − t)l ∂

l1+l2f(x+ t · (u− x), z + t · (v − z))∂xl1∂zl2

× l

l1! · l2! · (u− x)l1(v − z)l2 dt, (15.6)

which implies

|f(u, v) − p(u, v)| ≤ c · (|u− x|l+1 + |v − z|l+1)

for some constant c ∈ R+ which depends only on l and on the partialderivatives of f of order less than or equal to l. From this, p(x, z) = f(x, z)and Qp = p, we conclude, as in the proof of Theorem 14.3,

|(Qf)(x, z) − f(x, z)|

= |(Qf)(x, z) − p(x, z)|

= |(Qf)(x, z) − (Qp)(x, z)|

≤K−1∑

i=−Mx

L−1∑

j=−Mz

|Q(i,j)(f − p)| ·B(i,j),Mx,u,Mz,v(x, z)

(because of (15.3))

=i0∑

i=i0−Mx

j0∑

j=j0−Mz

|Q(i,j)(f − p)| ·B(i,j),Mx,u,Mz,v(x, z)

(because of (15.2))


≤ maxi∈i0−Mx,...,i0,j∈j0−Mz,...,j0

|Q(i,j)(f − p)|

(because of (15.3) and (15.4))

≤ maxi∈i0−Mx,...,i0,

j∈j0−Mz,...,j0

‖Q‖ · ‖f − p‖∞,[ui,ui+Mx+1)×[vj ,vj+Mz+1)

≤ ‖Q‖ · ‖f − p‖∞,[ui0−hu(x),ui0+hu(x))×[vj0−hv(z),vj0+hv(z))

(by definition of hu(x) and hv(z))

≤ ‖Q‖ · c · ((2hu(x))l+1 + (2hv(z))l+1) .

This implies the assertion. .

Next we show that there always exists a bounded quasi interpolant oforder (Mx,Mz).

Theorem 15.2. Let Mx,Mz ∈ N0, u−Mx≤ · · · ≤ u0 < · · · < uK ≤ · · · ≤

uK+Mx , and v−Mz ≤ · · · ≤ v0 < · · · < vL ≤ · · · ≤ vL+Mz . Then there existsa bounded quasi interpolant Q : C(R2) → Su,Mx,v,Mz of order (Mx,Mz)such that ‖Q‖ is bounded above by a constant depending only on (Mx,Mz)(and not on the knot sequences).

Proof. Let tj,k, Qj,k be defined as in the proof of Theorem 14.4 for theknot sequence u and let sj,k, Rj,k be defined in the same way for the knotsequence v. Define Q(i,j) : C(R2) → R by

Q(i,j)(f) =Mx∑

k=0

Mz∑

l=0

Qi,k ·Rj,l · f(ti,k, sj,l).

Then

Qf :=K−1∑

i=−Mx

L−1∑

j=−Mz

Q(i,j)(f) ·B(i,j),Mx,u,Mz,v

defines a bounded quasi interpolant whose norm is bounded above by

maxi,j

Mx∑

k=0

Mz∑

l=0

|Qi,k| · |Rj,l|. (15.7)

It follows from the proof of Theorem 14.4, that the term in (15.7) is boundedabove by some constant depending only on (Mx,Mz). Because of

Q(i,j) (p(·) · q(·)) =

(Mx∑

k=0

Qi,k · p(ti,k)

)

·(

Mz∑

l=0

Rj,l · p(sj,l)

)

the quasi interpolant Q has order (Mx,Mz).


vj+2

vj+1

v

vj

vj−1

ui−1 ui u ui+1 ui+2

Figure 15.3. Local refinement of the partition corresponding to Su,Mx,v,Mz is notpossible.

15.2 Consistency

In the next two sections we will consider estimates defined by applying theprinciple of least squares to tensor product spline spaces.

According to the previous section one can define a space Su,Mx,v,Mz oftensor product spline functions f : R2 → R by choosing degrees Mx,Mz ∈N0 and knot sequences u = ui−Mx≤i≤K+Mx

and v = vj−Mz≤j≤L+Mz.

The functions in Su,Mx,v,Mz are piecewise polynomials of degree less thanor equal to (Mx,Mz) with respect to the partition of R2 consisting ofrectangles [ui, ui+1)× [vj , vj+1) (−Mx ≤ i < K+Mx,−Mz ≤ j < L+Mz).

It follows from Theorem 15.1 that one should choose the size of a rect-angle [ui, ui+1) × [vj , vj+1) small in order to obtain good approximationsof smooth functions by functions of Su,Mx,v,Mz on that rectangle. Thesizes of the rectangles [ui, ui+1) × [vj , vj+1) are determined by the knotsequences u and v. Unfortunately, one cannot control them locally, i.e.,one cannot choose them small in one subset of R2 without influencingthem in the rest of R2. To see this, assume that you want to make thesize of one fixed rectangle [ui0 , ui0+1) × [vj0 , vj0+1) smaller. You can dothis by introducing a new knot between ui0 and ui0+1 (or by introducinga new knot between vj0 and vj0+1). But this also splits all the rectan-gles [ui0 , ui0+1) × [vj , vj+1) (−Mz ≤ j < L + Mz) (or all rectangles[ui, ui+1) × [vj0 , vj0+1) (−Mx ≤ i < K +Mx)), see Figure 15.3.

Of course it is possible to control the size of the rectangles [ui, ui+1) ×[vj , vj+1) globally, i.e., to choose them all together too small or too large.


For this it is sufficient to use equidistant knot sequences and, therefore, wewill consider, in the next two sections, only tensor product spline spaceswith equidistant knot sequences.

In the following we will define a space of tensor product spline functionsf : Rd → R. This will depend on parameters Ln, Rn ∈ R, K ∈ N , andM ∈ N0.

Let B1i,M,K be the univariate B-spline with degree M , knot sequence

Ln + i · Rn−Ln

K i∈Z , and support

[Ln + i · Rn − Ln

K,Ln + (i+M + 1) · Rn − Ln

K].

For i1, . . . , id ∈ Z define the tensor product B-spline

Bd(i1,...,id),M,K : Rd → R

by

Bd(i1,...,id),M,K(x1, . . . , xd) = B1

i1,M,K(x1) · . . . ·B1id,M,K(xd)

(x1, . . . , xd ∈ R). Then the tensor product spline space SM,K([Ln, Rn]d) isdefined as

SM,K([Ln, Rn]d)

= spanBd

(i1,...,id),M,K : supp(Bd(i1,...,id),M,K) ∩ [Ln, Rn]d = ∅

.

The next theorem gives conditions on the parameters of SM,K([Ln, Rn]d)which imply that the resulting least squares estimates are stronglyuniversally consistent. There only M and K depend on the data.

Theorem 15.3. For n ∈ N let Ln, Rn ∈ R, Mmax(n) ∈ N0,Kmin(n),Kmax(n) ∈ N , and βn ∈ R+, such that Ln ≤ Rn, Kmin(n) ≤Kmax(n),

βn → ∞ (n → ∞) (15.8)

Ln → −∞ (n → ∞), Rn → ∞ (n → ∞), (15.9)

(Mmax(n) + 1) · Rn − Ln

Kmin(n)→ 0 (n → ∞), (15.10)

(Mmax(n) + 1)d(Kmax(n) +Mmax(n))dβ4n log(n)

n→ 0 (n → ∞),

(15.11)and

β4n

n1−δ→ 0 (n → ∞) (15.12)

for some δ > 0.


Depending on Dn choose M ∈ N0 and K ∈ N such that M ≤ Mmax(n)and Kmin(n) ≤ K ≤ Kmax(n), and define the estimate mn by

mn(·) = arg minf∈SM,K([Ln,Rn]d)

1n

n∑

i=1

|f(Xi) − Yi|2 (15.13)

and

mn = Tβnmn. (15.14)

Then mn is strongly universally consistent.

Proof. Because of Theorem 10.2 it suffices to show

inff∈SM,K([Ln,Rn]d),‖f‖∞≤βn

∫|f(x) −m(x)|2µ(dx) → 0 (n → ∞) a.s.

(15.15)and

supf∈Tβn SM,K([Ln,Rn]d)

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣→ 0 (15.16)

(n → ∞) a.s. for every L > 0.

Proof of (15.15). C∞0 (Rd) is dense in L2(µ) (cf. Corollary A.1), hence

it suffices to prove (15.15) for functions m ∈ C∞0 (Rd). Because of (15.8)

one may further assume that ‖m‖∞ ≤ βn.Set I = (i1, . . . , id) ∈ Zd : supp(Bd

(i1,...,id),M,K) ∩ [Ln, Rn]d = ∅. Fori ∈ I choose ui ∈ supp(Bd

i,M,K) and set

Qm =∑

i∈I

m(ui) ·Bdi,M,K .

Then ‖Qm‖∞ ≤ ‖m‖∞, which implies

inff∈SM,K([Ln,Rn]d),‖f‖∞≤βn

∫|f(x) −m(x)|2µ(dx)

≤∫

|(Qm)(x) −m(x)|2µ(dx)

≤ 4 ‖m‖2∞µ

(Rd \ [Ln, Rn]d)

+∫

[Ln,Rn]d|(Qm)(x) −m(x)|2µ(dx)

≤ 4 ‖m‖2∞µ

(Rd \ [Ln, Rn]d)

+ supx∈[Ln,Rn]d

|(Qm)(x) −m(x)|2.

Formula (15.9) implies

4 ‖m‖2∞µ

(Rd \ [Ln, Rn]d) → 0 (n → ∞).


Furthermore, for x ∈ [Ln, Rn]d, one gets

|(Qm)(x) −m(x)|≤

∑

i∈I

|m(ui) −m(x)| ·Bdi,M,K(x)

=∑

i∈I,x∈supp(Bdi,M,K

)

|m(ui) −m(x)| ·Bdi,M,K(x)

≤ supu,v∈Rd,

‖u−v‖∞≤(M+1)· Rn−LnK

|m(u) −m(v)| ·∑

i∈I,x∈supp(Bdi,M,K

)

Bdi,M,K(x)

≤ supu,v∈Rd,

‖u−v‖∞≤(Mmax(n)+1)· Rn−LnKmin(n)

|m(u) −m(v)|,

from which we conclude

supx∈[Ln,Rn]d

|(Qm)(x) −m(x)|2

≤ supu,v∈Rd,

‖u−v‖∞≤(Mmax(n)+1)· Rn−LnKmin(n)

|m(u) −m(v)| → 0 (n → ∞)

because of (15.10) and m ∈ C∞0 (Rd).

Proof of (15.16). Let Gn be the set of all polynomials p : Rd → R of de-gree Mmax(n) or less in each coordinate. Each function in SM,K([Ln, Rn]d)is, on each rectangle

[

Ln + i1Rn − Ln

K,Ln + (i1 + 1)

Rn − Ln

K

)

× · · ·

×[

Ln + idRn − Ln

K,Ln + (id + 1)

Rn − Ln

K

)

(i1, . . . , id ∈ −Mmax(n), . . . ,K + Mmax(n) − 1), equal to a polyno-mial contained in Gn. Denote the family of all partitions consisting ofsuch rectangles (and one additional set) for some K ≤ Kmax(n) by Πn.Then SM,K([Ln, Rn]d) ⊆ Gn Πn and it suffices to prove (15.16) with thedata-dependent set SM,K([Ln, Rn]d) replaced by the data-independent setGn Πn.

Gn is a linear vector space of functions of dimension (Mmax(n)+1)d, thusVG+

n≤ (Mmax(n)+1)d +1. Furthermore, Πn is a family of cubic partitions

consisting of (Kmax(n) + 2Mmax(n))d + 1 or less sets. As in the proof of(13.23) one gets

∆n(Πn) ≤ n · (Kmax(n) + 2Mmax(n))d.


From this, (15.11), and (15.12), one obtains the assertion as in the proofof Theorem 13.1.


In this section we derive rate of convergence results for least squaresestimates using tensor product spline spaces with equidistant knots.Throughout this section we will assume supp(X) bounded, for notationalsimplicity even supp(X) ⊆ [0, 1]d.

Let M ∈ N0 and let Knn∈N be a sequence of natural numbers. In theprevious section we have introduced spline spaces SM,Kn([0, 1]d). With thisnotation define the estimate mn by

mn = arg minf∈SM,Kn ([0,1]d)

1n

n∑

i=1

|f(Xi) − Yi|2

and set

mn(x) = TL(mn(x)) (x ∈ Rd).

Using Theorems 11.3 and 15.1 one can show

Theorem 15.4. Let the estimate mn be defined as above. Let p ∈1, . . . ,M + 1 and assume that

σ2 = supx∈[0,1]d

VarY |X = x < ∞,

‖m‖∞ = supx∈[0,1]d

|m(x)| ≤ L,

supp(X) ⊆ [0, 1]d,

and that m is p times continuously differentiable. Then there exist constantsc1, c2 ∈ R+ which depend only on σ2, L, d, M , and the supremum normsof the partial derivatives of m of order less than or equal to p such that

E∫


≤ c1(log(n) + 1) · (Kn +M + 1)d

n+ c2

(M + 1Kn

)2p

.

The proof is left to the reader (cf. Problem 15.3).

Corollary 15.1. Let p ∈ N , set M = p− 1,

Kn =

⌊(n

log(n)

) 12p+d

⌋


and define the estimate mn as above. Then one has

E∫

|mn(x) −m(x)|2µ(dx) = O

((log(n)n

) 2p2p+d

)

for every distribution of (X,Y ) with

σ2 = supx∈[0,1]d

VarY |X = x < ∞,

‖m‖∞ = supx∈[0,1]d

|m(x)| ≤ L,

supp(X) ⊆ [0, 1]d,

and m p times continuously differentiable.

The definition of the estimate in Corollary 15.1 depends on the smooth-ness p of the regression function. This can be avoided by using complexityregularization.

Set Pn = 1, . . . , n and for K ∈ Pn define the estimate mn,K by

mn,K = arg minf∈SM,K([0,1]d)

1n

n∑

i=1

|f(Xi) − Yi|2

and

mn,K(x) = TL(mn,K(x)) (x ∈ Rd).

To penalize the complexity of SK,M we use the penalty term

penn(K) =log2(n) ·Kd

n.

Set

K∗ = arg minK∈Pn

penn(K) +1n

n∑

i=1

|mn,K(Xi) − Yi|2

and

mn(x) = mn,K∗(x) (x ∈ Rd).

Theorem 12.1 implies the following result:

Theorem 15.5. Let L ∈ R+, M ∈ N0, and let mn be defined as above.Then one has, for any p ∈ 1, . . . ,M + 1,

E∫

|mn(x) −m(x)|2µ(dx) = O

(

log2(n)n

) 2p2p+d

for every distribution of (X,Y ) with supp(X) ⊆ [0, 1]d, |Y | ≤ L a.s., andm p times continuously differentiable.


Observe that the estimate in Theorem 15.5 doesn’t depend on p anymore.


For references concerning the deterministic properties of splines presentedin this chapter we refer to Section 14.5. Estimates which are based onan adaptive choice of multivariate spline spaces, constructed by stepwiseinsertion and deletion of knots, are investigated in Friedman (1991) andStone et al. (1997).


Problem 15.1. Prove Lemma 15.1.

Problem 15.2. Prove (15.6).Hint: Apply the identity

g(1) =l∑

j=0

g(j)(0)j!

· (1 − 0)j +1l!

∫ 1

0

(1 − t)lg(l+1)(t) dt

to

g(t) = f(x + t · (u − x), z + t · (v − z)).

Problem 15.3. Prove Theorem 15.4.Hint: Proceed as in the proof of Theorem 14.5.

Problem 15.4. Prove Theorem 15.5.Hint: Use Theorem 12.1.

16Neural Networks Estimates

16.1 Neural Networks

Artificial neural networks (ANN) have been motivated by the desire tomodel the human brain by a computer. ANN consist of computationalunits called neurons. They receive a number of input signals and produce anoutput signal. Originally proposed neuron models produced binary outputsignals resembling the response of biological neurons which are firing orproduce an output when their input activation exceeds some threshold.Neurons in ANN are also interconnected as are neurons in the brain. But,here, the similarities between ANN and biological neural networks end.The number of neurons in the brain and the degree of their connectivity ismuch higher than in ANN. Furthermore, the learning process in the brainis entirely different from the learning of parameters in ANN. The artificialneuron is a very simplified model of a real neuron which consists of dendrites(connections) and an axon (central cell). The electrophysiological processestaking place in a real neuron can be modeled to some degree by dynamicalmodels which are not incorporated into the static artificial neuron. In spiteof these shortcomings, ANN have become very useful in designing learningalgorithms for artificial intelligence.

A simple model of a neuron was introduced in the seminal paper of Mc-Culloch and Pitts (1943) who modeled the neuron by a binary thresholdingdevice. The McCulloch–Pitts model shown in Figure 16.1 consists of a linearweighted combination of the inputs followed by the threshold function. AMcCulloch–Pitts neuron has been applied in pattern recognition by Rosen-

298 16. Neural Networks Estimates

x(d)

x(2)

x(1)

1

...

∑

a(d)

...

a(2)

a(1)

b

g(x)

Figure 16.1. A McCulloch–Pitts neuron.

x(d)

x(2)

x(1)

1

...

∑

a(d)

...

a(2)

a(1)

b

g(x)

Figure 16.2. An artificial neuron.

blatt (1958) who also proposed a perceptron learning algorithm and provedits convergence, see Rosenblatt (1962).

By replacing the threshold function in the McCulloch-Pitts neuronmodel by a sigmoid function we obtain an artificial neuron illustrated inFigure 16.2. It is formally defined as follows.

An artificial neuron is a real-valued function on Rd given by

g(x) = σ(aTx+ b),

where x ∈ Rd is an input vector, aT = (a(1), . . . , a(d)) ∈ Rd, b ∈ R, are theweights, and σ(x) : R → [0, 1] is referred to as a sigmoid function.

Perceptrons can solve linear pattern recognition problems but fail tosolve trivial nonlinear pattern recognition problems such as discriminating

16.1. Neural Networks 299

x(d)

x(1)

1

...

∑

1

∑

...

∑

∑ f(x)

Figure 16.3. Neural network with one hidden layer.

two classes represented by the vertices of a square, where points from thesame class are the end points of a diagonal (exclusive-or problem – seeMinsky and Pappert (1969)). The limitations of perceptrons were obviatedby adding additional layers of neurons called hidden layers.

A feedforward neural network with a hidden layer of k neurons is a real-valued function on Rd of the form

f(x) =k∑

i=1

ciσ(aTi x+ bi) + c0,

where σ : R → [0, 1] is called a sigmoid function and a1, . . . , ak ∈Rd, b1, . . . , bk, c0, . . . , ck ∈ R are the parameters that specify the network.

Definition 16.1. A sigmoid function σ is called a squashing function if itis nondecreasing, limx→−∞ σ(x) = 0 and limx→∞ σ(x) = 1.

Since squashing functions are nondecreasing they have at most acountable number of discontinuities and are therefore measurable. Someexamples of squashing functions, illustrated in Figure 16.3, are given below:

• threshold squasher

σ(x) = Ix∈[0,∞);

• ramp squasher

σ(x) = xIx∈[0,1] + Ix∈(1,∞);

• cosine squasher

σ(x) = (1 + cos(x+ 3π/2))12Ix∈[−π/2,π/2] + Ix∈(π/2,∞);


0

1

x

0

1

x

0−π

2π2

1

x

Figure 16.4. Threshold, ramp, and cosine squashers.

0

1

x

0

1

x

0

1

x

Figure 16.5. Logistic, arctan and Gaussian squashers.

• logistic squasher

σ(x) =1

1 + exp(−x) ;

• arctan squasher

σ(x) =12

+1π

arctan(x);

• Gaussian squasher

σ(x) =1√2π

∫ x

−∞exp(−y2/2)dy.

16.2 Consistency

We now consider neural network regression function estimators. Given thetraining sequence Dn = (X1, Y1), . . . , (Xn, Yn) of n i.i.d. copies of (X,Y )the parameters of the network are chosen to minimize the empirical L2 risk

1n

n∑

j=1

|f(Xj) − Yj |2. (16.1)

However, in order to obtain consistency, we again, as in the previous chap-ters, restrict the range of some of the parameters. With a constraint on the


ci’s, we minimize (16.1) for the class of neural networks

Fn =

kn∑

i=1

ciσ(aTi x+ bi) + c0 : kn ∈ N , ai ∈ Rd, bi ∈ R,

kn∑

i=0

|ci| ≤ βn

(16.2)and obtain mn ∈ Fn satisfying

1n

n∑

j=1


1n

n∑

j=1

|f(Xj) − Yj |2. (16.3)

We assume that the minimum in (16.3) exists although it may not beunique. If the minimum does not exist we can carry out our analysis withfunctions whose empirical L2 risk is arbitrarily close to the infimum.

Regarding the computational aspect there is no practical algorithm foroptimizing (16.3). A practical steepest descent algorithm called backprop-agation iteratively converges to a local minimum of the empirical L2 risk.However, there is no guarantee that the global minimum will be reached.The backpropagation algorithm has been introduced by Rumelhart andMcClelland (1986).

The next consistency theorem states that with certain restrictions im-posed on βn and kn, the empirical L2 risk minimization provides universallyconsistent neural network estimates.

Theorem 16.1. Let Fn be the class of neural networks with the squashingfunction σ defined in (16.2) and let mn be the network that minimizes theempirical L2 risk in Fn. If kn and βn satisfy

kn → ∞, βn → ∞ andknβ

4n log(knβ

2n)

n→ 0,

then E∫

(mn(x)) − m(x))2µ(dx) → 0 (n → ∞) for all distributionsof (X,Y ) with E|Y |2 < ∞, that is, mn is weakly universally consis-tent. If, in addition, there exists δ > 0 such that β4

n/n1−δ → 0, then∫

(mn(x)) − m(x))2µ(dx) → 0 almost surely, that is, mn is stronglyuniversally consistent.

In order to handle the approximation error, we need a denseness theoremfor feedforward neural networks. We will start with an approximation resultin sup norm because of its importance in the neural network field. In thefollowing approximation lemma we will take the approach similar to Hornik,Stinchcombe, and White (1989) and make use of the Stone–Weierstrasstheorem.

Lemma 16.1. Let σ be a squashing function and let K be a compact subsetof Rd. Then, for every continuous function f : Rd → R and every ε > 0,


there exists a neural network

h(x) =k∑

i=1

ciσ(aTi x+ bi) + c0 (ai ∈ Rd, bi, ci ∈ R),

such that

supx∈K

|f(x) − h(x)| < ε.

Proof. The proof will be divided into several steps. First we use the Stone–Weierstrass theorem to show that networks of the form

k∑

i=1

ci cos(aTi x+ bi)

are uniformly dense on compacts in the space of continuous functions. Thenwe show that cosine functions can be approximated in sup norm by net-works with the cosine squasher. In the final step of the proof we show thatnetworks with the cosine squasher can be approximated in sup norm bynetworks with an arbitrary squashing function σ.

Step 1. Consider a class of cosine networks

F =

k∑

i=1

ci cos(aTi x+ bi), k ∈ N , ai ∈ Rd, bi, ci ∈ R (i = 1, . . . , k)

.

Observe that F is an algebra since it is closed under addition, multiplica-tion, and scalar multiplication. Closedness under multiplication follows byrepeated application of the trigonometric identity

cos a cos b =12(cos(a+ b) + cos(a− b)).

We say that the family F separates points on K if for every x, y ∈ K,x = ythere is a function f ∈ F such that f(x) = f(y). The family F vanishes atno point of K if there exists f ∈ F such that for every x ∈ K, f(x) = 0.We will now apply the Stone–Weierstrass theorem, see Rudin (1964). Itstates that every algebra defined in the space CK(R) of real continuousfunctions on a compact set K, that separates points on K and vanishesat no point of K, is dense on CK(R) in sup norm. In other words, theuniform closure of F consists of all real continuous functions on K. ClearlyF is an algebra on K. F separates points on K and vanishes at no pointof K. To see this pick x, y ∈ K,x = y. For a suitable a ∈ Rd and b = 0 wehave cos(aTx + 0) = cos(aT y + 0). Observe that a should be chosen suchthat aTx ∈ (−π, π), aT y ∈ (−π, π), and aTx = aT y. Next pick b such thatcos(b) = 0 and set a to zero. Clearly for all x ∈ K cos(aTx+b) = cos(b) = 0.This implies that Fn is uniformly dense in the space of real continuousfunctions on K.


Step 2. Let c(x) be the cosine squasher of Section 16.1. We now show thatwe can reconstruct cos(u) on any compact interval [−M,M ] by

f(u) =l∑

i=1

γic(αiu+ βi) (l ∈ N , αi, βi, γi ∈ R).

Varying the parameters αi, βi, and γi we can shift f(u) along the x− andy− axes, thus it suffices to show that we can reconstruct 2(cos(u) + 1) onany compact interval by functions f(u). It is easy to see that

(cos(u) + 1)Iu∈[−π,π] = 2(c(u+ π/2) − c(u− π/2)) (16.4)

(see Figure 16.6).By adding a finite number of shifted functions (16.4) we can reconstruct

the cosine on any compact interval [−M,M ]. If the squashing function isthe cosine squasher then the conclusion of the lemma follows immediatelyfrom Steps 1 and 2.Step 3. Next we show that, for every ε > 0 and arbitrary squashing func-tion σ, there is a neural network hε(u) =

∑ki=1 ciσ(aiu + bi) + c0 such

that

supu∈R

|hε(u) − c(u)| < ε. (16.5)

Pick an arbitrary ε (0 < ε < 1) and k such that 1/k < ε/4. Since σ is asquashing function there is an M > 0 such that

σ(−M) < ε/(2k) (16.6)

and

σ(M) > 1 − ε/(2k). (16.7)

Next by the continuity and monotonicity of c we can find constantsr1, . . . , rk such that

c(ri) = i/k, i = 1, . . . , k − 1,

and

c(rk) = 1 − 1/2k (16.8)

(see Figure 16.7).For 1 ≤ i ≤ k−1 pick ai, bi such that airi+bi = −M and airi+1+bi = M ,

i.e., ai = 2Mri+1−ri

and bi = M(ri+ri+1)ri−ri+1

. Thus aiu+ bi is the line through thepoints (ri,−M) and (ri+1,M) and clearly ai > 0 (i = 1, . . . , k − 1). Wewill now verify that the network

hε(u) =1k

k−1∑

j=1

σ(aju+ bj)


u

2(c(u+ π2 ) − c(u− π

2 ))

−π π

1

2

u

c(u+ π2 ) − c(u− π

2 )

−π −π2

π2 π

1

u

c(u− π2 )

−π −π2

π2 π

1

u

c(u+ π2 )

−π −π2

π2 π

1

u

c(u)

−π −π2

π2 π

1

Figure 16.6. Reconstruction of cosine by cosine squashers.

approximates c(u) in sup norm, that is,

|c(u) − hε(u)| < ε

on each of the subintervals (−∞, r1], (r1, r2], . . . , (rk,∞).


u

c(u)

−π2

π2

14

12

34

78

1

r1 r2 r3 r4

Figure 16.7. Construction of the sequence r1 . . . rk.

Let i ∈ 1, . . . , k − 1 and u ∈ (ri, ri+1]. Then ik ≤ c(u) ≤ i+1

k . Forj ∈ 1, . . . , i− 1 we have

σ(aju+ bj) ≥ σ(ajrj+1 + bj)

= σ(M) ≥ 1 − ε/(2k)

and for j ∈ i+ 1, . . . , k − 1 we have

σ(aju+ bj) ≤ σ(ajrj + bj) = σ(−M) < ε/(2k).

Split the sum in hε(u) into three components:

1k

i−1∑

j=1

σ(aju+ bj),1kσ(aiu+ bi), and

1k

k−1∑

j=i+1

σ(aju+ bj).

Applying the inequalities above and bounding each component yields

|c(u) − hε(u)|

≤∣∣∣∣∣∣c(u) − 1

k

i−1∑

j=1

σ(aju+ bj)

∣∣∣∣∣∣+

1kσ(aiu+ bi) +

1k

k−1∑

j=i+1

σ(aju+ bj)

≤∣∣∣∣c(u) − i− 1

k

∣∣∣∣ +

∣∣∣∣∣∣

i− 1k

− 1k

i−1∑

j=1

σ(aju+ bj)

∣∣∣∣∣∣

+1kσ(aiu+ bi) +

1k

k−1∑

j=i+1

σ(aju+ bj)

≤(i+ 1k

− i− 1k

)+(i− 1k

− 1k

(i− 1)(1 − ε

2k))

+1k

+1k

(k − 1 − i)ε

2k

=2k

+1k

(i− 1)ε

2k+

1k

+1k

(k − 1 − i)ε

2k

≤ 3k

+ε

2k


≤ 34ε+ ε2/8

<78ε < ε.

For u ∈ (−∞, r1], we have, by c(r1) = 1/k and (16.6),

|c(u) − hε(u)|

≤ max

1k,k − 1k

ε

2k

< maxε/4, ε/8 < ε.

Similarly, for u ∈ (rk,∞), (16.8) and (16.7) yield

|c(u) − hε(u)| = |1 − c(u) − (1 − hε(u))|≤ max|1 − c(u)|, |1 − hε(u)|

= max

1 −(

1 − 12k

), 1 − k − 1

k

(1 − ε

2k

)

= max

12k,1k

+ε

2k

(1 − 1

k

)

≤ max

12k,1k

+ε

2k

< maxε

8,ε

4+ε2

8

< ε.

Step 4. Steps 2 and 3 imply that, for every ε > 0, every M > 0, andarbitrary squashing function σ, there exists a neural network CM,ε(u) =∑k

i=1 ciσ(aiu+ bi) + c0 such that

supu∈[−M,M ]

|CM,ε(u) − cos(u)| < ε. (16.9)

Step 5. Let g(x) =∑k

i=1 ci cos(aTi x+ bi) be any cosine network. We show

that, for arbitrary squashing function σ, arbitrary compact set K ⊂ Rd,and for any ε > 0, there exists a network s(x) =

∑ki=1 ciσ(aT

i x + bi) suchthat

supx∈K

|s(x) − g(x)| < ε.

Since K is compact and functions aTi x + bi (i = 1, . . . , k) are continuous

there is a finite M > 0 such that supx∈K |aTi x+ bi| ≤ M (i = 1, . . . , k). Let

c =∑k

i=1 |ci|.


Using the results of Step 4 we have

supx∈K

|k∑

i=1

ci cos(aTi x+ bi) −

k∑

i=1

ciCM,ε/c(aTi x+ bi)|

≤k∑

i=1

|ci| supu∈[−M,M ]

| cos(u) − CM,ε/c(u)| < cε

c= ε.

Lemma 16.1 follows from the inequality above, Step 1, and the triangleinequality.

The next lemma gives a denseness result in L2(µ) for any probabilitymeasure µ.

Lemma 16.2. Let σ be a squashing function. Then for every probabil-ity measure µ on Rd, every measurable function f : Rd → R with∫ |f(x)|2µ(dx) < ∞, and every ε > 0, there exists a neural network

h(x) =k∑

i=1

ciσ(aTi x+ bi) + c0 (k ∈ N , ai ∈ Rd, bi, ci ∈ R)

such that∫

|f(x) − h(x)|2µ(dx) < ε.

Proof. Let

Fk =

k∑

i=1

ciσ(aTi x+ bi) + c0 : k ∈ N , ai ∈ Rd, bi, ci ∈ R

.

We need to show that F =⋃

k∈N Fk is dense in L2(µ). Suppose, to thecontrary, that this is not the case. Then there exists g ∈ L2(µ), g = 0, i.e.,∫

Rd |g(ν)|2µ(dν) > 0 such that g is orthogonal to any f ∈ F , i.e.,∫

Rd

f(x)g(x)µ(dx) = 0

for all f ∈ F . In particular,∫

Rd

σ(aTx+ b)g(x)µ(dx) = 0 (16.10)

for all a ∈ Rd, b ∈ R. In the remainder of the proof we will show that(16.10), together with the assumption that σ is a squashing function, im-plies g = 0, which contradicts g = 0. Consider the Fourier transform g of gdefined by

g(u) =∫

Rd

exp(iuT v)g(v)µ(dv)

=∫

Rd

cos(uT v)g(v)µ(dv) + i

∫

Rd

sin(uT v)g(v)µ(dv).


By the uniqueness of the Fourier transform it suffices to show that

g = 0

(see Hewitt and Ross (1970, Theorem 31.31)). We start by showing that∫

Rd

cos(uT v)g(v)µ(dv) = 0 (u ∈ Rd), (16.11)

using the uniform approximation of cosine by the univariate neural net-works derived in the proof of Lemma 16.1 and the fact that g is orthogonalto these univariate neural networks (see (16.10)).

Let 0 < ε < 1 be arbitrary. Choose N ∈ N odd such that∫

w∈Rd:|uT w|>Nπ|g(v)|µ(dv) < ε/6. (16.12)

Let c be the cosine squasher. It follows from Step 2 in the proof ofLemma 16.1 that we can find K ∈ N , ai, bi, ci ∈ R, such that

cos(z) −K∑

i=1

cic(aiz + bi) = 0

for z ∈ [−Nπ,Nπ] and

K∑

i=1

cic(aiz + bi) = −1

for z ∈ R \ [−Nπ,Nπ].According to Step 3 in the proof of Lemma 16.1 there exists a neural

network

C(z) =L∑

j=1

γjσ(αjz + βj)

such that

supz∈R

|c(z) − C(z)|

= supz∈R

|c(z) −L∑

j=1

γjσ(αjz + βj)|

<ε

2∑K

i=1 |ci|

∫|g(v)|µ(dv).

Then, for z ∈ [−Nπ,Nπ],∣∣∣∣∣cos(z) −

K∑

i=1

ciC(aiz + bi)

∣∣∣∣∣


=

∣∣∣∣∣

K∑

i=1

cic(aiz + bi) −K∑

i=1

ciC(aiz + bi)

∣∣∣∣∣

≤K∑

i=1

|ci| ε

2∑K

j=1 |cj |∫ |g(v)|dv

=ε

2∫ |g(v)|µ(dv)

(16.13)

and, for z /∈ [−Nπ,Nπ],

∣∣∣∣∣cos(z) −

K∑

i=1

ciC(aiz + bi)

∣∣∣∣∣

≤ 2 +

∣∣∣∣∣

K∑

i=1

cic(aiz + bi) −K∑

i=1

ciC(aiz + bi)

∣∣∣∣∣

≤ 2 +ε

2≤ 3. (16.14)

Thus we have

∣∣∣∣

∫

Rd

cos(uT v)g(v)µ(dv)∣∣∣∣

≤∣∣∣∣∣

∫

Rd

(cos(uT v) −K∑

i=1

ciC(aiuT v + bi))g(v)µ(dv)

∣∣∣∣∣

+

∣∣∣∣∣

∫

Rd

(K∑

i=1


∣∣∣∣∣

= T1 + T2.

It follows from (16.10) that

T2 =

∣∣∣∣∣

∫

Rd

(K∑

i=1

ciC(aiuT v + bi)

)

g(v)µ(dv)

∣∣∣∣∣

=

∣∣∣∣∣∣

K∑

i=1

ci

L∑

j=1

γj

∫

Rd

σ(αj(aiuT v + bi) + βj)g(v)µ(dv)

∣∣∣∣∣∣

= 0.

Furthermore, by (16.13), (16.14), and (16.12),


T1 ≤∣∣∣∣∣

∫

w∈Rd:|uT w|≤Nπ(cos(uT v) −

K∑

i=1


∣∣∣∣∣

+

∣∣∣∣∣

∫

w∈Rd:|uT w|>Nπ(cos(uT v) −

K∑

i=1


∣∣∣∣∣

≤ ε

2∫

Rd |g(v)|µ(dv)

∫

w∈Rd:|uT w|≤Nπ|g(v)|µ(dv)

+ 3∫

w∈Rd:|uT w|>Nπ|g(v)|µ(dv)

≤ ε

2+ε

2= ε

and (16.11) follows from ε → 0.Similarly, we can show

∫

Rd

sin(uT v)g(v)µ(dv) = 0 (u ∈ Rd)

and the proof is complete.

In the proof of Theorem 16.1 we need several results concerning proper-ties of covering numbers. The first lemma describes the VC dimension ofgraphs of compositions of functions.

Lemma 16.3. Let F be a family of real functions on Rm, and letg : R → R be a fixed nondecreasing function. Define the classG = g f : f ∈ F. Then

VG+ ≤ VF+ .

Proof. Let (s1, t1), . . . , (sn, tn) be such that they are shattered by G+.Then there exist functions f1, . . . , f2n ∈ F such that the binary vector

(Ig(fj(s1))≥t1, . . . , Ig(fj(sn))≥tn

)

takes on all 2n values if j = 1, . . . , 2n. For all 1 ≤ i ≤ n define the numbers

ui = min1≤j≤2n

fj(si) : g(fj(si)) ≥ ti

and

li = max1≤j≤2n

fj(si) : g(fj(si)) < ti .

By the monotonicity of g, ui > li, which implies li < (ui + li)/2 < ui.Furthermore,

g(fj(si)) ≥ ti =⇒ fj(si) ≥ ui =⇒ fj(si) >ui + li

2


and, likewise,

g(fj(si)) < ti =⇒ fj(si) ≤ li =⇒ fj(si) <ui + li

2.

Thus the binary vector(Ifj(s1)≥ u1+l1

2 , . . . , Ifj(sn)≥ un+ln2

)

takes on the same values as(Ig(fj(s1))≥t1, . . . , Ig(fj(sn))≥tn

)

for every j ≤ 2n. Therefore, the pairs(s1,

u1 + l12

), . . . ,

(sn,

un + ln2

)

are shattered by F+, which proves the lemma.

The next two lemmas are about the covering numbers of classes of func-tions whose members are the sums or products of functions from otherclasses.

Lemma 16.4. Let F and G be two families of real functions on Rm. IfF ⊕ G denotes the set of functions f + g : f ∈ F , g ∈ G, then for anyzn1 ∈ Rn·m and ε, δ > 0, we have

N1(ε+ δ,F ⊕ G, zn1 ) ≤ N1(ε,F , zn

1 )N1(δ,G, zn1 ).

Proof. Let f1, . . . , fK and g1, . . . , gL be an ε-cover and a δ-cover ofF and G, respectively, on zn

1 of minimal size. Then, for every f ∈ F andg ∈ G, there exist k ∈ 1, . . . ,K and l ∈ 1, . . . , L such that

1n

n∑

i=1

|f(zi) − fk(zi)| < ε,

and

1n

n∑

i=1

|g(zi) − gl(zi)| < δ.

By the triangle inequality

1n

n∑

i=1

|f(zi) + g(zi) − (fk(zi) + gl(zi))|

≤ 1n

n∑

i=1

|f(zi) − fk(zi)| +1n

n∑

i=1

|g(zi) − gl(zi)| ≤ ε+ δ

which proves that fk + gl : 1 ≤ k ≤ K, 1 ≤ l ≤ L is an (ε + δ)-cover ofF ⊕ G on zn

1 .


Lemma 16.5. Let F and G be two families of real functions on Rm suchthat |f(x)| ≤ M1 and |g(x)| ≤ M2 for all x ∈ Rm, f ∈ F , g ∈ G. If F Gdenotes the set of functions f · g : f ∈ F , g ∈ G then, for any zn

1 ∈ Rn·m

and ε, δ > 0 we have

N1(ε+ δ,F G, zn1 ) ≤ N1(ε/M2,F , zn

1 )N1(δ/M1,G, zn1 ).

Proof. Let f1, . . . , fK and g1, . . . , gL be an ε/M2-cover and a δ/M1-cover of F and G, respectively, on zn

1 of minimal size. Then, for everyf ∈ F and g ∈ G, there exist k ∈ 1, . . . ,K and l ∈ 1, . . . , L such that|fk(z)| ≤ M1, |gl(z)| ≤ M2, and

1n

n∑

i=1

|f(zi) − fk(zi)| < ε/M2,

and

1n

n∑

i=1

|g(zi) − gl(zi)| < δ/M1.

We have, by the triangle inequality,

1n

n∑

i=1

|f(zi)g(zi) − fk(zi)gl(zi)|

=1n

n∑

i=1

|f(zi)(gl(zi) + g(zi) − gl(zi)) − fk(zi)gl(zi)|

≤ 1n

n∑

i=1

|gl(zi)(f(zi) − fk(zi))| +1n

n∑

i=1

|f(zi)(g(zi) − gl(zi))|

≤ M21n

n∑

i=1

|f(zi) − fk(zi)| +M11n

n∑

i=1

|g(zi) − gl(zi)| ≤ ε+ δ

which implies that fkgl : 1 ≤ k ≤ K, 1 ≤ l ≤ L is an (ε + δ)-cover ofF G on zn

1 .

Proof of Theorem 16.1 . We can mimic the proof of Theorem 10.3.It is only the bound on the covering numbers that requires additionalwork. The argument in Chapter 10 implies that the approximation error,inff∈Fn

∫ |f(x) − m(x)|2µ(dx), converges to zero as kn, βn → ∞, if theunion of the Fn’s is dense in L2(µ) for every µ (Lemma 16.2).

To handle the estimation error, we use Theorem 10.2, which implies thatwe can assume |Y | ≤ L almost surely, for some L, and then we have toshow that

supf∈Fn

∣∣∣∣∣∣E|f(X) − Y |2 − 1

n

n∑

j=1

|f(Xj) − Yj |2∣∣∣∣∣∣→ 0 a.s..


We proceed as in the proof of Theorem 10.3. Define

Z = (X,Y ), Z1 = (X1, Y1), . . . , Zn = (Xn, Yn)

and

Hn =h : Rd × R → R : ∃f ∈ Fn such that h(x, y) = |f(x) − y|2 .

We may assume βn ≥ L so that functions in Hn satisfy

0 ≤ h(x, y) ≤ 2β2n + 2L2 ≤ 4β2

n.

Using the bound of Theorem 9.1, we have, for arbitrary ε > 0,

P

supf∈Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi|2 − E|f(X) − Y |2∣∣∣∣∣> ε

= P

suph∈Hn

∣∣∣∣∣1n

n∑

i=1

h(Zi) − Eh(Z)∣∣∣∣∣> ε

≤ 8EN1

( ε8,Hn, Z

n1

)e

− nε2

128(4β2n)2 . (16.15)

Next we bound the covering number in (16.15). Let hi(x, y) = |fi(x) − y|2((x, y) ∈ Rd × R) for some fi ∈ Fn. Mimicking the derivation in the proofof Theorem 10.3 we get

1n

n∑

i=1

|h1(Zi) − h2(Zi)|

=1n

n∑

i=1

∣∣|f1(Xi) − Yi|2 − |f2(Xi) − Yi|2

∣∣

=1n

n∑

i=1

|f1(Xi) − f2(Xi)| · |f1(Xi) − Yi + f2(Xi) − Yi|

≤ 4βn1n

n∑

i=1

|f1(Xi) − f2(Xi)|.

Thus

N1

( ε8,Hn, Z

n1

)≤ N1

(ε

32βn,Fn, X

n1

). (16.16)

Define the following classes of functions:

G1 = aTx+ b : a ∈ Rd, b ∈ R,G2 = σ(aTx+ b) : a ∈ Rd, b ∈ R,G3 = cσ(aTx+ b) : a ∈ Rd, b ∈ R, c ∈ [−βn, βn],


where G1 is a linear vector space of dimension d + 1, thus Theorem 9.5implies

VG+1

≤ d+ 2

(also see the proof of Theorem 9.4). Since σ is a nondecreasing function,Lemma 16.3 implies that

VG+2

≤ d+ 2.

Thus, by Theorem 9.4,

N1 (ε,G2, Xn1 )

≤ 3(

2eε

log3eε

)d+2

≤ 3(

3eε

)2d+4

. (16.17)

By Lemma 16.5,

N1 (ε,G3, Xn1 )

≤ N1 (ε/2, c : |c| ≤ βn, Xn1 ) N1

(ε

2βn,G2, X

n1

)

≤ 4βn

ε3

(3eε

2βn

)2d+4

≤(

12eβn

ε

)2d+5

. (16.18)

Upon applying Lemma 16.4 we obtain the bound on the covering numberof Fn,

N1 (ε,Fn, Xn1 )

≤ N1

(ε

kn + 1, c0 : |c0| ≤ βn, Xn

1

)·(

N1

(ε

kn + 1,G2, X

n1

))kn

≤ 2βn(kn + 1)ε

12eβn(ε

kn+1

)

(2d+5)kn

≤(

12eβn(kn + 1)ε

)(2d+5)kn+1

. (16.19)


Using the bound (16.16) together with (16.19) on the right-hand side ofinequality (16.15) we obtain

P

supf∈Tβn Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi|2 − E|f(X) − Y |2∣∣∣∣∣> ε

≤ 8(

384eβ2n(kn + 1)ε

)(2d+5)kn+1

e−nε2/128·24β4n .

As in the proof of Theorem 10.3 this implies the final result.


Consider the class of neural networks with k neurons and with boundedoutput weights

Fn,k =

k∑

i=1

ciσ(aTi x+ bi) + c0 : k ∈ N , ai ∈ Rd, bi, ci ∈ R,

k∑

i=0

|ci| ≤ βn

.

(16.20)Let F =

⋃n,k Fn,k and let F be the closure of F in L2(µ).

In this section we will assume that |Y | ≤ L ≤ βn a.s. and we will examinehow fast E

∫ |mn(x) −m(x)|2µ(dx) converges to zero.We use complexity regularization to control the size of the neural network

estimate. Let

fn,k = arg minf∈Fn,k

1n

n∑

i=1

|f(Xi) − Yi|2,

that is, fn,k minimizes the empirical risk for n training samples over Fn,k.We assume the existence of such a minimizing function for each k and n.The penalized empirical L2 risk is defined for each f ∈ Fn,k as

1n

n∑

i=1

|f(Xi) − Yi|2 + penn(k).

Following the ideas presented in Chapter 12 we use an upper-boundN1(1/n,Fn,k) such that

N1(1/n,Fn,k, xn1 ) ≤ N1(1/n,Fn,k)

for all n ∈ N , xn1 ∈ (Rd)n and we choose penn(k) as

penn(k) ≥ 2568β4

n

n· (log N1(1/n,Fn,k) + tk) (16.21)

for some tk ∈ R+ with∑

k e−tk ≤ 1. For instance, tk can be chosen as

tk = 2 log(k) + t0, t0 ≥ ∑∞k=1 k

−2. Our estimate mn is then defined as the


fn,k minimizing the penalized empirical risk over all classes, i.e.,

mn = fn,k∗(n) (16.22)

k∗(n) = arg mink≥1

(1n

n∑

i=1

|fn,k(Xi) − Yi|2 + penn(k)

)

.

The next theorem is an immediate consequence of Theorem 12.1 and(12.14).

Theorem 16.2. Let 1 ≤ L < ∞, n ∈ N , and let L ≤ βn < ∞. Assume|Y | ≤ L a. s., and let the neural network estimate mn with squashing func-tion σ be defined by minimizing the penalized empirical risk as in (16.22)with the penalty satisfying condition (16.21) for some tk ∈ R+ such that∑

k e−tk ≤ 1. Then we have

E∫


≤ 2 mink

penn(k) + inf

f∈Fn,k

∫|f(x) −m(x)|2µ(dx)

+5 · 2568β4

n

n. (16.23)

Before presenting further results we need to review the basic propertiesof the Fourier transform, see, e.g., Rudin (1966). The Fourier transform Fof a function f ∈ L1(Rd) is given by

F (ω) =1

(2π)d/2

∫

Rd

e−iωT xf(x)dx (ω ∈ Rd).

If F ∈ L1(Rd) then the inverse formula

f(x) =∫

Rd

eiωT xF (ω)dω

or, equivalently,

f(x) = f(0) +∫

Rd

(eiωT x − 1)F (ω)dω (16.24)

holds almost everywhere with respect to the Lebesgue measure. The Fouriertransform F of a function f is a complex function which can be written

F (ω) = |F (ω)| cos(θ(ω)) + i · |F (ω)| sin(θ(ω))

= |F (ω)|eiθ(ω)

= Re(F (ω)) + i · Im(F (ω)),

where

F (ω) := |F (ω)| =√

Re2(F (ω)) + Im2(F (ω))


is the magnitude of F (ω) and

θ(ω) = arctanIm(F (ω))Re(F (ω))

is the phase angle. Observe that by the Taylor expansion

eiωT x = 1 + eiθ · iωTx,

where θ is between 0 and ωTx. We then obtain

|eiωT x − 1| ≤ |ωTx| ≤ ‖ω‖ · ‖x‖.Consider the class of functions FC for which (16.24) holds on Rd and, in

addition,∫

Rd

‖ω‖F (ω)dω ≤ C, 0 < C < ∞. (16.25)

A class of functions satisfying (16.25) is a subclass of functions with Fouriertransform having a finite first absolute moment, i.e.,

∫Rd ‖ω‖F (ω)dω < ∞

(these functions are continuously differentiable on Rd). The next theoremprovides the rate of convergence for the neural network estimate with ageneral squashing function.

Theorem 16.3. Let |Y | ≤ L a.s. with and let 1 ≤ L < ∞, and let m ∈ FC .The neural network estimate mn is defined by minimizing the penalized em-pirical risk in (16.22) with squashing function σ and the penalty satisfyingcondition (16.21) for some tk ∈ R+ such that

∑k e

−tk ≤ 1. Then, if3rC + L ≤ βn,

E∫

Sr


≤ 2 mink

2568

β4n

n((2d+ 6)k log(18eβnn) + tk)

+ inff∈Fn,k

∫

Sr

|f(x) −m(x)|2µ(dx)

+ 5 · 2568β4

n

n, (16.26)

where Sr is a ball with radius r centered at 0. In particular, upon choosingtk = 2 log k + t0, t0 ≥ ∑∞

k=1 k−2, and βn → ∞, we get

E∫

Sr

|mn(x) −m(x)|2µ(dx) = O

(

β2n

(log(βnn)

n

)1/2)

.(16.27)

If βn < const < ∞ then

pen(k) = 2568β4

n

n· ((2d+ 6)k log(18eβnn) + tk)

= O

(k log(n)

n

)


and the rate in (16.27) becomes

E∫

Sr

|mn(x) −m(x)|2µ(dx) = O

(√log(n)n

)

.

In the proof of Theorem 16.3 we will need a refinement of Lemmas 16.4and 16.5. It is provided in the lemma below.

Lemma 16.6. Let B > 0,G1, . . . ,Gk be classes of real functions f : Rd →[−B,B], and define F as

F =

k∑

i=1

wifi : (w1, . . . , wk) ∈ Rk,

k∑

i=1

|wi| ≤ b, fi ∈ Gi, i = 1, . . . , k

.

Then we have, for any zn1 ∈ Rd·n and any η, δ > 0,

N1(η + δ,F , zn1 ) ≤

(Be(b+ 2δ/B)

δ

)k k∏

i=1

N1(η/(b+ 2δ),Gi, zn1 ).

Proof. Let

Sb = w ∈ Rk :k∑

i=1

|wi| ≤ b

and assume that Sb,δ is a finite subset of Rk with the covering property

maxw∈Sb

minx∈Sb,δ

‖w − x‖1 ≤ δ,

where ‖y‖1 = |y(1)| + . . . + |y(k)| denotes the l1 norm of any y =(y(1), . . . , y(k))T ∈ Rk. Also, let the Gi(η) be an L1 η-cover of Gi on zn

1of minimal size, that is, each Gi(η) has cardinality N1(η,Gi, z

n1 ) and

ming∈Gi(η)

‖f − g‖n ≤ η

for all f ∈ Gi, where ‖f‖n = 1n

∑ni=1 |f(zi)|21/2. Let f ∈ F be given by

f =∑k

i=1 wifi, and choose x ∈ Sb,δ and fi ∈ Gi(η) with ‖w − x‖1 ≤ δ and‖fi − fi‖n ≤ η, i = 1, . . . , k. Since ‖fi‖n ≤ B for all i, we have

‖f −k∑

i=1

xifi‖n ≤ ‖k∑

i=1

wifi −k∑

i=1

xifi‖n + ‖k∑

i=1

xifi −k∑

i=1

xifi‖n

≤k∑

i=1

|wi − xi| · ‖fi‖ +k∑

i=1

|xi| · ‖fi − fi‖n

≤ δB + ηbδ,


where bδ = maxx∈Sb,δ‖x‖1. It follows that a set of functions of cardinality

|Sb,δ| ·k∏

i=1

N1(η,Gi, zn1 )

is a (δB + ηbδ)-cover of F . Thus we only need to bound the cardinality ofSb,δ.

The obvious choice for Sb,δ is a rectangular grid spaced at width 2δ/k.For a given collection of grid points we define partition of space into Voronoiregions such that each region consists of all points closer to a given gridpoint than to any other grid point. Define Sb,δ as the points on the gridwhose l1 Voronoi regions intersect Sb. These Voronoi regions (and the as-sociated grid points) are certainly contained in Sb+2δ. To get a bound onthe number of points in Sb,δ we can divide the volume of simplex Sb+2δ bythe volume of a cube with side length 2δ/k. The volume of Sb+2δ is thevolume of a ball with radius r = b+ 2δ in L1(Rk) norm. It can be directlycalculated by

2k

∫ r

0

∫ r−x1

0· · ·

∫ r−x1−···−xk−1

0dx1dx2 . . . dxk = (2(b+ 2δ))k/k!.

Thus the cardinality of Sb+2δ is bounded above by

(2(b+ 2δ))k

k!

(2δk

)−k

=1k!

(k(b+ 2δ)

δ

)k

≤(e(b+ 2δ)

δ

)k

,

where in the last inequality we used Stirling’s formula

k! =√

2πkkke−keθ(k), |θ(k)| ≤ 1/12k.

Since bδ ≤ b+ 2δ, we have

N1(δB + η(b+ 2δ),F , zn1 ) ≤

(e(b+ 2δ)

δ

)k k∏

i=1

N1(η,Gi, zn1 ),

or

N1(δ + η,F , zn1 ) = N1

(δ

BB +

η

b+ 2δ(b+ 2δ),F , zn

1

)

≤(e(b+ 2δ/B)

δ/B

)k k∏

i=1

N1

(η

b+ 2δ,Gi, z

n1

)

=(Be(b+ 2δ/B)

δ

)k k∏

i=1

N1(η/(b+ 2δ),Gi, zn1 ).

Proof of Theorem 16.3 . Following the ideas given in the proofof Theorem 12.1 we only need to find a bound on the covering number ofFn,k. In order to get the postulated rate we need a stronger bound on the


covering number than in (16.19). Using (16.17), Lemmas 16.4 and 16.6,and setting ε = δ, we get for βn ≥ 2ε

N1(3ε,Fn,k, Xn1 )

≤ N1(ε, c0 : |c0| ≤ βn, Xn1 ) ·

(e(βn + 2ε)

ε

)k k∏

i=1

N1(ε/(βn + 2ε),G2, Xn1 )

≤ 2βn

ε·(e(βn + 2ε)

ε

)k+1 k∏

i=1

N1(ε/(βn + 2ε),G2, Xn1 )

≤(

2eβn

ε

)k(

3(

3e(βn + 2ε)ε

)(2d+4))k

≤(

6eβn

ε

)(2d+6)k

.

Here, again, G2 = σ(aTx + b) : a ∈ Rd, b ∈ R. Hence, for n sufficientlylarge,

2568β4

n

n· (log N1(1/n,Fn,k) + tk)

≤ 2568β4

n

n·

log

(6eβn( 1

3n

)

)(2d+6)k

+ tk

= 2568β4

n

n· ((2d+ 6)k log(18eβnn) + tk)

= penn(k).

Now (16.26) follows from Theorem 16.2. Note that upon choosing tk =2 log(k) + t0, t0 ≥ ∑∞

k=1 k−2 we get

pen(k) = O

(β4

nk log(βnn)n

)

and (16.27) follows from (16.26) and Lemma 16.8 below, which implies

inff∈Fk

∫

Sr

|f(x) −m(x)|2µ(dx) = O(1/k).

The next lemma plays a key role in deriving the rate of approximationwith neural networks. It describes the rate of approximation of convex com-binations in the L2(µ). Let ‖f‖2

S denote∫

Sf2(x)µ(dx). In our presentation

we will follow Barron’s (1993) approach.


Lemma 16.7. Let

Φ = φa : a ∈ A, ‖φa‖2Sr

≤ B2 ⊂ L2(µ),

where Sr is the ball with radius r centered at 0 and A ⊂ Rm is a setof parameters. Let h be any real-valued measurable function on A with∫

A |h(a)|da ∈ (0,∞) such that a representation

f(x) =∫

Aφa(x)h(a)da+ c (16.28)

(c ∈ R) is valid for all |x| < r.Hence, for every k ∈ N , there exists

fk(x) =k∑

j=1

cjφaj(x) + c

such that

‖f − fk‖2Sr<B2

k

(∫

A|h(a)|da

)2

(16.29)

and the coefficients cj can be chosen such that∑k

j=1 |cj | ≤ ∫A |h(a)|da,

where∫ |h(a)|da is the total variation of the signed measure h(a)da.

Proof. Assuming |x| < r, we have by, (16.28),

f(x) =∫

Aφa(x)h(a)da+ c

=∫

Asgn(h(a)) · φa(x)|h(a)|da+ c,

where sgn(x) denotes the sign of x. Let Q be the probability measure onA defined by the density |h(·)|/ ∫A |h(a)|da with respect to the Lebesguemeasure. Let D =

∫A |h(a)|da. Thus

f(x) = D · EQsgn(h(A))φA(x) + c,

where A is a random variable with distribution Q. Let A1, . . . , Ak be i.i.d.random variables with distribution Q independent of A, and let

fk(x) =D

k

k∑

j=1

sgn(h(Aj))φAj (x) + c.

By (16.28) and the Fubini theorem

EQ‖f − fk‖2Sr

= EQ

∫

Sr

(f(x) − fk(x))2µ(dx)


=∫

Sr

EQ(f(x) − fk(x))2µ(dx)

=∫

Sr

EQ

f(x) −

D

k

k∑

j=1

sgn(h(Aj))φAj (x) + c

2

µ(dx)

=∫

Sr

EQ (D · EQsgn(h(A)φA(x) + c

−

D

k

k∑

j=1

sgn(h(Aj))φAj (x) + c

2

µ(dx)

= D2∫

Sr

VarQ

1k

k∑

j=1

sgn(h(Aj))φAj (x)

µ(dx)

=D2

k

∫

Sr

VarQ

(sgn(h(A))φA(x))µ(dx).

Since

EQ(sgn(h(A))φA(x)) =1D

(f(x) − c)

without loss of generality we may alter the constant c in f and fk such thatEQ(sgn(h(A))φA(x)) > 0 on a set of µ-measure greater than zero. On thisset VarQ (sgn(h(A))φA(x)) < EQ(φ2

A(x)) with strict inequality.Hence, by the Fubini theorem,

EQ‖f − fk‖2Sr

<D2

k

∫

Sr

EQ(φ2A(x))µ(dx)

=D2

kEQ

(∫

Sr

φ2A(x)µ(dx)

)

≤ B2D2

k.

Since

EQ‖f − fk‖2Sr<B2D2

k

we can find a1, . . . , ak such that

‖f − fk‖2Sr<B2D2

k=B2

k(∫

A|h(a)|da)2.

The coefficients in fk are Dk sgn(h(aj)), hence their absolute values sum to

at most D.

The next lemma gives the rate of approximation for neural nets withsquashing functions.


Lemma 16.8. Let σ be a squashing function. Then, for every probabilitymeasure µ on Rd, every measurable f ∈ FC and every k ≥ 1, there existsa neural network fk in

Fk =

k∑

i=1

ciσ(aTi x+ bi) + c0; k ∈ N , ai ∈ Rd, bi, ci ∈ R

(16.30)

such that∫

Sr

(f(x) − fk(x))2µ(dx) ≤ (2rC)2

k. (16.31)

The coefficients of the linear combination in (16.30) may be chosen so that∑ki=0 |ci| ≤ 3rC + f(0).

Proof. Consider the class of multiples of sigmoids and linear functions

Gσ = cσ(aTx+ b) : |c| ≤ 2rC, a ∈ Rd, b ∈ R.For F ⊂ L2(µ, Sr), let F denote the closure of the convex hull of Fin L2(µ, Sr), where L2(µ, Sr) denotes the space of functions such that∫

Srf2(x)µ(dx) < ∞.

The key idea of the proof is to show that for every f ∈ FC , f(x)−f(0) ∈Gσ. The rate postulated in (16.31) will then follow from Lemma 16.7. LetΩ = ω ∈ Rd : ω = 0. From the properties of the Fourier transform andthe fact that f is real valued, it follows that

f(x) − f(0) (16.32)

= Re∫

Ω(eiω·x − 1)F (ω)dω

= Re∫

Ω(eiω·x − 1)eiθ(ω)F (ω)dω

=∫

Ω(cos(ω · x+ θ(ω)) − cos(θ(ω)))F (ω)dω

=∫

Ω

(cos(ω · x+ θ(ω)) − cos(θ(ω)))‖ω‖ ‖ω‖F (ω)dω

=∫

Ωg(x, ω)‖ω‖F (ω)dω (16.33)

for every x ∈ Sr, where

g(x, ω) =(cos(ω · x+ θ(ω)) − cos(θ(ω)))

‖ω‖ .

Let Gstep be the class of step functions φ(t, a) defined as

Gstep = φ(t,a)(x) = I∗(a · x− ‖a‖t) : a ∈ Rd, t ∈ R,


where

I∗(x) =

0 , x < 0,σ(0) , x = 0,

1 , x > 0.

By the identity∫

sin(ax+ b)dx = −1/a cos(ax+ b) + const we get

g(x, ω) =(cos(‖ω‖ ω·x

‖ω‖ + θ(ω)) − cos(θ(ω)))

‖ω‖

=∫ 0

ω·x‖ω‖

sin(‖ω‖t+ θ(ω))dt

= I. (16.34)

Consider two cases.

Case 1. ω·x‖ω‖ ≥ 0.

I = −∫ ω·x

‖ω‖

0sin(‖ω‖t+ θ(ω))dt

= −∫ r

0I∗

(ω · x‖ω‖ − t

)sin(‖ω‖t+ θ(ω))dt

since | ω·x‖ω‖ | ≤ |x| ≤ r. On the other hand,

∫ 0

−r

I∗(t− ω · x

‖ω‖)

sin(‖ω‖t+ θ(ω))dt =∫ 0

−r

0 dt = 0.

Case 2. ω·x‖ω‖ < 0. Then

I =∫ 0

ω·x‖ω‖


=∫ 0

−r

I∗(t− ω · x‖ω‖ ) sin(‖ω‖t+ θ(ω))dt.

On the other hand,∫ r

0I∗

(ω · x‖ω‖ − t

)sin(‖ω‖t+ θ(ω))dt =

∫ r

00 dt = 0.

Hence we obtain for the right-hand side of (16.34),

I = −∫ r

0I∗

(ω · x‖ω‖ − t


+∫ 0

−r

I∗(t− ω · x

‖ω‖)


= −∫ r

0I∗

(ω · x‖ω‖ − t



+∫ 0

−r

(1 − I∗

(ω · x‖ω‖ − t

))sin(‖ω‖t+ θ(ω))dt

=∫ 0

−r

sin(‖ω‖t+ θ(ω))dt−∫ r

−r

I∗(ω · x‖ω‖ − t


= s(ω) −∫ r

−r

I∗(ω · x‖ω‖ − t) sin(‖ω‖t+ θ(ω))dt

= s(ω) −∫ r

−r

φ(t,ω)(x) sin(‖ω‖t+ θ(ω))dt,

where s(ω) =∫ 0

−rsin(‖ω‖t + θ(ω))dt. Substituting the above into (16.33)

we get

f(x) − f(0)

=∫

Rd

s(ω)‖ω‖F (ω)dω

−∫

[−r,r]×Rd

φ(t,ω)(x) sin(‖ω‖t+ θ(ω))‖ω‖F (ω)d(t, ω)

= c−∫

[−r,r]×Rd

φ(t,ω)(x)ν(t, ω)d(t, ω),

where c =∫s(ω)‖ω‖F (ω)dω and

ν(t, ω)d(t, ω) = sin(‖ω‖t+ θ(ω))‖ω‖F (ω)d(t, ω)

is a signed measure with total variation bounded from above∫

[−r,r]×Rd

| sin(‖ω‖t+ θ(ω))‖ω‖F (ω)|d(t, ω)

≤ 2r∫

Rd

‖ω‖F (ω)dω = 2rC.

Thus by Lemma 16.7 there exists a linear combination of functions fromGstep plus a constant fk(x) =

∑kj=1 cjφ(tj ,aj)(x) + c0 which approximates

f(x) in L2(µ, Sr), i.e.,∫

Sr

||f(x) − fk(x)||2µ(dx) <4r2C2

k. (16.35)

In order to complete the proof we approximate a step function with thesigmoid. Note that

φ(t,ω)(x) = I∗(ω · x‖ω‖ − t

)

= limL→∞

σ

(L

(ω · x‖ω‖ − t

))(16.36)


for every ω ∈ Rd, x ∈ Rd, t ∈ R including the case ω·x‖ω‖ − t = 0 when the

limit in (16.36) is equal to σ(0). This means Gσ ⊂ Gstep. By the Lebesguedominated convergence theorem, (16.36) holds in L2(µ, Sr) sense, and using(16.35) together with the triangle inequality we obtain

lim supL→∞

∥∥∥∥∥∥f(x) −

k∑

j=1

cjσ

(L

(aj · x‖aj‖ − t

)− c0

)∥∥∥∥∥∥

Sr

≤∥∥∥∥∥∥f(x) −

k∑

j=1

cjφ(tj ,aj)(x) − c0

∥∥∥∥∥∥

Sr

+k∑

j=1

|cj | lim supL→∞

∥∥∥∥φ(tj ,aj)(x) − σ

(L

(aj · x‖aj‖ − t

))∥∥∥∥

Sr

≤ 2rC√k.

As to the size of the coefficients, Lemma 16.7 states

k∑

j=1

|cj | ≤∫

[−r,r]×Rd

|ν(t, ω)|d(t, ω) ≤ 2rC,

and

|c0| = |c+ f(0)|

≤∫

|s(ω)|||ω||F (ω)dω + |f(0)|

≤ rC + |f(0)|,hence

k∑

j=0

|cj | ≤ 3rC + |f(0)|.

This completes the proof.


An early account of feedforward neural networks is provided in Nilsson(1965). More recent monographs include Hertz, Krogh, and Palmer (1991),Ripley (1996), Devroye, Gyorfi, and Lugosi (1996) and Bartlett and An-thony (1999). A large number of papers have been devoted to the theoreticalanalysis of neural network regression estimates. For distributions, where


both X and Y are of bounded support, White (1990) proved L2 consis-tency in probability for certain estimators. Unlike in Section 16.2 the rangeof the ai’s and bi’s in White (1990; 1991) had to be restricted.

Almost sure consistency for the same class of distributions can be ob-tained by using Haussler’s (1992) results. Mielniczuk and Tyrcha (1993)obtained L2 consistency for arbitrary sigmoids. Universal consistency ofnetwork classifiers with threshold sigmoid functions was shown by Faragoand Lugosi (1993). Barron (1991; 1994) applied the complexity regulariza-tion principle to regression estimation by neural networks. The consistencyof the neural network regression estimate presented in Theorem 16.1 hasbeen investigated by Lugosi and Zeger (1995).

Cybenko (1989), Hornik, Stinchcombe, and White (1989), and Funahashi(1989) proved independently, that, on compact sets, feedforward neuralnetworks with one hidden layer are dense with respect to the supremumnorm in the set of continuous functions. In other words, every continuousfunction on Rd can be approximated arbitrarily closely uniformly over anycompact set by functions realized by neural networks. For a survey of suchdenseness results we refer the reader to Barron (1989) and Hornik (1993).

Cybenko (1989) proved the approximation result in sup norm for net-works with continuous squashing functions through application of theHahn–Banach theorem and the Riesz representation theorem. In their uni-form approximation proof Hornik, Stinchcombe, and White (1989) madeuse of the Stone–Weierstrass theorem (see Rudin (1964)). Funahashi (1989)proved the same result using properties of the Fourier transform. TheL2 approximation result is due to Hornik (1991). Its simplified versionis presented in Lemma 16.2.

The rate of approximation of convex combinations in L2(µ), given inLemma 16.7, and the L2 approximation rate for neural networks withsquashing functions, given in Lemma 16.8, follow Barron (1993). The ratesof approximation have also been studied by Mhaskar (1996) and Maiorovand Meir (2000).

The rates of L2 convergence for sigmoidal neural networks have beenstudied by Barron (1994) and McCaffrey and Gallant (1994). Barron usedcomplexity regularization on the discrete set of parameters and imposeda Lipschitz condition on the sigmoid to obtain the rate in the class offunctions covered by the approximation Lemma 16.8. Barron’s results havebeen extended by McCaffrey and Gallant (1994) by not discretizing the pa-rameter space. They considered networks with cosine squasher and appliedthem to functions in Sobolev spaces, i.e., the spaces of functions for whichf (r−1) is absolutely continuous, r = 1, 2, . . ., and f (r) ∈ Lp(µ), p ≥ 1.

The bound on the VC dimension of the graphs of compositions of func-tions (Lemma 16.3) has been investigated by Nolan and Pollard (1987)and Dudley (1987). The properties of covering numbers described in Lem-mas 16.4 and 16.5 can be found in Noland and Pollard (1987), Pollard(1990) and in Devroye, Gyorfi, and Lugosi (1996).



Problem 16.1. (Hornik (1989)). Prove that Lemma 16.1 is valid for neuralnetworks with continuous nonconstant sigmoids.

Hint. Use the Stone–Weierstrass theorem.

Problem 16.2. (Hornik (1993)). Consider σ(aT x + b), a ∈ A, b ∈ B. ExtendLemma 16.1 so that it holds for any Riemann integrable and non-polynomialsigmoid on some A containing a neighborhood of the origin, and on some non-degenerate compact interval B.

Problem 16.3. (Hornik (1991)). Prove that Lemma 16.2 remains true if σ isbounded and non-constant.

Problem 16.4. Define Cf =∫

Rd ‖ω‖F (ω)dω (see (16.25)). Let f(x) = g(‖x‖),i. e., f is a radial function. Show that Cf = Vd

∫ ∞0

rd|F (r)|dr, where Vd is thevolume of d−1-dimensional unit sphere in Rd. Prove that if f(x) = exp(−‖x‖2/2),i. e., f is the Gaussian function then Cf ≤ d1/2.

17Radial Basis Function Networks

17.1 Radial Basis Function Networks

The definition of a radial basis function network (RBF network) with onehidden layer and at most k nodes for a fixed function K : R+ → R, calleda kernel, is given by the equation

f(x) =k∑

i=1

wiK (||x− ci||Ai) + w0, (17.1)

where

‖x− ci‖2Ai

= [x− ci]TAi[x− ci],

w0, w1, . . . , wk ∈ [−b, b], c1, . . . , ck ∈ Rd, and A1, . . . , Ak are (d × d)-dimensional positive semidefinite matrices (i. e., aTAia ≥ 0 for all a ∈ Rd)and b > 0 (we allow b = ∞). The weights wi, ci, Ai are parameters of theRBF network and K (||x− ci||Ai) is the radial basis function (Figure 17.1).

There are two different types of kernels used in applications: increasingkernels, i.e., kernels such that K(x) → ∞ as x → ∞ and decreasing kernels,i.e., kernels such that K(x) → 0 as x → ∞. The increasing kernels playan important role in approximation theory (see Section 17.4 for details).All our theoretical results discussed in this chapter concern only decreas-ing kernels. It is an open problem whether similar results remain true forincreasing kernels.Common choices for decreasing kernels are:

330 17. Radial Basis Function Networks

x(d)

x(1)

...

‖x− ck‖

‖x− c2‖

‖x− c1‖

...

...

∑ f(x)

w1

w2

wk

Figure 17.1. Radial basis network with one hidden layer.

• K(x) = Ix∈[0,1] (window);

• K(x) = max(1 − x2), 0 (truncated parabolic);

• K(x) = e−x2(Gaussian); and

• K(x) = e−x (exponential).

For a given fixed kernel K, there are three sets of parameters of RBFnetworks:

(i) the weight vectors of the output layer of an RBF network wi (i =1, · · · , k),

(ii) the center vectors ci (i = 1, · · · , k), and

(iii) d× d positive semidefinite matrices Ai (i = 1, · · · , k) determining thesize of the receptive field of the basis functions K (||x− ci||Ai) .

The last two sets constitute the weights of the hidden layer of an RBFnetwork. The most common choice for K(x) is the Gaussian kernel, K(x) =e−x2

, which leads to

K (‖x− ci‖Ai) = e−[x−ci]T Ai[x−ci].

For a specific K(x), e.g., Gaussian, the size, shape, and orientation ofthe receptive field of a node are determined by the matrix Ai. WhenAi = 1

σ2i

I, the shape is a hyperspherical ball with radius σi. When

Ai = diag[σ−2i,1 , . . . , σ

−2i,d ], the shape of the receptive field is an elliptical ball

with each axis coinciding with a coordinate axis; the lengths of the axes ofthe ellipsoid are determined by σi,1, . . . , σi,d. When Ai is nondiagonal but

17.1. Radial Basis Function Networks 331

1 1

1 1

0 x

K(x) = e−x2

0 x

K(x) = e−x

0 x

K(x) = Ix∈[0,1]

0 x

K(x) = max (1 − x2), 0

Figure 17.2. Window, truncated parabolic, Gaussian, and exponential kernels.

symmetric, we have Ai = RTi DiRi where Di is a diagonal matrix which

determines the shape and size of the receptive field and Ri is a rotationmatrix which determines the orientation of the receptive field.

There is a close relationship between RBF networks and smoothingsplines (see Chapter 20). Consider the multivariate penalized least squaresproblem in which we minimize

1n

n∑

i=1

|g(Xi) − Yi|2 + λn · J2k (g) (17.2)

over the Sobolev spaceW k(Rd) consisting of functions g whose weak deriva-tives of order k are contained in L2(Rd). The complexity of g is penalizedby λn · J2

k (g), where

J2k (g) =

∫

Rd

exp(||s||2/β)|g(s)|2ds

and g denotes the Fourier transform of g. The minimization of (17.2) leadsto

K(‖x‖) = exp(−β||x||2). (17.3)

Note that (17.3) is a Gaussian kernel which satisfies the conditions ofTheorems 17.1 and 17.2.

Another interesting property of an RBF networks which distinguishesthem from neural networks with squashing functions is that the center


vectors can be selected as cluster centers of the input data. In practicalapplications, the number of clusters is usually much smaller than the num-ber of data points resulting in RBF networks of smaller complexity thanmultilayer neural networks.

The problem of determining the specific values of parameters from thetraining sequenceDn = (X1, Y1), . . . , (Xn, Yn) consisting of n i.i.d. copiesof (X,Y ) is called learning or training. The most common parameterlearning strategies are:

• cluster input vectors Xi(i = 1, . . . , n) and set center vectors ci (i =1, . . . , k) to cluster centers. Remaining parameters are determined byminimizing the empirical L2 risk on Dn. If the elements of the co-variance matrices Ai (i = 1, . . . , k) are chosen arbitrarily then findingthe output weights wi (i = 1, . . . , k) by the least squares method isan easy linear problem, see Section 10.1;

• choose from Dn a random k-element subset

D′n = (X ′

1, Y′1), . . . , (X ′

k, Y′k)

of samples and assign X ′i → ci, Y

′i → wi (i = 1, . . . , k). The elements

of the covariance matrices Ai, i = 1, . . . , k are chosen arbitrarily; and

• choose all the parameters of the network by minimizing the empiricalL2 risk.

Computationally, the last strategy is the most costly. In practice, parame-ters of RBF networks are learned by the steepest descent backpropagationalgorithm. This algorithm is not guaranteed to find a global minimum ofthe empirical L2 risk.

In this chapter we will focus our attention on RBF networks trained byminimizing the empirical L2 error subject to size restrictions on the outputweights. We will study the asymptotic properties of such networks. Theempirical risk minimization is computationally very complex and there areno efficient algorithms which can find parameters of multidimensional RBFnetworks minimizing the empirical L2 risk.

17.2 Consistency

Given the training set Dn, our estimate of the regression function m(x) =EY |X = x is an RBF network mn which minimizes the empirical L2 risk

1n

n∑

j=1

|f(Xj) − Yj |2.


More specifically, for each n we fix Θn as the set of parameters defined by

Θn =

θ = (w0, . . . , wkn , c1, . . . , ckn , A1, . . . , Akn) :kn∑

i=0

|wi| ≤ bn

,

and we choose our regression estimator mn from the class

Fn = fθ : θ ∈ Θn

=

kn∑

i=1

wiK (‖x− ci‖Ai) + w0 :

kn∑

i=0

|wi| ≤ bn

(17.4)

with mn satisfying

1n

n∑

j=1


1n

n∑

j=1

|f(Xj) − Yj |2. (17.5)

We assume the existence of a minimum in (17.5). If the minimum does notexist we can work with functions whose empirical L2 risk is close to theinfimum.

Thus, mn, the optimal estimator is sought among RBF networks consist-ing of at most kn neurons satisfying the weight constraint

∑kn

i=0 |wi| ≤ bn.The number of allowable nodes kn will be a function of the training setsize n, to be specified later. If we assume that |K(u)| ≤ K∗ for all u ≥ 0,and let k∗ = maxK∗, 1, then these constraints and the Cauchy–Schwarzinequality imply that, for any θ ∈ Θn and x ∈ Rd,

|fθ(x)|2 =

∣∣∣∣∣

kn∑

i=1

wiK (‖x− ci‖Ai) + w0)

∣∣∣∣∣

2

≤ (kn∑

i=0

|wi|)(

kn∑

i=1

|wi|K2 (‖x− ci‖Ai) + w0

)

≤ b2nk∗2. (17.6)

Our analysis of RBF networks will be confined to networks with theregular radial kernels defined next.

Definition 17.1. Kernel K : [0,∞) → R is a regular radial kernel if it isnonnegative, monotonically decreasing, left continuous,

∫Rd K(||x||)dx = 0,

and∫

Rd K(||x||)dx < ∞, where || · || is the Euclidean norm on.

Note that a regular radial kernel is bounded, i.e., K(x) ≤ k∗ (x ∈ [0,∞))for some finite constant k∗. All the kernels in Figure 17.2 are regular radialkernels. The next result describes the consistency properties of mn.

Theorem 17.1. Let |Y | ≤ L < ∞ a.s. Consider a family Fn of RBFnetworks defined by (17.4), with kn ≥ 1, and let K be a regular radial


kernel. If

kn, bn → ∞and

knb4n log(knb

2n)/n → 0

as n → ∞, then the RBF network mn minimizing the empirical L2 riskover Fn = fθ : θ ∈ Θn is weakly universally consistent. If, in addition,

b4nn1−δ

→ 0 (n → ∞)

for some δ > 0, then mn is strongly universally consistent.

Proof. Following the proofs of Theorems 10.2 and 10.3, for weakconsistency, it suffices to show

inff∈Fn

∫|f(x) −m(x)|2µ(dx) → 0 (n → ∞) (17.7)

and

E

supf∈Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi|2 − E|f(X) − Y |2∣∣∣∣∣

→ 0 (n → ∞)

for bounded Y , and for strong consistency it suffices to show (17.7) and

supf∈Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi|2 − E|f(X) − Y |2∣∣∣∣∣→ 0 (n → ∞) a.s.

for bounded Y . Here we do not need to consider truncated functions becausethe functions in Fn are bounded. The last three limit relations follow fromLemmas 17.1 and 17.2 below.

We first consider the approximation error for a subset of the family ofRBF networks in (17.1) by constraining Ai to be diagonal with the equalelements. Letting Ai = h−2

i I, (17.1) becomes

fθ(x) =k∑

i=1

wiK

(∥∥∥∥x− cihi

∥∥∥∥

2)

+ w0, (17.8)

where θ = (w0, . . . , wk, c1, . . . , ck, h1, . . . , hk) is the vector of parameters,w0, . . . , wk ∈ R, h1, . . . , hk ∈ R, and c1, . . . , ck ∈ Rd.

We will show in Lemma 17.1 that⋃∞

k=1 Fk is dense in L2(µ) for anyprobability measure µ on Rd and for RBF networks with regular radial ker-nels. Consequently, the approximation error inff∈Fn

∫ |f(x)−m(x)|2µ(dx)converges to zero when kn → ∞.

Lemma 17.1. Assume that K is a regular radial kernel. Let µ be an arbi-trary probability measure on Rd. Then the RBF networks given by (17.8)


are dense in L2(µ). In particular, if m ∈ L2(µ), then, for any ε > 0, thereexist parameters θ = (w0, . . . , wk, c1, . . . , ck, h1, . . . , hk) such that

∫

Rd

|fθ(x) −m(x)|2µ(dx) < ε. (17.9)

Proof. Let ‖f‖ denote the L2(µ) norm of any f ∈ L2(µ). By Theorem A.1we have that, for any m ∈ L2(µ) and any ε > 0, there exists a continuousg supported on a compact set Q such that

‖m− g‖L2 <ε

2. (17.10)

Set K(x) = K(||x||2)/ ∫ K(||x||2)dx, Kh(x) = 1hd K(||x||/h), and define

σh(x) =∫

Rd

g(y)Kh(x− y)dy.

First we show that limh→0 σh(x) = g(x) for all x ∈ Rd. Since g is uniformlycontinuous for each δ > 0 we can find a γ > 0, such that

|g(x− y) − g(x)| < δ

whenever |y| < γ. We have

|σh(x) − g(x)| = |∫

Rd

(g(y) − g(x))Kh(x− y)dy|

= |∫

Rd

(g(x− y) − g(x))Kh(y)dy|

≤ δ

∫

|y|<γ

Kh(y)dy +∫

|y|≥γ

|g(x− y) − g(x)|Kh(y)dy

≤ δ

∫

Rd

K(y)dy + 2 supx∈Rd

|g(x)|∫

|y|≥γ/h

K(y)dy

and the last term goes to zero as h → 0 since∫K(||x||)dx < ∞. We

also have |σh(x)| ≤ B∫ |K(x)|dx < ∞ for all x ∈ Rd, where B =

supx∈IRd |g(x)|. By the dominated convergence theorem, and since δ > 0 isarbitrary, limh→0 ‖g − σh‖L2(µ) = 0. Consequently, we can choose h > 0such that

‖g − σh‖L2 <ε

4. (17.11)

In the rest of the proof, using a probabilistic argument similar to the oneby Barron (1993) for Lemma 17.4, we will demonstrate that there existsan fθ which approximates σh within ε/4 in L2(µ) norm.

First assume that g(x) ≥ 0 and is not identically zero for all x, and definethe probability density function ϕ by

ϕ(x) = g(x)/∫g(y)dy.


Then σh(x) = EK(x, Z), where K(x, y) = Kh(x−y)∫g(y)dy and Z has

density ϕ.Let Z1, Z2, . . . be an i.i.d. sequence of random variables with each Zi

having density ϕ. By the strong law of large numbers, for all x,

limk→∞

1k

k∑

i=1

K(x, Zi) = σh(x) a.s.,

therefore, by the dominated convergence theorem,

∫ (1k

k∑

i=1

K(x, Zi) − σh(x)

)2

µ(dx) → 0 a.s.

as k → ∞. Thus there exists a sequence (z1, z2, . . .) for which

∫ (1k

k∑

i=1

K(x, zi) − σh(x)

)2

µ(dx) <ε

4

for k large enough.Since 1

k

∑ki=1 K(x, zi) is an RBF network in the form of (17.8), this

implies the existence of an fθ such that

‖σh − fθ‖L2 <ε

4. (17.12)

To generalize (17.12) for arbitrary g we use the decomposition g(x) =g+(x) − g−(x), where g+ and g− denote the positive and negative parts ofg, respectively. Then

σh(x) = σ(1)h (x) − σ

(2)h (x) =

∫

Rd

g+(y)K(x− z)dz −∫

Rd

g−(z)K(x− z)dz.

We can approximate σ(1)h and σ

(2)h separately as in (17.12) above by fθ(1)

and fθ(2) , respectively. Then for fθ = fθ(1) − fθ(2) we get

‖σh − fθ‖L2 <ε

2. (17.13)

We conclude from (17.10), (17.11), and (17.13) that

‖m− fθ‖L2(µ) < ε

which proves (17.9). Note that the above proof also establishes the firststatement of the theorem, namely that fθ : θ ∈ Θ is dense in L2(µ).

Next we consider

supf∈Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi|2 − E|f(X) − Y |2∣∣∣∣∣.

We have the following result:


Lemma 17.2. Assume |Y | ≤ L < ∞ a.s. Consider a family of RBF net-works defined by (17.4), with k = kn ≥ 1. Assume that K is a regular radialkernel. If

kn, bn → ∞and

knb4n log(knb

2n)/n → 0

as n → ∞, then

E

supf∈Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi|2 − E|f(X) − Y |2∣∣∣∣∣

→ 0 (n → ∞)

for all distributions of (X,Y ) with Y bounded. If, in addition,

b4nn1−δ

→ 0 (n → ∞)

for some δ > 0, then

supf∈Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi|2 − E|f(X) − Y |2∣∣∣∣∣→ 0 (n → ∞) a.s.

Proof. LetK be bounded by k∗. Without loss of generality we may assume

L2 ≤ b2nk∗2 and bn ≥ 1. If |y| ≤ L, then by (17.6) the functions h(x, y) =

(f(x) − y)2, f ∈ Fn, are bounded above by

h(x, y) ≤ 4 max|f(x)|2, |y|2 ≤ 4 maxb2nk∗2, L2 ≤ 4b2nk∗2. (17.14)

Define the family of functions

Hn = h : Rd+1 → R : h(x, y) = (f(x) − TLy)2

((x, y) ∈ Rd+1) for some f ∈ Fn, (17.15)

where TL is the usual truncation operator. Thus each member of Hn mapsRd+1 into R. Hence

supf∈Fn

∣∣∣∣∣∣

1n

n∑

j=1

|f(Xj) − Yj |2 − E|f(X) − Y |2∣∣∣∣∣∣

= suph∈Hn

∣∣∣∣∣1n

n∑

i=1

h(Xi, Yi) − Eh(X,Y )

∣∣∣∣∣,

and for all h ∈ Hn we have |h(x, y)| ≤ 4b2nk∗2 for all (x, y) ∈ Rd ×R. Using

Pollard’s inequality, see Theorem 9.1, we obtain

P

sup

f∈Fn

∣∣∣∣∣∣

1n

n∑

j=1

|f(Xj) − Yj |2 − E|f(X) − Y |2∣∣∣∣∣∣> ε


= P

suph∈Hn

∣∣∣∣∣1n

n∑

i=1


∣∣∣∣∣> ε

≤ 8E N1(ε/8,Hn, Zn1 ) e−nε2/128(4k∗2b2n)2 . (17.16)

In the remainder of the proof we obtain an upper bound on the L1covering number N1(ε/8,Hn, z

n1 ), which will be independent of zn

1 .Let f1 and f2 be two real functions on Rd satisfying |fi(x)|2 ≤ b2nk

∗2 (i =1, 2) for all x ∈ Rd. Then for h1(x, y) = (f1(x)−y)2 and h2(x, y) = (f2(x)−y)2, and any zn

1 = ((x1, y1), . . . , (xn, yn)) with |yi| ≤ L (i = 1, . . . , n) wehave

1n

n∑

i=1

|h1(xi, yi) − h2(xi, yi)| =1n

n∑

i=1

|(f1(xi) − yi)2 − (f2(xi) − yi)2|

=1n

n∑

i=1

|f1(xi) − f2(xi)| · |f1(xi) + f2(xi) − 2yi|

≤ 4k∗bn1n

n∑

i=1

|f1(xi) − f2(xi)|. (17.17)

Since |f(x)|2 ≤ b2nk∗2 (x ∈ Rd) for all f ∈ Fn, functions in the cover

f1, . . . , fl, l = N1(ε,Fn, xn1 ), (Definition 9.3) can be chosen so that they

also satisfy |fi(x)|2 ≤ b2nk∗2 (x ∈ Rd) (i = 1, . . . , l). Combining this with

(17.17) we conclude that, for n large enough,

N1

( ε8,Hn, z

n1

)≤ N1

(ε

32k∗bn,Fn, x

n1

).

We can use Lemmas 16.4 and 16.5 to relate the covering numbers of theclass of functions in Fn to covering numbers of the class

G = K (‖x− c‖A) : c ∈ Rd.We have

N1

(ε

32k∗bn,Fn, x

n1

)

≤kn∏

i=1

N1

(ε

32k∗bn(kn + 1), w · g : g ∈ G, |w| ≤ bn , xn

1

)

×N1

(ε

32k∗bn(kn + 1), w : |w| ≤ bn , xn

1

)

≤kn∏

i=1

(

N1

(ε

64k∗b2n(kn + 1),G, xn

1

)


×N1

(ε

64k∗2bn(kn + 1), w : |w| ≤ bn , xn

1

))

×N1

(ε

32k∗bn(kn + 1), w : |w| ≤ bn , xn

1

)

≤kn∏

i=1

(2bn

ε64k∗2bn(kn+1)

N1

(ε

64k∗b2n(kn + 1),G, xn

1

))

· 2bnε

32k∗bn(kn+1)

=(

128k∗2b2n(kn + 1)ε

)k+1 (N1

(ε

64k∗b2n(kn + 1),G, xn

1

))k

.(17.18)

To bound N1(ε/(64k∗b2n(kn +1)),G, xn1 ) we will use Theorem 9.4 relating

covering numbers of G to the VC dimension of graph sets of functions in G(see Figure 17.3).

Since K is left continuous and monotone decreasing we have

K

(√[x− c]TA[x− c]

)≥ t if and only if [x− c]TA[x− c] ≤ ϕ2(t),

where ϕ(t) = maxy : K(y) ≥ t. Equivalently, (x, t) must satisfy

xTAx− xT (Ac+AT c) + cTAc− ϕ2(t) ≤ 0.

Consider now the set of real functions on Rd+1 defined for any (x, s) ∈Rd × R by

gA,α,β,γ(x, s) = xTAx+ xTα+ γ + βs,

where A ranges over all (d × d)-matrices, and α ∈ Rd, β, γ ∈ R are arbi-trary. The collection gA,α,β,γ is a (d2 + d + 2)-dimensional vector spaceof functions. Thus the class of sets of the form (x, s) : gA,α,β,γ(x, s) ≤ 0has VC dimension at most d2 +d+2 by Theorem 9.5. Clearly, if for a givencollection of points (xi, ti) a set (x, t) : g(x) ≥ t, g ∈ G picks out thepoints (xi1 , ti1), . . . , (xil

, til), then there exist A,α, β, γ such that (x, s) :

gA,α,β,γ(x, s) ≥ 0 picks out only the points (xi1 , ϕ2(ti1)), . . . , (xil

, ϕ2(til)).

This shows that VG+ ≤ d2 + d+ 2. Theorem 9.4 implies

N1(ε/(64k∗b2n(kn + 1)),G, xn1 )

≤ 3(

6ek∗

ε/(64k∗b2n(kn + 1))

)2(d2+d+2)

≤ 3(

384ek∗2b2n(kn + 1)ε

)2(d2+d+2)

,


t

x

(xil, ϕ(til

))(xi1 , ϕ(ti1))

Figure 17.3. Graph set of G.

from which, upon substitution into (17.18), we obtain for n large enough

N1

(ε

32k∗bn,Fn, x

n1

)

≤ 3k

(384ek∗2b2n(kn + 1)

ε

)(2d2+2d+3)k+1

≤(C1b

2nkn

ε

)C2kn

.

Collecting the results above we finally obtain, with appropriate constantsC1, C2, and C3, depending on k∗, d, and k∗, respectively, the followingbound implied by inequality (17.16):

P

sup

f∈Fn

∣∣∣∣∣∣

1n

n∑

j=1

|f(Xj) − Yj |2 − E|f(X) − Y |2∣∣∣∣∣∣> ε

≤ 8(C1b

2nkn

ε

)C2kn

e−nε2/C3(b2n)2

= 8 exp(

− n

(bn)4[ε2/C3 − C2knb

4n

nlog

C1b2nkn

ε]).

As in the proof of Theorem 10.3 this inequality implies Lemma 17.2.


Consider the RBF networks given by (17.1) with weights satisfying theconstraint

∑ki=0 |wi| ≤ βn for a fixed b > 0. Thus the kth candidate class


Fk for the function estimation task is defined as the class of networks withk nodes

Fn,k =

k∑

i=1

wiK (‖x− ci‖Ai) + w0 :

k∑

i=0

|wi| ≤ βn

. (17.19)

In the sequel we develop bounds on the expected L2 error of complexityregularized RBF networks.

Clearly, |f(x)| ≤ k∗βn for all x ∈ Rd and f ∈ Fn,k. It will be assumedthat for each k we are given a finite, almost sure uniform upper bound onthe random covering numbers N1(ε,Fn,k, X

n1 ), where Xn

1 = (X1, . . . , Xn).Denoting this upper bound by N1(ε,Fn,k), we have

N1(ε,Fn,k, Xn1 ) ≤ N1(ε,Fn,k) a.s.

Note that we have suppressed the possible dependence of this bound onthe distribution of X. Finally assume that |Y | ≤ L < ∞ a.s.

We define the complexity penalty of the kth class for n training samplesas any nonnegative number penn(k) satisfying

penn(k) ≥ 2568β4

n

n· (log N1(1/n,Fn,k) + tk), (17.20)

where the nonnegative constants tk ∈ R+ satisfy Kraft’s inequality∑∞k=1 e

−tk ≤ 1. As in Chapter 16 the coefficients tk may be chosen as

tk = 2 log k + t0 with t0 ≥ log(∑

k≥1 k−2

). The penalty is defined in a

similar manner as in inequality (12.14).We can now define our estimate. Let

mn,k = arg minf∈Fn,k

1n

n∑

i=1

|f(Xi) − Yi|2,

that is, mn,k minimizes the empirical L2 risk for n training samples overFn,k. (We assume the existence of such a minimizing function for each kand n.) The penalized empirical L2 risk is defined for each f ∈ Fn,k as

1n

n∑

i=1

|f(Xi) − Yi|2 + penn(k).

Our estimate mn is then defined as the mn,k minimizing the penalizedempirical risk over all classes

mn = mn,k∗ , (17.21)

where

k∗ = arg mink≥1

(1n

n∑

i=1

|mn,k(Xi) − Yi|2 + penn(k)

)

.


It is easy to see that the class F =⋃

k Fk of RBF networks given by(17.19) is convex. Let F be the closure of F in L2(µ). We have the followingtheorem for the estimate (17.21).

Theorem 17.2. Let 1 ≤ L < ∞, n ∈ N , and let L ≤ βn < ∞. Suppose,furthermore, that |Y | ≤ L < ∞ a.s. Let K be a regular radial kernel. As-sume that the penalty satisfies (17.20) for some tk such that

∑∞k=1 e

−tk ≤ 1.Then the RBF regression estimate chosen by the complexity regularizationsatisfies for n sufficiently large

E∫


≤ 2 mink≥1

(2568 · β

4n

n

(((2d2 + 2d+ 6)k + 5) log 12ek∗βnn+ tk

)

+ inff∈Fk

E∫

|f(x) −m(x)|2µ(dx))

+ 5 · 2568β4

n

n. (17.22)

If m ∈ F , then

E∫

|mn(x) −m(x)|2µ(dx) = O

(

β2n

(log(βnn)

n

)1/2)

.(17.23)

Note that if βn < const < ∞ then

pen(k) = O

(k log nn

)

and the rate in (17.23) becomes

E∫

|mn(x) −m(x)|2µ(dx) = O

(√log nn

)

.

In the proof of Theorem 17.2 we will use Lemma 17.3 which will providea stronger bound on the covering numbers of RBF networks than (17.18).Lemma 17.4 will provide the approximation error rate.

Lemma 17.3. Assume that K is a regular radial kernel. Then, for 0 <ε < k∗/4,

N1(ε,Fk, Xn1 )

≤ 3k 4βn(βn + ε)ε

(2ek∗(βn + ε/k∗)

ε

)k+1 (6ek∗(βn + ε)ε

)2k(d2+d+2)

.

If, in addition, βn ≥ ε, then

N1(ε,Fk, Xn1 )

≤ 3k8β2n

ε

(12ek∗βn

ε

)(2d2+2d+5)k+1

.


Proof. Let G be the collection of functions [x− c]TA[x− c] parametrizedby nonnegative definite matrices A and c ∈ Rd. Also let

K = K(√g(·)) : g ∈ G = K(g(·)) : g ∈ G,

where K(x) = K(√x), x ∈ R, is a monotone decreasing function since it

is a composition of the monotone increasing function√x and a monotone

decreasing function K(x). Since G spans a (d2 + d+ 1)-dimensional vectorspace, then by Theorem 9.5 the collection of sets

G+ = (x, t) : g(x) − t ≥ 0 : g ∈ Ghas VC dimension VG+ ≤ d2 + d + 2. Since K is monotone decreasing, itfollows from Lemma 16.3 that VK+ ≤ d2 + d+ 2, where the families of setsK+ are defined just as G+ with K in place of G. Since 0 ≤ K(x) ≤ k∗,x ∈ R, Theorem 9.4 implies, for all f ∈ K and ε < k∗/4,

N1(ε,K, Xn1 ) ≤ 3

(2ek∗

εlog

(3ek∗

ε

))VK+

≤ 3(

3ek∗

ε

)2VK+

.

It follows that

N1(ε,K, Xn1 ) ≤ 3

(3ek∗

ε

)2(d2+d+2)

for ε < k∗/4. Since Fk is defined as

Fk = k∑

i=1

wifi + w0 :k∑

i=0

|wi| ≤ βn, fi ∈ K,

we obtain, from Lemma 16.6 with η = δ = ε/2 and B = k∗,

N1(ε,Fk, Xn1 )

≤(

2ek∗(βn + ε/k∗)ε

)k+1

(N1(ε/2(βn + ε),K, Xn1 ))k

×N1 (ε/2(βn + ε), w0 : |w0| ≤ βn, Xn1 )

≤(

2ek∗(βn + ε/k∗)ε

)k+1(

3(

6ek∗(βn + ε)ε

)2(d2+d+2))k

· 4βn(βn + ε)ε

≤ 3k 4βn(βn + ε)ε

(2ek∗(βn + ε/k∗)

ε

)k+1 (6ek∗(βn + ε)ε

)2k(d2+d+2)

≤ 3k8β2n

ε

(12ek∗βn

ε

)(2d2+2d+5)k+1

.

The next lemma, by Barron (1993) needed in the proof of Theorem 17.2describes the rate of approximation of convex combinations in the Hilbertspace.


Lemma 17.4. Denote by F the closure of the convex hull of the set F =⋃k Fk in L2(µ) with norm || · || = || · ||L2(µ). Assume ||f || ≤ b for each

f ∈ F , and let f∗ ∈ F . Then for every k ≥ 1 and every c > 2(b2 − ||f∗||2),there is an fk in the convex hull of k points in F such that

‖fk − f∗‖2L2(µ) ≤ c

k.

Proof. Pick k ≥ 1 and δ > 0. Choose a function f in the convex hull ofF such that

||f − f∗|| ≤ δ/k.

Thus f =∑m

i=1 αif i, with αi ≥ 0,∑m

i=1 αi = 1, f i ∈ F for m sufficientlylarge. Let X be randomly drawn from the set 1, . . . ,m with

PgX = f i = PX = i = αi (i = 1, . . . ,m)

and let X1, . . . , Xk be independently drawn from the same distribution asX. Set fk = 1

k

∑ki=1 gXi . Then

Efk =1k

k∑

i=1

EgXi =1k

k∑

i=1

Em∑

j=1

f jI[Xi=j] =1k

k∑

i=1

m∑

j=1

αjf j =1k

k∑

i=1

f = f.

Next

E||fk − f ||2 = E < fk − f, fk − f >

= E <1k

k∑

i=1

gXi− f,

1k

k∑

i=1

gXi− f >

= E <1k

k∑

i=1

gXi− E

1k

k∑

i=1

gXi,1k

k∑

i=1

gXi− E

1k

k∑

i=1

gXi>

=1k2 E

k∑

i,j=1

< gXi − EgXi , gXj − EgXj >

=1k2

k∑

i,j=1

m∑

l,p=1

E < f l(I[Xi=l] − αl), fp(I[Xj=p] − αp) >

=1k2

k∑

i=1

m∑

l,p=1

E < f l(I[Xi=l] − αl), fp(I[Xi=p] − αp) >

=1k2

k∑

i=1

E||gXi− EgXi

||2

=1kE||gX1 − f ||2.


Since the boundedness of g implies boundedness of g, we obtain

1kE||gX1 − f ||2

=1kE < gX1 − f, gX1 − f >

=1k

(E||gX1 ||2 + ||f ||2 − 2E < gX1 , f >

)

=1k

(

E||gX1 ||2 + ||f ||2 − 2m∑

i=1

E(< f i, f > ·IX1=i

))

=1k

(

E||gX1 ||2 + ||f ||2 − 2m∑

i=1

αi < f i, f >

)

=1k

(E||gX1 ||2 − ||f ||2)

≤ 1k

(b2 − ||f ||2) .

We have thus bounded E||fk − f ||2 by 1k

(b2 − ||f ||2) which implies that

there exist g1, . . . , gk ∈ F and fk in the convex hull of g1, . . . , gk such that

||fk − f ||2 ≤ 1k

(b2 − ||f ||2) .

Thus by the triangle inequality

||fk − f∗||2 ≤ 2||fk − f ||2 + 2||f − f∗||2

≤ 2(b2 − ||f ||2)k

+2δk.

The conclusion of the lemma follows by choosing δ sufficiently small.

Proof of Theorem 17.2. Using the bound for N1(ε,Fk, Xn1 ) from

Lemma 17.3 we obtain from (17.20)

2568β4

n

n(log N1(1/n,Fn,k) + tk)

≤ 2568β4

n

n

(

log3k8β2

n

ε

(12ek∗βn

ε

)(2d2+2d+5)k+1

+ tk

)

≤ 2568β4

n

n(((2d2 + 2d+ 6)k + 5) log 12ek∗βnn+ tk)

= penn(k). (17.24)

The penalty bound in (12.14) implies


E∫


≤ 2 mink≥1

(penn(k) + inf

f∈Fk

E∫

|f(x) −m(x)|2µ(dx))

+ 5 · 2568β4

n

n,

from which and from (17.24) for n sufficiently large we obtain (17.22). Notethat upon choosing tk = 2 log(k) + t0, t0 ≥ ∑∞

k=1 k−2 we get

pen(k) = O

(β4

nk log(βnn)n

).

In order to obtain the second statement of Theorem 17.2 we note that theclass ∪kFk is convex if the Fk are the collections of RBF networks definedin (17.1). We can thus apply, to the right-hand side of (17.22), Lemma 17.4which states that there exists an RBF network fk ∈ Fk such that

‖fk −m‖2L2

≤ c1k

for some c1 > 2((k∗b)2 − ||m||2). Substituting this bound into (17.22) weobtain

E∫


≤ 2 mink≥1

(c1k

+ cβ4

nk log(βnn)n

)+ 5 · 2568

β4n

n

= O

(

β2n

(log(βnn)

n

)1/2)

,

and (17.23) follows.

The above convergence rate results hold in the case when the regressionfunction is a member of the L2(µ) closure of F =

⋃Fk, where

Fk =

k∑

i=1

wiK (‖x− ci‖Ai) + w0 :k∑

i=0

|wi| ≤ b

. (17.25)

In other words, m should be such that for all ε > 0 there exists a k anda member f of Fk with ‖f − m‖L2 < ε. The precise characterization ofF remains largely unsolved. However, based on the work of Girosi andAnzellotti (1992) we can describe a large class of functions that is containedin F .

Let H(x, t) be a bounded, real-valued, and measurable function of twovariables x ∈ Rd and t ∈ Rn. Suppose that ν is a signed measure on Rn

with finite total variation ‖ν‖, where ‖ν‖ = ν+(Rd) + ν−(Rd) and ν+

and ν− are positive and negative parts of ν, respectively (see, e.g., Rudin(1966)). If g(x) is defined as

g(x) =∫

Rn

H(x, t)ν(dt),


then g ∈ L2(µ) for any probability measure µ on Rd. One can reasonablyexpect that g can be approximated well by functions f(x) of the form

f(x) =k∑

i=1

wiH(x, ti),

where t1, . . . , tk ∈ Rn and∑k

i=1 |wi| ≤ ‖ν‖. The case n = d andH(x, t) = G(x − t) has been investigated by Girosi and Anzellotti (1993),where a detailed description of function spaces arising from the differentchoices of the basis function G is given. Girosi (1994) extends this ap-proach to approximation by convex combinations of translates and dilatesof a Gaussian function. In general, one can prove the following:

Lemma 17.5. Let

g(x) =∫

Rn

H(x, t)ν(dt), (17.26)

where H(x, t) and ν are as above. Define, for each k ≥ 1, the class offunctions

Gk =

f(x) =k∑

i=1

wiH(x, ti) :k∑

i=0

|wi| ≤ ‖ν‖

.

Then, for any probability measure µ on Rd and for any 1 ≤ p < ∞, thefunction g can be approximated in L2(µ) arbitrarily closely by members ofG =

⋃Gk, i.e.,

inff∈Gk

‖f − g‖L2(µ) → 0 as k → ∞.

In other words, g ∈ G.

The proof of the lemma is similar to the proof of Lemma 17.1 and is leftas an exercise (see Problem 17.1).

It is worth mentioning that in general, the closure of⋃

k Gk is richer thanthe class of functions having representation as in (17.26).

To apply the lemma for RBF networks (17.1), let n = d2 + d, t = (A, c),and H(x, t) = K (‖x− ci‖A). Note that F contains all the functions g withthe integral representation

g(x) =∫

Rd2+d

K (‖x− ci‖A) ν(dc, dA), (17.27)

for which ‖ν‖ ≤ b, where b is the constraint on the weights as in (17.25). Theapproximation result of Lemma 17.4 for functions of the form (17.27) in thespecial case of X having bounded support and absolutely continuous ν is adirect consequence of Lemma 17.4. In this case the rates of approximationin both lemmas are the same. One important example of a class of functionsg obtainable in this manner has been given by Girosi (1994). He used the


Gaussian basis function

H(x, t) = H(x, c, σ) = exp(

−‖x− c‖2

σ

),

where c ∈ Rd, σ > 0, and t = (c, σ). The results by Stein (1970) imply thatmembers of the Bessel potential space of order 2m > d have an integralrepresentation in the form of (17.26) with this H(x, t), and that they canbe approximated by functions of the form

f(x) =k∑

i=1

wi exp(

−‖x− ci‖2

σi

)(17.28)

in sup norm and, thus, in L2(µ). The space of functions thus obtainedincludes the Sobolev space H2m,1 of functions whose weak derivatives upto order 2m are in L1(Rd). Note that the RBF networks considered inTheorem 17.2 contain (17.28) as a special case.

17.4 Increasing Kernels and Approximation

Increasing kernels, i.e., kernels such that K(x) → ∞ as x → ∞, playimportant role in approximation. Common choices for increasing kernelsare:

• K(x) = x (linear);

• K(x) = x3 (cubic)

• K(x) =√x2 + c2 (multiquadric);

• K(x) = x2n+1 (thin-plate spline), n ≥ 1; and

• K(x) = x2n log x (thin-plate spline), n ≥ 1.

There is a close relationship between RBF networks and smoothingsplines (see Chapter 20). Consider the multivariate penalized least squaresproblem in which we minimize

1n

n∑

i=1

|g(Xi) − Yi|2 + λn · J2k (g) (17.29)

over the Sobolev spaceW k(Rd) consisting of functions g whose weak deriva-tives of order k are contained in L2(Rd). The complexity of g is penalizedby λn · J2

k (g), where

J2k (g) =

∑

α1,...,αd∈N0, α1+···+αd=k

k!α1! · . . . · αd!

∫

Rd

∣∣∣∣

∂kg(x)∂xα1

1 . . . ∂xαd

d

∣∣∣∣

2

dx.

17.4. Increasing Kernels and Approximation 349

0 x

K(x) = x

0 x

K(x) = x3

c

0 x

K(x) =√x2 + c2

0 x

K(x) = x2 log x

Figure 17.4. Linear, cubic, multiquadric, thin-plate spline kernels.

Using techniques from functional analysis one can show that a functionwhich minimizes (17.29) over W k(Rd) always exists. In addition, one cancalculate such a function as follows.

Let l =(

d+k−1d

)and let φ1, . . . , φl be all monomials xα1

1 · . . . · xαd

d oftotal degree α1 + . . .+ αd less than k. Depending on k and d define

K(z) = ‖z‖2k−d · log(‖z‖) if d is even,

‖z‖2k−d if d is odd.

Let z1, . . . , zN be the distinct values of X1, . . . , Xn. Then there exists afunction of the form

g∗(x) =N∑

i=1

µiK(x− zi) +l∑

j=1

νjφj(x) (17.30)

which minimizes (17.29) over W k(Rd) (see Chapter 20 for further details).The kernelK in (17.30) is the thin-plate spline kernel, which is an increasingkernel.

Radial functions (primarily increasing ones) are also encountered in in-terpolation problems where one looks for radial function interpolants of theform

f(x) =n∑

i=1

ciK(‖x−Xi‖) + pm(x)


with polynomial pm(x) on Rd of degree less than m interpolating data(Xi, Yi) (i = 1, . . . , n). Typical radial functions used in interpolation aremultiquadrics, shifted surface splines, and thin-plate splines.


Radial Basis Function (RBF) Networks have been introduced by Broom-head and Lowe (1988) and Moody and Darken (1989). Powell (1987) andDyn (1989) described applications of RBF networks in interpolation andapproximation. The universal approximation ability of RBF networks wasstudied by Park and Sandberg (1991; 1993) and Krzyzak, Linder, and Lu-gosi (1996). Lemma 17.1 which is due to Krzyzak, Linder, and Lugosi (1996)generalizes the approximation results of Park and Sandberg (1991; 1993)who showed that if K(||x||) ∈ L1(λ) ∩ Lp(λ) and

∫K(||x||) = 0, then

the class of RBF networks defined in (17.8) is dense in Lp(λ) for p ≥ 1.Approximation Lemma 17.4 is due to Barron (1993).

Poggio and Girosi (1990), Chen, Cowan, and Grant (1991), Krzyzak et al.(1996; 1998) and Xu, Krzyzak, and Yuille (1994) investigated the issues oflearning and estimation. RBF networks with centers learned by clusteringwere studied by Xu, Krzyzak, and Oja (1993), with randomly sampledcenters by Xu, Krzyzak, and Yuille (1994). RBF networks with parameterslearned by minimizing the empirical L2 risk were investigated by Krzyzaket al. (1996; 1998).

The steepest descent backpropagation weight training algorithm has beenproposed by Rumelhart, Hinton and Williams (1986) and applied to RBFtraining by Chen, Cowan, and Grant (1991). Convergence rates of RBFapproximation schemes have been shown to be comparable with those forneural networks by Girosi and Anzellotti (1992; 1993). Niyogi and Girosi(1996) studied the tradeoff between approximation and estimation errorsand provided an extensive review of the problem. Lp error rates with1 ≤ p < ∞ were established by Krzyzak and Linder (1998) who showedTheorem 17.2.

RBF networks with optimal MISE radial functions were investigatedby Krzyzak (2001) and normalized RBF networks in Krzyzak and Nie-mann (2001), and Krzyzak and Schafer (2002). Radial functions were usedin interpolation by Powell (1987; 1992), Dyn (1989), and Light (1992).


Problem 17.1. Prove Lemma 17.5.Hint: Use the probabilistic argument from the proof of Lemma 17.1.

Problem 17.2. Prove the following generalization of Lemma 17.1.


0

1

x

Figure 17.5. Decomposition of the kernel of bounded variation into the differenceof monotonically decreasing kernels.

Suppose K : R → R is bounded and

K(||x||) ∈ L1(λ) ∩ Lp(λ)

for some p ∈ [1, ∞) and assume that∫

K(||x||) dx = 0. Let µ be an arbitraryprobability measure on Rd and let q ∈ (0, ∞). Then the RBF networks in theform (17.8) are dense in both Lq(µ) and Lp(λ). In particular, if m ∈ Lq(µ)∩Lp(λ),then for any ε there exists a θ = (w0, . . . , wk, b1, . . . , bk, c1, . . . , ck) such that

∫

Rd

|fθ(x) − m(x)|qµ(dx) < ε and∫

Rd

|fθ(x) − m(x)|pdx < ε.

Problem 17.3. Show Theorem 17.2 for the bounded kernel K of boundedvariation.

Problem 17.4. Prove the following generalization of Lemma 17.3.Assume that |K(x)| ≤ 1 for all x ∈ R, and suppose that K has total variation

V < ∞. Then

N1(ε, Fk, Xn1 ) ≤ 3k 4b(b + ε)

ε

(2e(b + ε)

ε

)k+1 (6eV (b + ε)

ε

)4k(d2+d+2)

.

Hint: Since K is of bounded variation it can be written as the difference oftwo monotone decreasing functions: K = K1 − K2 (see Figure 17.5).

Let G be the collection of functions [x−c]T A[x−c] parametrized by c ∈ Rd andthe nonnegative definite matrix A. Also, let Fi = Ki(g(·)) : g ∈ G (i = 1, 2)and let F = K(g(·)) : g ∈ G. By Lemma 16.4 for the covering number of sumsof families of functions, we have

N1(ε, F , Xn1 ) ≤ N1(ε/2, F1, X

n1 )N1(ε/2, F2, X

n1 )

because F ⊂ f1−f2 : f1 ∈ F1, f2 ∈ F2. Since G spans a (d2+d+1)-dimensionalv ector space, by Lemma 9.5 the collection of sets

G+ = (x, t) : g(x) − t ≥ 0 : g ∈ Ghas VC dimension VG+ ≤ d2 + d + 2. Since Ki is monotone, it follows fromLemma 16.3 that VF+

i

≤ d2 + d + 2, where the families of sets F+i are defined

just as G+ with Fi in place of G. Let V1 and V2 be the total variations of K1 andK2, respectively. Then V = V1 + V2 and 0 ≤ Ki(x) + αi ≤ Vi, x ∈ R (i = 1, 2)for suitably chosen constants α1 and α2. Lemma 9.4 implies 0 ≤ f(x) ≤ B for all


f ∈ F and x, thus

N1(ε, F , Xn1 ) ≤ 3

(2eB

εlog

(3eB

ε

))VF+

≤ 3(3eB

ε

)2VF+

.

Show that this implies that

N1(ε, F , Xn1 ) ≤ 9

(3eV

ε

)4(d2+d+2)

.

Mimic the rest of the proof of Lemma 17.3 to obtain the conclusion.

18Orthogonal Series Estimates

Orthogonal series estimates use the estimates of coefficients of a series ex-pansion to reconstruct the regression function. In this chapter we focus onnonlinear orthogonal series estimates, where one applies a nonlinear trans-formation (thresholding) to the estimated coefficients. The most popularorthogonal series estimates use wavelets. We start our discussion of orthog-onal series estimates by describing the motivation for using these waveletestimates.

18.1 Wavelet Estimates

We introduce orthogonal series estimates in the context of regression esti-mation with fixed, equidistant design, which is the field where they havebeen applied most successfully. Here one gets data (x1, Y1), . . . , (xn, Yn)according to the model

Yi = m(xi) + εi, (18.1)

where x1, . . . , xn are fixed (nonrandom) equidistant points in [0, 1], ε1,. . . , εn are i.i.d. random variables with Eε1 = 0 and Eε21 < ∞, and m is afunction m : [0, 1] → R (cf. Section 1.9).

Recall that λ denotes the Lebesgue measure on [0, 1]. Assume m ∈ L2(λ)and let fjj∈N be an orthonormal basis in L2(λ), i.e.,

< fj , fk >λ=∫ 1

0fj(x)fk(x) dx = δj,k (j, k ∈ N ),

354 18. Orthogonal Series Estimates

and each function in L2(λ) can be approximated arbitrarily well by linearcombinations of the fjj∈N . Then m can be represented by its Fourierseries with respect to fjj∈N :

m =∞∑

j=1

cjfj where cj =< m, fj >λ=∫ 1

0m(x)fj(x) dx. (18.2)

In orthogonal series estimation we use estimates of the coefficients of theseries expansion (18.2) in order to reconstruct the regression function.

In the model (18.1), where the x1, . . . , xn are equidistant in [0, 1], thecoefficients cj can be estimated by

cj =1n

n∑

i=1

Yifj(xi) (j ∈ N ). (18.3)

If (18.1) holds, then

cj =1n

n∑

i=1

m(xi)fj(xi) +1n

n∑

i=1

εifj(xi),

where, hopefully,

1n

n∑

i=1

m(xi)fj(xi) ≈∫ 1

0m(x)fj(x) dx = cj

and

1n

n∑

i=1

εifj(xi) ≈ E

1n

n∑

i=1

εifj(xi)

= 0.

The traditional way of using these estimated coefficients to construct anestimate mn of m is to truncate the series expansion at some index K andto plug in the estimated coefficients:

mn =K∑

j=1

cjfj . (18.4)

Here one tries to choose K such that the set of functions f1, . . . , fK isthe “best” among all subsets f1, f1, f2, f1, f2, f3, . . . of fjj∈N inview of the error of the estimate (18.4). This implicitly assumes that themost important information about m is contained in the first K coefficientsof the series expansion (18.2).

A way of overcoming this assumption was proposed by Donoho and John-stone (1994). It consists of thresholding the estimated coefficients, e.g., touse all those coefficients whose absolute value is greater than some threshold

18.1. Wavelet Estimates 355

δn (so-called hard thresholding). This leads to estimates of the form

mn =K∑

j=1

ηδn(cj)fj , (18.5)

where K is usually much larger than K in (18.4), δn > 0 is a threshold,and

ηδn(c) =

c if |c| > δn,0 if |c| ≤ δn.

(18.6)

As we will see in Section 18.3 this basically tries to find the “best” of allsubsets of f1, . . . , fK in view of the estimate (18.5).

The most popular choice for the orthogonal system fjj∈N are the so-called wavelet systems, where the fj are constructed by translation of aso-called father wavelet and by translation and dilatation of a so-calledmother wavelet. For these wavelet systems the series expansion (18.2) ofmany functions contains only a few nonzero coefficients. This together withchoosing a subset of the orthonormal system by hard thresholding leads toestimates which achieve a nearly optimal minimax rate of convergence fora variety of function spaces (e.g., Holder, Besov, etc.) (for references seeSection 18.7). In particular, these estimates are able to adapt to local irreg-ularities (e.g., jump discontinuities) of the regression function, a propertywhich classical linear smoothers like kernel estimators with fixed bandwidthdo not have.

Motivated by the success of these estimates for fixed design regression,similar estimates were also applied for random design regression, whereone has i.i.d. data (X1, Y1), . . . , (Xn, Yn). One difficulty to overcome hereis to find a reasonable way to estimate the coefficients cj . If X is uni-formly distributed on [0, 1], then one can use the same estimate as forfixed, equidistant x1, . . . , xn:

cj =1n

n∑

i=1

Yifj(Xi),

because, in this case,

Ecj = E EY1fj(X1)|X1 = Em(X1)fj(X1) = cj .

Clearly, this is not a reasonable estimate ifX is not uniformly distributed on[0, 1]. In this case, it was suggested in the literature to use the data (X1, Y1),. . . , (Xn, Yn) to construct new, equidistant data (x1, Y1), . . . , (xn, Yn),where x1, . . . , xn are equidistant in [0, 1] and Yi is an estimate of m(xi),and then to apply (18.3) to these new data (for references, see Section18.7). Results concerning the rate of convergence of these estimates haveonly been derived under the assumption that X has a density with respectto the Lebesgue–Borel measure, which is bounded away from infinity on[0, 1]. If this assumption is true, then the L2 error can be bounded by some


constant times∫

[0,1]|mn(x) −m(x)|2 dx,

and the last term can be expressed as the sum of squares of the coefficientsof the series expansion of mn −m with respect to the orthonormal systemin L2(λ). Hence, if one estimates the coefficients of the series expansion ofm in a proper way, then this automatically leads to estimates with small L2error. This is no longer true if µ is not “close” to the uniform distribution.Then it is not clear whether nearly correct estimation of the coefficientsleads to a small L2 error

∫|mn(x) −m(x)|2µ(dx),

because in the above term one integrates with respect to µ and not withrespect to λ.

18.2 Empirical Orthogonal Series Estimates

If µ is not “close” to the uniform distribution, then a natural approachis to estimate an orthonormal expansion of m in L2(µ). Clearly, this isnot possible, because µ (i.e., the distribution of X) is unknown in anapplication.

What we do in the sequel is to use an orthonormal series expansion of min L2(µn) rather than in L2(λ), where µn is the empirical measure of theX1, . . . , Xn, i.e.,

µn(A) =1n

n∑

i=1

IA(Xi) (A ⊆ R).

We will call the resulting estimates empirical orthogonal series estimates.For f, g : [0, 1] → R define

< f, g >n=1n

n∑

i=1

f(Xi)g(Xi) and ‖f‖2n =< f, f >n .

In Section 18.4 we will describe a way to construct an orthonormal systemfj=1,...,K in L2(µn), i.e., functions f1, . . . ,fK : [0, 1] → R which satisfy

< fj , fk >n= δj,k (j, k = 1, . . . ,K).

Given such an orthonormal system, the best approximation with respectto ‖ · ‖n of m by functions in spanf1, . . . , fK is given by

K∑

j=1

cjfj where cj =< m, fj >n=1n

n∑

i=1

m(Xi)fj(Xi). (18.7)

18.2. Empirical Orthogonal Series Estimates 357

We will estimate the coefficients in (18.7) by

cj =1n

n∑

i=1

Yifj(Xi), (18.8)

and use hard thresholding to construct the estimate

mn =K∑

j=1

ηδn(cj)fj (18.9)

of m, where δn > 0 is the threshold and ηδn is defined by (18.6). Finally,we truncate the estimate at some data-independent height βn, i.e., we set

mn(x) = (Tβnmn)(x) =

βn if mn(x) > βn,mn(x) if −βn ≤ mn(x) ≤ βn,

−βn if mn(x) < −βn,(18.10)

where βn > 0 and βn → ∞ (n → ∞).Figures 18.1–18.3 show applications of this estimate to our standard data

example with different thresholds.

−1 −0.5 0.5 1

0.5

Figure 18.1. L2 error = 0.021336, δ = 0.02.


−1 −0.5 0.5 1

0.5

Figure 18.2. L2 error = 0.012590, δ = 0.04.

−1 −0.5 0.5 1

0.5

Figure 18.3. L2 error = 0.014011, δ = 0.06.

18.3 Connection with Least Squares Estimates

Let fjj=1,...,K be a family of functions fj : R → R. For J ⊆ 1, . . . ,Kdefine Fn,J as the linear span of those fj ’s with j ∈ J , i.e.,

Fn,J =

∑

j∈J

ajfj : aj ∈ R (j ∈ J)

. (18.11)

18.3. Connection with Least Squares Estimates 359

Recall that the least squares estimate mn,J of m in Fn,J is defined by

mn,J ∈ Fn,J and1n

n∑

i=1

|mn,J (Xi) − Yi|2 = minf∈Fn,J

1n

n∑

i=1

|f(Xi) − Yi|2.(18.12)

Using (18.11) this can be rewritten as

mn,J =∑

j∈J

a∗jfj

for some a∗ = a∗jj∈J ∈ R|J| which satisfies

1n

‖Ba∗ − Y‖22 = min

a∈R|J|

1n

‖Ba− Y‖22, (18.13)

where

B = (fj(Xi))1≤i≤n,j∈J and Y = (Y1, . . . , Yn)T .

As we have mentioned in Chapter 10, (18.13) is equivalent to

1nBT Ba∗ =

1nBT Y, (18.14)

which is the so-called normal equation of the least squares problem.Now consider the special case that fjj=1,...,K is orthonormal in L2(µn).

Then1nBT B = (< fj , fk >n)j,k∈J = (δj,k)j,k∈J ,

and therefore the solution of (18.14) is given by

a∗j =

1n

n∑

i=1

fj(Xi)Yi (j ∈ J). (18.15)

Define a∗j by (18.15) for all j ∈ 1, . . . ,K, and set

J =j ∈ 1, . . . ,K : |a∗

j | > δn. (18.16)

Then the orthogonal series estimate mn defined in Section 18.2 satisfies

mn = mn,J ,

and so does the least squares estimate of m in Fn,J as well. So hard thresh-olding can be considered as a way of choosing one of the 2K least squaresestimates mn,J (J ⊆ 1, . . . ,K).

In Chapter 12 we have introduced complexity regularization as anothermethod to use the data to select one estimate of a family of least squaresestimates. There one minimizes the sum of the empirical L2 risk and apenalty term, where the penalty term is approximately equal to the number


of parameters divided by n. To apply it in this context, we define the penaltyterm by

penn(J) = cn|J |n

(J ⊆ 1, . . . ,K),

where cn > 0 is defined below, and choose

J∗ ⊆ 1, . . . ,Ksuch that

1n

n∑

i=1

|mn,J∗(Xi) − Yi|2 + penn(J∗)

= minJ⊆1,...,K

1n

n∑

i=1

|mn,J (Xi) − Yi|2 + penn(J)

. (18.17)

For properly defined cn, J defined by (18.16) minimizes (18.17). To seethis, observe that, for f =

∑Kj=1 bjfj ,

1n

n∑

i=1

(f(Xi) − mn,1,...,K(Xi)

) · (mn,1,...,K(Xi) − Yi

)

=1n

(bT − a∗T

)BT (Ba∗ − Y)

=1n

(bT − a∗T

)· (BT Ba∗ − BTY

)

=1n

(bT − a∗T

)· 0 (because of (18.14))

= 0,

which, together with the orthonormality of fjj=1,...,K , implies

1n

n∑

i=1

|f(Xi) − Yi|2

=1n

n∑

i=1

|mn,1,...,K(Xi) − Yi|2 +1n

n∑

i=1

|f(Xi) − mn,1,...,K(Xi)|2

=1n

n∑

i=1

|mn,1,...,K(Xi) − Yi|2 +K∑

j=1

|bj − a∗j |2.

Hence

1n

n∑

i=1


18.4. Empirical Orthogonalization of Piecewise Polynomials 361

=1n

n∑

i=1

|mn,1,...,K(Xi) − Yi|2 +∑

j ∈J

|a∗j |2 + cn

|J |n

=1n

n∑

i=1

|mn,1,...,K(Xi) − Yi|2 +K∑

j=1

|a∗

j |2Ij /∈J +cnnIj∈J

≥ 1n

n∑

i=1

|mn,1,...,K(Xi) − Yi|2 +K∑

j=1

min

|a∗j |2,

cnn

.

Setting cn = nδ2n one gets

min

|a∗j |2,

cnn

= |a∗

j |2I|a∗j|2≤δ2

n +cnnI|a∗

j|2>δ2

n

= |a∗j |2Ij /∈J +

cnnIj∈J (j ∈ 1, . . . ,K).

Collecting the above results we get

1n

n∑

i=1


≥ 1n

n∑

i=1

|mn,1,...,K(Xi) − Yi|2 +K∑

j=1

|a∗

j |2Ij /∈J +cnnIj∈J

=1n

n∑

i=1

|mn,J(Xi) − Yi|2 + penn(J).

This proves that (18.17) is minimized by J .It follows that (18.12) and (18.17) is an alternative way to define the

estimate of Section 18.2. Observe that it is difficult to compute the estimateusing (18.17), because there one has to minimize the penalized empirical L2error over 2K , i.e., exponential many function spaces. On the other hand,it is easy to compute the estimate if one uses the definition of Section 18.2.The estimate can be computed in O(n log n) time (cf. Problem 18.1).

However, the definition given in this section will be useful in proving theasymptotic properties of the estimate.

18.4 Empirical Orthogonalization of PiecewisePolynomials

Next we define an orthonormal system in L2(µn) by orthonormalizing piece-wise polynomials. Fix X1, . . . , Xn, and denote their different values by x1,. . . , xn′ , i.e.,

n′ ≤ n, x1, . . . , xn′ = X1, · · · , Xn, and x1 < · · · < xn′ .


0 1X6 X2 X1 = X4 X5 X3

A00

0 1X6 X2 X1 = X4 X5 X3

A10 A1

1

0 1X6 X2 X1 = X4 X5 X3

A20 A2

1 A22 A2

3

0 1X6 X2 X1 = X4 X5 X3

A30 A3

1 A32 A3

3 A34 A3

5 A36 A3

7

Figure 18.4. Example for construction of Pl (l ∈ 0, 1, 2, 3).

For nonatomic µ we will have n′ = n with probability one.We start by defining partitions P l of [0, 1] (l ∈ N0). Each P l con-

sists of 2l intervals Al0, . . . , Al

2l−1. Depending on x1, . . . , xn′ they arerecursively defined as follows: set A0

0 = [0, 1] and P0 = [0, 1]. GivenP l = Al

0, . . . , Al2l−1, define P l+1 = Al+1

0 , . . . , Al+12l+1−1 by subdividing

each interval Alj into two intervals Al+1

2j , Al+12j+1, such that each of these two

intervals contains nearly the same number of the x1, . . . , xn′ , i.e.,

Alj = Al+1

2j ∪Al+12j+1, Al+1

2j ∩Al+12j+1 = ∅,

and∣∣|i : xi ∈ Al+1

2j | − |i : xi ∈ Al+12j+1|∣∣ ≤ 1.

This is always possible because the x1, . . . , xn′ are pairwise distinct.Using these nested partitions P0, P1, . . . , we define the nested spaces of

piecewise polynomials V M0 , V M

1 , . . . , where M ∈ N0 denotes the degree ofthe polynomials. Let V M

l be the set of all piecewise polynomials of degreenot greater than M with respect to P l, i.e.,

V Ml =

f(x) =

2l−1∑

j=0

M∑

k=0

aj,kxk · IAl

j(x) : aj,k ∈ R

.

Clearly, V M0 ⊆ V M

1 ⊆ V M2 ⊆ · · ·. We will construct an orthonormal basis

of

V Mlog2(n)

in L2(µn).


0 1

0 1

0 1

0 1

Figure 18.5. Example of functions from V 0l , l ∈ 0, 1, 2, 3.

1

1f1

1

1√2

−√2

f2

1

√32

√3

−√

32

−√3

f3

Figure 18.6. Construction of piecewise constant orthonormal functions.

To do this, we first decompose V Ml+1 into an orthogonal sum of spaces

UMl+1,0, . . . , UM

l+1,2l−1, i.e, we construct orthogonal spaces UMl+1,0, . . . ,


UMl+1,2l−1 with the property that the set of all functions of the form

∑2l−1j=0 fj

with fj ∈ UMl+1,j is equal to V M

l+1. Observe that each f ∈ V Ml+1 can be written

as a sum

f =2l−1∑

j=0

fj where fj = f · IAlj

∈ V Ml+1.

Clearly, the supports of the f0, . . . , f2l−1 are all disjoint, which impliesthat the f0, . . . , f2l−1 are orthogonal with respect to <,>n. Hence,

V Ml+1 =

2l−1⊕

j=0

UMl+1,j with UM

l+1,j = f · IAlj

: f ∈ V Ml+1

is an orthogonal decomposition of V Ml+1.

Let BMl+1,j be an orthonormal basis of the orthogonal complement of

V Ml ∩ UM

l+1,j =f · IAl

j: f ∈ V M

l

=

M∑

k=0

akxkIAl

j: a0, . . . , aM ∈ R

in

UMl+1,j =

f · IAl

j: f ∈ V M

l+1

,

i.e., of the set of all functions in UMl+1,j which are orthogonal to V M

l ∩UMl+1,j .

Such an orthonormal basis can be computed easily: Assume g is an elementof the orthogonal complement of V M

l ∩ UMl+1,j in UM

l+1,j . Then g ∈ UMl+1,j ,

which implies

g(x) =M∑

k=0

akxk · IAl+1

2j(x) +

M∑

k=0

bkxk · IAl+1

2j+1(x) (x ∈ [0, 1]) (18.18)

for some a0, . . . , aM , b0, . . . , bM ∈ R. Furthermore, g is orthogonal toV M

l ∩ UMl+1,j , which is equivalent to assuming that g is orthogonal (with

respect to <,>n) to

1 · IAlj, x · IAl

j, . . . , xM · IAl

j.

This leads to a homogeneous linear equation system for the coefficientsa0, . . . , bM of g. Hence all the functions of the orthogonal complementof V M

l ∩ UMl+1,j in UM

l+1,j can be computed by solving a linear equationsystem, and an orthonormal basis of this orthogonal complement can becomputed by orthonormalizing the solutions of this linear equation systemwith respect to the scalar product induced by <,>n.

Set

BMl+1 = BM

l+1,0 ∪ · · · ∪ BMl+1,2l−1.


Then it is easy to see that BMl+1 is an orthonormal basis of the orthogo-

nal complement of V Ml in V M

l+1 (cf. Problem 18.2). Choose an arbitraryorthonormal basis BM

0 of V M0 . Then

B = BM0 ∪ · · · ∪ BM

log2(n)

is an orthonormal basis of V Mlog2(n) in L2(µn). This is the orthonormal

system we use for the estimate defined in Section 18.2, i.e., we set

fjj=1,...,K = B.Let P be an arbitrary partition of [0, 1] consisting of intervals. The main

property of the orthonormal system fjj=1,...,K defined above is that anypiecewise polynomial of degree M (or less) with respect to P can be rep-resented in L2(µn) by a linear combination of only slightly more than |P|of the fj ’s. More precisely,

Lemma 18.1. Let f1, . . . , fK be the family of functions constructedabove.(a) Each fj is a piecewise polynomial of degree M (or less) with respect toa partition consisting of four or less intervals.(b) Let P be a finite partition of [0, 1] consisting of intervals, and let f bea piecewise polynomial of degree M (or less) with respect to this partitionP. Then there exist coefficients a0, . . . , aK ∈ R such that

f(Xi) =K∑

j=1

ajfj(Xi) (i = 1, . . . , n)

and

|j : aj = 0| ≤ (M + 1)(log2(n) + 1) · |P|≤ 2(M + 1)(log(n) + 1) · |P|.

Proof. (a) For fi ∈ BM0 the assertion is trivial. If fi /∈ BM

0 , then fi ∈ BMl+1,j

for some 0 ≤ l < log2(n), j ≤ 2l, and the assertion follows from (18.18).(b) The second inequality follows from

log2(n) + 1 ≤ log(n)log(2)

+ 2 ≤ 2(log(n) + 1),

hence it suffices to show the first inequality.By construction each interval of the partition Plog2(n) contains at most

one of the X1, . . . , Xn, which implies that there exists f ∈ V Mlog2(n) such

that

f(Xi) = f(Xi) (i = 1, . . . , n).


Choose a1, . . . , aK ∈ R such that

f =K∑

j=1

ajfj .

Then

f(Xi) =K∑

j=1

ajfj(Xi) (i = 1, . . . , n).

Since f1, . . . , fK are orthonormal w.r.t. < ·, · >n we get

< f, fk >n=K∑

j=1

aj < fj , fk >n= ak.

Hence it suffices to show∣∣∣f ∈ BM

l+1 : < f, f >n = 0∣∣∣ ≤ (M + 1) · |P| (0 ≤ l < log2(n)).

There are at most |P|−1 indices j such that f is not equal to a polynomialof degree M (or less) on Al

j . Since each f ∈ BMl+1,j is orthogonal to every

function which is equal to a polynomial on Alj we get

∣∣∣f ∈ BM

l+1 : < f, f >n = 0∣∣∣ ≤

∑

j : f is not equal to a polynomial

of degree M (or less) on Alj

|BMl+1,j |

≤ (|P| − 1) · (M + 1),


18.5 Consistency

In this section we study the consistency of our orthogonal series estimate.For simplicity we only consider the case X ∈ [0, 1] a.s. It is straightforwardto modify the definition of the estimate such that the resulting estimate isweakly and strongly universally consistent for univariate X (cf. Problem18.5).

In order to be able to show strong consistency of the estimate, we needthe following slight modification of its definition. Let α ∈ (0, 1

2 ). Depend-ing on the data let the functions fj and the estimated coefficients cj bedefined as in Sections 18.2 and 18.4. Denote by (c(1), f(1)), . . . , (c(K), f(K))a permutation of (c1, f1), . . . , (cK , fK) with the property

|c(1)| ≥ |c(2)| ≥ · · · ≥ |c(K)|. (18.19)


Define the estimate mn by

mn =minK,n1−α∑

j=1

ηδn(c(j))f(j). (18.20)

This ensures that mn is a linear combination of no more than n1−α of thefj ’s. As in Section 18.3 one can show that (18.20) implies

mn = mn,J∗ for some J∗ ⊆ 1, . . . ,K, (18.21)

where J∗ satisfies |J∗| ≤ n1−α and

1n

n∑

i=1

|mn,J∗(Xi) − Yi|2 + penn(J∗)

= minJ⊆1,...,K,

|J|≤n1−α

1n

n∑

i=1


. (18.22)

With this modification of the estimate we are able to show:

Theorem 18.1. Let M ∈ N0 be fixed. Let the estimate mn be defined by(18.8), (18.19), (18.20), and (18.10) with βn = log(n) and δn ≤ 1

(log(n)+1)2 .Then

∫|mn(x) −m(x)|2µ(dx) → 0 (n → ∞) a.s.

and

E∫

|mn(x) −m(x)|2µ(dx) → 0 (n → ∞)

for every distribution of (X,Y ) with X ∈ [0, 1] a.s. and EY 2 < ∞.

Proof. We will only show the almost sure convergence of the L2 error tozero, the weak consistency can be derived in a similar way using argumentsfrom the proof of Theorem 10.2 (cf. Problem 18.4).

Let L > 0. Set YL = TLY , Y1,L = TLY1, . . . , Yn,L = TLYn. Let Fn bethe set of all piecewise polynomials of degree M (or less) with respect to apartition of [0, 1] consisting of 4n1−α or less intervals, let GM be the set ofall polynomials of degree M (or less), let Pn be an equidistant partition of[0, 1] into log(n) intervals, and denote by GM Pn the set of all piecewisepolynomials of degree M (or less) w.r.t. Pn.

In the first step of the proof we show that the assertion follows from

inff∈GM Pn,‖f‖∞≤log(n)

∫|f(x) −m(x)|2µ(dx) → 0 (n → ∞) (18.23)


and

supf∈Tlog(n)Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣→ 0 (n → ∞)

(18.24)a.s. for every L > 0. In the second and third steps, we will show (18.23)and (18.24), respectively.

So, assume temporarily that (18.23) and (18.24) hold. Because of∫

|mn(x) −m(x)|2µ(dx) = E |mn(X) − Y |2∣∣Dn

− E|m(X) − Y |2

it suffices to showE |mn(X) − Y |2∣∣Dn

12 −

E|m(X) − Y |212 → 0 a.s. (18.25)

We use the decomposition

0 ≤ E |mn(X) − Y |2∣∣Dn

12 −

E|m(X) − Y |212

=

(E |mn(X) − Y |2∣∣Dn

12

− inff∈GM Pn,

‖f‖∞≤log(n)

E|f(X) − Y |2

12

)

+

(

inff∈GM Pn,

‖f‖∞≤log(n)

E|f(X) − Y |2

12 −

E|m(X) − Y |212

)

.

(18.26)

It follows from (18.23) by the triangle inequality that the second term of(18.26) converges to zero. Therefore for (18.25) it suffices to show

lim supn→∞

(E |mn(X) − Y |2∣∣Dn

12 − inf

f∈GM Pn,

‖f‖∞≤log(n)

E|f(X) − Y |2

12

)

≤ 0

(18.27)a.s. To this end, let L > 0 be arbitrary. We can assume w.l.o.g. thatlog(n) > L. Then

E |mn(X) − Y |2∣∣Dn

12 − inf

f∈GM Pn,‖f‖∞≤log(n)

E|f(X) − Y |2

12

= supf∈GM Pn,‖f‖∞≤log(n)

E |mn(X) − Y |2∣∣Dn

12 −

E|f(X) − Y |212

≤ supf∈GM Pn,‖f‖∞≤log(n)

E |mn(X) − Y |2∣∣Dn

12


−E |mn(X) − YL|2∣∣Dn

12

+E |mn(X) − YL|2∣∣Dn

12 −

1n

n∑

i=1

|mn(Xi) − Yi,L|2 1

2

+

1n

n∑

i=1

|mn(Xi) − Yi,L|2 1

2

−

1n

n∑

i=1

|mn(Xi) − Yi,L|2 1

2

+

1n

n∑

i=1

|mn(Xi) − Yi,L|2 1

2

−

1n

n∑

i=1

|mn(Xi) − Yi|2 1

2

+

1n

n∑

i=1

|mn(Xi) − Yi|2 1

2

−

1n

n∑

i=1

|f(Xi) − Yi|2 1

2

+

1n

n∑

i=1

|f(Xi) − Yi|2 1

2

−

1n

n∑

i=1

|f(Xi) − Yi,L|2 1

2

+

1n

n∑

i=1

|f(Xi) − Yi,L|2 1

2

− E|f(X) − YL|2

12

+E|f(X) − YL|2

12 −

E|f(X) − Y |212.

Now we give upper bounds for the terms in each row of the right-hand sideof the last inequality.

The second and seventh terms are bounded above by

supf∈Tlog(n)Fn

∣∣∣∣∣∣

1n

n∑

i=1

|f(Xi) − Yi,L|2 1

2

− E|f(X) − YL|2

12

∣∣∣∣∣∣

(observe mn = Tlog(n)mn and mn ∈ Fn,J∗ ⊆ Fn). For the third termobserve that if x, y ∈ R with |y| ≤ log(n) and z = Tlog(n)x, then |z − y| ≤|x− y|. Therefore the third term is not greater than zero.

Next we bound the fifth term. Fix f ∈ GM Pn. By definition of Pn andLemma 18.1 there exists J ⊆ 1, . . . , n and f ∈ Fn,J such that

f(Xi) = f(Xi) (i = 1, . . . , n) and |J | ≤ 2(M + 1)(log(n) + 1)2.

This, together with (18.22), implies

1n

n∑

i=1

|mn(Xi) − Yi|2 − 1n

n∑

i=1

|f(Xi) − Yi|2

=1n

n∑

i=1

|mn(Xi) − Yi|2 − 1n

n∑

i=1

|f(Xi) − Yi|2


≤ penn(J)

≤ nδ2n2(M + 1)(log(n) + 1)2

n.

Using these upper bounds and the triangle inequality for the remainingterms one gets

E |mn(X) − Y |2∣∣Dn

12 − inf

f∈GM Pn,‖f‖∞≤log(n)

E|f(X) − Y |2

12

≤ 2 · E|Y − YL|212 + 2 ·

1n

n∑

i=1

|Yi − Yi,L|2 1

2

+2(M + 1)δ2n(log(n) + 1)2

+2 · supf∈Tlog(n)Fn

∣∣∣∣∣∣

1n

n∑

i=1

|f(Xi) − Yi,L|2 1

2

− E|f(X) − YL|2

12

∣∣∣∣∣∣.

Because of (18.24), δn ≤ 1(log(n)+1)2 , and the strong law of large numbers

lim supn→∞

(E |mn(X) − Y |2∣∣Dn

12 − inf

f∈GM Pn,

‖f‖∞≤log(n)

E|f(X) − Y |2

12

)

≤ 4 · E|Y − YL|212 a.s.

With L → ∞ one gets the assertion.In the second step we prove (18.23). Since m can be approximated ar-

bitrarily closely in L2(µ) by continuously differentiable functions, we mayassume w.l.o.g. that m is continuously differentiable. For each A ∈ Pn

choose some xA ∈ A and set f∗ =∑

A∈Pnm(xA)IA. Then f∗ ∈ GM Pn,

and for n sufficiently large (i.e., for n such that ‖m‖∞ ≤ log(n)) we get

inff∈GM Pn,‖f‖∞≤log(n)

∫|f(x) −m(x)|2µ(dx)

≤ supx∈[0,1]

|f∗(x) −m(x)|2

≤ c

log2(n)→ 0 (n → ∞),

where c is some constant depending on the first derivative of m.In the third step we prove (18.24). Let L > 0. W.l.o.g. we may assume

L ≤ log(n). Set

Hn :=h : R × R → R : h(x, y) = |f(x) − TLy|2 ((x, y) ∈ Rd × R)

for some f ∈ Tlog(n)Fn

.


For h ∈ Hn one has 0 ≤ h(x, y) ≤ 4 log(n)2 ((x, y) ∈ R × R). Using thenotion of covering numbers and Theorem 9.1, one concludes

P

supf∈Tlog(n)Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣> t

= P

suph∈Hn

∣∣∣∣∣1n

n∑

i=1


∣∣∣∣∣> t

≤ 8E

N1

(t

8,Hn, (X,Y )n

1

)exp

(− nt2

2048 log(n)4

). (18.28)

Next we bound the covering number in (18.28). Observe first that if

hj(x, y) = |fj(x) − Tlog(n)y|2 ((x, y) ∈ R × R)

for some functions fj bounded in absolute value by log(n) (j = 1, 2), then

1n

n∑

i=1

|h1(Xi, Yi) − h2(Xi, Yi)|

=1n

n∑

i=1

|f1(Xi) − Tlog(n)Yi + f2(Xi) − Tlog(n)Yi| · |f1(Xi) − f2(Xi)|

≤ 4 log(n)1n

n∑

i=1

|f1(Xi) − f2(Xi)|.

Thus

N1

(t

8,Hn, (X,Y )n

1

)≤ N1

(t

32 log(n), Tlog(n)Fn, X

n1

). (18.29)

Using the notion of VC dimension and partitioning numbers, Theorem 9.4,and Lemma 13.1, one gets

N1

(t

32 log(n), Tlog(n)Fn, X

n1

)

≤ ∆n(P)

supz1,...,zl∈Xn

1 ,l≤nN1

(t

32 log(n), Tlog(n)GM , zl

1

)4n1−α

≤ ∆n(P)

36e log(n)

t32 log(n)

2VTlog(n)G+

M

4n1−α

= ∆n(P)

576e log2(n)t

2VTlog(n)G+

M

4n1−α

, (18.30)


where P is the set of all partitions of [0, 1] consisting of 4n1−α or lessintervals. Example 13.1 in Chapter 13 implies

∆n(P) ≤ (n+ 4n1−α

)4n1−α

≤ (5n)4n1−α

. (18.31)

Furthermore, one easily concludes, from the definition of the VC dimension,that

VTlog(n)G+M

≤ VG+M,

which, together with Theorem 9.5, implies

VTlog(n)G+M

≤ M + 2. (18.32)

It follows from (18.28)–(18.32),

P

supf∈Tlog(n)Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣> t

≤ 8(5n)4n1−α

(576e log2(n)

t

)8(M+2)n1−α

exp(

− nt2

2048 log4(n)

),

from which one gets the assertion by an application of the Borel–Cantellilemma.


In this section we study the rate of convergence of our orthogonal seriesestimate. For simplicity we assume that Y is bounded, i.e., |Y | ≤ L a.s. forsome L ∈ R+, and that we know this bound L. Instead of truncating theestimate at log(n) we truncate it at L, i.e., we set βn = L (n ∈ N ).

In order to illustrate the next theorem let us compare our estimate withan ideal (but not practical) estimate defined by fitting a piecewise polyno-mial to the data least squares, where the partition is chosen optimally forthe (unknown) underlying distribution.

Denote by Pk the family of all partitions of [0, 1] consisting of k intervals,and let GM be the set of all polynomials of degree M (or less). For apartition P ∈ Pk let

GM P =

f =∑

A∈PgAIA : gA ∈ GM (A ∈ P)

be the set of all piecewise polynomials of degree M (or less) with respectto P, and set

GM Pk =⋃

P∈Pk

GM P.


For k ∈ N and P ∈ Pk consider the estimation of m by a piecewise poly-nomial contained in GM P. Clearly, the estimate cannot approximate mbetter than the “best” piecewise polynomial in GM P, hence its L2 erroris not smaller than

inff∈GM P

∫|f(x) −m(x)|2µ(dx).

Furthermore, least squares fitting of the piecewise polynomial to the datarequires estimating (M + 1) · k parameters, which induces an error of atleast

(M + 1) · kn

.

Thus, if one chooses k ∈ N and P ∈ Pk optimally for the distribution of(X,Y ), then the L2 error of the estimate will be at least

mink∈N

(M + 1) · k

n+ inf

f∈GM Pk

∫|f(x) −m(x)|2µ(dx)

.

The next theorem states that for bounded Y our estimate achieves thiserror bound up to a logarithmic factor.

Theorem 18.2. Let L ∈ R+, n ∈ N , and let the estimate mn be definedby (18.8), (18.19), (18.20), and (18.10) with βn = L and

δn =

√

clog2(n)n

where c > 0 is some arbitrary constant. Then there exists a constant c,which depends only on L and c, such that

E∫


≤ min1≤k≤ n1−α

2(M+1)(log(n)+1)

4c(log(n) + 1)3(M + 1)k

n

+2 inff∈GM Pk

∫|f(x) −m(x)|2µ(dx)

+clog(n)n

for every distribution of (X,Y ) with |Y | ≤ L a.s.

Before proving the theorem we give two applications. First we considerthe case when m is a piecewise polynomial. Then the result above impliesthat the estimate achieves the parametric rate n−1 up to a logarithmicfactor.


Corollary 18.1. Let the estimate be defined as in Theorem 18.2. Then

E∫

|mn(x) −m(x)|2µ(dx) ∈ O

(log3(n)n

)

for every distribution of (X,Y ) with X ∈ [0, 1] a.s., |Y | ≤ L a.s., and mequal to a piecewise polynomial of degree M (or less) with respect to a finitepartition consisting of intervals.

As a second application we consider the estimation of a piecewise smoothfunction. Let C > 0 and p = q + r for some q ∈ N0 and 0 < r ≤ 1. Recallthat a function m : [0, 1] → R is called (p, C)-smooth if its qth derivativem(q) exists and satisfies

|m(q)(x) −m(q)(z)| ≤ C|x− z|r (18.33)

for all x, z ∈ [0, 1]. It is called piecewise (p, C)-smooth with respect to apartition P = Ijj of [0, 1] consisting of intervals, if its qth derivativeexists on each interval Ij and satisfies (18.33) for all x, z ∈ Ij and all j.

Corollary 18.2. Let n ∈ N . Let the estimate be defined as in Theorem18.2, and let 0 < p ≤ M + 1 be arbitrary. Then there exists a constant c,which depends only on L and c, such that for n sufficiently large

E∫

|mn(x) −m(x)|2µ(dx) ≤ c · C 22p+1

(log3(n)n

) 2p2p+1

for every distribution of (X,Y ) with X ∈ [0, 1] a.s., |Y | ≤ L a.s., and mpiecewise (p, C)-smooth with respect to a finite partition P, which consistsof intervals.

Proof. Let Pn be a partition which is a refinement of P and which consistsof

|Pn| = |P| +

⌈(C2n

log3(n)

) 12p+1

⌉

intervals of size not exceeding ( log3(n)C2n )

12p+1 . By approximating m on each

interval of Pn by a Taylor polynomial of degree M it follows from Lemma11.1 that

inff∈GM P|Pn|

∫|f(x) −m(x)|2µ(dx) ≤ inf

f∈GM Pn

supx∈[0,1]

|f(x) −m(x)|2

≤ C2(

log3(n)C2n

) 2p2p+1

.

This together with Theorem 18.2 implies the assertion.


Proof of Theorem 18.2. We will use the following error decomposition:


=E|mn(X) − Y |2|Dn

− E|m(X) − Y |2

−2

(1n

n∑

i=1

|mn(Xi) − Yi|2 − 1n

n∑

i=1

|m(Xi) − Yi|2 + penn(J∗)

)

+2

(1n

n∑

i=1

|mn(Xi) − Yi|2 − 1n

n∑

i=1

|m(Xi) − Yi|2 + penn(J∗)

)

=: T1,n + T2,n. (18.34)

In the first step of the proof we show

ET2,n ≤ 2 min1≤k≤ n1−α

2(M+1)(log(n)+1)

2c(log(n) + 1)3(M + 1)k

n

+ inff∈GM Pk

∫|f(x) −m(x)|2µ(dx)

.

(18.35)

To this end, we conclude from (18.10), |Y | ≤ L a.s. and (18.22)

T2,n

≤ 2

(1n

n∑

i=1

|mn(Xi) − Yi|2 + penn(J∗) − 1n

n∑

i=1

|m(Xi) − Yi|2)

= 2 minJ⊆1,...,K,

|J|≤n1−α

1n

n∑

i=1


− 1n

n∑

i=1

|m(Xi) − Yi|2

= 2 minJ⊆1,...,K,

|J|≤n1−α

minf∈Fn,J

1n

n∑

i=1

|f(Xi) − Yi|2 + penn(J)

− 1n

n∑

i=1

|m(Xi) − Yi|2


= 2 min1≤k≤n1−α

minJ⊆1,...,K,|J|=k,

f∈Fn,J

1n

n∑

i=1

|f(Xi) − Yi|2 + δ2nk

− 1n

n∑

i=1

|m(Xi) − Yi|2

.

By Lemma 18.1,

minf∈GM Pk

1n

n∑

i=1

|f(Xi)−Yi|2 ≥ minJ⊆1,...,K,

|J|≤2(M+1)(log(n)+1)k,f∈Fn,J

1n

n∑

i=1

|f(Xi)−Yi|2,

hence

ET2,n

≤ E

2 min1≤k≤ n1−α

2(M+1)(log(n)+1)

min

f∈GM Pk

1n

n∑

i=1

|f(Xi) − Yi|2

+2δ2n(M + 1)(log(n) + 1)k − 1n

n∑

i=1

|m(Xi) − Yi|2

≤ 2 min1≤k≤ n1−α

2(M+1)(log(n)+1)

2δ2n(M + 1)(log(n) + 1)k

+ inff∈GM Pk

∫|f(x) −m(x)|2µ(dx)

.

With δ2n = c log2(n)/n this implies the assertion of the first step.In the second step we show

ET2,n ≤ clog(n)n

. (18.36)

Let t > 0 be arbitrary. By definition of mn, mn ∈ TLFn,J∗ ⊆ TLGM P4|J∗|.Using this and penn(J) = δ2n|J | one gets

P T1,n > t

= PE|mn(X) − Y |2|Dn

− E|m(X) − Y |2

− 1n

n∑

i=1

|mn(Xi) − Yi|2 − |m(Xi) − Yi|2

>12(t+ 2 penn(J∗) + E

|mn(X) − Y |2|Dn

− E|m(X) − Y |2)


≤ P

∃1 ≤ k ≤ 4n1−α, ∃f ∈ TLGM Pk :

E|f(X) − Y |2 − E|m(X) − Y |2

− 1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

>12(t+ 2δ2nk + E|f(X) − Y |2 − E|m(X) − Y |2)

≤∑

1≤k≤4n1−α

P ∃f ∈ TLGM Pk : ... .

To bound the above probability we use the notion of covering numbers andTheorem 11.4. This implies

PT1,n > t

≤∑

1≤k≤4n1−α

14 supxn1

N1

( 12δ

2nk

20L, TLGM Pk, x

n1

)

× exp(

−18 (t+ δ2nk)n

214(1 + 1/2)L4

)

=

∑

1≤k≤4n1−α

14 supxn1

N1 (. . .) exp(

− δ2nkn

2568L4

)

exp

(− t n

2568L4

).

Proceeding as in the third step of the proof of Theorem 18.1 one gets, forthe covering number,

supxn1

N1

( 12δ

2nk

20L, TLGM Pk, x

n1

)≤ (5n)k

36eLδ2

nk40L

2(M+2)k

.

This, together with δn =√c log2(n)

n , implies

14 supxn1

N1

( 12δ

2nk

20L, TLGM Pk, x

n1

)exp

(− δ2nkn

2568L4

)≤ c

for some constant c > 0 depending only on L and c. This proves

PT1,n > t ≤ 4c · n · exp(− n

2568L4 t).

For arbitrary u > 0 it follows that

ET1,n ≤ u+∫ ∞

u

PT1,n > t dt = u+ 4c · 2568L4 exp(− n

2568L4u).


By setting

u =2568L4

nlog(c · n)

one gets (18.36) which, in turn (together with (18.34) and (18.35)), impliesthe assertion of Theorem 18.2.


For general introductions to wavelets see, e.g., Chui (1992), Daubechies(1992), or Meyer (1993). A description of various statistical applications ofwavelets can be found in Hardle et al. (1998).

In the context of fixed design regression it was shown by Donoho andJohnstone (1994), Donoho et al. (1995), and Donoho and Johnstone (1998)that orthogonal series estimates using thresholding and wavelets achieve anearly optimal minimax rate of convergence for a variety of function spaces(e.g., Holder, Besov, etc.).

Motivated by the success of these estimates, several different ways of ap-plying them to random design regression were proposed. In most of themone uses the given data to construct new, equidistant data and then appliesthe wavelet estimates for fixed design to these new data. Construction ofthe new, equidistant data can be done e.g., by binning (see Antoniadis,Gregoire, and Vial (1997)) or by using interpolation methods (see Halland Turlach (1997), Neumann and Spokoiny (1995), and Kovac and Silver-man (2000)). Under regularity conditions on the distribution of X it wasshown, in Hall and Turlach (1997) and Neumann and Spokoiny (1995), thatthese estimates are able to adapt to a variety of inhomogeneous smooth-ness assumptions and achieve, up to a logarithmic factor, the correspondingoptimal minimax rate of convergence.

To avoid regularity conditions on the design we used, in this chapter,the approach described in Kohler (2002a; 2000a). In particular, Theorems18.1 and 18.2 are due to Kohler (2002a; 2000a). There we analyzed orthog-onal series estimates by considering them as least squares estimates usingcomplexity regularization. In the context of fixed design regression, Engel(1994) and Donoho (1997) defined and analyzed data-dependent partition-ing estimates (i.e., special least squares estimates) by considering them asorthogonal series estimates.


Problem 18.1. (a) Show that the orthogonal system defined in Section 18.4 canbe computed in O(n · log(n)) time.


(b) Show that the estimate defined in Sections 18.2 and 18.4 can be computed inO(n · log(n)) time.

Problem 18.2. Let V Ml , UM

l+1,j , and BMl+1,j be defined as in Section 18.4. Show

that

BMl+1 = BM

l+1,0 ∪ · · · ∪ BMl+1,2l−1

is the basis of the orthogonal complement of V Ml in V M

l+1.Hint: Show that:

(1) the functions in BMl+1 are contained in V M

l+1;

(2) the functions in BMl+1 are orthogonal to each function in V M

l ;

(3) the functions in BMl+1 are orthonormal; and

(4) if f ∈ V Ml+1 and f is orthogonal to each function in V M

l , then f · IAlj

can

be represented as a linear combination of the functions from BMl+1,j (and

hence f can be represented as a linear combination of the functions fromBM

l+1).

Problem 18.3. Show that (18.20) implies (18.21), (18.22), and |J∗| ≤ n1−α.

Problem 18.4. Prove the second part of Theorem 18.1, i.e., show that the ex-pected L2 error converges to zero for every distribution of (X, Y ) with X ∈ [0, 1]a.s. and EY 2 < ∞.

Hint: Show that the assertion follows from (18.23) and

E

supf∈Tlog(n)Fn

∣∣∣∣∣1n

n∑

i=1

|f(Xi) − Yi,L|2 − E|f(X) − YL|2∣∣∣∣∣

→ 0 (n → ∞).

Problem 18.5. Modify the definition of the orthogonal series estimate in such away that the resulting estimate is weakly and strongly universally consistent forunivariate X.

Hint: Proceed similarly as in Problem 10.6.

19Advanced Techniques from EmpiricalProcess Theory

In Chapter 9 we used techniques from empirical process theory to bounddifferences between expectations and averages uniformly over some functionspace. We used the resulting inequalities (Theorems 9.1, 11.4, and 11.6)to analyze least squares estimates. Unfortunately, the rate of convergenceresults we have obtained thus far are optimal only up to a logarithmicfactor.

In the first three sections of this chapter we use some advanced tech-niques from empirical process theory to derive sharper inequalities. Theseresults are rather technical, but will be extremely useful. The main resultis Theorem 19.3 which we will use in Section 19.4 to derive the optimalrate of convergence for linear least squares estimates, for example, for suit-ably defined piecewise polynomial partitioning estimates. Furthermore, wewill use Theorem 19.3 in Chapter 21 to analyze the rate of convergence ofpenalized least squares estimates.

19.1 Chaining

In Chapter 9 we used finite covers of the underlying function spaces of somefixed size. In this section we apply instead the so-called chaining technique,which introduces a sequence of covers of increasing cardinality. There onehas to control a sum of covering numbers instead of one fixed coveringnumber, and this sum will be bounded above by an integral of coveringnumbers (cf. (19.2)).

19.1. Chaining 381

Theorem 19.1. Let L ∈ R+ and let ε1, . . . , εn be independent randomvariables with expectation zero and values in [−L,L]. Let z1, . . . , zn ∈ Rd,let R > 0, and let F be a class of functions f : Rd → R with the property

‖f‖2n :=

1n

n∑

i=1

|f(zi)|2 ≤ R2 (f ∈ F). (19.1)

Then

√nδ ≥ 48

√2L

∫ R2

δ8L

(log N2(u,F , zn1 ))1/2

du (19.2)

and√nδ ≥ 36R · L

imply

P

supf∈F

∣∣∣∣∣1n

n∑

i=1

f(zi) · εi∣∣∣∣∣> δ

≤ 5 exp(

− nδ2

2304L2R2

).

For |F| = 1 the inequality above follows from Hoeffding’s inequality (seethe proof of Theorem 19.1 for details). For finite F , Hoeffding’s inequality(cf. Lemma A.3), (19.1), and the union bound imply

P

supf∈F

∣∣∣∣∣1n

n∑

i=1

f(zi) · εi∣∣∣∣∣> δ

≤ |F| · 2 exp(

− nδ2

4L2R2

),

from which one can conclude that for√nδ ≥ 2

√2L ·R · (log |F|)1/2 (19.3)

one has

P

supf∈F

∣∣∣∣∣1n

n∑

i=1

f(zi) · εi∣∣∣∣∣> δ

≤ 2 exp(

log |F| − nδ2

8L2R2 − nδ2

8L2R2

)

≤ 2 exp(

− nδ2

8L2R2

).

This is the way the above theorem is formulated. There (19.3) is replacedby (19.2), which will allow us later to derive the optimal rate of convergencefor linear least squares estimates.

In the proof we will introduce finite covers to replace the supremum overthe possible infinite set F by a maximum over some finite set. Comparedwith the inequalities described in Chapter 9 the main new idea is to intro-duce a finite sequence of finer and finer covers of F (the so-called chainingtechnique) instead of one fixed cover. This will allow us to represent anyf ∈ F by

(f − fS) + (fS − fS−1) + · · · + (f1 − f0) + f0,

382 19. Advanced Techniques from Empirical Process Theory

where f0, f1, . . . , fS are elements of the covers which approximate f betterand better. fS will be such a good approximation that the first term canbe neglected. The other terms will lead to a sum of probabilities involvingmaxima over the finite covers which we will bound as above using the unionbound and Hoeffding’s inequality.Proof of Theorem 19.1. For R ≤ δ/(2L) we get, by the Cauchy–Schwarz inequality,

supf∈F

∣∣∣∣∣1n

n∑

i=1

f(zi) · εi∣∣∣∣∣≤ sup

f∈F‖f‖n ·

√√√√ 1n

n∑

i=1

ε2i ≤ R · 2L ≤ δ,

so we can assume w.l.o.g. R > δ/(2L).For s ∈ N0 let fs

1 , . . . , fsNs

be a ‖ · ‖n-cover of F of radius R2s of size

Ns = N2

(R

2s,F , zn

1

).

Because of (19.1) we can assume w.l.o.g. f01 = 0 and N0 = 1. For f ∈ F

choose

fs ∈ fs1 , . . . , f

sNs

such that

‖f − fs‖n ≤ R

2s.

Set

S = mins ≥ 1 :

R

2s≤ δ

2L

.

Because of

f = f − f0 = f − fS +S∑

s=1

(fs − fs−1)

we get, by definition of S and the Cauchy–Schwarz inequality,∣∣∣∣∣1n

n∑

i=1

f(zi) · εi∣∣∣∣∣

=

∣∣∣∣∣1n

n∑

i=1

(f(zi) − fS(zi)) · εi +S∑

s=1

1n

n∑

i=1

(fs(zi) − fs−1(zi)) · εi∣∣∣∣∣

≤∣∣∣∣∣1n

n∑

i=1

(f(zi) − fS(zi)) · εi∣∣∣∣∣+

S∑

s=1

∣∣∣∣∣1n

n∑

i=1

(fs(zi) − fs−1(zi)) · εi∣∣∣∣∣

≤ ‖f − fS‖n ·√√√√ 1n

n∑

i=1

ε2i +S∑

s=1

∣∣∣∣∣1n

n∑

i=1

(fs(zi) − fs−1(zi)) · εi∣∣∣∣∣

19.1. Chaining 383

≤ R

2S· L+

S∑

s=1

∣∣∣∣∣1n

n∑

i=1

(fs(zi) − fs−1(zi)) · εi∣∣∣∣∣

≤ δ

2+

S∑

s=1

∣∣∣∣∣1n

n∑

i=1

(fs(zi) − fs−1(zi)) · εi∣∣∣∣∣.

Therefore, for any η1, . . . , ηS ≥ 0, η1 + · · · + ηS ≤ 1,

P

supf∈F

∣∣∣∣∣1n

n∑

i=1

f(zi) · εi∣∣∣∣∣> δ

≤ P

∃f ∈ F :δ

2+

S∑

s=1

∣∣∣∣∣1n

n∑

i=1

(fs(zi) − fs−1(zi)) · εi∣∣∣∣∣>δ

2+

S∑

s=1

ηs · δ2

≤S∑

s=1

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

(fs(zi) − fs−1(zi)) · εi∣∣∣∣∣> ηs · δ

2

≤S∑

s=1

Ns ·Ns−1 · maxf∈F

P

∣∣∣∣∣1n

n∑

i=1

(fs(zi) − fs−1(zi)) · εi∣∣∣∣∣> ηs · δ

2

.

Fix s ∈ 1, . . . , S and f ∈ F . The random variables

(fs(z1) − fs−1(z1)) · ε1, . . . , (fs(zn) − fs−1(zn)) · εnare independent, have zero mean, and take values in

[−L · |fs(zi) − fs−1(zi)|, L · |fs(zi) − fs−1(zi)|]

(i = 1, . . . , n).

Therefore

1n

n∑

i=1

(2L · |fs(zi) − fs−1(zi)|

)2= 4L2‖fs − fs−1‖2

n

≤ 4L2 (‖fs − f‖n + ‖f − fs−1‖n

)2

≤ 4L2(R

2s+

R

2s−1

)2

= 36R2L2

22s,

together with Hoeffding’s inequality (cf. Lemma A.3), implies

P

[∣∣∣∣∣1n

n∑

i=1

(fs(zi) − fs−1(zi)) · εi∣∣∣∣∣> ηs · δ

2

]

≤ 2 exp

(

−2n(ηsδ2 )2

36R2L2

22s

)

.

It follows that

P

[

supf∈F

∣∣∣∣∣1n

n∑

i=1

f(zi) · εi∣∣∣∣∣> δ

]

≤S∑

s=1

2N2s exp

(−nδ2η2

s22s

72R2L2

)

=S∑

s=1

2 exp(

2 logNs − nδ2η2s22s

72R2L2

).


In order to get rid of the covering number Ns, we choose ηs such that

2 logNs ≤ 12

· nδ2η2

s22s

72R2L2 , (19.4)


ηs ≥ ηs :=12 · √

2 ·R · L2sδ

√n

· logNs1/2.

More precisely, we set

ηs := max

2−s√s

4, ηs

.

Because ofS∑

s=1

2−s√s

4≤ 1

8

∞∑

s=1

s · (12)s−1 =

18

1(1 − 1

2 )2=

12

andS∑

s=1

ηs =S∑

s=1

24√

2Lδ√n

· R

2s+1

log N2

(R

2s,F , zn

1

)1/2

≤S∑

s=1

24√

2Lδ√n

·∫ R/2s

R/2s+1log N2(u,F , zn

1 )1/2du

=24

√2L

δ√n

·∫ R/2

R/2S+1log N2(u,F , zn

1 )1/2du

≤ 12,

where the last inequality follows from (19.2) and

14

· R

2S−1 ≥ 14

· δ

2L=

δ

(8L),

we get

S∑

s=1

ηs ≤S∑

s=1

2−s√s

4+

S∑

s=1

ηs ≤ 1.

Furthermore, ηs ≥ ηs implies (19.4), from which we can conclude

P

[

supf∈F

∣∣∣∣∣1n

n∑

i=1

f(zi) · εi∣∣∣∣∣> δ

]

≤S∑

s=1

2 exp(

−nδ2η2s22s

144L2R2

)

19.2. Extension of Theorem 11.6 385

≤S∑

s=1

2 exp(

− nδ2

16 · 144L2R2 · s)

≤ 21 − exp

(− nδ2

16·144L2R2

) · exp(

− nδ2

16 · 144L2R2

).

Now,

nδ2

16 · 144L2R2 ≥ 362L2R2

16 · 144L2R2 =916

yields2

1 − exp(− nδ2

16·144L2R2

) ≤ 21 − exp

(− 916

) ≤ 5

which in turn implies the assertion.

19.2 Extension of Theorem 11.6

Theorem 11.6 implies that for

nαε2 ≥ 80B3

log EN1

(αε5,F , Zn

1

)(19.5)

one has

P

supf∈F

1n

∑ni=1 f(Zi) − Ef(Z)

α+ Ef(Z) + 1n

∑ni=1 f(Zi)

> ε

≤ 4 exp(

−3nαε2

80B

).

This is the way in which the next result is formulated. There (19.5) isreplaced by a condition on the integral of the logarithm of the coveringnumber.

Theorem 19.2. Let Z, Z1, . . . , Zn be independent and identically dis-tributed random variables with values in Rd. Let K ≥ 1 and let F be aclass of functions f : Rd → [0,K]. Let 0 < ε < 1 and α > 0. Assume that

√nε

√α ≥ 576

√K (19.6)

and that, for all z1, . . . , zn ∈ Rd and all δ ≥ αK/2,√nεδ

192√

2K≥

∫ √δ

εδ32K

(

log N2

(

u,

f ∈ F :1n

n∑

i=1

f(zi) ≤ 4δK

, zn1

))1/2

du.

(19.7)

Then

P

supf∈F

∣∣Ef(Z) − 1

n

∑ni=1 f(Zi)

∣∣

α+ Ef(Z) + 1n

∑ni=1 f(Zi)

> ε

≤ 15 exp(

− nαε2

128 · 2304K

).

(19.8)


Proof. The proof will be divided into four steps.

Step 1. Replace the expectation inside the probability in (19.8) by anempirical mean.Draw a “ghost” sample Z ′n

1 = (Z ′1, . . . , Z


tributed as Z1 and independent of Zn1 . Let f∗ = f∗(Zn

1 ) be a functionf ∈ F such that

∣∣∣∣∣1n

n∑

i=1

f(Zi) − Ef(Z)∣∣∣∣∣> ε

(

α+1n

n∑

i=1

f(Zi) + Ef(Z))

,

if there exists any such function, and let f∗ be an other arbitrary functioncontained in F , if such a function doesn’t exist.

Observe that∣∣∣∣∣1n

n∑

i=1

f(Zi) − Ef(Z)∣∣∣∣∣> ε

(

α+1n

n∑

i=1

f(Zi) + Ef(Z))

and∣∣∣∣∣1n

n∑

i=1

f(Z ′i) − Ef(Z)

∣∣∣∣∣<ε

2

(

α+1n

n∑

i=1

f(Z ′i) + Ef(Z)

)

imply∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣>εα

2+ε

1n

n∑

i=1

f(Zi)− ε

21n

n∑

i=1

f(Z ′i)+

ε

2Ef(Z),

which is equivalent to∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣− 3

4ε

(1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

)

>εα

2+ε

41n

n∑

i=1

f(Zi) +ε

41n

n∑

i=1

f(Z ′i) +

ε

2Ef(Z).

Because of 0 < 1 + 34ε < 2 and Ef(Z) ≥ 0 this in turn implies

∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣>ε

8

(

2α+1n

n∑

i=1

f(Zi) +1n

n∑

i=1

f(Z ′i)

)

.

Using this one gets

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣

>ε

8

(

2α+1n

n∑

i=1

f(Zi) +1n

n∑

i=1

f(Z ′i)

)


≥ P

∣∣∣∣∣1n

n∑

i=1

f∗(Zi) − 1n

n∑

i=1

f∗(Z ′i)

∣∣∣∣∣

>ε

8

(

2α+1n

n∑

i=1

f∗(Zi) +1n

n∑

i=1

f∗(Z ′i)

)

≥ P

∣∣∣∣∣1n

n∑

i=1

f∗(Zi) − Ef∗(Z)|Zn1

∣∣∣∣∣

> ε

(

α+1n

n∑

i=1

f∗(Zi) + Ef∗(Z)|Zn1

)

,

∣∣∣∣∣1n

n∑

i=1

f∗(Z ′i) − Ef∗(Z)|Zn

1 ∣∣∣∣∣

<ε

2

(

α+1n

n∑

i=1

f∗(Z ′i) + Ef∗(Z)|Zn

1 )

≥ E

I| 1n

∑n

i=1f∗(Zi)−Ef∗(Z)|Zn

1 |>ε(α+ 1n

∑n

i=1f∗(Zi)+Ef∗(Z)|Zn

1 )

×P

∣∣∣∣∣1n

n∑

i=1


1 ∣∣∣∣∣

<ε

2

(

α+1n

n∑

i=1

f∗(Z ′i) + Ef∗(Z)|Zn

1 )∣∣∣∣Z

n1

.

By Lemma 11.2 and n > 100Kε2α (which follows from (19.6)) we get that the

probability inside the expectation is bounded from below by

1 − K

4( ε2 )2αn

= 1 − K

ε2αn≥ 0.99,

and we can conclude

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣

>ε

8

(

2α+1n

n∑

i=1

f(Zi) +1n

n∑

i=1

f(Z ′i)

)


≥ 0.99P

∣∣∣∣∣1n

n∑

i=1


∣∣∣∣∣

> ε

(

α+1n

n∑

i=1

f∗(Zi) + Ef∗(Z)|Zn1

)

= 0.99P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − Ef(Z)∣∣∣∣∣

> ε

(

α+1n

n∑

i=1

f(Zi) + Ef(Z))

.

This proves

P

supf∈F

∣∣ 1n

∑ni=1 f(Zi) − Ef(Z)∣∣

α+ 1n

∑ni=1 f(Zi) + Ef(Z) > ε

≤ 10099

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣

>ε

8

(

2α+1n

n∑

i=1

f(Zi) +1n

n∑

i=1

f(Z ′i)

)

.

Step 2. Introduction of additional randomness by random signs.Let U1, . . . , Un be independent and uniformly distributed over −1, 1and independent of Z1, . . . , Zn, Z ′

1, . . . , Z′n. Because of the independence

and identical distribution of Z1, . . . , Zn, Z ′1, . . . , Z

′n, the joint distribution

of Zn1 , Z

′n1 doesn’t change if one randomly interchanges the corresponding


1 . Hence

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣

>ε

8

(

2α+1n

n∑

i=1

f(Zi) +1n

n∑

i=1

f(Z ′i)

)

= P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

Ui · (f(Zi) − f(Z ′i))

∣∣∣∣∣

>ε

8

(

2α+1n

n∑

i=1

f(Zi) +1n

n∑

i=1

f(Z ′i)

)

≤ 2P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

Uif(Zi)

∣∣∣∣∣>ε

8

(

α+1n

n∑

i=1

f(Zi)

)

.


Step 3. Peeling.In this step we use the so-called peeling technique motivated in van de Geer(2000), Chapter 5.

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

Uif(Zi)

∣∣∣∣∣>ε

8

(

α+1n

n∑

i=1

f(Zi)

)

≤∞∑

k=1

P

∃f ∈ F : Ik =12k−1α ≤ 1n

n∑

i=1

f(Zi) < 2kα,

∣∣∣∣∣1n

n∑

i=1

Uif(Zi)

∣∣∣∣∣>ε

8

(

α+1n

n∑

i=1

f(Zi)

)

≤∞∑

k=1

P

∃f ∈ F :1n

n∑

i=1

f(Zi) ≤ 2kα,

∣∣∣∣∣1n

n∑

i=1

Uif(Zi)

∣∣∣∣∣>ε

8α2k−1

.

Step 4. Application of Theorem 19.1.Next we condition inside the above probabilities on Z1, . . . , Zn, which isequivalent to considering, for z1, . . . , zn ∈ Rd and k ∈ N ,

P

∃f ∈ F :1n

n∑

i=1

f(zi) ≤ 2kα,

∣∣∣∣∣1n

n∑

i=1

Uif(zi)

∣∣∣∣∣>ε

8α2k−1

. (19.9)

By the assumptions of Theorem 19.2 (use δ = α2k−2K in (19.7)) we have√n ε

8α2k−1

48√

2

≥∫ √

α2kK/2

εα2k−232

(

log N2

(

u,

f ∈ F :1n

n∑

i=1

f(zi) ≤ α2k

, zn1

))1/2

du

and√nε

8α2k−1 ≥ 36

√α2kK.

Furthermore, 1n

∑ni=1 f(zi) ≤ α2k implies

1n

n∑

i=1

f(zi)2 ≤ α2kK.

Hence Theorem 19.1, with R2 = α2kK,L = 1, δ = ε8α2k−1, implies that

(19.9) is bounded from above by

5 exp

(

− n(

ε16α2k

)2

2304(α2kK)

)

= 5 exp(

− nε2α

162 · 2304K2k

).


This proves

∞∑

k=1

P

∃f ∈ F :1n

n∑

i=1

f(Zi) ≤ 2kα,

∣∣∣∣∣1n

n∑

i=1

Uif(Zi)

∣∣∣∣∣>ε

8α2k−1

≤∞∑

k=1

5 exp(

− nε2α

162 · 2304K2k)

= 51

1 − exp(− nε2α

128·2304K

) exp(

− nε2α

128 · 2304K

)

≤ 51 − exp

(− 98

) exp(

− nε2α

128 · 2304K

).

Steps 1 to 4 imply the assertion.

19.3 Extension of Theorem 11.4

In this section we show the following modification of Theorem 11.4. Thistheorem will enable us to obtain optimal rates of convergence for thepiecewise polynomial partitioning estimates in Section 19.4.

Theorem 19.3. Let Z, Z1, . . . , Zn be independent and identically dis-tributed random variables with values in Rd. Let K1,K2 ≥ 1 and let F bea class of functions f : Rd → R with the properties

|f(z)| ≤ K1 (z ∈ Rd) and Ef(Z)2 ≤ K2Ef(Z).Let 0 < ε < 1 and α > 0. Assume that

√nε

√1 − ε

√α ≥ 288 max2K1,

√2K2 (19.10)

and that, for all z1, . . . , zn ∈ Rd and all δ ≥ α8 ,

√nε(1 − ε)δ

96√

2 maxK1, 2K2

≥∫ √

δ

ε·(1−ε)·δ16 maxK1,2K2

×(

log N2

(

u,

f ∈ F :1n

n∑

i=1

f(zi)2 ≤ 16δ

, zn1

))1/2

du.

(19.11)


Then

P

supf∈F

∣∣Ef(Z) − 1

n

∑ni=1 f(Zi)

∣∣

α+ Ef(Z) > ε

≤ 60 exp(

− nα ε2(1 − ε)128 · 2304 maxK2

1 ,K2). (19.12)

Proof. The proof is divided into six steps.

Step 1. Replace the expectation in the nominator of (19.12) by anempirical mean.Draw a “ghost” sample Z ′n

1 = (Z ′1, . . . , Z


tributed as Z1 and independent of Zn1 . Let f∗ = f∗(Zn

1 ) be a functionf ∈ F such that

∣∣∣∣∣1n

n∑

i=1

f(Zi) − Ef(Z)∣∣∣∣∣> ε(α+ Ef(Z)),

if there exists any such function, and let f∗ be an other arbitrary functioncontained in F , if such a function doesn’t exist.

We have

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣>ε

2α+

ε

2Ef(Z)

≥ P

∣∣∣∣∣1n

n∑

i=1

f∗(Zi) − 1n

n∑

i=1

f∗(Z ′i)

∣∣∣∣∣>ε

2α+

ε

2Ef∗(Z)|Zn

1

≥ P

∣∣∣∣∣1n

n∑

i=1


∣∣∣∣∣> εα+ εEf∗(Z)|Zn

1 ,∣∣∣∣∣1n

n∑

i=1


1 ∣∣∣∣∣≤ ε

2α+

ε

2Ef∗(Z)|Zn

1

= E

I| 1n

∑n

i=1f∗(Zi)−Ef∗(Z)|Zn

1 |>εα+εEf∗(Z)|Zn1

×P

∣∣∣∣∣1n

n∑

i=1


1 ∣∣∣∣∣

≤ ε

2α+

ε

2Ef∗(Z)|Zn

1 ∣∣∣∣Z

n1

.

Chebyshev’s inequality, together with

0 ≤ Ef∗(Z)2|Zn

1 ≤ K2E f∗(Z)|Zn

1


and n ≥ 20K2ε2α (which follows from (19.10)), implies

P

∣∣∣∣∣1n

n∑

i=1


1 ∣∣∣∣∣>ε

2α+

ε

2Ef∗(Z)|Zn

1 ∣∣∣∣Z

n1

≤ K2E f∗(Z)|Zn1

n(

ε2α+ ε

2Ef∗(Z)|Zn1 )2

≤ K2E f∗(Z)|Zn1

n · 2 · ε2α · ε

2Ef∗(Z)|Zn1

≤ 2K2

nε2α≤ 1

10.

Thus

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣>ε

2α+

ε

2Ef(Z)

≥ 910

P

∣∣∣∣∣1n

n∑

i=1


∣∣∣∣∣> εα+ εEf∗(Z)|Zn

1

=910

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − Ef(Z)∣∣∣∣∣> εα+ εEf(Z)

,

which implies

P

supf∈F

∣∣ 1n

∑ni=1 f(Zi) − Ef(Z)∣∣α+ Ef(Z) > ε

≤ 109

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣>ε

2α+

ε

2Ef(Z)

.

Step 2. Replace the expectation in the above probability by an empiricalmean.Using Ef(Z) ≥ 1

K2Ef(Z)2 we get

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣>ε

2α+

ε

2Ef(Z)

≤ P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣>ε

2α+

ε

21K2

Ef(Z)2

≤ P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣>ε

2α+

ε

21K2

Ef(Z)2,

1n

n∑

i=1

f(Zi)2 − Ef(Z)2 ≤ ε

(

α+1n

n∑

i=1

f(Zi)2 + Ef(Z)2)

,


1n

n∑

i=1

f(Z ′i)

2 − Ef(Z)2 ≤ ε

(

α+1n

n∑

i=1

f(Z ′i)

2 + Ef(Z)2)

+2P

supf∈F

∣∣ 1n

∑ni=1 f(Zi)2 − Ef(Z)2∣∣

α+ 1n

∑ni=1 f(Zi)2 + Ef(Z)2 > ε

.

The second inequality in the first probability on the right-hand side of theabove inequality is equivalent to

Ef(Z)2

≥ 11 + ε

(

−εα+ (1 − ε)1n

n∑

i=1

f(Zi)2)

,

which implies

12ε

21K2

Ef(Z)2 ≥ − ε2

4(1 + ε)K2α+

1 − ε

4(1 + ε)K2ε1n

n∑

i=1

f(Zi)2.

This together with a similar argument applied to the third inequality and

εα

2− ε2α

2(1 + ε)K2≥ εα

4

(which follows from ε/(K2(1 + ε)) ≤ 1/2 for 0 < ε < 1 and K2 ≥ 1) yields

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣>ε

2α+

ε

2Ef(Z)

≤ P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣

>εα

4+

(1 − ε)ε4(1 + ε)K2

(1n

n∑

i=1

f(Zi)2 +1n

n∑

i=1

f(Z ′i)

2

)

+2P

supf∈F

∣∣ 1n

∑ni=1 f(Zi)2 − Ef(Z)2∣∣

α+ 1n

∑ni=1 f(Zi)2 + Ef(Z)2 > ε

. (19.13)

Step 3. Application of Theorem 19.2.Next we apply Theorem 19.2 to the second probability on the right-handside of the above inequality. f2 : f ∈ F is a class of functions with valuesin [0,K2

1 ]. In order to be able to apply Theorem 19.2 to it, we need√nε

√α ≥ 576K1 (19.14)

and that, for all δ ≥ αK21/2 and all z1, . . . , zn ∈ Rd,


√nεδ

192√

2K21

≥∫ √

δ

εδ

32K21

(

log N2

(

u,

f2 : f ∈ F , 1n

n∑

i=1

f(zi)2 ≤ 4δK2

1

, zn1

)) 12

du.

(19.15)

Inequality (19.14) is implied by (19.10). Furthermore, for arbitraryfunctions f1, f2 : Rd → [−K1,K1], one has

1n

n∑

i=1

∣∣f1(zi)2 − f2(zi)2

∣∣2 =

1n

n∑

i=1

|f1(zi) + f2(zi)|2 · |f1(zi) − f2(zi)|2

≤ (2K1)21n

n∑

i=1

|f1(zi) − f2(zi)|2 ,

which implies

N2

(

u,

f2 : f ∈ F , 1n

n∑

i=1

f(zi)2 ≤ 4δK2

1

, zn1

)

≤ N2

(u

2K1,

f ∈ F :1n

n∑

i=1

f(zi)2 ≤ 4δK2

1

, zn1

)

.

Hence (19.15) follows from (19.11) (set δ = δ/(4K21 ) in (19.11)).

Thus the assumptions of Theorem 19.2 are satisfied and we can conclude

P

supf∈F

∣∣ 1n

∑ni=1 f(Zi)2 − Ef(Z)2∣∣

α+ 1n

∑ni=1 f(Zi)2 + Ef(Z)2 > ε

≤ 15 exp(

− nε2α

128 · 2304K21

).

Step 4. Introduction of additional randomness by random signs.Let U1, . . . , Un be independent and uniformly distributed over −1, 1and independent of Z1, . . . , Zn, Z ′

1, . . . , Z′n. Because of the independence

and identical distribution of Z1, . . . , Zn, Z ′1, . . . , Z

′n, the joint distribution

of Zn1 , Z

′n1 doesn’t change if one randomly interchanges the corresponding


1 . Hence

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

f(Zi) − 1n

n∑

i=1

f(Z ′i)

∣∣∣∣∣

>εα

4+

(1 − ε)ε4(1 + ε)K2

(1n

n∑

i=1

f(Zi)2 +1n

n∑

i=1

f(Z ′i)

2

)


= P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

Ui (f(Zi) − f(Z ′i))

∣∣∣∣∣

>εα

4+

(1 − ε)ε4(1 + ε)K2

(1n

n∑

i=1

f(Zi)2 +1n

n∑

i=1

f(Z ′i)

2

)

≤ 2P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

Uif(Zi)

∣∣∣∣∣>εα

8+

(1 − ε)ε4(1 + ε)K2

1n

n∑

i=1

f(Zi)2

.

Step 5. Peeling.We have

P

∃f ∈ F :

∣∣∣∣∣1n

n∑

i=1

Uif(Zi)

∣∣∣∣∣>εα

8+

(1 − ε)ε4(1 + ε)K2

1n

n∑

i=1

f(Zi)2

≤∞∑

k=1

P

∃f ∈ F :

Ik =12k−1K2(1 + ε)α2(1 − ε)

≤ 1n

n∑

i=1

f(Zi)2 ≤ 2kK2(1 + ε)α2(1 − ε)

,

∣∣∣∣∣1n

n∑

i=1

Uif(Zi)

∣∣∣∣∣>εα

8+

(1 − ε)ε4(1 + ε)K2

1n

n∑

i=1

f(Zi)2

≤∞∑

k=1

P

∃f ∈ F :

1n

n∑

i=1

f(Zi)2 ≤ 2kK2(1 + ε)α2(1 − ε)

,

∣∣∣∣∣1n

n∑

i=1

Uif(Zi)

∣∣∣∣∣>εα

82k−1

.

Step 6. Application of Theorem 19.1.Next we condition inside the above probabilities on Z1, . . . , Zn, which isequivalent to considering, for z1, . . . , zn ∈ Rd and k ∈ N ,

P

∃f ∈ F :1n

n∑

i=1

f(zi)2 ≤ 2kK2(1 + ε)α2(1 − ε)

,

∣∣∣∣∣1n

n∑

i=1

Uif(zi)

∣∣∣∣∣>εα

82k−1

.

(19.16)The assumptions of Theorem 19.3 imply (use δ = 2k K2(1+ε)α

8(1−ε) in (19.11))


√n εα

8 2k−1

48√

2

≥∫√

2kK2(1+ε)α

2(1−ε) /2

εα2k

8·16

×(

log N2

(

u,

f ∈ F :1n

n∑

i=1

f(zi)2 ≤ 2kK2(1 + ε)α2(1 − ε)

, zn1

)) 12

du

and

√nεα

82k−1 ≥ 36

√2kK2(1 + ε)α

2(1 − ε).

Hence we can conclude by Theorem 19.1 that (19.16) is bounded fromabove by

5 exp

− n(

εα8 2k−1

)2

2304 2kK2(1+ε)α2(1−ε)

= 5 exp(

− nε2(1 − ε)α64 · 2304(1 + ε)K2

2k−1).

It follows that

∞∑

k=1

P

∃f ∈ F :1n

n∑

i=1

f(Zi)2 ≤ 2kK2(1 + ε)α2(1 − ε)

,

∣∣∣∣∣1n

n∑

i=1

Uif(Zi)

∣∣∣∣∣>εα

82k−1

≤∞∑

k=1

5 exp(

− nε2(1 − ε)α64 · 2304(1 + ε)K2

2k−1)

≤∞∑

k=1

5 exp(

− nε2(1 − ε)α64 · 2304(1 + ε)K2

· k)

=5

1 − exp(− nε2(1−ε)α

64·2304(1+ε)K2

) exp(

− nε2(1 − ε)α64 · 2304(1 + ε)K2

)

≤ 51 − exp

(− 916

) exp(

− nε2(1 − ε)α64 · 2304(1 + ε)K2

).

Steps 1 to 6 imply the assertion.



In this section we use Theorem 19.3 to show that suitably defined piecewisepolynomial partitioning estimates achieve for bounded Y the optimal mini-max rate of convergence for estimating (p, C)–smooth regression functions.In the case of Y bounded this improves the convergence rates of Section11.2 by a logarithmic factor.

For simplicity we assume X ∈ [0, 1] a.s. and |Y | ≤ L a.s. for someL ∈ R+. We will show in Problem 19.1 how to derive similar results formultivariate X.

Recall that the piecewise polynomial partitioning estimate is defined byminimizing the empirical L2 risk over the set FK,M of all piecewise poly-nomials of degree M (or less) with respect to an equidistant partition of[0, 1] into K intervals. In Section 11.2 we truncated the estimate in orderto ensure that it is bounded in absolute value. Here we impose instead abound on the supremum norm of the functions which we consider duringminimization of the empirical L2 risk. More precisely, set

mn,(K,M)(·) = arg minf∈FK,M (L+1)

1n

n∑

i=1

|f(Xi) − Yi|2,

where

FK,M (L+ 1) =

f ∈ FK,M : supx∈[0,1]

|f(x)| ≤ L+ 1

is the set of all piecewise polynomials of FK,M bounded in absolute value byL+1. Observe that by our assumptions the regression function is boundedin absolute value by L and that it is reasonable to fit a function to thedata for which a similar bound (which we have chosen to be L+1) is valid.In contrast to the truncation used in Section 11.2 this ensures that theestimate is contained in a linear vector space which will allow us to applythe bound on covering numbers of balls in linear vector spaces in Lemma9.3 in order to evaluate the conditions of Theorem 19.3.

It is not clear whether there exists any practical algorithm to computethe above estimate. The problem is the bound on the supremum normof the functions in FK,M (L + 1) which makes it difficult to represent thefunctions by a linear combination of some fixed basis functions. In Problem19.4 we will show how to change the definition of the estimate in order toget an estimate which can be computed much easier.

The next theorem gives a bound on the expected L2 error of mn,(K,M):


Theorem 19.4. Let M ∈ N0, K ∈ N , and L > 0 and let the estimatemn,(K,M) be defined as above. Then

E∫

|mn,(K,M)(x) −m(x)|2µ(dx)

≤ c1 · (M + 1)Kn

+ 2 inff∈FK,M (L+1)

∫|f(x) −m(x)|2µ(dx)

for every distribution of (X,Y ) with |Y | ≤ L a.s. Here c1 is a constantwhich depends only on L.

Before we prove this theorem we study its consequences.

Corollary 19.1. Let C,L > 0 and p = k + β with k ∈ N0 and β ∈ (0, 1].Set M = k and Kn = C2/(2p+1)n1/(2p+1). Then there exists a constantc2 which depends only on p and L such that, for all n ≥ max

C1/p, C−2

,

E∫

|mn,(Kn,M)(x) −m(x)|2µ(dx) ≤ c2C2

2p+1n− 2p2p+1 .

for every distribution of (X,Y ) with X ∈ [0, 1] a.s., |Y | ≤ L a.s., and m(p, C)-smooth.

Proof. By Lemma 11.1 there exists a piecewiese polynomial g ∈ FKn,M

such that

supx∈[0,1]

|g(x) −m(x)| ≤ 12pk!

· C

Kpn.

In particular,

supx∈[0,1]

|g(x)| ≤ supx∈[0,1]

|m(x)| +1

2pk!· C

Kpn

≤ L+ (C · n−p)1/(2p+1) ≤ L+ 1,

hence g ∈ FKn,M (L + 1). Here the third inequality follows from thecondition n ≥ C1/p. Finally,

inff∈FKn,M (L+1)

∫|f(x) −m(x)|2µ(dx) ≤ sup

x∈[0,1]|g(x) −m(x)|2

≤ 1(2pk!)2

· C2 1K2p

n

. (19.17)

This together with Theorem 19.4 implies the assertion.

In the proof of Theorem 19.4 we will apply the following lemma:

Lemma 19.1. Let (X,Y ), (X1, Y1), . . . , (Xn, Yn) be independent andidentically distributed (R × R)-valued random variables. Assume |Y | ≤ La.s. for some L ≥ 1. Let K ∈ N and M ∈ N0. Then one has, for any


α ≥ c3(M+1)·K

n ,

P

∃f ∈ FK,M (L+ 1) : E|f(X) − Y |2 − |m(X) − Y |2

− 1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

>12

(

α+ E|f(X) − Y |2 − |m(X) − Y |2)

≤ 60 exp(− nα

128 · 2304 · 800 · L4

). (19.18)

Here c3 is a constant which depends only on L.

Proof. Set Z = (X,Y ) and Zi = (Xi, Yi) (i = 1, . . . , n). For f : Rd → Rdefine gf : Rd × R → R by

gf (x, y) = (|f(x) − y|2 − |m(x) − y|2) · Iy∈[−L,L].

Set

G = gf : f ∈ FK,M (L+ 1) .Then the right-hand side of (19.18) can be written as

P

∃g ∈ G :Eg(Z) − 1

n

∑ni=1 g(Zi)

α+ Eg(Z) >12

.

By definition, the functions in FK,M (L+ 1) are bounded in absolute valueby L+ 1, hence

|g(z)| ≤ 2(L+ 1)2 + 2L2 ≤ 10L2 (z ∈ Rd × R)

for all g ∈ G. Furthermore,

Egf (Z)2 = E(|f(X) − Y |2 − |m(X) − Y |2)2

= E

|(f(X) +m(X) − 2Y ) · (f(X) −m(X))|2

≤ (L+ 1 + 3L)2E|f(X) −m(X)|2

= 25L2Egf (Z) (f ∈ FK,M (L+ 1)).

The functions in FK,M (L + 1) are piecewise polynomials of degree M (orless) with respect to a partition of [0, 1] consisting of K intervals. Hence,f ∈ FK,M (L + 1) implies that f2 is a piecewise polynomial of degree 2M(or less) with respect to the same partition. This together with

gf (x, y) =(f2(x) − 2f(x)y + y2 − |m(x) − y|2) · Iy∈[−L,L]


yields that G is a subset of a linear vector space of dimension

D = K · ((2M + 1) + (M + 1)) + 1,

thus, by Lemma 9.3,

log N2

(

u,

g ∈ G :1n

n∑

i=1

g(zi)2 ≤ 16δ

, zn1

)

≤ D · log16

√δ + u

u.

Because of∫ √

δ

0

(

D · log16

√δ + u

u

) 12

du =√D

√δ

∫ ∞

1

√log(1 + 16v)

v2 dv

≤√D

√δ

∫ ∞

1

√16vv2 dv

≤ 8√D

√δ

the assumptions of Theorem 19.3 are satisfied whenever√n

14δ ≥ 96 ·

√2 · 50L2 · 8

√D

√δ

for all δ ≥ α/8, which is implied by

α ≥ c3(M + 1) ·K

n.

Thus we can conclude from Theorem 19.3

P

∃g ∈ G :Eg(Z) − 1

n

∑ni=1 g(Zi)

α+ Eg(Z) >12

≤ 60 exp(

− nα 14 (1 − 1

2 )128 · 2304 · 100L4

).

Proof of Theorem 19.4. We use the error decomposition∫

|mn,(K,M)(x) −m(x)|2µ(dx)

= E|mn,(K,M)(X) − Y |2|Dn

− E|m(X) − Y |2= T1,n + T2,n,

where

T1,n = 21n

n∑

i=1

|mn,(K,M)(Xi) − Yi|2 − |m(Xi) − Yi|2

and

T2,n = E|mn,(K,M)(X) − Y |2|Dn

− E|m(X) − Y |2 − T1,n.


By definition of the estimate,

T1,n = 2 minf∈FK,M (L+1)

1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

which implies

ET1,n ≤ 2 inff∈FK,M (L+1)

∫|f(x) −m(x)|2µ(dx). (19.19)

Furthermore, Lemma 19.1 implies that, for t ≥ c3(M+1)·K

n ,

P T2,n > t

≤ P

∃f ∈ FK,M (L+ 1) : 2E|f(X) − Y |2 − |m(X) − Y |2

− 2n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

> t+ E|f(X) − Y |2 − |m(X) − Y |2

≤ 60 exp(

−n t

c4

).

Hence,

ET2,n

≤∫ ∞

0PT2,n > t dt ≤ c3

(M + 1) ·Kn

+∫ ∞

c3(M+1)K

n

PT2,n > t dt

≤ c3(M + 1) ·K

n+

60c4n

exp(

−c3c4

(M + 1) ·K).

This together with (19.19) implies the assertion.

The estimate in Corollary 19.1 depends on the smoothness (p, C) ofthe regression function, which is usually unknown in an application. Asin Chapter 12 this can be avoided by using the method of complexityregularization.

Let M0 ∈ N0 and set

Pn = (K,M) ∈ N × N0 : 1 ≤ K ≤ n and 0 ≤ M ≤ M0 .For (K,M) ∈ Pn define mn,(K,M) as above. Depending on the data(X1, Y1), . . . , (Xn, Yn) choose

(K∗,M∗) ∈ Pn


such that

1n

n∑

i=1

|mn,(K∗,M∗)(Xi) − Yi|2 + penn(K∗,M∗)

= min(K,M)∈Pn

1n

n∑

i=1

|mn,(K,M)(Xi) − Yi|2 + penn(K,M)

,

where

penn(K,M) = c3 · (M + 1) ·Kn

is a penalty term penalizing the complexity of FK,M (L + 1) and c3 is theconstant from Lemma 19.1 above. Set

mn(x, (X1, Y1), . . . , (Xn, Yn)) = mn,(K∗,M∗)(x, (X1, Y1), . . . , (Xn, Yn)).

The next theorem provides the bound on the expected L2 error of theestimate. The bound is obtained by means of complexity regularization.

Theorem 19.5. Let M0 ∈ N , L > 0, and let the estimate be defined asabove. Then

E∫


≤ min(K,M)∈Pn

2c3 · (M + 1) ·Kn

+2 inff∈FK,M (L+1)

∫|f(x) −m(x)|2µ(dx)

+c5n

for every distribution of (X,Y ) with |Y | ≤ L a.s. Here c5 is a constantwhich depends only on L and M0.

Before we prove Theorem 19.5 we study its consequences. Our firstcorollary shows that if the regression function is contained in the setFK,M (L+1) of bounded piecewise polynomials, then the expected L2 errorof the estimate converges to zero with the parametric rate 1/n.

Corollary 19.2. Let L > 0, M ≤ M0, and K ∈ N . Then

E∫

|mn(x) −m(x)|2µ(dx) = O

(1n

)

for every distribution of (X,Y ) with |Y | ≤ L a.s. and m ∈ FK,M (L+ 1).

Proof. The assertion follows directly from Theorem 19.5.

Next we study estimation of (p, C)-smooth regression functions.


Corollary 19.3. Let L > 0 be arbitrary. Then, for any p = k + β withk ∈ N0, k ≤ M0, β ∈ (0, 1], and any C > 0,

E∫

|mn(x) −m(x)|2µ(dx) = O(C

22p+1n− 2p

2p+1

)

for every distribution of (X,Y ) with X ∈ [0, 1] a.s., |Y | ≤ L a.s., and m(p, C)-smooth.

Proof. The assertion follows directly from Theorem 19.5 and (19.17).

According to Chapter 3 the rate in Corollary 19.3 is the optimal minimaxrate of convergence for estimation of (p, C)-smooth regression functions.

Proof of Theorem 19.5. We start with the error decomposition∫


= E|mn,(K∗,M∗)(X) − Y |2|Dn

− E|m(X) − Y |2= T1,n + T2,n,

where

T1,n = 21n

n∑

i=1

|mn,(K∗,M∗)(Xi) − Yi|2 − |m(Xi) − Yi|2 + 2penn(K∗,M∗)

and

T2,n = E|mn,(K∗,M∗)(X) − Y |2|Dn

− E|m(X) − Y |2 − T1,n.

By definition of the estimate,

T1,n = 2 min(K,M)∈Pn

minf∈FK,M (L+1)

1n

n∑

i=1

|f(Xi) − Yi|2 + penn(K,M)

− 2n

n∑

i=1

|m(Xi) − Yi|2,

which implies

ET1,n

≤ 2 min(K,M)∈Pn

penn(K,M) + inf

f∈FK,M (L+1)

∫|f(x) −m(x)|2µ(dx)

.

(19.20)


ET2,n ≤ c5n. (19.21)


To this end, let t > 0 be arbitrary. Then Lemma 19.1 implies

P T2,n > t

≤n∑

K=1

M0∑

M=0

P

∃f ∈ FK,M (L+ 1) : 2E|f(X) − Y |2 − |m(X) − Y |2

− 2n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

> t+ 2 · penn(M,K) + E|f(X) − Y |2 − |m(X) − Y |2

≤n∑

K=1

M0∑

M=0

60 exp(

−n (t+ 2 · penn(K,M))c6

)

= exp(

−n t

c6

)·

n∑

K=1

M0∑

M=0

60 exp(

−2c3c6

· (M + 1)K)

≤ (M0 + 1) · c7 · exp(

−n t

c6

).

Hence

ET2,n ≤∫ ∞

0P T2,n > t dt ≤ c6c7 (M0 + 1)

n.


Theorem 19.1 is due to van de Geer (1990). Applications of empirical pro-cess theory in statistics are described in van der Vaart and Wellner (1996)and van de Geer (2000), which are also excellent sources for learning manyuseful techniques from empirical process theory.

Theorems 19.2, 19.3, 19.4, and 19.5 are due to Kohler (2000a; 2000b).The rate of convergence results in this chapter require the boundedness

of (X,Y ). For the empirical L2 error

‖mn −m‖22 =

1n

n∑

i=1

|mn(Xi) −m(Xi)|2

one can derive the rate of convergence results for bounded X and un-bounded Y if one assumes the existence of an exponential moment of Y(such results can be found, e.g., in van de Geer (2000)). One can use The-orem 19.2 to bound the L2 error by some constant times the empiricalL2 error, which leads to rate of convergence results for bounded X and


unbounded Y . It is an open problem whether one can also show similarresults if X is unbounded, e.g., if X is normally distributed (cf. Question1 in Stone (1982)).


Problem 19.1. Let L > 0 be arbitrary. Let F (d)K,M be the set of all multivariate

piecewise polynomials of degree M (or less, in each coordinate) with respect toan equidistant partition of [0, 1]d into Kd cubes, set

F (d)K,M (L + 1) =

f ∈ F (d)K,M : sup

x∈[0,1]d|f(x)| ≤ L + 1

and define the estimate mn,(K,M) by

mn,(K,M) = arg minf∈F(d)

K,M(L+1)

1n

n∑

i=1

|f(Xi) − Yi|2.

Show that there exists a constant c depending only on L and d such that

E∫

|mn,(K,M)(x) − m(x)|2µ(dx)

≤ c · (M + 1)dKd

n+ 2 inf

f∈F(d)K,M

(L+1)

∫|f(x) − m(x)|2µ(dx)

for all distributions of (X, Y ) with X ∈ [0, 1]d a.s. and |Y | ≤ L a.s.Hint: Proceed as in the proof of Theorem 19.4.

Problem 19.2. Let p = k + β for some k ∈ N0 and β ∈ (0, 1]. Show that for asuitable choice of the parameters K = Kn and M the estimate in Problem 19.1satisfies

E∫

|mn,(Kn,M)(x) − m(x)|2µ(dx) ≤ c′C2d

2p+d n− 2p

2p+d

for every distribution of (X, Y ) with X ∈ [0, 1]d a.s., |Y | ≤ L a.s., and m (p, C)-smooth.

Problem 19.3. Use complexity regularization to choose the parameters of theestimate in Problem 19.2.

Problem 19.4. Let Bj,Kn,M : j = −M, . . . , Kn − 1 be the B-spline basis ofthe univariate spline space SKn,M introduced in Section 14.4. Let c > 0 be aconstant specified below and set

SKn,M :=

Kn−1∑

j=−M

ajBj,Kn,M :Kn−1∑

j=−M

|aj | ≤ c · (L + 1)

.


Show that if one chooses c in a suitable way then the least squares estimate

mn,(Kn,M)(·) = arg minf∈SKn,M

1n

n∑

i=1

|f(Xi) − Yi|2

satisfies the bounds in Theorem 19.4 (with FK,M (L+1) replaced by SKn,M ) andin Corollary 19.1.

Hint: According to de Boor (1978) there exists a constant c > 0 such that, forall aj ∈ R,

Kn−1∑

j=−M

|aj | ≤ c · supx∈[0,1]

∣∣∣∣∣

Kn−1∑

j=−M

ajBj,Kn,M (x)

∣∣∣∣∣.

Problem 19.5. Use complexity regularization to choose the parameters of theestimate in Problem 19.4.

Problem 19.6. Assume (X, Y ) ∈ R × [−L, L] a.s. Let F be a set of functionsf : R → R and assume that F is a subset of a linear vector space of dimensionK. Let mn be the least squares estimate

mn(·) = arg minf∈F

1n

n∑

i=1

|f(Xi) − Yi|2

and set

m∗n(·) = arg min

f∈F1n

n∑

i=1

|f(Xi) − m(Xi)|2.

Show that, for all δ > 0,

P

1n

n∑

i=1

|mn(Xi) − m(Xi)|2 > 2δ + 18 minf∈F

1n

n∑

i=1

|f(Xi) − m(Xi)|2∣∣∣∣X

n1

≤ P

δ <1n

n∑

i=1

|mn(Xi) − m(Xi)|2

≤ 1n

n∑

i=1

(mn(Xi) − m∗n(Xi)) · (Yi − m(Xi))

∣∣∣∣X

n1

.

Use the peeling technique and Theorem 19.1 to show that, for δ ≥ c · Kn

, the lastprobability is bounded by

c′ exp(−nδ/c′)

and use this result to derive a rate of convergence result for

1n

n∑

i=1

|mn(Xi) − m(Xi)|2.

Problem 19.7. Apply the chaining and the peeling technique in the proof ofTheorem 11.2 to derive a version of Theorem 11.2 where one uses integrals ofcovering numbers of balls in the function space as in Theorem 19.3. Use thisresult together with Problem 19.6 to give a second proof for Theorem 19.5.

20Penalized Least Squares Estimates I:Consistency

In the definition of least squares estimates one introduces sets of functionsdepending on the sample size over which one minimizes the empirical L2risk. Restricting the minimization of the empirical L2 risk to these sets offunctions prevents the estimates from adapting too well to the given data.If an estimate adapts too well to the given data, then it is not suitable forpredicting new, independent data. Penalized least squares estimates use adifferent strategy to avoid this problem: instead of restricting the class offunctions they add a penalty term to the empirical L2 risk which penalizesthe roughness of a function and minimize the sum of the empirical L2risk and this penalty term basically over all functions. The most popularexamples are smoothing spline estimates, where the penalty term is chosenproportional to an integral over a squared derivative of the function. Incontrast to the complexity regularization method introduced in Chapter12, the penalty here depends directly on the function considered and notonly on a set of functions in which this function is defined.

In this chapter we study penalized least squares estimates. In Section20.1 we explain how the univariate estimate is defined and how it canbe computed efficiently. The proofs there are based on some optimalityproperties of spline interpolants, which are the topic of Section 20.2. Resultsabout the consistency of univariate penalized least squares estimates arecontained in Section 20.3. In Sections 20.4 and 20.5 we show how one canextend the previous results to the multivariate case.

408 20. Penalized Least Squares Estimates I: Consistency

20.1 Univariate Penalized Least Squares Estimates

In this section we will discuss the univariate penalized least squares esti-mates. In order to simplify the notation we will assume X ∈ (0, 1) a.s. InProblem 20.6 we will describe a modification of the estimate which leadsto universally consistent estimates.

Fix the dataDn and let k ∈ N and λn > 0. The univariate penalized leastsquares estimate (or smoothing spline estimate) is defined as a function gwhich minimizes

1n

n∑

i=1

|g(Xi) − Yi|2 + λn

∫ 1

0|g(k)(x)|2 dx, (20.1)

where

1n

n∑

i=1

|g(Xi) − Yi|2

is the empirical L2 risk of g, which measures how well the function g isadapted to the training data, while

λn

∫ 1

0|g(k)(x)|2 dx (20.2)

penalizes the roughness of g. As mentioned in Problem 20.1, the functionwhich minimizes (20.1) satisfies

g(k)(x) = 0 for x < minX1, . . . , Xn or x > maxX1, . . . , Xn,therefore it doesn’t matter if one replaces the penalty term (20.2) byλn

∫ ∞−∞ |g(k)(x)|2 dx.

If one minimizes (20.1) with respect to all (measurable) function, wherethe kth derivative is square integrable, it turns out that, for k > 1, theminimum is achieved by some function which is k times continuously dif-ferentiable. This is not true for k = 1. But in order to simplify the notationwe will always denote the set of functions which contains the minima of(20.1) by Ck(R). For k > 1, this is the set of all k times continuouslydifferentiable functions f : R → R, but, for k = 1, the exact definition ofthis set is more complicated.

It is not obvious that an estimate defined by minimizing (20.1) withrespect to all k times differentiable functions is of any practical use. But aswe will see in this section, it is indeed possible to compute such a functionefficiently. More precisely, we will show in the sequel that there exists aspline function f of degree 2k − 1 which satisfies

1n

n∑

i=1

|f(Xi) − Yi|2 + λn

∫ 1

0|f (k)(x)|2 dx

20.1. Univariate Penalized Least Squares Estimates 409

= ming∈Ck(R)

1n

n∑

i=1

|g(Xi) − Yi|2 + λn

∫ 1

0|g(k)(x)|2 dx

,

and which, furthermore, can be computed efficiently. Here Ck(R) denotesthe set of all k times differentiable functions g : R → R.

Set M = 2k − 1 and choose a knot vector ujK+Mj=−M such that

u1, . . . , uK−1 = X1, . . . , Xnand

u−M < · · · < u0 < 0 < u1 < · · · < 1 < uK < · · · < uK+M .

If X1, . . . , Xn are all distinct, then K − 1 = n and u1, . . . , uK−1 is apermutation of X1, . . . , Xn.

The following optimality property of spline interpolants will be provenin Section 20.2:

Lemma 20.1. Let N ∈ N , 0 < z1 < · · · < zN < 1, and k ≤ N . Letg : R → R be an arbitrary k times differentiable function. Set M = 2k − 1and K = N + 1. Define the knot vector u = ujK+M

j=−M of the spline spaceSu,M by setting

uj = zj (j = 1, . . . , N)

and by choosing

u−M < u−M+1 < · · · < u0 < 0, 1 < uK < uK+1 < · · · < uK+M

arbitrary.Then there exists a spline function f ∈ Su,M such that

f(zi) = g(zi) (i = 1, . . . , N) (20.3)

and∫ 1

0|f (k)(z)|2dz ≤

∫ 1

0|g(k)(z)|2dz. (20.4)

According to Lemma 20.1 for each g ∈ Ck(R) there exists g ∈ Su,M suchthat

g(Xi) = g(Xi) (i = 1, . . . , n)

and∫ 1

0|g(k)(x)|2 dx ≤

∫ 1

0|g(k)(x)|2 dx.

Because of the first condition,

1n

n∑

i=1

|g(Xi) − Yi|2 =1n

n∑

i=1

|g(Xi) − Yi|2,


and thus

1n

n∑

i=1

|g(Xi) − Yi|2 + λn

∫ 1

0|g(k)(x)|2 dx

≤ 1n

n∑

i=1

|g(Xi) − Yi|2 + λn

∫ 1

0|g(k)(x)|2 dx.

This proves

ming∈Ck(R)

1n

n∑

i=1

|g(Xi) − Yi|2 + λn

∫ 1

0|g(k)(x)|2 dx

= ming∈Su,M

1n

n∑

i=1

|g(Xi) − Yi|2 + λn

∫ 1

0|g(k)(x)|2 dx

,

therefore, it suffices to minimize the penalized empirical L2 risk only overthe finite-dimensional spline space Su,M .

Let g ∈ Su,M be arbitrary. Then g can be written as

g =K−1∑

j=−M

aj ·Bj,M,u,

and by Lemma 14.6, g(k) can be written as

g(k) =K−1∑

j=−(k−1)

bj ·Bj,k−1,u,

where one can compute the b′js from the a′js by repeatedly taking differences

and multiplying with some constants depending on the knot sequence u.Hence b = bj is a linear transformation of a = aj and we have, forsome (K + k − 1) × (K + 2k − 1) matrix D,

b = Da.

It follows that∫ 1

0|g(k)(x)|2 dx =

K−1∑

i,j=−(k−1)

bibj

∫ 1

0Bi,k−1,u(x) ·Bj,k−1,u(x) dx

= bTCb

= aTDTCDa,

where

C =(∫ 1


)

i,j=−(k−1),...,K−1.


In addition,

1n

n∑

i=1

|g(Xi) − Yi|2 =1n

‖Ba− Y ‖22 =

1n

(aTBT − Y T

)(Ba− Y ) ,

where

B = (Bj,2k−1,u(Xi))i=1,...,n;j=−(2k−1),...,K−1 and Y = (Y1, . . . , Yn)T .

Hence,

1n

n∑

i=1

|g(Xi) − Yi|2 + λn

∫ 1

0|g(k)(x)|2 dx

=1n

(aTBT − Y T

)(Ba− Y ) + λna

TDTCDa (20.5)

and it suffices to show that there always exists an a∗ ∈ RK+2k−1 whichminimizes the right-hand side of the last equation.

The right-hand side of (20.5) is equal to

1naTBTBa− 2

1nY TBa+

1nY TY + λna

TDTCDa

= aT

(1nBTB + λnD

TCD

)a− 2

1nY TBa+

1nY TY.

Next we show that the matrix

A =1nBTB + λnD

TCD

is positive definite. Indeed,

aTAa =1n

‖Ba‖22 + λn(Da)TCDa ≥ 0

because

bTCb =∫ 1

0

∣∣∣∣∣∣

K−1∑

j=−(k−1)

bjBj,k−1,u(x)

∣∣∣∣∣∣

2

dx ≥ 0.

Furthermore, aTAa = 0 implies Ba = 0 and aTDTCDa = 0 from whichone concludes that g =

∑K−1j=−(2k−1) ajBj,2k−1,u satisfies

g(Xi) = 0 (i = 1, . . . , n)

and

g(k)(x) = 0 for all x ∈ [0, 1].

The last condition implies that g is a polynomial of degree k − 1 (or less).If we assume |X1, . . . , Xn| ≥ k, then we get g = 0 and hence a = 0.


In particular, A is regular, i.e., A−1 exists. Using this we get

aTAa− 21nY TBa+

1nY TY

=(a−A−1 1

nBTY

)T

A

(a−A−1 1

nBTY

)+

1nY TY

− 1n2Y

TBA−1BTY.

The last two terms do not depend on a, and because A is positive definitethe first term is minimal for

a = A−1 1nBTY.

This proves that the right-hand side of (20.5) is minimized by the uniquesolution of the linear system of equations

(1nBTB + λnD

TCD

)a =

1nBTY. (20.6)

We summarize our result in the next theorem.

Theorem 20.1. For any fixed data (X1, Y1), . . . , (Xn, Yn) such thatX1, . . . , Xn ∈ (0, 1), |X1, . . . , Xn| ≥ k, and Y1, . . . , Yn ∈ R, the splinefunction

g =K−1∑

j=−(2k−1)

ajBj,2k−1,u

with knot vector u = uj, which satisfies

u−M < · · · < u0 < 0, u1, . . . , uK−1 = X1, . . . , Xnand

1 < uK < · · · < uK+M ,

and coefficient vector a which is the solution of (20.6), minimizes

1n

n∑

i=1

|f(Xi) − Yi|2 + λn

∫ 1

0|f (k)(x)|2 dx

with respect to all f ∈ Ck(R). The solution of (20.6) is unique.

In order to compute the estimate one has to solve the system of linearequations (20.6). In particular, one has to compute the matrices B, C, andD. The calculation of

B = (Bj,2k−1,u(Xi))i=1,...,n;j=−(2k−1),...,K−1

requires the evaluation of the B-splines Bj,2k−1,u at the points Xi, whichcan be done via the recursive evaluation algorithm described in Chapter 14.


The matrix D = D(M,K)k transforms the coefficients of a linear combination

of B-splines into the coefficients of its kth derivative. By Lemma 14.6 onegets that, for k = 1, the matrix D(M,K)

1 is given by

−Mu1−u−M+1

Mu1−u−M+1

0 . . . 00 −M

u2−u−M+2

Mu2−u−M+2

. . . 00 0 M

u3−u−M+3. . . 0. . .

0 0 0 . . . 00 0 0 . . . M

uK−1+M −uK−1

.

For general k one has

D(M,K)k = D

(M−(k−1),K)1 · . . . ·D(M−1,K)

1 ·D(M,K)1 .

The matrix

C =(∫ 1

0Bi,k−1,u(x)Bj,k−1,u(x) dx

)

i,j=−(k−1),...,K−1

consists of inner products of B-splines. Now∫ 1

0Bi,k−1,u(x)Bj,k−1,u(x) dx =

∫ u1


+K−2∑

l=1

∫ ul+1

ul

Bi,k−1,u(x)Bj,k−1,u(x) dx

+∫ 1

uK−1

Bi,k−1,u(x) ·Bj,k−1,u(x) dx

is a sum of K integrals of polynomials of degree 2(k − 1). Using a lineartransformation it suffices to compute K integrals of polynomials of degree2(k − 1) over [−1, 1], which can be done by Gauss quadrature. Recall thatby the Gaussian quadrature there exist points

−1 < z1 < z2 < · · · < zk < 1

(symmetric about zero) and positive weights w1, . . . , wk such that∫ 1

−1p(x) dx =

k∑

j=1

wjp(zj)

for every polynomial of degree 2k − 1 (or less). Table 20.1 lists pointsz1, . . . , zk and weights w1, . . . , wk for various values of k.

In Theorem 20.1 we ignored the case |X1, . . . , Xn| < k. This case israther trivial, because if |X1, . . . , Xn| ≤ k, then one can find a polynomialp of degree k−1 (or less) which interpolates at each Xi the average of those


Table 20.1. Parameters for Gauss quadrature.

k wi zi

1 w1 = 2 z1 = 02 w1 = w2 = 1 z2 = −z1 = 0.57773502...3 w1 = w3 = 5/9 z3 = −z1 = 0.7745966692...

w2 = 8/9 z3 = −z2 = 0

Yj for which Xj = Xi. It is easy to see that such a polynomial minimizes

1n

n∑

i=1

|g(Xi) − Yi|2

with respect to all functions g : R → R (cf. Problem 2.1). In addition,p(k)(x) = 0 for all x ∈ R, which implies that this polynomial also minimizes

1n

n∑

i=1

|g(Xi) − Yi|2 + λn

∫ 1

0|g(k)(x)|2 dx.

The polynomial is not uniquely determined if |X1, . . . , Xn| < k, henceTheorem 20.3 does not hold in this case.

20.2 Proof of Lemma 20.1

Let t−M ≤ t−M+1 ≤ · · · ≤ tK−1. Denote the number of occurrences of tjin the sequence tj+1, . . . , tK−1 by #tj , i.e., set

#tj = |i > j : ti = tj| .In order to prove Lemma 20.1 we need results concerning the followinginterpolation problem: Given tjK−1

j=−M , f−M , . . . , fK−1 ∈ R, and a splinespace Su,M , find an unique spline function f ∈ Su,M such that

∂#tjf(tj)∂x#tj

= fj (j = −M, . . . ,K − 1). (20.7)

If t−M < · · · < tK−1, then #tj = 0 for all j and (20.7) is equivalent to

f(tj) = fj (j = −M, . . . ,K − 1),

i.e., we are looking for functions which interpolate the points

(tj , fj) (j = −M, . . . ,K − 1).

By using tj with #tj > 0 one can also specify values of derivatives of f attj .

If we represent f by a linear combination of B-splines, then (20.7) isequivalent to a system of linear equations for the coefficients of this linear

20.2. Proof of Lemma 20.1 415

combination. The vector space dimension of Su,M is equal to K+M , hencethere are as many equations as free variables. Our next theorem describesunder which condition this system of linear equations has a unique solution.

Theorem 20.2. (Schoenberg–Whitney Theorem). Let M ∈ N0,K ∈ N , u−M < · · · < uK+M , and t−M ≤ · · · ≤ tK−1. Assume #tj ≤ Mand, if tj = ui for some i, #tj ≤ M − 1 (j = −M, . . . ,K − 1).

Then the interpolation problem (20.7) has an unique solution for anyf−M , . . . , fK−1 ∈ R if and only if

Bj,M,u(tj) > 0 for all j = −M, . . . ,K − 1.

For M > 0 the condition Bj,M,u(tj) > 0 is equivalent to uj < tj <uj+M+1 (cf. Lemma 14.2).

Proof. The proof is left to the reader (cf. Problems 20.2 and 20.3).

Next we show the existence of a spline interpolant. In Lemma 20.3 wewill use condition (20.9) to prove that this spline interpolant minimizes∫ b

a|f (k)(x)|2dx.

Lemma 20.2. Let n ∈ N , a < x1 < · · · < xn < b, and k ≤ n. SetM = 2k − 1 and K = n + 1. Define the knot vector u = ujK+M

j=−M of thespline space Su,M by setting

uj = xj (j = 1, . . . , n)

and by choosing arbitrary

u−M < u−M+1 < · · · < u0 < a, b < uK < uK+1 < · · · < uK+M .

Let fj ∈ R (j = 1, . . . , n).Then there exists an unique spline function f ∈ Su,M such that

f(xi) = fi (i = 1, . . . , n) (20.8)

and

f (l)(a) = f (l)(b) = 0 (l = k, k + 1, . . . , 2k − 1 = M). (20.9)

Proof. Represent any function f ∈ Su,M by its B-spline coefficients. ByLemma 14.6, (20.8) and (20.9) are equivalent to a nonhomogeneous linearequation system for the B-spline coefficients. The dimension of the splinespace Su,M is equal to K + M = n + 1 + (2k − 1) = n + 2k, hence thenumber of rows of this equation system is equal to the number of columns. Aunique solution of such a nonhomogeneous linear equation system exists ifand only if the corresponding homogeneous equation system doesn’t havea nontrivial solution. Therefore it suffices to show: If f ∈ Su,M satisfies(20.8) and (20.9) with fi = 0 (i = 1, . . . , n), then f = 0.

Let f ∈ Su,M be such that (20.8) and (20.9) hold with fi = 0 (i =1, . . . , n). By the theorem of Rolle,

f(xi) = 0 = f(xi+1)


implies that there exists ti ∈ (xi, xi+1) such that

f ′(ti) = 0.

We can show by induction that for any l ∈ 0, . . . , k there exists t(l)j ∈(xj , xj+l) such that f (l) ∈ Su,M−l satisfies

f (l)(t(l)j ) = 0 (j = 1, . . . , n− l). (20.10)

Hence, f (k) ∈ Su,M−k = Su,k−1 satisfies

f (k)(t(k)j ) = 0 (j = 1, . . . , n− k)

and

f (l)(a) = f (l)(b) = 0 (l = k, . . . , 2k − 1).

Here t(k)j ∈ (xj , xj+k) = (uj , uj+k) and thus the assumptions of

the Schoenberg–Whitney theorem are fulfilled. The Schoenberg–Whitneytheorem implies

f (k) = 0.

From this and (20.10) one concludes successively

f (l) = 0 (l = k − 1, . . . , 0).

Our next lemma implies that the spline interpolant of Lemma 20.2 min-imizes

∫ b

a|g(k)(x)|2dx with respect to all k times differentiable functions.

This proves Lemma 20.1.

Lemma 20.3. Let g be an arbitrary k times differentiable function suchthat

g(xi) = fi (i = 1, . . . , n).

Let f be the spline function of Lemma 20.2 satisfying

f(xi) = fi (i = 1, . . . , n)

and

f (l)(a) = f (l)(b) = 0 (l = k, k + 1, . . . , 2k − 1 = M).

Then∫ b

a

|f (k)(x)|2 dx ≤∫ b

a

|g(k)(x)|2 dx.

Proof.

∫ b

a

|g(k)(x)|2 dx =∫ b

a

|f (k)(x)|2 dx+∫ b

a

|g(k)(x) − f (k)(x)|2 dx

20.2. Proof of Lemma 20.1 417

+2∫ b

a

f (k)(x)(g(k)(x) − f (k)(x)) dx.

Integration by parts yields∫ b

a

f (k)(x)(g(k)(x) − f (k)(x)) dx

=[f (k)(x)(g(k−1)(x) − f (k−1)(x))

]b

x=a

−∫ b

a

f (k+1)(x)(g(k−1)(x) − f (k−1)(x)) dx

= −∫ b

a

f (k+1)(x)(g(k−1)(x) − f (k−1)(x)) dx

(because f (k)(a) = f (k)(b) = 0)

= . . .

= (−1)k−1∫ b

a

f (2k−1)(x)(g′(x) − f ′(x)) dx

= (−1)k−1(f (2k−1)(a)

∫ x1

a

(g′(x) − f ′(x)) dx

+n−1∑

k=1

f (2k−1)(xk)∫ xk+1

xk

(g′(x) − f ′(x)) dx

+f (2k−1)(b)∫ b

xn

(g′(x) − f ′(x)) dx

)

(because of f (2k−1) is piecewise constant)

= (−1)k−1

(

0 ·∫ x1

a

(g′(x) − f ′(x)) dx+n−1∑

k=1

f (2k−1)(xk) · 0

+0 ·∫ b

xn

(g′(x) − f ′(x)) dx

)

(observe f(xk) = g(xk) (k = 1, . . . , n))

= 0.

Hence,∫ b

a

|g(k)(x)|2 dx =∫ b

a

|f (k)(x)|2 dx+∫ b

a

|g(k)(x) − f (k)(x)|2 dx

≥∫ b

a

|f (k)(x)|2 dx.


20.3 Consistency

In order to simplify the notation we will assume in the sequel thatX ∈ (0, 1)a.s. Problem 20.6 shows that after a minor modification of the estimate thisassumption is no longer necessary and the resulting estimate is universallyconsistent.

Let k ∈ N and let the estimate mn be defined via

mn(·) = arg minf∈Ck(R)

1n

n∑

i=1

|f(Xi) − Yi|2 + λnJ2k (f)

, (20.11)

where λn > 0,

J2k (f) =

∫ 1

0|f (k)(x)|2 dx

and Ck(R) is the set of all k times differentiable functions f : R → R.In order to show consistency of the estimate mn, we will use the followingerror decomposition:


= E|mn(X) − Y |2|Dn − E|m(X) − Y |2

= E|mn(X) − Y |2|Dn − 1n

n∑

i=1

|mn(Xi) − Yi|2

+1n

n∑

i=1

|mn(Xi) − Yi|2 − 1n

n∑

i=1

|m(Xi) − Yi|2

+1n

n∑

i=1

|m(Xi) − Yi|2 − E|m(X) − Y |2

=: T1,n + T2,n + T3,n.

By the strong law of large numbers,

T3,n =1n

n∑

i=1

|m(Xi) − Yi|2 − E|m(X) − Y |2 → 0 (n → ∞) a.s.

Furthermore, if m ∈ Ck(R), then the definition of mn implies

T2,n ≤ 1n

n∑

i=1

|mn(Xi) − Yi|2 + λnJ2k (mn) − 1

n

n∑

i=1

|m(Xi) − Yi|2

≤ 1n

n∑

i=1

|m(Xi) − Yi|2 + λnJ2k (m) − 1

n

n∑

i=1

|m(Xi) − Yi|2


= λnJ2k (m).

Hence, if m ∈ Ck(R) and if we choose λn such that

λn → 0 (n → ∞),

then

lim supn→∞

T2,n ≤ 0.

Here the assumption m ∈ Ck(R) can be avoided by approximating m by asmooth function (which we will do in the proof of Theorem 20.3 below).

Thus, in order to obtain strong consistency of the estimate (20.11), webasically have to show

T1,n = E|mn(X) − Y |2|Dn − 1n

n∑

i=1

|mn(Xi) − Yi|2 → 0 (n → ∞) a.s.

(20.12)By definition of the estimate,

1n

n∑

i=1

|mn(Xi) − Yi|2 + λnJ2k (mn)

≤ 1n

n∑

i=1

|0 − Yi|2 + λn · 0

=1n

n∑

i=1

|Yi|2 → EY 2 (n → ∞) a.s.,

which implies that, with probability one,

J2k (mn) ≤ 2

λnEY 2 (20.13)

for n sufficiently large. Set

Fn :=f ∈ Ck(R) : J2

k (f) ≤ 2λn

EY 2.

Then (with probability one) mn ∈ Fn for n sufficiently large and therefore(20.12) follows from

supf∈Fn

∣∣∣∣∣E|f(X) − Y |2 − 1

n

n∑

i=1

|f(Xi) − Yi|2∣∣∣∣∣→ 0 (n → ∞) a.s.

(20.14)To show (20.14) we will use the results of Chapter 9. Recall that theseresults require that the random variables |f(X)−Y |2 (f ∈ Fn) be boundeduniformly by some constant, which may depend on n. To ensure this, we


will truncate our estimate, i.e., we will set

mn(x) = Tβnmn(x) =

βn if mn(x) > βn,mn(x) if − βn ≤ mn(x) ≤ βn,−βn if mn(x) < −βn,

(20.15)

where βn > 0, βn → ∞ (n → ∞). All that we need then is an upper boundon the covering number of

Tβn

f : f ∈ Ck(R) and J2k (f) ≤ 2

λnEY 2

.

This bound will be given in the next lemma.

Lemma 20.4. Let L, c > 0 and set

F =TLf : f ∈ Ck(R) and J2

k (f) ≤ c.

Then, for any 1 ≤ p < ∞, 0 < δ < 4L, and x1, . . . , xn ∈ [0, 1],

Np (δ,F , xn1 ) ≤

(9 · e · 4pLpn

δp

)8(k+2)·(

(√

c/δ)1k +1

)

.

Proof. In the first step of the proof we approximate the functions of F inthe supremum norm by piecewise polynomials. In the second step we boundthe covering number of these piecewise polynomials. Fix g = TLf ∈ Fwhere f ∈ Ck(R), J2

k (f) ≤ c. Choose K and

0 = u0 < u1 < · · · < uK = 1

such that∫ ui+1

ui

|f (k)(x)|2 dx = c ·(

δ

2√c

) 1k

(i = 0, 1, . . . , K − 2),

and∫ uK

uK−1

|f (k)(x)|2 dx ≤ c ·(

δ

2√c

) 1k

.

Then

c ≥∫ 1

0|f (k)(x)|2 dx =

K−1∑

i=0

∫ ui+1

ui

|f (k)(x)|2 dx ≥ (K − 1) · c ·(

δ

2√c

) 1k

,

which implies

K ≤(

2√c

δ

) 1k

+ 1 ≤ 2(√

c

δ

) 1k

+ 1.

By a refinement of the partition [u0, u1), . . . , [uK−1, uK ] one canconstruct points

0 = v0 < v1 < · · · < vK = 1


such that∫ vi+1

vi

|f (k)(x)|2 dx ≤ c ·(

δ

2√c

) 1k

(i = 0, 1, . . . ,K − 1),

|vi+1 − vi| ≤(

δ

2√c

) 1k

(i = 0, 1, . . . ,K − 1)

and

K ≤ 4(√

c

δ

) 1k

+ 2.

Let pi be the Taylor polynomial of degree k − 1 of f about vi. Forvi ≤ x ≤ vi+1 one gets, by the Cauchy–Schwarz inequality,

|f(x) − pi(x)|2 =∣∣∣∣

1(k − 1)!

∫ x

vi

(t− vi)k−1f (k)(t) dt∣∣∣∣

2

≤ 1(k − 1)!2

∫ x

vi

(t− vi)2k−2 dt ·∫ x

vi

|f (k)(t)|2 dt

≤ 1(k − 1)!2

(vi+1 − vi)2k−1

2k − 1·∫ vi+1

vi

|f (k)(t)|2 dt

≤ 1(k − 1)!2(2k − 1)

(δ

2√c

) 2k−1k

· c ·(

δ

2√c

) 1k

≤ δ2

4.

Let Gk−1 be the set of all polynomials of degree less than or equal to k− 1and let Π be the family of all partitions of [0, 1] into K ≤ 4 (

√c/δ)1/k + 2

intervals. For π ∈ Π let Gk−1 π be the set of all piecewise polynomials ofdegree less than or equal to k − 1 with respect to π and let Gk−1 Π bethe union of all sets Gk−1 π (π ∈ Π). We have shown that for each g ∈ Fthere exists p ∈ Gk−1 Π such that

supx∈[0,1]

|g(x) − TLp(x)| ≤ δ

2.

This implies

Np (δ,F , xn1 ) ≤ Np

(δ

2, TLGk−1 Π, xn

1

). (20.16)

To bound the last covering number, we use Lemma 13.1, (13.4) (togetherwith the observation that the partition number no longer increases as soonas the number of intervals is greater than the number of points), and theusual bounds for covering numbers of linear vector spaces (Lemma 9.2 and


Theorems 9.4 and 9.5). This yields

Np

(δ

2, TLGk−1 Π, xn

1

)

≤ (2n)4(√

c/δ)1k +2

supz1,...,zl∈xn

1 ,l≤nNp

(δ

2, TLGk−1, z

l1

)4(√

c/δ)1k +2

≤ (2n)4(√

c/δ)1k +2 ·

(

3 ·(

3e(2L)p

(δ/2)p

)2(k+1))4(

√c/δ)

1k +2

.

The assertion follows from this and (20.16).

Using this lemma and the results of Chapter 9 we next show an auxiliaryresult needed to prove the consistency of mn.

Lemma 20.5. Let k ∈ N and for n ∈ N choose βn, λn > 0, such that

βn → ∞ (n → ∞), (20.17)

β4n

n1−δ→ 0 (n → ∞) (20.18)

for some 0 < δ < 1,

λn → 0 (n → ∞) (20.19)

and

λn1β2

n

(n

β4n log(n)

)2k

→ ∞ (n → ∞). (20.20)

Let the estimate mn be defined by (20.11) and (20.15). Then

E|mn(X) − Y |2|Dn − 1n

n∑

i=1

|mn(Xi) − Yi|2 → 0 (n → ∞) a.s.

for every distribution of (X,Y ) with X ∈ (0, 1) a.s. and |Y | bounded a.s.

Proof. Without loss of generality we assume |Y | ≤ L ≤ βn a.s. Bydefinition of the estimate and the strong law of large numbers

1n

n∑

i=1

|mn(Xi)−Yi|2+λnJ2k (mn) ≤ 1

n

n∑

i=1

|0−Yi|2+λn·0 → EY 2 (n → ∞)

a.s., which implies that with probability one we have, for n sufficientlylarge,

mn ∈ Fn =Tβnf : f ∈ Ck(R) and J2

k (f) ≤ 2EY 2

λn

.



supg∈Gn

∣∣∣∣∣Eg(X,Y ) − 1

n

n∑

i=1

g(Xi, Yi)

∣∣∣∣∣→ 0 (n → ∞) a.s., (20.21)

where Gn =g : Rd × R → R : g(x, y) = |f(x) − TLy|2 for some f ∈ Fn

.

If gj(x, y) = |fj(x) − TLy|2 ((x, y) ∈ Rd × R) for some function fj

bounded in absolute value by βn (j = 1, 2), then

1n

n∑

i=1

|g1(Xi, Yi) − g2(Xi, Yi)| ≤ 4βn1n

n∑

i=1

|f1(Xi) − f2(Xi)|,

which implies

N1

( ε8,Gn, (X,Y )n

1

)≤ N1

(ε

32βn,Fn, X

n1

).

Using this, Theorem 9.1, and Lemma 20.4, one gets, for t > 0 arbitrary,

P

supg∈Gn

∣∣∣∣∣Eg(X,Y ) − 1

n

n∑

i=1

g(Xi, Yi)

∣∣∣∣∣> t

≤ 8

(c1βnn

ε32βn

)c2

(√2·EY 232βn√

λnε

) 1k

+c3

exp(

− nε2

128(4β2n)2

)

= 8 exp

(

c2

(√2 · EY 232

ε

βn√λn

) 1k

+ c3

· log(

32c1β2nn

ε

)

− nε2

2048β4n

)

≤ 8 exp(

−12

· nε2

2048β4n

)

for n sufficiently large, where we have used that (20.18) and (20.20) imply

(βn/√λn)

1k log(β2

nn)n/β4

n

=β4

n log(β2nn)

n·(βn√λn

) 1k

=

((β2

n

λn

)·(β4

n log(β2nn)

n

)2k) 1

2k

→ 0 (n → ∞).

From this and (20.18) one gets the assertion by an application of the Borel–Cantelli lemma.

We are now ready to formulate and prove our main result, which showsthat mn is consistent for all distributions of (X,Y ) satisfying X ∈ (0, 1)a.s. and EY 2 < ∞.


Theorem 20.3. Let k ∈ N and for n ∈ N choose βn, λn > 0, suchthat (20.17)–(20.20) hold. Let the estimate mn be defined by (20.11) and(20.15). Then

∫|mn(x) −m(x)|2µ(dx) → 0 (n → ∞) a.s.

for every distribution of (X,Y ) with X ∈ (0, 1) a.s. and EY 2 < ∞.

Proof. Let L, ε > 0 be arbitrary, set YL = TLY and Yi,L = TLYi

(i = 1, . . . , n). Because of Corollary A.1 we can choose gε ∈ Ck(R) suchthat

∫|m(x) − gε(x)|2µ(dx) < ε and J2

k (gε) < ∞.

We use the following error decomposition:∫

|mn(x) −m(x)|2µ(dx) = E|mn(X) − Y |2|Dn − E|m(X) − Y |2

= E|mn(X) − Y |2|Dn − (1 + ε)E|mn(X) − YL|2|Dn

+(1 + ε)

(

E|mn(X) − YL|2|Dn − 1n

n∑

i=1

|mn(Xi) − Yi,L|2)

+(1 + ε)

(1n

n∑

i=1

|mn(Xi) − Yi,L|2 − 1n

n∑

i=1

|mn(Xi) − Yi,L|2)

+(1 + ε)1n

n∑

i=1

|mn(Xi) − Yi,L|2 − (1 + ε)21n

n∑

i=1

|mn(Xi) − Yi|2

+(1 + ε)2(

1n

n∑

i=1

|mn(Xi) − Yi|2 − 1n

n∑

i=1

|gε(Xi) − Yi|2)

+(1 + ε)2(

1n

n∑

i=1

|gε(Xi) − Yi|2 − E|gε(X) − Y |2)

+(1 + ε)2(E|gε(X) − Y |2 − E|m(X) − Y |2)

+((1 + ε)2 − 1)E|m(X) − Y |2

=8∑

j=1

Tj,n.

Because of (a+ b)2 ≤ (1 + ε)a2 + (1 + 1ε )b2 (a, b > 0) and the strong law of

large numbers we get

T1,n = E|(mn(X) − YL) + (YL − Y )|2|Dn

20.4. Multivariate Penalized Least Squares Estimates 425

−(1 + ε)E|mn(X) − YL|2|Dn

≤ (1 +1ε)E|Y − YL|2

and

T4,n ≤ (1 + ε)(1 +1ε)1n

n∑

i=1

|Yi − Yi,L|2

→ (1 + ε)(1 +1ε)E|Y − YL|2 (n → ∞) a.s.

By Lemma 20.5,

T2,n → 0 (n → ∞) a.s.

Furthermore, if x, y ∈ R with |y| ≤ βn and z = Tβnx, then |z−y| ≤ |x−y|,

which implies

T3,n ≤ 0 for n sufficiently large.

It follows from the definition of the estimate and (20.19) that

T5,n ≤ (1 + ε)2(λnJ2k (gε) − λnJ

2k (mn)) ≤ (1 + ε)2λnJ

2k (gε) → 0

(n → ∞). By the strong law of large numbers,

T6,n → 0 (n → ∞) a.s.

Finally,

T7,n = (1 + ε)2∫

|gε(x) −m(x)|2µ(dx) ≤ (1 + ε)2ε.

Using this, one concludes

lim supn→∞


≤ (2 + ε)(1 +1ε)E|Y − YL|2 + (1 + ε)2ε+ (2ε+ ε2)E|m(X) − Y |2

a.s. With L → ∞ and ε → 0, the result follows.

20.4 Multivariate Penalized Least SquaresEstimates

In this section we briefly comment on how to extend the results of Sec-tion 20.1 to multivariate estimates. The exact definition of the estimate,the proof of its existence, and the derivation of a computation algorithmrequires several techniques from functional analysis which are beyond thescope of this book. Therefore we will just summarize the results withoutproofs.


To define multivariate penalized least squares estimates we use theempirical L2 risk

1n

n∑

i=1

|g(Xi) − Yi|2

which measures how well the function g : Rd → R is adapted to the trainingdata. The roughness of g is penalized by λn · J2

k (g), where

J2k (g) =

∑

i1,...,ik∈1,...,d

∫

Rd

∣∣∣∣

∂kg(x)∂xi1 . . . ∂xik

∣∣∣∣

2

dx

=∑

α1,...,αd∈N0, α1+···+αd=k

k!α1! · . . . · αd!

∫

Rd

∣∣∣∣

∂kg(x)∂xα1

1 . . . ∂xαd

d

∣∣∣∣

2

dx.

Hence the multivariate penalized least squares estimate minimizes

1n

n∑

i=1

|g(Xi) − Yi|2 + λn · J2k (g). (20.22)

It is not obvious over which function space one should minimize (20.22).In order to ensure that the penalty term J2

k (g) exists one needs to assumethat the partial derivatives

∂kg

∂xi1 . . . ∂xik

exist. The proof of the existence of a function which minimizes (20.22) isbased on the fact that the set of functions considered here is a Hilbertspace. Therefore one doesn’t require that the derivatives exist in the clas-sical sense and uses so-called weak derivatives instead. Without going intodetail we mention that one mimizes (20.22) over the Sobolev space W k(Rd)consisting of all functions whose weak derivatives of order k are containedin L2(Rd).

In general, the point evaluation (and thus the empirical L2 risk) of func-tions from W k(Rd) is not well-defined because the values of these functionsare determined only outside a set of Lebesgue measure zero. In the sequel wewill always assume that 2k > d. Under this condition one can show that thefunctions in W k(Rd) are continuous and point evaluation is well-defined.

Using techniques from functional analysis one can show that a functionwhich minimizes (20.22) over W k(Rd) always exists. In addition, one cancalculate such a function by the following algorithm:

Let l =(

d+k−1d

)and let φ1, . . . , φl be all monomials xα1

1 · . . . · xαd

d oftotal degree α1 + · · · + αd less than k. Depending on k and d define

K(z) =

Θk,d‖z‖2k−d · log(‖z‖) if d is even,Θk,d‖z‖2k−d if d is odd,


where

Θk,d =

(−1)k+d/2+1

22k−1πd/2(k−1)!·(k−d/2)! if d is even,Γ(d/2−k)

22kπd/2(k−1)! if d is odd.

Let z1, . . . , zN be the distinct values ofX1, . . . , Xn, and let ni be the numberof occurrences of zi in X1, . . . , Xn. Then there exists a function of the form

g∗(x) =N∑

i=1

µiK(x− zi) +l∑

j=1

νjφj(x) (20.23)

which minimizes (20.22) over W k(Rd). Here µ1, . . . , µN , ν1, . . . , νl ∈ R aresolutions of the linear system of equations

λnµi +ni

n

N∑

j=1

K(zi − zj) +ni

n

l∑

j=1

νjφj(zi) =1n

∑

j:Xj=zi

Yj

(i = 1, . . . , N)N∑

j=1

µjφm(zj) = 0 (m = 1, . . . , l).

The solution of the above system of linear equations is unique if there isno polynomial p = 0 of total degree less than k that vanishes at all pointsX1, . . . , Xn.

If there exists a polynomial p = 0 of total degree less than k whichvanishes at all points X1, . . . , Xn, then one can add this polynomial to anyfunction g without changing the value of (20.22). Therefore in this casethe minimization of (20.22) does not lead to a unique function. However,one can show in this case that any solution of the above linear system ofequations defined via (20.23) yields a function which minimizes (20.22) overW k(Rd).

20.5 Consistency

In the sequel we show the consistency of the multivariate penalized leastsquares estimates defined by

mn(·) = arg minf∈W k(Rd)

(1n

n∑

i=1


)

(20.24)

and

mn(x) = Tlog(n)mn(x). (20.25)


Here

J2k (g) =

∑

i1,...,ik∈1,...,d

∫

Rd

∣∣∣∣

∂kg(x)∂xi1 . . . ∂xik

∣∣∣∣

2

dx

is the penalty term for the roughness of the function f : Rd → R, λn > 0 isthe smoothing parameter of the estimate, and W k(Rd) is the Sobolev spaceconsisting of all functions whose weak derivatives of order k are containedin L2(Rd) (cf. Section 20.4).

The following covering result, which is an extension of Lemma 20.4 toRd, plays a key role in proving the consistency of the multivariate penalizedleast squares estimate.

Lemma 20.6. Let L, c > 0 and set

F =TLf : f ∈ W k(Rd) and J2

k (f) ≤ c.

Then for any 0 < δ < L, 1 ≤ p < ∞, and x1, . . . , xn ∈ [0, 1]d,

Np(δ,F , xn1 ) ≤

(c1Lp n

δp

)c2(√

cδ )

dk +c3

,

where c1, c2, c3 ∈ R+ are constants which only depend on k and d.

The proof, which is similar to the proof of Lemma 20.4, is left to thereader (cf. Problem 20.9).

From this covering result we easily get the consistency of multivariatepenalized least squares estimates:

Theorem 20.4. Let k ∈ N with 2k > d. For n ∈ N choose λn > 0 suchthat

λn → 0 (n → ∞) (20.26)

and

n · λ d2kn

log(n)7→ ∞ (n → ∞). (20.27)

Let the estimate mn be defined by (20.24) and (20.25). Then∫

|mn(x) −m(x)|2µ(dx) → 0 (n → ∞) a.s.

for every distribution of (X,Y ) with ‖X‖2 bounded a.s. and EY 2 < ∞.

The proof is left to the reader (cf. Problem 20.10)



Various applications of penalized modeling in statistics can be found,e.g., in Wahba (1990), Green and Silverman (1994), Eubank (1999), andEggermont and LaRiccia (2001).

The principle of penalized modeling, in particular smoothing splines,goes back to Whittaker (1923), Schoenberg (1964), and Reinsch (1967); seeWahba (1990) or Eubank (1999) for additional references.

The results in Sections 20.1 and 20.2 are standard results in the the-ory of deterministic splines, although the existence and computation ofsmoothing splines is often shown for k = 2 only. References concerningthe determinist theory of splines can be found in Section 14.5. For resultsconcerning the existence and computation of multivariate penalized leastsquares estimates, see Duchon (1976), Cox (1984), or Wahba (1990).

The proof of the existence of penalized least squares estimates (i.e., theproof of the existence of a function which minimizes the penalized empiricalL2 risk), is often based on tools from functional analysis, in particular, onthe theory of so-called reproducing kernel Hilbert spaces. This also leads toa different way of analysis of penalized least squares estimates, for details,see Wahba (1990).

Lemma 20.4 and Theorem 20.3 are based on van de Geer (1987) (inparticular on Lemma 3.3.1 there). The generalization of these results tothe multivariate case (i.e., Lemma 20.6 and Theorem 20.4) is due to Kohlerand Krzyzak (2001).

Mammen and van de Geer (1997) considered penalized least squaresestimates defined by using a penalty on the total variation of the function.


Problem 20.1. Show that if one replaces∫ 1

0

|f (k)(x)|2 dx

by∫ R

L

|f (k)(x)|2 dx

in the definition of the penalized least squares estimate for some

−∞ ≤ L ≤ minX1, . . . , Xn ≤ maxX1, . . . , Xn ≤ R ≤ ∞,

then the values of the estimate on [minX1, . . . , Xn, maxX1, . . . , Xn] do notchange.

Hint: Apply Lemma 20.2 with

a = minX1, . . . , Xn − ε and b = maxX1, . . . , Xn + ε


and show that a function f ∈ Ck(R), which minimizes

1n

n∑

i=1

|f(Xi) − Yi|2 + λ

∫ R

L

|f (k)(x)|2 dx,

satisfies f (k)(x) = 0 for L < x < minX1, . . . , Xn−ε and for maxX1, . . . , Xn+ε < x < R.

Problem 20.2. Consider the interpolation problem of Theorem 20.2. Show that

Bj,M,u(tj) = 0 for some j ∈ −M, . . . , K − 1implies that the solution of the interpolation problem is not unique.

Hint: First prove the assertion for M = 0. Then assume M > 0 and considerthe cases tj ≤ uj and tj ≥ uj+M+1. Show that tj ≤ uj implies Bk,M,u(ti) = 0for i ≤ j and k ≥ j. Conclude that in this case the matrix (Bk,M,u(ti))i,k is notregular. Argue similarly in the case tj ≥ uj+M+1.

Problem 20.3. Consider the interpolation problem of Theorem 20.2. Show that

Bj,M,u(tj) > 0 for all j ∈ −M, . . . , K − 1implies that the solution of the interpolation problem is unique.

Hint: Show that the matrix (Bk,M,u(ti))i,k is regular. Prove this first for M =0. Then proceed by induction.

Problem 20.4. Show that the univariate penalized least squares estimates is,without truncation, in general not consistent, even if X and Y are bounded.

Hint: Consider random variables X with PX = 0 = 12 and PX ≤ x = 1+x

2for x ∈ [0, 1] and Y independent of X with PY = −1 = PY = +1 = 1

2 .Hence m(x) = 0 for all x. Now draw an i.i.d. sample (X1, Y1), ..., (Xn, Yn) fromthe distribution of (X, Y ). Show that if the event

A := X1 = · · · = Xn−1 = 0; Y1, ..., Yn−1 = −1; Xn = 0; Yn = 1occurs, then the smoothing spline mn obtained with penalty J2

k for k ≥ 2 is thestraight line through (0, −1) and (Xn, 1), mn(x) = −1+ 2x

Xn. Use this to conclude

that the L2 error satisfies

E∫

|mn(x) − m(x)|2PX(dx) ≥ E

(IA · 1

2

∫ 1

0

∣∣∣−1 +

2x

Xn

∣∣∣2

dx

)

=p

2· E

(IXn =0 · Xn

6

((−1 +

2Xn

)3

+ 1

))

=p

4

∫ 1

0

u

6

((−1 +

2u

)3

+ 1

)du

= ∞,

where p = PX1 = · · · = Xn−1 = 0; Y1 = · · · = Yn−1 = −1; Yn = 1 > 0.

Problem 20.5. Let A ≥ 1 and x1, . . . , xn ∈ [−A, A]. Show that, for any c > 0,L > 0, and 0 < δ < L,

N1

(δ,

TLf : f ∈ Ck(R) and

∫ A

−A

|f (k)(x)|2dx ≤ c

, xn

1

)


≤(c1

L

δ

)(2A+2)(

c2(√

cδ

)1/k+c3

)

for some constants c1, c2, c3 ∈ R which only depend on k.Hint: Apply Lemma 20.4 on intervals [i, i + 1] with [i, i + 1] ∩ [−A, A] = ∅.

Problem 20.6. Define the estimate mn by


(1n

n∑

i=1

|f(Xi) − Yi|2IXi∈[− log(n),log(n)]

+λn

∫ ∞

−∞|f (k)(x)|2 dx

)

and

mn(x) = Tlog(n)mn(x) · Ix∈[− log(n),log(n)].

Show that mn is strongly universally consistent provided

λn → 0 (n → ∞) and nλn → ∞ (n → ∞).

Hint: Use the error decomposition∫

|mn(x) − m(x)|2µ(dx) =∫

R\[− log(n),log(n)]


+E|mn(X) − Y |2 · IX∈[− log(n),log(n)] | Dn−E|m(X) − Y |2 · IX∈[− log(n),log(n)].

Problem 20.7. Show that Theorem 20.3 can also be proved via the truncationargument which we have used in Theorem 10.2.

Problem 20.8. Prove that the result of Theorem 20.3 still holds if the smoothingparameter λn of the penalized least squares estimate depends on the data and(20.19) and (20.20) hold with probability one.

Problem 20.9. Prove Lemma 20.6.Hint:

Step 1: Partition [0, 1]d into d-dimensional rectangles A1, . . . , AK with thefollowing properties:

(i)∫

Ai

∑α1+···+αd=k

k!α1!·...·αd!

∣∣∣ ∂kf

∂xα11 ...∂x

αdd

(x)∣∣∣2dx ≤ c(δ/

√c)d/k,

(i = 1, . . . , K);

(ii) supx,z∈Ai||x − z||∞ ≤ (δ/

√c)1/k (i = 1, . . . , K); and

(iii) K ≤ ((√

c/δ)1/k + 1)d + (√

c/δ)d/k.

To do this, start by dividing [0, 1]d into K ≤ ((√

c/δ)1/k + 1)d equi-volumecubes B1, . . . , BK of side length (δ/

√c)1/k. Then partition each cube Bi into

d-dimensional rectangles Bi,1, . . . , Bi,li such that, for j = 1, . . . , li − 1,∫

Bi,j

∑

α1+···+αd=k

k!α1! · . . . · αd!

∣∣∣∣

∂kf

∂xα11 . . . ∂x

αdd

(x)

∣∣∣∣

2

dx = c(δ/√

c)d/k,


and, for j = li,∫

Bi,j

∑

α1+···+αd=k

k!α1! · . . . · αd!

∣∣∣∣

∂kf

∂xα11 . . . ∂x

αdd

(x)

∣∣∣∣

2

dx ≤ c(δ/√

c)d/k.

Step 2: Approximate f on each rectangle Ai by a polynomial of total degreek − 1. Fix 1 ≤ i ≤ K. Use the Sobolev integral identity, see Oden and Reddy(1976), Theorem 3.6, which implies that there exists a polynomial pi of total de-gree not exceeding k−1 and an infinitely differentiable bounded function Qα(x, y)such that, for all x ∈ Ai,

|f(x) − pi(x)|

=∫

Ai

1||x − z||d−k

2

∑

α1+···+αd=k

Qα(x, z)

(∂kf

∂xα11 . . . ∂x

αdd

)(z) dz.

Use this to conclude

Np((√

c0d(k−d) + 1)δ, F , xn

1 ) ≤ Np(δ, TLG, xn1 ),

where TLG = TLg : g ∈ G and G is the set of all piecewise polynomials oftotal degree less than or equal to k − 1 with respect to a rectangular partition of[0, 1]d consisting of at most K ≤ (2d + 1)(

√c/δ)d/k + 2d rectangles.

Step 3: Use the results of Chapters 9 and 13 to bound Np(δ, TLG, xn1 ).

Problem 20.10. Prove Theorem 20.4.Hint: Proceed as in the proof of Theorem 20.3, but use Lemma 20.6 instead

of Lemma 20.4.

Problem 20.11. The assumption ‖X‖2 bounded a.s. in Theorem 20.4 may bedropped if we slightly modify the estimate. Define mn by

mn(·) = arg minf∈W k(Rd)

(1n

n∑

i=1

|f(Xi) − Yi|2 · I[− log(n),log(n)]d(Xi) + λnJ2k(f)

)

and set mn(x) = Tlog(n)mn(x) · I[− log(n),log(n)]d(x). Show that mn is stronglyconsistent for all distributions of (X, Y ) with EY 2 < ∞, provided 2k > d andsuitable modifications of (20.26)–(20.27) hold.

Hint: Use∫

|mn(x) − m(x)|2µ(dx) =∫

Rd\[− log(n),log(n)]d|mn(x) − m(x)|2µ(dx)

+E|mn(X) − Y |2 · I[− log(n),log(n)]d(X)|Dn−E|m(X) − Y |2 · I[− log(n),log(n)]d(X).

21Penalized Least Squares Estimates II:Rate of Convergence

In this chapter we study the rate of convergence of penalized least squaresestimates. In Section 21.1 the smoothing parameter is chosen depending onthe smoothness of the regression function. In Section 21.2 we use the com-plexity regularization principle to define penalized least squares estimateswhich automatically adapt to the smoothness of the regression function.


Our main result in this section is

Theorem 21.1. Let 1 ≤ L < ∞, n ∈ N , λn > 0, and k ∈ N . Define theestimate mn by


1n

n∑

i=1


(21.1)

and

mn(x) = TLmn(x) (x ∈ R), (21.2)

where

J2k (f) =

∫ 1

0|f (k)(x)|2dx.

434 21. Penalized Least Squares Estimates II: Rate of Convergence

Then there exist constants c1, c2 ∈ R which depend only on k and L suchthat

E∫


≤ 2λnJ2k (m) + c1 · log(n)

n · λ1/(2k)n

+ c2 · log(n)n

(21.3)

for every distribution of (X,Y ) with X ∈ [0, 1] a.s., |Y | ≤ L a.s., andm ∈ Ck(R). In particular, for any constant c3 > 0 and for

λn = c3

(log(n)

n · J2k (m)

)2k/(2k+1)

there exists a constant c4 such that J2k (m) ≥ log(n)/n implies

E∫


≤ c4J2k (m)1/(2k+1) ·

(log(n)n

)2k/(2k+1)

. (21.4)

If m is (p, C)-smooth with p = k, then the (k − 1)th derivative of mis Lipschitz continuous with Lipschitz constant C. If, as in the proof ofTheorem 3.2, m is in addition k times differentiable, then the kth derivativeof m is bounded by C which implies

J2k (m)1/(2k+1) ·

(log(n)n

)2k/(2k+1)

≤ C2/(2k+1) ·(

log(n)n

)2k/(2k+1)

.

From this we see that the rate of convergence in Theorem 21.1 is optimalup to a logarithmic factor (cf. Theorem 3.2).

The advantage of the bound (21.4), compared with our previous boundsfor (p, C)-smooth regression functions, is that in some sense J2

k (m) ≤ C2

is a much weaker condition than m (k,C)-smooth, because in the latterthe Lipschitz function C is independent of x, so the function satisfies thesame smoothness condition on whole [0, 1]. But if J2

k (m) ≤ C2 then thekth derivative is allowed to vary in such a way that the squared average isbounded by C.

Proof. Inequality (21.4) is an easy consequence of (21.3), therefore weprove only (21.3). In order to simplify the notation in the proof we willabbreviate various constants which depend only on k and L by c5, c6, . . ..

We use the error decomposition∫

|mn(x) −m(x)|2µ(dx) = E|mn(X) − Y |2|Dn

− E|m(X) − Y |2

= T1,n + T2,n,


where

T1,n = 2

(1n

n∑

i=1

|mn(Xi) − Yi|2 − |m(Xi) − Yi|2

+ λnJ2k (mn)

)

and

T2,n = E|mn(X) − Y |2|Dn

− E|m(X) − Y |2 − T1,n.

Let us first observe that it is easy to bound T1,n: Assuming |Yi| ≤ L(i = 1, . . . , n), (21.2), and m ∈ Ck(R) we get

T1,n ≤ 2

(1n

n∑

i=1

|mn(Xi) − Yi|2 + λnJ2k (mn) − 1

n

n∑

i=1

|m(Xi) − Yi|2)

≤ 2

(1n

n∑

i=1

|m(Xi) − Yi|2 + λnJ2k (m) − 1

n

n∑

i=1

|m(Xi) − Yi|2)

= 2λnJ2k (m).


ET2,n ≤ c1 · log(n)

n · λ1/(2k)n

+ c2 · log(n)n

. (21.5)

In order to do this, we fix t > 0 and analyze

PT2,n > t

= P

2E|mn(X) − Y |2 − |m(X) − Y |2∣∣Dn

−21n

n∑

i=1

|mn(Xi) − Yi|2 − |m(Xi) − Yi|2

> t+ 2λnJ2k (mn) + E

|mn(X) − Y |2 − |m(X) − Y |2∣∣Dn

.

The above probability depends on the random function mn, which makesthe analysis difficult. But we know from Chapter 10 how to get rid of thiskind of randomness: we can bound the probability above by a probabilitywhere the deterministic functions are taken from some deterministic set inwhich mn is contained. A simple way to do this would be to assume

mn ∈TLg : g ∈ Ck(R), J2

k (g) ≤ 2L2

λn

a.s.

for n sufficiently large (cf. (20.13)), which implies that PT2,n > t isbounded from above by


P

∃f = TLg : g ∈ Ck(R), J2k (g) ≤ 2L2

λn:

E|f(X) − Y |2 − |m(X) − Y |2

− 1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

>12

· (t+ 2λnJ2k (g) + E

|f(X) − Y |2 − |m(X) − Y |2).

Unfortunately, this bound is not sharp enough to get the right rate ofconvergence.

To get a better rate we use the peeling technique (cf. Chapter 19): mn

is contained in Ck(R), which implies that, for some l ∈ N0, we have

2lt · Il =0 ≤ 2λnJ2k (mn) < 2l+1t.

From this, together with the union bound, we conclude

PT2,n > t

≤∞∑

l=0

P

∃f = TLg : g ∈ Ck(R), 2lt · Il =0 ≤ 2λnJ2k (g) < 2l+1t :

E|f(X) − Y |2 − |m(X) − Y |2

− 1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

>12

· (t+ 2λnJ2k (g) + E

|f(X) − Y |2 − |m(X) − Y |2)

≤∞∑

l=0

P

∃f = TLg : g ∈ Ck(R), J2k (g) <

2lt

λn:

E|f(X) − Y |2 − |m(X) − Y |2

− 1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

>12

· (2lt+ E|f(X) − Y |2 − |m(X) − Y |2)

.

Fix l ∈ N0. We will show momentarily that we can find constants c5, c6, c7such that, for t ≥ c5

log(n)nλ

1/(2k)n

+ c6log(n)

n ,


P

∃f = TLg : g ∈ Ck(R), J2k (g) <

2lt

λn:

E|f(X) − Y |2 − |m(X) − Y |2

− 1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

>12

· (2lt+ E|f(X) − Y |2 − |m(X) − Y |2)

≤ 60 · exp(−c7n · t · 2l

). (21.6)

This inequality implies the assertion, because we can conclude from(21.6) that, for t ≥ c5

log(n)nλ

1/(2k)n

+ c6log(n)

n ,

PT2,n > t ≤∞∑

l=0

60 · exp(−c7n · t · 2l

) ≤ c8 exp (−c7n · t) ,

which implies

ET2,n

≤∫ c5

log(n)

nλ1/(2k)n

+c6log(n)

n

01 dt+

∫ ∞

c5log(n)

nλ1/(2k)n

+c6log(n)

n

c8 exp (−c7n · t) dt

≤ c1log(n)

nλ1/(2k)n

+ c2log(n)n

.

So it remains to prove (21.6).Inequality (21.6) follows directly from Theorem 19.3 provided we can

show that the assumptions in Theorem 19.3 are satisfied. Set

F =

f : R × R → R : f(x, y) = |TLg(x) − TLy|2 − |m(x) − TLy|2

((x, y) ∈ [0, 1] × R) for some g ∈ Ck(R), J2k (g) ≤ 2lt

λn

and

Z = (X,Y ), Zi = (Xi, Yi) (i = 1, . . . , n).

Then the left-hand side of (21.6) can be rewritten as

P

supf∈F

Ef(Z) − 1n

∑ni=1 f(Zi)

2lt+ Ef(Z)>

12

.

Hence it suffices to show that for the set F of functions, α = 2lt, ε = 12 ,

and suitable values of K1 and K2 the assumptions of Theorem 19.3 aresatisfied.


We first determine K1 and K2. For f ∈ F we have

|f(z)| ≤ 4L2 (z ∈ [0, 1] × R)

and

E|f(Z)|2 = E|(TLg)(X) − Y |2 − |m(X) − Y |2|2= E|((TLg)(X) − Y ) − (m(X) − Y )|2

×|((TLg)(X) − Y ) + (m(X) − Y )|2≤ 16L2E|(TLg)(X) −m(X)|2= 16L2Ef(Z).

So we can choose K1 = 4L2 and K2 = 16L2.Condition (19.10) follows from t ≥ c5

log(n)nλ

1/(2k)n

+ c6log(n)

n so it remains to

show that (19.11) holds.In order to bound the covering number we observe

1n

n∑

i=1

∣∣∣∣(|TLg1(xi) − TLyi|2 − |m(xi) − TLyi|2)

−(|TLg2(xi) − TLyi|2 − |m(xi) − TLyi|2)∣∣∣∣

2

=1n

n∑

i=1

∣∣|TLg1(xi) − TLyi|2 − |TLg2(xi) − TLyi|2

∣∣2

=1n

n∑

i=1

|TLg1(xi) − TLg2(xi)|2 · |TLg1(xi) + TLg2(xi) − 2TLyi|2

≤ 16L2 1n

n∑

i=1

|TLg1(xi) − TLg2(xi)|2,

which implies

N2(u,F , zn1 ) ≤ N2

(u

4L,

TLg : g ∈ Ck(R), J2

k (g) ≤ 2lt

λn

, xn

1

).

This together with Lemma 20.4 implies, for any u ≥ 1n ,

log N2

(

u,

f ∈ F :1n

n∑

i=1

f(zi)2 ≤ 16δ

, zn1

)

≤ log N2(u,F , zn1 )

≤ log N2

(u

4L,

TLg : g ∈ Ck(R), J2

k (g) ≤ 2lt

λn

, xn

1

)


≤ 8(k + 2)

(√2lt/λn

u/(4L)

)1/k

+ 1

log(

144eL2n

u2/(16L2)

)

≤ c9 log(n) ·((

2lt

λn

)1/2k

u−1/k + 1

)

,

hence, for δ ≥ α/8 ≥ 2048L2/n,

∫ √δ

δ/2048L2

(

log N2

(

u,

f ∈ F :1n

n∑

i=1

f(zi)2 ≤ 16δ

, zn1

)) 12

du

≤√c9 log(n) ·

∫ √δ

0

((2lt

λn

)1/(4k)

u−1/(2k) + 1

)

du

= c10√

log(n)

((2lt

λn

)1/(4k)

δ1/2−1/(4k) + δ1/2

)

.

Hence (19.11) is implied by

√nδ ≥ c11

√log(n)

((2lt

λn

)1/(4k)

δ1/2−1/(4k) + δ1/2

)

for all δ ≥ α/8 = 2l−3t.Since

√n2l−3t

c11√

log(n)(

2ltλn

)1/(4k)(2l−3t)1/2−1/(4k)

=1

c1181/(4k) ·√nλ

1/(4k)n√

log(n)· (2l−3t)1/2

≥ 12

provided

t ≥ c12log(n)

n · λ1/(2k)n

and since√n2l−3t

c11√

log(n)(2l−3t)1/2=

1c11

·√n

√log(n)

· (2l−3t)1/2 ≥ 12

provided

t ≥ c13log(n)n

,


this in turn is implied by the assumption t ≥ c5log(n)

nλ1/(2k)n

+ c6log(n)

n .

21.2 Application of Complexity Regularization

In this section we will use complexity regularization to automaticallyadapt to smoothness of the estimated regression function. In the sequelwe will assume that (X,Y ) takes with probability one only values in somebounded subset of R×R. Without loss of generality this bounded subset is[0, 1] × [−L,L], i.e.,

(X,Y ) ∈ [0, 1] × [−L,L] a.s.

for some L ∈ R+.Let k ∈ N and λ ∈ R+. First define the smoothing spline estimate

mn,(k,λ) by

mn,(k,λ)(·) = arg minf∈Ck([0,1])

(1n

n∑

i=1

|f(Xi) − Yi|2 + λJ2k (f)

)

. (21.7)

The estimate mn,(k,λ) depends on the parameters k ∈ N and λ ∈ R+.We next describe how one can use the data Dn to choose these parametersby complexity regularization (see Chapter 12): in Lemma 21.1 below wederive an upper bound on L2 error of (a truncated version of) the estimatemn,(k,λ). We then choose (k∗, λ∗) by minimizing this upper bound.

Lemma 21.1. Let 1 ≤ L < ∞, λ ∈ R+, and η ∈ [0, 1]. Then for nsufficiently large one has, with probability greater than or equal to 1 − η,

∫|TLmn,(k,λ)(x) −m(x)|2µ(dx)

≤ L4 log(n)n

+ 2L5(log(n))2

n · λ1/(2k) + 2

1n

n∑

i=1

|TLmn,(k,λ)(Xi) − Yi|2

+λJ2k (mn,(k,λ)) − 1

n

n∑

i=1

|m(Xi) − Yi|2

for every distribution of (X,Y ) with (X,Y ) ∈ [0, 1]× [−L,L] almost surely.

The proof is similar to the proof of Theorem 21.2 below and is thereforeomitted (see Problem 21.1).

The basic idea in this chapter is to choose the parameters of the estimateby minimizing the upper bound given in the above lemma. This will nowbe described in detail.

Set

K :=

1, . . . , (log(n))1/(2)

21.2. Application of Complexity Regularization 441

and

Λn :=

log(n)2n

,log(n)2n−1 , . . . ,

log(n)1

.

For (k, λ) ∈ K × Λn define mn,(k,λ) by

mn,(k,λ)(x) = TLmn,(k,λ)(x) (x ∈ R).

Depending on the data Dn we choose from the family of estimatesmn,(k,λ) : (k, λ) ∈ K × Λn

the estimate that minimizes the upper bound in Lemma 21.1. Moreprecisely, we choose

(k∗, λ∗) = (k∗(Dn), λ∗(Dn)) ∈ K × Λn

such that

1n

n∑

i=1

|mn,(k∗,λ∗)(Xi) − Yi|2 + λ∗J2k∗(mn,(k∗,λ∗)) + penn(k∗, λ∗)

= min(k,λ)∈K×Λn

1n

n∑

i=1

|mn,(k,λ)(Xi) − Yi|2 + λJ2k (mn,(k,λ)) + penn(k, λ)

,

where

penn(k, λ) =L5(log(n))2

n · λ1/(2k) ((k, λ) ∈ K × Λn),

and define our adaptive smoothing spline estimate by

mn(x) = mn(x,Dn) = mn,(k∗(Dn),λ∗(Dn))(x,Dn). (21.8)

An upper bound on the L2 error of the estimate is given in the nexttheorem.

Theorem 21.2. Let mn be the estimate defined by (21.8).(a)

E∫

|mn(x) −m(x)|2µ(dx) = O

((log(n))2

n

)

for any p ∈ N and any distribution of (X,Y ) with (X,Y ) ∈ [0, 1]× [−L,L]almost surely, m ∈ Cp([0, 1]) and J2

p (m) = 0.(b)

E∫

|mn(x) −m(x)|2µ(dx) = O((J2

p (m))1

2p+1 (log n)2n− 2p2p+1

)

for any p ∈ N and any distribution of (X,Y ) with (X,Y ) ∈ [0, 1]× [−L,L]almost surely, m ∈ Cp([0, 1]) and 0 < J2

p (m) < ∞.


If we compare Theorem 21.2 (b) with Theorem 21.1 we see that the esti-mate above achieves up to a logarithmic factor the same rate of convergenceas the estimate in Theorem 21.1, although it doesn’t use parameters whichdepend on the smoothness of the regression function (measured by k andJ2

k (m)). In this sense it is able to adapt automatically to the smoothnessof m.

Proof. Without loss of generality we assume p ∈ K. We start with theerror decomposition

∫|mn(x) −m(x)|2µ(dx) = T1,n + T2,n,

where

T1,n = E[|mn(X) − Y |2|Dn

] − E(|m(X) − Y |2)

−2

1n

n∑

i=1

|mn(Xi) − Yi|2 + λ∗J2k∗(mn,(k∗,λ∗))

− 1n

n∑

i=1

|m(Xi) − Yi|2 + penn(k∗, λ∗)

and

T2,n = 2

1n

n∑

i=1

|mn(Xi) − Yi|2 + λ∗J2k∗(mn,(k∗,λ∗))

− 1n

n∑

i=1

|m(Xi) − Yi|2 + penn(k∗, λ∗)

.

Step 1. We show

T2,n ≤ 2 infλ∈Λn

λJ2

p (m) + penn(p, λ). (21.9)

By definition of mn, the Lipschitz property of TL, |Yi| ≤ L almost surely,which implies Yi = TLYi (i = 1, . . . , n) almost surely, and m ∈ Cp([0, 1])we have

T2,n ≤ 2 infλ∈Λn

1n

n∑

i=1

|TLmn,(p,λ)(Xi) − Yi|2 + λJ2p (mn,(p,λ))

− 1n

n∑

i=1

|m(Xi) − Yi|2 + penn(p, λ)

≤ 2 infλ∈Λn

1n

n∑

i=1

|mn,(p,λ)(Xi) − Yi|2 + λJ2p (mn,(p,λ))

− 1n

n∑

i=1



≤ 2 infλ∈Λn

1n

n∑

i=1

|m(Xi) − Yi|2 + λJ2p (m)

− 1n

n∑

i=1


= 2 infλ∈Λn

λJ2

p (m) + penn(p, λ).

Step 2. Let t > 0 be arbitrary. We will now show

PT1,n > t

≤∑

(k,λ)∈K×Λn

∞∑

l=1

P

∃f = TLg, g ∈ Ck([0, 1]), J2k (g) ≤ 2lpenn(k, λ)

λ:

E|f(X) − Y |2 − E|m(X) − Y |2

− 1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

>12

· (t+ 2lpenn(k, λ) + E|f(X) − Y |2 − E|m(X) − Y |2)

.

This follows from

PT1,n > t

≤ P

E[|mn,(k∗,λ∗)(X) − Y |2∣∣Dn

] − E[|m(X) − Y |2]

− 1n

n∑

i=1

|mn,(k∗,λ∗)(Xi) − Yi|2 − |m(Xi) − Yi|2

>12

(t+ 2λ∗J2

k∗(mn,(k∗,λ∗)) + 2penn(k∗, λ∗)

+E[|mn,(k∗,λ∗)(X) − Y |2∣∣Dn

] − E[|m(X) − Y |2]

)


≤∑

(k,λ)∈K×Λn

P

∃f = TLg, g ∈ Ck([0, 1]) :

E|f(X) − Y |2 − E|m(X) − Y |2

− 1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

>12

· (t+ 2λJ2k (g) + 2penn(k, λ)

+E|f(X) − Y |2 − E|m(X) − Y |2)

≤∑

(k,λ)∈K×Λn

∞∑

l=1

P

∃f = TLg, g ∈ Ck([0, 1]),

2lpenn(k, λ) ≤ 2λJ2k (g) + 2penn(k, λ) < 2l+1penn(k, λ) :

E|f(X) − Y |2 − E|m(X) − Y |2

− 1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

>12

· (t+ 2λJ2k (g) + 2penn(k, λ)

+E|f(X) − Y |2 − E|m(X) − Y |2)

≤∑

(k,λ)∈K×Λn

∞∑

l=1

P

∃f = TLg, g ∈ Ck([0, 1]), J2k (g) ≤ 2l penn(k, λ)

λ:

E|f(X) − Y |2 − E|m(X) − Y |2

− 1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

>12


.


Step 3. Fix (k, λ) ∈ K×Λn and l ∈ N . As in the proof of Theorem 21.1(cf. (21.6)) one can show that, for n sufficiently large,

P

∃f = TLg, g ∈ Ck([0, 1]), J2k (g) ≤ 2lpenn(k, λ)

λ:

E|f(X) − Y |2 − E|m(X) − Y |2

− 1n

n∑

i=1

|f(Xi) − Yi|2 − |m(Xi) − Yi|2

>12


≤ c3 exp(

−c4n · (t+ 2lpenn(k, λ))L4

)(21.10)

(cf. Problem 21.2).

Step 4. Next we demonstrate, for n sufficiently large,

ET1,n ≤ c5n.

Using the results of Steps 2 and 3 we get, for n sufficiently large,

ET1,n ≤∫ ∞

0PT1,n > tdt

≤∑

(k,λ)∈K×Λn

∞∑

l=1

∫ ∞

0c3 exp

(−c4n · (t+ 2lpenn(k, λ))

L4

)dt

=∑

(k,λ)∈K×Λn

∞∑

l=1

exp(

−c4n2lpenn(k, λ)L4

)· c3 · L4

c4n

≤∑

(k,λ)∈K×Λn

∞∑

l=1

exp(−c42lL · (log(n))2−1/(2k)

)· c3L

4

c4n

≤ c6n · (log(n))1/(2) exp (−2 log(n)) · c3L4

c4n

≤ c5n.

Step 5. We now conclude the proof. By the results of Steps 1 and 4 weget, for n sufficiently large,


E∫


≤ 2 infλ∈Λn

λ · J2

p (m) +L5(log(n))2

n · λ1/(2p)

+c5n.

Clearly, this implies the assertion of part (a). Concerning (b), assume0 < J2

p (m) < ∞ and set

λ∗ =(L5(log(n))2

n · J2p (m)

)2p/(2p+1)

.

Then for n sufficiently large there exists λ ∈ Λn such that

λ∗ ≤ λ ≤ 2λ∗.

It follows that

E∫


≤ 2λ · J2

p (m) +L5(log(n))2

n · λ1/(2p)

+c5n

≤ 2

2λ∗ · J2p (m) +

L5(log(n))2

n · (λ∗)1/(2p)

+c5n

≤ 6 · (J2p (m))1/(2p+1)

(L5(log(n))2

n

)2p/(2p+1)

+c5n

= O((J2

p (m))1/(2p+1)(log(n))2n−2p/(2p+1)).

21.3 Bibliographic notes

Theorem 21.2 is due to Kohler, Krzyzak, and Schafer (2002). In the contextof fixed design regression the rate of convergence of (univariate) smooth-ing spline estimates was investigated in Rice and Rosenblatt (1983), Shen(1998), Speckman (1985), and Wahba (1975). Cox (1984) studied the rateof convergence of multivariate penalized least squares estimates. Applica-tion of complexity regularization to smoothing spline estimates for fixeddesign regression was conisidered in van de Geer (2001).



Problem 21.1. Prove Lemma 21.1.Hint: Start with the error decomposition

∫|mn,(k,λ)(x) − m(x)|2µ(dx) = T1,n + T2,n,

where

T1,n = E[|mn,(k,λ)(X) − Y |2|Dn

]− E(|m(X) − Y |2)

−2

1n

n∑

i=1

|mn,(k,λ)(Xi) − Yi|2 + λJ2k(mn,(k,λ))

− 1n

n∑

i=1

|m(Xi) − Yi|2 + penn(k, λ)

and

T2,n = 2

1n

n∑

i=1

|mn,(k,λ)(Xi) − Yi|2 + λJ2k(mn,(k,λ))

− 1n

n∑

i=1

|m(Xi) − Yi|2 + penn(k, λ)

.

As in the proof of Theorem 21.2 show, for n sufficiently large and any t > 0,

PT1,n > t ≤∞∑

l=1

c3 exp

(−c4

n(t + 2lpenn(k, λ))L4

)≤ c3 exp

(−c4

n · t

L4

).

Conclude

P

T1,n > L4 log(n)

n

≤ c3 · exp(−c4 log(n)) ≤ η

for n sufficiently large, which implies the assertion.

Problem 21.2. Prove (21.10).Hint: Show that, for (k, λ) ∈ K × Λn,

penn(k, λ) ≥ c1log n

n · λ1/(2k) + c2log n

n

and apply (21.6).

Problem 21.3. Formulate and prove a multivariate version of Theorem 21.1.

Problem 21.4. Formulate and prove a multivariate version of Lemma 21.1. Useit to define adaptive penalized least squares estimates for multivariate data andformulate and prove a multivariate version of Theorem 21.2.

22Dimension Reduction Techniques

We know from Chapter 2 that the estimation of a regression function isespecially difficult if the dimension of X is large. One consequence of thisis that the optimal minimax rate of convergence n−2k/(2k+d) for the estima-tion of a k times differentiable regression function converges to zero ratherslowly if the dimension d of X is large compared to k. The only possibilityof circumventing this so-called curse of dimensionality is to impose addi-tional assumptions on the regression functions. Such assumptions will bediscussed in this chapter.

In the classical linear model one assumes

Y = (X,β) + ε,

where β ∈ Rd, Eε = 0, and X, ε are independent. Here

m(x) = (x, β) =d∑

j=1

βjx(j)

is a linear function of the components of x. This rather restrictiveparametric assumption can be generalized in various ways.

For additive models, one assumes that m(x) is a sum of univariatefunctions mj : R → R applied to the components of x, i.e.,

m(x) =d∑

j=1

mj(x(j)).

22.1. Additive Models 449

In projection pursuit one generalizes this further by assuming that m(x)is a sum of univariate functions mj applied to projections of x onto variousdirections βj ∈ Rd:

m(x) =K∑

j=1

mj ((x, βj)) .

For single index models one assumes K = 1, i.e., one assumes that theregression function is given by

m(x) = F ((x, β)),

where F : R → R and β ∈ Rd. In the literature, additive and single indexmodels are called semiparametric models.

In the next three sections we discuss estimates which use the aboveassumptions to simplify the regression estimation problem.

22.1 Additive Models

In this section we assume that the regression function is an additive functionof its components, i.e.,

m(x) =d∑

j=1

mj(x(j)).

This assumption can be used to simplify the problem of regression esti-mation by fitting only functions to the data which have the same additivestructure. We have seen already two principles which can be used to fita function to the data: least squares and penalized least squares. In thesequel we will use the least squares principle to construct an estimate ofan additive regression function.

Assume that X ∈ [0, 1]d a.s. Let M ∈ N0 and Kn ∈ N . Let F (1)n be the

set of all piecewise polynomials of degree M (or less) w.r.t. an equidistantpartition of [0, 1] into Kn intervals, and put

Fn =

f : Rd → R : f(x) =

d∑

j=1

fj(x(j)) for some fj ∈ F (1)n

.

Let

mn = arg minf∈Fn

1n

n∑

i=1

(Yi − f(Xi))2.

Then define the estimate by

mn(x) = TLmn(x).

450 22. Dimension Reduction Techniques

With a slight modification of the proof of Corollary 11.2 we can get

Theorem 22.1. Let C > 0, p = q + r, q ∈ 0, . . . ,M, r ∈ (0, 1]. Assumethat the distribution of (X,Y ) satisfies X ∈ [0, 1]d a.s.,

σ2 = supx∈[0,1]d

VarY |X = x < ∞,

‖m‖∞ = supx∈[0,1]d

|m(x)| ≤ L,

and

m(x) = m1(x(1)) + · · · +md(x(d))

for some (p, C)-smooth functions mj : [0, 1] → R. Then

E∫


≤ c · maxσ2, L2 (log(n) + 1) · d ·Kn · (M + 1)n

+ 8d2

22pq!2· C

2

K2pn

and for

Kn =

⌈(C2

maxσ2, L2n

log(n)

)1/(2p+1)⌉

one gets, for n sufficiently large,

E∫


≤ cM,dC2

2p+1 ·(

maxσ2, L2 · (log(n) + 1)n

) 2p2p+1

for some constant cM,d depending only on M and d.

Notice that the above rate of convergence does not depend on thedimension d.

Proof. Lemma 11.1, together with X ∈ [0, 1]d a.s., implies

inff∈Fn

∫|f(x) −m(x)|2µ(dx)

= inffj∈F(1)

n

∫|

d∑

j=1

fj(x(j)) −d∑

j=1

mj(x(j))|2µ(dx)

≤ d · inffj∈F(1)

n

∫ d∑

j=1

|fj(x(j)) −mj(x(j))|2µ(dx)

22.2. Projection Pursuit 451

≤ d · inffj∈Fn(1)

d∑

j=1

supx(j)∈[0,1]

|fj(x(j)) −mj(x(j))|2

≤ d2

22pq!2· C

2

K2pn

.

From this together with Theorem 11.3 one gets the first inequality. Thedefinition of Kn implies the second inequality.

If we compare Theorem 22.1 with Corollary 11.2, we see that the as-sumption, that the regression function is additive, enables us to derive inthe multivariate regression problem the same rate of convergence as in theunivariate regression problem.

A straightforward generalization of the above result is to fit, instead ofa sum of univariate functions applied to one of the components of X, asum of functions of d∗ < d of the components of X to the data. In thiscase one can show that if the regression function itself is a sum of functionsof d∗ < d of the components of X, and if these functions are all (p, C)-smooth, then the L2 error converges to zero with the rate n−2p/(2p+d∗)

(instead of n−2p/(2p+d)) (cf. Problem 22.1). Here the rate of convergence isagain independent of the dimension d of X.

22.2 Projection Pursuit

In projection pursuit one assumes that m(x) is a sum of univariate func-tions, where each of these univariate functions is applied to the projectionof x onto some vector βj ∈ Rd:

m(x) =K∑

j=1

mj ((x, βj)) . (22.1)

This is a generalization of additive models in such a way that the compo-nents ofX are replaced by the projections (X,βj). As we know from Lemma16.2, any regression function can be approximated arbitrarily closely byfunctions of the form (22.1), hence the assumption (22.1) is much less re-strictive than the additive model. But, on the other hand, the fitting of afunction of the form (22.1) to the data is much more complicated than thefitting of an additive function.

We again use the principle of least squares to construct an estimate ofthe form (22.1). Assume X ∈ [0, 1]d a.s. Let M ∈ N0, Kn ∈ N , and letFn be the set of all piecewise polynomials of degree M (or less) w.r.t.an equidistant partition of [−1, 1] into Kn intervals. In order to simplifythe computation of the covering number we use, in the definition of theestimate, only those functions from Fn, which are bounded in absolute


value by some constant B > 0 and which are Lipschitz-continuous withsome constant An > 0 (i.e., which satisfy |f(x) − f(z)| ≤ An|x− z| for allx, z ∈ [−1, 1]). Let Fn(B,An) be the subset of Fn consisting of all thesefunctions.

We define our estimate by

mn(x) =K∑

j=1

g∗j

((x, b∗j )

),

where

(g∗1 , b

∗1, . . . , g

∗K , b

∗K)

= arg ming1,b1,...,gK ,bK :gj∈Fn(B,An),‖bj‖≤1/

√d

1n

n∑

i=1

∣∣∣∣∣∣Yi −

K∑

j=1

gj ((Xi, bj))

∣∣∣∣∣∣

2

.

The assumption X ∈ [0, 1]d a.s., together with ‖bj‖ ≤ 1/√d, implies

(Xi, bj) ∈ [−1, 1], so gj((Xi, bj)) is defined.

Theorem 22.2. Let L > 0, C > 0, p = q + r, q ∈ 0, . . . ,M, r ∈ (0, 1].Assume X ∈ [0, 1]d a.s., |Y | ≤ L a.s., and

m(x) =K∑

j=1

mj ((x, βj)) (22.2)

for some (p, C)-smooth functions mj : [−1, 1] → R and some βj ∈ Rd,‖βj‖ ≤ 1/

√d. Choose An such that

An → ∞ (n → ∞) andAn

log(n)→ 0 (n → ∞),

set B = L + 1, M = q, Kn = (C2n/ log(n))1/(2p+1), and define theestimate as above. Then there exists a constant c > 0 depending only onL,M, d, and K such that, for n sufficiently large,

E∫


≤ c · C 22p+1 ·

(log(n)n

) 2p2p+1

.

In the proof we will apply the techniques introduced in Chapter 11 forthe analysis of nonlinear least squares estimates. These results require thatY is bounded rather than that the conditional variance of Y given X isbounded as in Theorem 22.1. As in Theorem 22.1 we get, in Theorem 22.2for the multivariate regression estimation problem up to a logarithmic fac-tor, the same rate of convergence as for the univariate regression estimationproblem. But this time the assumption on the structure of the regressionfunction is much less restrictive than in the additive model (cf. Lemma16.2).


Proof. In order to prove Theorem 22.2 we apply our standard techniquesfor the analysis of least squares estimates introduced in Chapter 11. Theonly two new things we need are bounds on the covering numbers and onthe approximation error.

We use the error decomposition∫

|mn(x) −m(x)|2µ(dx) = T1,n + T2,n,

where

T2,n = 2 · 1n

n∑

i=1

(|mn(Xi) − Yi|2 − |m(Xi) − Yi|2)

and

T1,n = E|mn(X) − Y |2 − |m(X) − Y |2∣∣Dn

− T2,n.

By (11.12),

ET2,n

≤ 2 infgj ,bj :gj∈Fn(L+1,An),‖bj‖≤1/

√d

∫∣∣∣∣∣∣

K∑

j=1

gj((x, bj)) −m(x)

∣∣∣∣∣∣

2

µ(dx)

and by the assumptions on m we get∣∣∣∣∣∣

K∑

j=1

gj((x, bj)) −m(x)

∣∣∣∣∣∣

2

=

∣∣∣∣∣∣

K∑

j=1

(gj((x, bj)) −mj((x, βj)))

∣∣∣∣∣∣

2

≤ K ·K∑

j=1

|gj((x, bj)) −mj((x, βj))|2 .

Set bj = βj and choose gj according to Lemma 11.1 (observe that for nsufficiently large we have gj ∈ Fn(L+ 1, An)). Then

|gj((x, bj)) −mj((x, βj))| ≤ supu∈[−1,1]

|g(u) −mj(u)|

≤ 12pq!

· C

(Kn/2)p

which implies that, for n sufficiently large, we have

ET2,n ≤ 2 ·K2 · 1q!2

· C2

K2pn

. (22.3)

Next we bound ET1,n. Let Gn be the set of all functions

g(x) =K∑

j=1

gj ((x, bj)) (gj ∈ Fn(L+ 1, An), ‖bj‖ ≤ 1/√d).


As in the proof of Theorem 11.5 one gets, for arbitrary t ≥ 1/n,

P T1,n > t

≤ 14 supxn1

N1

(1

80K · (L+ 1) · n,Gn, xn1

)

× exp(

− n

24 · 214 · ((L+ 1)K)4· t).

Next we bound the covering number. Let Hn be the set of all functions

h(x) = f((x, b)) (f ∈ Fn(L+ 1, An), ‖b‖ ≤ 1/√d).

By Lemma 16.4 we get

N1

(1

80K · (L+ 1) · n,Gn, xn1

)≤

(N1

(1

80K2 · (L+ 1) · n,Hn, xn1

))K

.

Set δ = 1/(80K2(L + 1)n). Choose b1, . . . , bN ∈ Rd such that for eachb ∈ Rd, ‖b‖ ≤ 1/

√d, there exists j ∈ 1, . . . , N with

‖b− bj‖ ≤ δ

2An

√d

and such that

N ≤(

2And

δ

)d

.

Then the Lipschitz-continuity of f ∈ Fn(L+ 1, An) implies, for x ∈ [0, 1]d,

|f((x, b)) − f((x, bj))| ≤ An|(x, b− bj)| ≤ An

√d · ‖b− bj‖ ≤ δ

2.

This proves

N1(δ,Hn, xn1 ) ≤

N∑

j=1

N1

(δ

2, f((x, bj)) : f ∈ Fn(L+ 1, An) , xn

1

).

Since

f((x, bj)) : f ∈ Fn(L+ 1, An)is a subspace of a linear vector space of dimension Kn(M + 1) we get, byTheorems 9.4 and 9.5,

N1 (δ,Hn, xn1 ) ≤

N∑

j=1

3(

3e(2(L+ 1))δ/2

)2(Kn(M+1)+1)

≤(

2And

δ

)d

3(

3e(2(L+ 1))δ/2

)2Kn(M+1)+2

≤ 3(

12e(An + L+ 1)dδ

)2Kn(M+1)+d+2

.


Summarizing the above results we have

N1

(1

80K · L · n,Gn, xn1

)

≤ 3K

(12e(An + L+ 1)d1/(80K2(L+ 1)n)

)(2Kn(M+1)+d+2)·K

≤ 3K(960eK2(L+ 1)(An + L+ 1)d · n)(2Kn(M+1)+d+2)·K

.

As in the proof of Theorem 11.5 one concludes from this

ET1,n ≤ c · Kn log(n)n

. (22.4)

The assertion follows from (22.3), (22.4), and the definition of Kn.

As long as themj are not linear functions, the functions of the form (22.1)are not linear in βj . Therefore computation of the estimate of Theorem22.2 above requires solving a nonlinear least squares problem, which is notpossible in practice. What one can do instead is to use a stepwise approachto construct a similar estimate.

Assume that

Y = m(X) + ε,

where m : Rd → R, Eε = 0, and X, ε are independent. Assume,furthermore, that m has the form (22.1). Then

Y −K∑

j=2

mj((X,βj)) = m((X,β1)) + ε

and m1 is the regression function to the random vector (X, Y ), where

X = (X,β1), Y = Y −K∑

j=2

mj((X,βj)).

Hence if we know all βj and mj except m1, then we can compute an esti-mate of m1 by applying an arbitrary univariate regression estimate to thedata (X1, Y1), . . . , (Xn, Yn). In addition, we can do this for various valuesof β1, use, e.g., cross-validation to estimate the L2 risk of the correspondingestimates and choose β1 such that the estimated L2 risk of the correspond-ing estimate is as small as possible. In this way we can compute one of the(mj , βj) as soon as we know all of the other (mk, βk) (k = j). By doingthis in a stepwise manner one gets an algorithm for fitting a function ofthe form (22.1) to the data. For details, see, e.g., Hastie et al. (2001).


22.3 Single Index Models

In this section we study the so-called single index model

m(x) = F ((x, β)), (22.5)

where β ∈ Rd and the function F : R → R can be arbitrary. We considerit as a special case of projection pursuit with K = 1.

The obvious disadvantage of considering only K = 1 is that not anyfunction can be approximated arbitrarily closely by functions of the form(22.5). But, on the other hand, setting K = 1 simplifies the iterative al-gorithm described above. Furthermore, an estimate of the form (22.5) caneasily be interpreted: The estimates changes only in direction βj , and theway it changes in this direction is described by the univariate function F .In contrast, for projection pursuit, the estimate is a sum of functions whichchanges in various directions βj , and although each of these functions canbe plotted to visualize the way they look, it is hard to imagine how thesum of all these functions behaves.

In the sequel we consider again least squares estimates as in Section 22.2.Assume X ∈ [0, 1]d a.s., let Fn(L + 1, An) be defined as in Section 22.2,and set

mn(x) = g∗((x, b∗)),

where

(g∗, b∗) = arg ming∈Fn(L+1,An),‖b‖≤1/

√d

1n

n∑

i=1

|Yi − g((Xi, b))|2.

Setting K = 1 in Theorem 22.2 we get

Corollary 22.1. Let L > 0, C > 0, p = q + r, q ∈ 0, . . . ,M, r ∈ (0, 1].Assume X ∈ [0, 1]d a.s., |Y | ≤ L a.s., and

m(x) = F ((x, β)) (22.6)

for some (p, C)-smooth function F : [−1, 1] → R and some β ∈ Rd, ‖β‖ ≤1/

√d. Choose An such that

An → ∞ (n → ∞) andAn

log(n)→ 0 (n → ∞),

set M = q, Kn = (C2n/ log(n))1/(2p+1), and define the estimate as above.Then there exists a constant c > 0 depending only on L,M , and d such that,for n sufficiently large,

E∫


≤ c · C 22p+1 ·

(log(n)n

) 2p2p+1

.



The additive model and its generalizations have been investigated by An-drews and Whang (1990), Breiman (1993), Breiman and Friedman (1985),Burman (1990), Chen (1991), Hastie and Tibshirani (1990), Bickel et al.(1993), Huang (1998), Kohler (1998), Linton (1997), Linton and Hardle(1996), Linton and Nielsen (1995), Newey (1994), Stone (1985; 1994), andWahba et al. (1995).

Projection pursuit was proposed by Friedman and Tukey (1974) andspecialized to regression estimation by Friedman and Stuetzle (1981).

In the literature there are conditions for the unique identification of β inthe single index model, see, e.g., Horowitz (1998), Ichimura (1993), Manski(1988), Powel (1994), Powel, Stock, and Stoker (1989), and Stoker (1991).Consistency and the rate of convergence were proved under conditions onthe underlying distributions, like: M has some derivatives, X has a smoothdensity, etc. (Amemiya (1985), Davidson and MacKinnon (1993), Gallant(1987), Ichimura (1993), and Robinson (1987; 1988)).

For the single index model a simple estimate is based on the observationthat, for differentiable F ,

grad m(x) = grad F ((x, β)) = βF ′((x, β)),

therefore,

Egrad m(X) = βEF ′((X,β)),so if the expected gradient can be estimated with a good rate then we geta multiple of β. This is the principle of the average derivative estimate(Hardle and Stoker (1989)).

Concerning some general results on dimension reduction we refer to Hall(1988), Hristache et al. (2001), Nicoleris and Yatracos (1997), Samarov(1993) and Zhang (1991).


Problem 22.1. Assume that the regression function is a sum of the (p, C)-smooth functions of d∗ < d of the components of X. Use the principle of leastsquares to construct an estimate that fits a sum of the multivariate piecewisepolynomials of d∗ < d of the components of X to the data. Show that the L2

error of this estimate converges to zero with the rate (log(n)/n)−2p/(2p+d∗) if theparameters of this estimate are chosen in a suitable way (cf. Kohler (1998)).

Problem 22.2. Use the complexity regularization principle to define an adaptiveversion of the estimate in Theorem 22.1.

Problem 22.3. Modify the definition of the estimate in Theorem 22.2 in such away that the resulting estimate is weakly and strongly universally consistent.


Problem 22.4. Use the complexity regularization principle to define an adaptiveversion of the estimate in Theorem 22.2.

23Strong Consistency of Local AveragingEstimates

23.1 Partitioning Estimates

For a statistician the individual development of an estimation sequence isalso of interest, therefore, in this chapter we discuss the strong consistencyof local averaging estimates. Consider first the partitioning estimate in thecase of bounded Y , the strong universal consistency of a modified parti-tioning estimate, and finally the strong universal consistency of the originalpartitioning estimate. The notations of Chapter 4 will be used.

Theorem 23.1. Under the conditions (4.1) and (4.2) the partitioning esti-mate is strongly consistent if |Y | ≤ L with probability one for some L < ∞.

For the proof we shall use the following special version of the Banach-Steinhaus theorem for integral operators in L1(µ).

Theorem 23.2. Let Kn(x, z) be functions on Rd × Rd satisfying thefollowing conditions:

(i) There is a constant c > 0 such that, for all n,∫

|Kn(x, z)|µ(dx) ≤ c

for µ-almost all z.(ii) There is a constant D ≥ 1 such that

∫|Kn(x, z)|µ(dz) ≤ D

460 23. Strong Consistency of Local Averaging Estimates

for all x and n.(iii) For all a > 0,

limn→∞

∫ ∫|Kn(x, z)|I‖x−z‖>aµ(dz)µ(dx) = 0.

(iv)

limn→∞

ess supx

∣∣∣∣

∫Kn(x, z)µ(dz) − 1

∣∣∣∣ = 0.

Then, for all m ∈ L1(µ),

limn→∞

∫ ∣∣∣∣m(x) −

∫Kn(x, z)m(z)µ(dz)

∣∣∣∣µ(dx) = 0.

Proof. The set of the continuous functions of compact support is dense inL1(µ) by Theorem A.1, so choose a continuous function of compact support(thus uniformly continuous and bounded) m such that

∫|m(x) − m(x)|µ(dx) < ε.

Then∫ ∣

∣∣∣m(x) −

∫Kn(x, z)m(z)µ(dz)

∣∣∣∣µ(dx)

≤∫

|m(x) − m(x)|µ(dx)

+∫ ∣

∣∣∣m(x)(1 −

∫Kn(x, z)µ(dz))

∣∣∣∣µ(dx)

+∫ ∣

∣∣∣

∫Kn(x, z)(m(x) − m(z))µ(dz)

∣∣∣∣µ(dx)

+∫ ∣

∣∣∣

∫Kn(x, z)(m(z) −m(z))µ(dz)

∣∣∣∣µ(dx)

= I1 + I2 + I3 + I4.

By the choice of m,

I1 < ε.

By condition (iv),

I2 ≤ supu

∣∣∣∣1 −

∫Kn(u, z)µ(dz)

∣∣∣∣

∫|m(x)|µ(dx) → 0,

where m is uniformly continuous, therefore it is possible to choose δ > 0such that ‖x − z‖ < δ implies |m(x) − m(z)| < ε. Let Sx,δ be the sphere

23.1. Partitioning Estimates 461

centered at x with radius δ and denote its complement by Scx,δ. Then

I3 ≤∫ ∫

Sx,δ

|Kn(x, z)||m(x) − m(z)|µ(dz)µ(dx)

+∫ ∫

Scx,δ

|Kn(x, z)||m(x) − m(z)|µ(dz)µ(dx)

≤ ε

∫ ∫

Sx,δ

|Kn(x, z)|µ(dz)µ(dx)

+ 2 supx

|m(x)|∫ ∫

Scx,δ

|Kn(x, z)|µ(dz)µ(dx),

therefore, by (ii) and (iii),

lim supn→∞

I3 ≤ εD.

For the last term apply (i),

I4 ≤∫ ∫

|Kn(x, z)|µ(dx)|m(z) −m(z)|µ(dz)

≤ supu,n

∫|Kn(x, u)|µ(dx)

∫|m(z) −m(z)|µ(dz)

≤ cε.

We need two additional lemmas. Set

m∗n(x) =

∑ni=1 YiIXi∈An(x)nµ(An(x))

.

Lemma 23.1. Under the conditions (4.1) and (4.2),

E∫

|m(x) −m∗n(x)|µ(dx)

→ 0.

Proof. By the triangle inequality

E∫


≤∫

|m(x) − Em∗n(x)|µ(dx) + E

∫|m∗

n(x) − Em∗n(x)|µ(dx)

.

The first term on the right-hand side is called the bias, and the second termis called the variation of m∗

n. Introduce the notation

Kn(x, z) =∞∑

j=1

Ix∈An,j ,z∈An,j = Iz∈An(x) = Ix∈An(z)


and

K∗n(x, z) =

Kn(x, z)∫Kn(x, u)µ(du)

=Iz∈An(x)µ(An(x))

.

Then

Em∗n(x) =

∫K∗

n(x, z)m(z)µ(dz).

It is easy to see that conditions (ii) and (iv) of Theorem 23.2 are fulfilledfor the bias. A simple argument shows that

∫K∗

n(x, z)µ(dx) =∫Iz∈An(x)µ(An(x))

µ(dx)

=∫Ix∈An(z)µ(An(z))

µ(dx)

= 1 (23.1)

for µ-almost all z, therefore (i) holds. To verify (iii) let a > 0 and S asphere centered at the origin. Then

∫ ∫

z:‖x−z‖>aK∗

n(x, z)µ(dz)µ(dx)

=∑

j

∫

An,j

∫

z:‖x−z‖>aK∗

n(x, z)µ(dz)µ(dx)

=∑

j

∫

An,j

∫

z:‖x−z‖>a

Iz∈An,j

µ(An,j)µ(dz)µ(dx)

=∑

j:An,j∩S =∅

∫

An,j

µ(z : ‖x− z‖ > a ∩An,j)µ(An,j)

µ(dx)

+∑

j:An,j∩S=∅

∫

An,j

µ(z : ‖x− z‖ > a ∩An,j)µ(An,j)

µ(dx).

By (4.1) the first term on the right-hand side is zero for sufficiently largen since the maximal diameter of An,j becomes smaller than a. The secondterm is less than

∑j:An,j∩S=∅ µ(An,j) ≤ µ(Sc). So the bias tends to 0. Next

consider the variation term. Let ln be the number of cells of the partition


Pn that intersect S. Then

E∫

|m∗n(x) − Em∗

n(x)|µ(dx)

=∑

j

E

∫

An,j

∣∣∣∣

∑ni=1 YiIXi∈An,j

nµ(An,j)− E

∑ni=1 YiIXi∈An,j

nµ(An,j)

∣∣∣∣µ(dx)

=1n

∑

j:An,j∩S =∅E

∣∣∣∣∣

n∑

i=1

YiIXi∈An,j − En∑

i=1

YiIXi∈An,j

∣∣∣∣∣

+1n

∑

j:An,j∩S=∅E

∣∣∣∣∣

n∑

i=1


i=1

YiIXi∈An,j

∣∣∣∣∣

and, therefore,

E∫

|m∗n(x) − Em∗

n(x)|µ(dx)

≤ 1n

∑

j:An,j∩S =∅

√√√√√E

(n∑

i=1


i=1

YiIXi∈An,j

)2

+1n

∑

j:An,j∩S=∅2Lnµ(An,j)

≤ 1n

∑

j:An,j∩S =∅

√nL2µ(An,j) + 2Lµ(Sc)

≤∑

j:An,j∩S =∅L

√µ(An,j)

n+ 2Lµ(Sc)

≤ Lln

√1ln

∑j:An,j∩S =∅ µ(An,j)

n+ 2Lµ(Sc)

(by Jensen’s inequality)

≤ L

√lnn

+ 2Lµ(Sc) → 2Lµ(Sc),

by the use of (4.2).

Lemma 23.2. Let (4.1) and (4.2) hold. Then, for each ε > 0,

P∫

|m(x) −m∗n(x)|µ(dx) > ε

≤ e−nε2/(32L2)

if n is large enough.


Proof. We begin with the decomposition

|m(x) −m∗n(x)|

= E|m(x) −m∗n(x)| +

(|m(x) −m∗

n(x)| − E|m(x) −m∗n(x)|

).

(23.2)

The first term on the right-hand side of (23.2) converges in L1 byLemma 23.1. We use Theorem A.2 to obtain an exponential bound forthe second term on the right-hand side of (23.2). Fix the training data(x1, y1), . . . , (xn, yn) ∈ Rd × [−L,L], and replace (xi, yi) by (xi, yi) thuschanging the value of m∗

n(x) to m∗n,i(x). Then m∗

n(x)−m∗n,i(x) differs from

zero only on An(xi) and An(xi), and thus∣∣∣∣

∫|m(x) −m∗

n(x)|µ(dx) −∫

|m(x) −m∗n,i(x)|µ(dx)

∣∣∣∣

≤∫

|m∗n(x) −m∗

n,i(x)|µ(dx)

≤(

2Lnµ(An(xi))

µ(An(xi)) +2L

nµ(An(xi))µ(An(xi))

)

≤ 4Ln.

By Theorem A.2, we have that, for sufficiently large n,

P∫

|m(x) −m∗n(x)|µ(dx) > ε

≤ P∫

|m(x) −m∗n(x)|µ(dx) − E

∫|m(x) −m∗

n(x)|µ(dx) >ε

2

≤ e−nε2/(32L2).

Proof of Theorem 23.1. Because of

|mn(x) −m(x)|2 ≤ 2L|mn(x) −m(x)|it suffices to show that

limn→∞

∫|mn(x) −m(x)|µ(dx) = 0

with probability one. Introduce the notation

Bn =

x :n∑

i=1

Kn(x,Xi) > 0

,


where Kn(x, z) = Iz∈An(x), i.e., Bn is the set of x’s whose cell isnonempty. Write

∫|mn(x) −m(x)|µ(dx)

≤∫

|mn(x) −m∗n(x)|µ(dx) +

∫|m∗

n(x) −m(x)|µ(dx).

By Lemma 23.2 and the Borel-Cantelli lemma∫

|m∗n(x) −m(x)|µ(dx) → 0

with probability one. On the other hand, if x ∈ Bn, then

|m∗n(x) −mn(x)|

=∣∣∣∣

∑ni=1Kn(x,Xi)Yi

n∫Kn(x, z)µ(dz)

−∑n

i=1Kn(x,Xi)Yi∑ni=1Kn(x,Xi)

∣∣∣∣

≤ L

n∑

i=1

Kn(x,Xi)∣∣∣∣

1n∫Kn(x, z)µ(dz)

− 1∑n

i=1Kn(x,Xi)

∣∣∣∣

= L

∣∣∣∣

∑ni=1Kn(x,Xi)

n∫Kn(x, z)µ(dz)

− 1∣∣∣∣

= L |M∗n(x) − 1| ,

where M∗n(x) is the special form of m∗

n(x) for Y ≡ 1. If x ∈ Bcn, then

|m∗n(x) −mn(x)| = 0 ≤ L |M∗

n(x) − 1| .Therefore, by Lemma 23.2,

∫|mn(x) −m∗

n(x)|µ(dx) ≤ L

∫|M∗

n(x) − 1|µ(dx) → 0

with probability one, and the proof is complete.

It is not known if under (4.1) and (4.2) the standard partitioning estimateis strongly universally consistent. We will prove the universal consistencyunder some additional mild conditions. In Theorem 23.3 the partitioningestimate is modified such that the estimate is 0 if there are few points inthe actual cell, while in Theorem 23.4 we don’t change the partition toofrequently.

Consider the following modification of the standard partitioning esti-mate:

m′n(x) =

∑ni=1 YiIXi∈An(x)∑ni=1 IXi∈An(x)

ifn∑

i=1

IXi∈An(x) > log n,

0 otherwise.


Theorem 23.3. Assume (4.1). If for each sphere S centered at the origin

limn→∞

|j : An,j ∩ S = ∅| log nn

= 0, (23.3)

then m′n is strongly universally consistent.

The following lemma is sometimes useful to extend consistencies frombounded Y ’s to unbounded Y ’s.

Lemma 23.3. Let mn be a local averaging regression function estimatewith subprobability weights Wn,i(x) that is strongly consistent for all dis-tributions of (X,Y ) such that Y is bounded with probability one. Assumethat there is a constant c such that for all Y with EY 2 < ∞,

lim supn→∞

n∑

i=1

Y 2i

∫Wn,i(x)µ(dx) ≤ cEY 2 with probability one.

Then mn is strongly universally consistent.

Proof. Fix ε > 0. Choose L > 0 such that

E|YL − Y |2 < ε,

where

YL =

L if Y > L,Y if −L ≤ Y ≤ L,−L if Y < −L.

For j ∈ 1, . . . , n set

Yj,L =

L if Yj > L,Yj if −L ≤ Yj ≤ L,−L if Yj < −L.

Let mL andmn,L be the functions m andmn when Y and Yj are replacedby YL and Yj,L. Then

∫(mn(x) −m(x))2µ(dx)

≤ 3(∫

(mn(x) −mn,L(x))2µ(dx) +∫

(mn,L(x) −mL(x))2µ(dx)

+∫

(mL(x) −m(x))2µ(dx)).

Because of the conditions, for all L,∫

(mn,L(x) −mL(x))2µ(dx) → 0 with probability one.


By Jensen’s inequality,∫

(mL(x) −m(x))2µ(dx)) = E(EYL|X − EY |X)2

≤ E(YL − Y )2

< ε

by the choice of L. We may apply another version of Jensen’s inequality,together with the fact that the weights are subprobability weights, to boundthe first term:

lim supn→∞

∫(mn(x) −mn,L(x))2µ(dx)

= lim supn→∞

∫ (n∑

i=1

Wn,i(x)(Yi − Yi,L)

)2

µ(dx)

≤ lim supn→∞

∫ n∑

i=1

Wn,i(x)(Yi − Yi,L)2µ(dx)

= lim supn→∞

n∑

i=1

∫Wn,i(x)µ(dx)(Yi − Yi,L)2

≤ cE(Y − YL)2

(by the condition of the theorem)

< cε with probability one,

by the choice of L.

Proof of Theorem 23.3. For fixed S and c = e2, let

In = j : An,j ∩ S = 0,Jn = j : µn(An,j) > log n/n,Ln = j : µ(An,j) > c log n/n,Fn =

⋃

j∈In∩Jcn

An,j .

First we show that, for |Y | ≤ L,∫

|m′n(x) −m(x)|µ(dx) → 0 a.s. (23.4)

By Theorem 23.1,∫

|mn(x) −m(x)|µ(dx) → 0 a.s.,

so we need that∫

|m′n(x) −mn(x)|µ(dx) → 0 a.s.


Because of∫

|m′n(x) −mn(x)|µ(dx) ≤ 2L(µ(Sc) + µ(Fn))

it suffices to show that

µ(Fn) → 0 a.s.

One has

µ(Fn) ≤∑

j∈In∩Lcn

µ(An,j) +∑

j∈In∩Ln∩Jcn

µ(An,j)

≤ clog nn

|In| +∑

j∈In∩Ln

µ(An,j)Icµn(An,j)<µ(An,j).

The first term on the right-hand side tends to zero by condition (23.3). Forthe second term, Lemma A.1 with ε = µ(A)/c implies

Pcµn(A) < µ(A) = Pnµn(A) <

nµ(A)c

≤ e−n(µ(A)− µ(A)c + µ(A)

c log 1c )

= e−nµ(A)(1− 1c − log c

c ).

Thus, for any 0 < ε < 1,

P

∑

j∈In∩Ln

µ(An,j)Icµn(An,j)<µ(An,j) > ε

≤ P

∑

j∈In∩Ln

µ(An,j)Icµn(An,j)<µ(An,j) > ε∑

j∈In∩Ln

µ(An,j)

≤∑

j∈In∩Ln

PIcµn(An,j)<µ(An,j) > ε

=∑

j∈In∩Ln

Pcµn(An,j) < µ(An,j)

≤∑

j∈In∩Ln

e−nµ(An,j)(1− 1c − log c

c )

≤∑

j∈In

e−(log n)·[c(1− 1c − log c

c )]

= |In|n−c(1− 1c − log c

c ) = |In|n−(e2−3),


therefore by condition (23.3) we have, for sufficiently large n,

P

∑

j∈In∩Ln

µ(An,j)Icµn(An,j)<µ(An,j) > ε

< n−e2+4,

which is summable. This proves (23.4).Now apply Lemma 23.3, the second condition of which is satisfied if

lim supn

max

in

∫Wn,i(x)µ(dx)

≤ c a.s.

Let c be as above,

Bn =⋃

j∈Jn∩Ln

An,j

and

Dn =⋃

j∈Jn

An,j .

If Jn = ∅, then

maxin

∫Wn,i(x)µ(dx) = max

i:An(Xi)⊂Dn

µ(An(Xi))µn(An(Xi))

,

and if Jn = ∅, then m′n = 0 and

maxin

∫Wn,i(x)µ(dx) = 0,


therefore

P

maxin

∫Wn,i(x)µ(dx) > c

= PJn = ∅, max

i:An(Xi)⊂Dn

µ(An(Xi))µn(An(Xi))

> c

≤ n · PAn(X1) ⊂ Dn,

µ(An(X1))µn(An(X1))

> c

= nPAn(X1) ⊂ Bn,

µ(An(X1))µn(An(X1))

> c

≤ n∑

j∈Ln

PX1 ∈ An,j ,

µ(An,j)µn(An,j)

> c

= n∑

j∈Ln

PX1 ∈ An,j ,

nµ(An,j)1 +

∑ni=2 IXi∈An,j)

> c

= n∑

j∈Ln

µ(An,j)P

nµ(An,j)1 +

∑ni=2 IXi∈An,j)

> c

≤ n∑

j∈Ln

µ(An,j)Pµ(An,j)µn(An,j)

> c

≤ n∑

j∈Ln

µ(An,j)e−nµ(An,j)(1− 1c − log c

c )

≤ n∑

j

µ(An,j)e−c·(log n)·(1− 1c − log c

c )

≤ n−e2+4,

which is again summable.

A sequence Pn of partitions of Rd by Borel sets is called nested ifAn+1(x) ⊆ An(x) for all n ∈ N , x ∈ Rd.

Theorem 23.4. Let mn be a sequence of partitioning estimates with par-tition sequence Pn satisfying (4.1) and (4.2). Let indices n1, n2, . . . satisfynk+1 ≥ D · nk for some fixed D > 1. Assume that either Pn−1 = Pn

at most for the indices n = n1, n2, . . . or that the sequence Pn is nestedsatisfying An(x) ∈ Ank

(x), Ank+1(x) for n ∈ nk, nk + 1, . . . , nk+1(k = 1, 2, . . . , x ∈ Rd). Then mn is strongly universally consistent.

Let P1, . . . ,P7 be the partitions described in Figure 23.1. Then for n ≤ 7the first kind of assumptions of Theorem 23.4 is satisfied with n1 = 3 andn2 = 7, and the second kind of assumptions is satisfied with n1 = 1 andn2 = 7.


A11

A12

P1 = P2

A31

A32 A33

P3 = P4 = P5 = P6

A71

A72

A73 A74

P7

Figure 23.1. A sequence of nested partitions.

If Y is bounded then according to Theorem 23.1 conditions (4.1) and(4.2) imply strong consistency, even for nonnested Pn.

In the following we give a simple example of a partition, which satisfiesthe second kind of assumptions of Theorem 23.4. Let d = 1. Via successivebisections we define partitions of (0, 1): let Q2k be

(0,

12k

],

(12k,

22k

], . . . ,

(2k − 1

2k, 1)

and let Q2k+j be(

0,1

2k+1

], . . . ,

(2j − 12k+1 ,

2j2k+1

],

(2j

2k+1 ,2j + 22k+1

], . . . ,

(2k+1 − 2

2k+1 , 1)

for j ∈ 1, . . . , 2k − 1

, k ∈ 0, 1, . . .. By a fixed continuous bijective

transformation of (0, 1) to R (e.g., by x → tan(π(x−1/2))) let the partitionQn of (0, 1) be transformed into the partition Pn of R. Then Pn fulfillsthe conditions of Theorem 23.4 with nk = 2k, besides (4.2), and withPn = P

√n all conditions of Theorem 23.4 are fulfilled.

We need some lemmas for the proof of Theorem 23.4. The follow-ing lemma is well-known from the classical Kolmogorov proof and fromEtemadi’s (1981) proof of the strong law of large numbers for independentand identically distributed integrable random variables.

Lemma 23.4. For identically distributed random variables Yn ≥ 0 withEYn < ∞ let Y ∗

n be the truncation of Yn at n, i.e., Y ∗n := YnIYn≤n +

nIYn>n. Then

1nEY ∗2

n → 0 (n → ∞), (23.5)

∞∑

n=1

1n2 EY ∗2

n < ∞, (23.6)


and, for nk+1 ≥ Dnk with D > 1,∞∑

k=1

1nk

EY ∗2

nk< ∞. (23.7)

Proof. Noticing that

EY ∗2

n =n∑

i=1

∫

(i−1,i]

t2PY (dt),

one obtains (23.5) from∞∑

i=1

1i

∫

(i−1,i]

t2PY (dt) ≤∞∑

i=1

∫

(i−1,i]

tPY (dt) = EY < ∞

by the Kronecker lemma, and (23.6) follows from∞∑

n=1

1n2 EY ∗2

n =∞∑

n=1

n∑

i=1

1n2

∫

(i−1,i]t2PY (dt)

=∞∑

i=1

∫

(i−1,i]

t2PY (dt)∞∑

n=i

1n2

≤∞∑

i=1

2i

∫

(i−1,i]t2PY (dt)

≤ 2EY < ∞.

In view of (23.7), one notices∞∑

k=j

1nk

≤∞∑

k=j

1Dk−j · nj

=1nj

∞∑

l=0

(1D

)l

=D

(D − 1)nj.

Now let mi denote the minimal index m with i ≤ nm. Then∞∑

k=1

1nk

EY ∗2

nk=

∞∑

k=1

1nk

nk∑

i=1

∫

(i−1,i]

t2PY (dt)

=∞∑

i=1

∞∑

k=1

Ii≤nk · 1nk

∫

(i−1,i]t2PY (dt)

=∞∑

i=1

( ∞∑

k=mi

1nk

) ∫

(i−1,i]

t2PY (dt)

≤∞∑

i=1

D

(D − 1) · nmi

∫

(i−1,i]t2PY (dt)


≤ D

D − 1

∞∑

i=1

1i

∫

(i−1,i]

t2PY (dt)

≤ D

D − 1EY < ∞.

Lemma 23.5. Let Kn : Rd × Rd → 0, 1 be a measurable function.Assume that a constant ρ > 0 exists with

∫Kn(x, z)∫

Kn(x, s)µ(ds)µ(dx) ≤ ρ (23.8)

for all n, all z, and all distributions µ. If Kn−1 = Kn at most for the in-dices n = n1, n2, . . . , where nk+1 ≥ Dnk for some fixed D > 1, then

lim supn→∞

∫ n∑

i=1

YiKn(x,Xi)1 +

∑

j∈1,...,n\iKn(x,Xj)

µ(dx) ≤ 2ρEY a.s., (23.9)

for each integrable Y ≥ 0.

Proof. Without loss of generality we may assume nk+1/nk ≤ 2 (as longas nl+1/nl ≥ 2 for some l insert

⌈√nl · nl+1

⌉into the sequence nk). Set

Y [N ] := Y IY ≤N +N · IY >N,

Y[N ]i := YiIYi≤N +N · IYi>N,

and, mimicking Lemma 23.4,

Y ∗n := YnIYn≤n + nIYn>n.

Further, set

Un,i :=∫

Y ∗i Kn(x,Xi)

1 +∑


µ(dx)

and

Vn,i := Un,i − EUn,i.

In the first step it will be shown

E

(N∑

i=1

Vn,i

)2

≤ c

NEY [N ]2 (23.10)

for 4 ≤ n ≤ N ≤ 2n with some suitable constant c.


Notice

E1

(1 +B(n, p))2≤ 2

(n+ 1)(n+ 2)p2 , (23.11)

E1

(1 +B(n, p))4≤ 24

(n+ 1)(n+ 2)(n+ 3)(n+ 4)p4 , (23.12)

for random variables B(n, p) binominally distributed with parameters nand p (cf. Problem 23.1). Let

(X1, Y1), . . . , (XN , YN )

and

(X1, Y1), . . . , (XN , YN )

be i.i.d. copies of (X,Y ) and let Un,i,l be obtained from Un,i via replacing(Xl, Yl) by (Xl, Yl) (l = 1, . . . , N). By Theorem A.3,

E

∣∣∣∣∣

N∑

i=1

Vn,i

∣∣∣∣∣

2

= Var

N∑

i=1

Un,i

≤ 12

N∑

l=1

E

∣∣∣∣∣

N∑

i=1

Un,i −N∑

i=1

Un,i,l

∣∣∣∣∣

2

=12

N∑

l=1

E

∣∣∣∣∣∣∣

N∑

i=1

∫Y ∗

i Kn(x,Xi)1 +

∑


µ(dx)

−∑

i∈1,...,N\l

∫Y ∗

i Kn(x,Xi)1 +

∑

j∈1,...,n\i,lKn(x,Xj) +Kn(x, Xl)Il≤n

µ(dx)

−∫

Y ∗l Kn(x, Xl)

1 +∑

j∈1,...,n\lKn(x,Xj)

µ(dx)

∣∣∣∣∣∣∣

2

=:12

N∑

l=1

Al.

For l ∈ 1, . . . , N we have


Al

≤ 2E

∣∣∣∣∣∣∣

∫Y ∗

l Kn(x,Xl) + Y ∗l Kn(x, Xl)

1 +∑


µ(dx)

∣∣∣∣∣∣∣

2

+ 2E

∣∣∣∣∣∣∣

∑

i∈1,...,N\l

∫Y ∗

i Kn(x,Xi)[Kn(x, Xl) +Kn(x,Xl)][1 +

∑

j∈1,...,n\i,lKn(x,Xj)]2

µ(dx)

∣∣∣∣∣∣∣

2

≤ 8E

∣∣∣∣∣∣∣

∫Y ∗

l Kn(x,Xl)1 +

∑


µ(dx)

∣∣∣∣∣∣∣

2

+ 8E

∣∣∣∣∣∣∣

∑

i∈1,...,N\l

∫Y ∗

i Kn(x,Xi)Kn(x,Xl)[1 +

∑

j∈1,...,n\i,lKn(x,Xj)]2

µ(dx)

∣∣∣∣∣∣∣

2

(by exchangeability)

≤ 8E

Y

[N ]2

1

∫

Kn(x,X1)1 +

∑

j∈2,...,nKn(x,Xj)

µ(dx)

2

+ 8NE

Y

[N ]2

1

∫

Kn(x,X1)Kn(x,X2)[1 +

∑

j∈3,...,nKn(x,Xj)]2

µ(dx)

2

+ 8N2E

Y[N ]1 Y

[N ]2

∫Kn(x,X1)Kn(x,X3)

[1 +∑

j∈4,...,nKn(x,Xj)]2

µ(dx)

×∫

Kn(x, X2)Kn(x, X3)[1 +

∑

j∈4,...,nKn(x, Xj)]2

µ(dx)

(by exchangeability and 0 ≤ Y ∗i ≤ Y

[N ]i )

=: 8B + 8C + 8D.

We have


B

= E

∣∣∣Y [N ]

1

∣∣∣2∫ ∫

Kn(x,X1)Kn(x, X1)[1 +

∑

j∈2,...,nKn(x,Xj)][1 +

∑

j∈2,...,nKn(x, Xj)]

µ(dx)µ(dx)

≤ E

∣∣∣Y [N ]

1

∣∣∣2∫ ∫

Kn(x,X1) ·Kn(x, X1)

×(

E

1

[1 +∑n

j=2Kn(x,Xj)]2

)1/2

×(

E

1

[1 +∑n

j=2Kn(x, Xj)]2

)1/2

µ(dx)µ(dx)

(by independence and by the Cauchy–Schwarz inequality)

≤ 2E

∣∣∣Y [N ]

1

∣∣∣2 1n2

∫Kn(x,X1)∫Kn(x, s)µ(ds)

µ(dx)∫

Kn(x, X1)∫Kn(x, s)µ(ds)

µ(dx)

(by (23.11))

≤ 2ρ2

n2 EY [N ]2

(by (23.8)).

In a similar way we obtain

C ≤ 17ρ2N

n3 EY [N ]2

by

Kn(x,X2)1 +

∑nj=3Kn(x,Xj)

≤ 11 +

∑nj=3Kn(x,Xj)

≤ 1

and by the use of (23.11), (23.12), and (23.8), further, via exchangeability,

D ≤ 52ρ2N2

n4 E∣∣∣Y [N ]

∣∣∣2

by the use of (23.12) and (23.8) (cf. Problem 23.2). These bounds yield(23.10).


In the second step a monotonicity argument will be used to show

lim supn→∞

n∑

i=1

∫Y ∗

i Kn(x,Xi)1 +

∑


µ(dx) ≤ 2ρEY a.s. (23.13)

For

nk ≤ n ≤ nk+1 − 1 ≤ nk+1 ≤ 2nk

one has Kn = Knk, which implies

n∑

i=1

∫Y ∗

i Kn(x,Xi)1 +

∑


µ(dx) ≤nk+1−1∑

i=1

Unk,i. (23.14)

Further

EUn,i

=∫

E Y ∗i Kn(x,Xi)E

1

1 +∑


µ(dx)

≤∫

E Y ∗i Kn(x,Xi) 1

n · ∫ Kn(x, s)µ(ds)µ(dx)

(by independence and by Lemma 4.1)

=1nEY ∗

i

∫Kn(x,Xi)∫Kn(x, s)µ(ds)

µ(dx)

≤ ρ

nE Y ∗

i (by (23.8)),

which implies

nk+1−1∑

i=1

EUnk,i ≤ ρnk+1

nkEY ≤ 2ρEY. (23.15)

Using (23.10), Lemma 23.4, and nk+1 ≤ 2nk one obtains

E

∞∑

k=1

(nk+1−1∑

i=1

Vnk,i

)2

≤

∞∑

k=1

c

nk+1 − 1E∣∣∣Y [nk+1−1]

∣∣∣2< ∞,

which implies

nk+1−1∑

i=1

Vnk,i → 0 (k → ∞) a.s. (23.16)

Now (23.16), (23.14), (23.15) yield (23.13).


In the third step we will prove the assertion (23.9). Because of∞∑

i=1

PYi = Y ∗i =

∞∑

i=1

PY > i ≤ EY < ∞

one has with probability one Yi = Y ∗i from a random index on. Then,

because of (23.13), it suffices to show that, for each fixed l,∫

Kn(x,Xl)1 +

∑


µ(dx) → 0 a.s.

But this follows from

∞∑

n=1

E

∣∣∣∣∣∣∣

∫Kn(x,Xl)

1 +∑


µ(dx)

∣∣∣∣∣∣∣

2

≤ 2ρ2∞∑

n=1

1n2

(see above upper bound for B)

< ∞.

Proof of Theorem 23.4. The assertion holds when L > 0 exists with|Y | ≤ L (see Theorem 23.1). According to Lemma 23.3 it suffices to showthat, for some constant c > 0,

lim supn→∞

∫ ∑ni=1 |Yi| · IAn(x)(Xi)∑n

i=1 IAn(x)(Xi)µ(dx) ≤ c · E|Y | (23.17)

for every distribution of (X,Y ) with E|Y | < ∞.W.l.o.g. we can assume Y ≥ 0 with probability one. Notice that the

covering assumption (23.8) is fulfilled with

Kn(x, z) = Iz∈An(x)

and ρ = 1 (cf. (23.1)).If Pn−1 = Pn at most for indices n = n1, n2, . . ., then (23.17) immediately

follows from Lemma 23.5.If Pn is nested with An(x) ∈ Ank

(x), Ank+1(x) for n ∈ nk, nk +1, . . . , nk+1 (k = 1, 2, . . . , x ∈ Rd), then let m∗

n and m∗∗n be the sequences

of estimates based on the sequences of partitions P∗n and P∗∗

n , respectively,where

P∗n = Pnk

for nk ≤ n < nk+1

and

P∗∗n = Pnk+1 for nk < n ≤ nk+1.

23.2. Kernel Estimates 479

Then mn(x) ∈ m∗n(x),m∗∗

n (x) for any x ∈ Rd, n > n1, which implies

mn(x) ≤ m∗n(x) +m∗∗

n (x).

By Lemma 23.5, (23.17) is fulfilled for (m∗n)n≥n1 and (m∗∗

n )n≥n1 and thusfor mn as well.

23.2 Kernel Estimates

For the strong consistency of kernel estimates, we consider a rather generalclass of kernels.

Definition 23.1. The kernel K is called regular if it is nonnegative, andif there is a ball S0,r of radius r > 0 centered at the origin, and constantb > 0 such that

1 ≥ K(x) ≥ bIx∈S0,r

and∫

supu∈x+S0,r

K(u) dx < ∞. (23.18)

Theorem 23.5. Let mn be the kernel estimate of the regression functionm with a regular kernel K. Assume that there is an L < ∞ such thatP|Y | ≤ L = 1. If hn → 0 and nhd

n → ∞, then the kernel estimate isstrongly consistent.

For the proof we will need the following four lemmas. Put Kh(x) =K(x/h).

Lemma 23.6. (Covering Lemma). If the kernel is regular then thereexists a finite constant ρ = ρ(K) only depending upon K such that, for anyu ∈ Rd, h > 0, and probability measure µ,

∫Kh(x− u)∫

Kh(x− z)µ(dz)µ(dx) ≤ ρ.

supx−r<y<x+r

K(y)

K(x) = e−x2

Figure 23.2. Regular kernel.


Figure 23.3. An example for a bounded overlap cover of R2.

Moreover, for any δ > 0

limh→0

supu

∫Kh(x− u)I‖x−u‖>δ∫

Kh(x− z)µ(dz)µ(dx) = 0.

Proof. First take a bounded overlap cover of Rd with translates of S0,r/2,where r > 0 is the constant appearing in the definition of a regular kernel.

This cover has an infinite number of member balls, but every x getscovered at most k1 times, where k1 depends upon d only. The centers ofthe balls are called xi (i = 1, 2, . . .). The integral condition (23.18) on Kimplies that

∞∑

i=1

supz∈xi+S0,r/2

K(z)

=∞∑

i=1

1∫S0,r/2

dx

∫

x∈Sxi,r/2

supz∈xi+S0,r/2

K(z) dx

≤ 1∫S0,r/2

dx

∫ ∞∑

i=1

Ix∈Sxi,r/2 supz∈x+S0,r

K(z) dx

(because x ∈ Sxi,r/2 implies xi + S0,r/2 ⊆ x+ S0,r cf. Figure 5.8)

≤ k1∫S0,r/2

dx

∫sup

z∈x+S0,r

K(z) dx ≤ k2

for another finite constant k2. Furthermore,

Kh(x− u) ≤∞∑

i=1

supx∈u+hxi+S0,rh/2

Kh(x− u),


and, for x ∈ u+ hxi + S0,rh/2,∫Kh(x− z)µ(dz) ≥ bµ(x+ S0,rh) ≥ bµ(u+ hxi + S0,rh/2),

from which we conclude∫

Kh(x− u)∫Kh(x− z)µ(dz)

µ(dx)

≤∞∑

i=1

∫

x∈u+hxi+S0,rh/2

Kh(x− u)∫Kh(x− z)µ(dz)

µ(dx)

≤∞∑

i=1

∫

x∈u+hxi+S0,rh/2

supz∈hxi+S0,rh/2Kh(z)

bµ(u+ hxi + S0,rh/2)µ(dx)

=∞∑

i=1

µ(u+ hxi + S0,rh/2) supz∈hxi+S0,rh/2Kh(z)

bµ(u+ hxi + S0,rh/2)

≤ 1b

∞∑

i=1

supz∈xi+S0,r/2

K(z) ≤ k2

b,

where k2 depends on K and d only. To obtain the second statement in thelemma, substitute Kh(z) above by Kh(z)I‖z‖>δ and notice that∫Kh(x− u)I‖x−u‖>δ∫

Kh(x− z)µ(dz)µ(dx) ≤ 1

b

∞∑

i=1

supz∈xi+S0,r/2

K(z)I‖z‖>δ/h → 0

as h → 0 by dominated convergence.

Lemma 23.7. Let 0 < h ≤ R < ∞, and let S ⊂ Rd be a ball of radius R.Then, for every probability measure µ,

∫

S

1√µ(Sx,h)

µ(dx) ≤(

1 +R

h

)d/2

cd,

where cd depends upon the dimension d only.

The proof of Lemma 23.7 is left to the reader (cf. Problem 23.3).Define

m∗n(x) =

∑ni=1 YiKhn

(x−Xi)nEKhn(x−X)

.

Lemma 23.8. Under the conditions of Theorem 23.5,

limn→∞

∫E|m(x) −m∗

n(x)|µ(dx) = 0.

Proof. By the triangle inequality∫

E|m(x) −m∗n(x)|µ(dx)


≤∫

|m(x) − Em∗n(x)|µ(dx) +

∫E|m∗

n(x) − Em∗n(x)|µ(dx).

Concerning the first term on the right-hand side verify the conditions ofTheorem 23.2 for

Kn(x, z) =K(x−z

hn)

∫K(x−u

hn)µ(du)

.

Part (i) follows from the covering lemma with c = ρ. Parts (ii) and (iv) areobvious. For (iii) note that again by the covering lemma,

∫ ∫Kn(x, z)I‖x−z‖>aµ(dz)µ(dx)

=∫ ∫

K(x−zhn

)I‖x−z‖>aµ(dz)∫K(x−u

hn)µ(du)

µ(dx) → 0.

For the second term we have, with h = hn,

E |m∗n(x) − Em∗

n(x)|≤

√E |m∗

n(x) − Em∗n(x)|2

=

√√√√√

E(∑n

j=1(YjKh(x−Xj) − EY Kh(x−X)))2

n2(EKh(x−X))2

=

√√√√E

(Y Kh(x−X) − EY Kh(x−X))2

n(EKh(x−X))2

≤

√√√√E

(Y Kh(x−X))2

n(EKh(x−X))2

≤ L

√√√√E

(Kh(x−X))2

n(EKh(x−X))2

≤ L√Kmax

√EK((x−X)/h)

n(EK((x−X)/h))2

≤ L

√Kmax

b

√1

nµ(Sx,h),

where we used the Cauchy-Schwarz inequality. Next we use the inequalityabove to show that the integral converges to zero. Divide the integral overRd into two terms, namely, an integral over a large ball S centered at theorigin, of radius R > 0, and an integral over Sc. For the integral outside


the ball we have∫

Sc

E |Em∗n(x) −m∗

n(x)|µ(dx) ≤ 2∫

Sc

E |m∗n(x)|µ(dx) ≤ 2Lµ(Sc),

which can be small by the choice of the ball S. To bound the integral overS we employ Lemma 23.7:∫

S

E |Em∗n(x) −m∗

n(x)|µ(dx) ≤ L

√Kmax

b

1√n

∫

S

1√µ(Sx,h)

µ(dx)

(by the inequality obtained above)

≤ L

√Kmax

b

1√n

(1 +

R

h

)d/2

cd

→ 0 (by assumption nhd → ∞).

Therefore,

E∫


→ 0.

Lemma 23.9. For n large enough

P∫

|m(x) −m∗n(x)|µ(dx) > ε

≤ e−nε2/(8Lρ)2).

Proof. We use a decomposition, as in the proof of strong consistency ofthe partitioning estimate,

∫|m(x) −m∗

n(x)|µ(dx)

=∫

E|m(x) −m∗n(x)|µ(dx)

+∫ (

|m(x) −m∗n(x)| − E|m(x) −m∗

n(x)|)µ(dx).

(23.19)

The first term on the right-hand side tends to 0 by Lemma 23.8. It remainsto show that the second term on the right-hand side of (23.19) is small withlarge probability. To do this, we use McDiarmid’s inequality (Theorem A.2)for

∫|m(x) −m∗

n(x)|µ(dx) − E∫


.

Fix the training data at ((x1, y1), . . . , (xn, yn)) and replace the ith pair(xi, yi) by (xi, yi), changing the value of m∗

n(x) to m∗ni(x). Clearly, by the


covering lemma (Lemma 23.6),∫

|m(x) −m∗n(x)|µ(dx) −

∫|m(x) −m∗

ni(x)|µ(dx)

≤∫

|m∗n(x) −m∗

ni(x)|µ(dx)

≤ supy∈Rd

∫2LKh(x− y)nEKh(x−X)

µ(dx)

≤ 2Lρn.

So by Theorem A.2, for n large enough,

P∫

|m(x) −m∗n(x)|µ(dx) > ε

≤ P∫

|m(x) −m∗n(x)|µ(dx) − E

∫|m(x) −m∗

n(x)|µ(dx)>ε

2

≤ e−nε2/(8L2ρ2).

The proof is now completed.

Proof of Theorem 23.5. As in the proof of Theorem 23.1, it sufficesto show that

∫|mn(x) −m(x)|µ(dx) → 0 a.s.

Obviously,∫

|mn(x) −m(x)|µ(dx)

≤∫

|mn(x) −m∗n(x)|µ(dx) +

∫|m∗

n(x) −m(x)|µ(dx)

and, according to Lemma 23.9 ,∫

|m∗n(x) −m(x)|µ(dx) → 0

with probability one. On the other hand,

|m∗n(x) −mn(x)|

=∣∣∣∣

∑ni=1 YiKhn(x−Xi)nEKhn(x−X)

−∑n

i=1 YiKhn(x−Xi)∑ni=1Khn(x−Xi)

∣∣∣∣

=

∣∣∣∣∣

n∑

i=1

YiKhn(x−Xi)

∣∣∣∣∣

∣∣∣∣

1nEKhn(x−X)

− 1∑n

i=1Khn(x−Xi)

∣∣∣∣


≤ L

∣∣∣∣∣

n∑

i=1

Khn(x−Xi)

∣∣∣∣∣

∣∣∣∣

1nEKhn

(x−X)− 1

∑ni=1Khn(x−Xi)

∣∣∣∣

= L |M∗n(x) − 1| ,

where M∗n(x) is the special form of m∗

n(x) for Y ≡ 1. Therefore,∫

|mn(x) −m∗n(x)|µ(dx) ≤ L

∫|M∗

n(x) − 1|µ(dx) → 0 a.s.,

which completes the proof.

The following theorem concerns the strong universal consistency of thekernel estimate with naive kernel and special sequences of bandwidths. Asto more general kernels we refer to Walk (2002c).

Theorem 23.6. Let K = IS and let hn satisfy

hn−1 = hn at most for the indices n = n1, n2, . . . ,

where nk+1 ≥ Dnk for fixed D > 1 and

hn → 0, nhdn → ∞,

e.g., hn = ce−γq log n/q with c > 0, 0 < γd < 1, q > 0. Then mn isstrongly universally consistent.

Proof. We argue as in the proof of Theorem 23.4. The assertion holdswhen L > 0 exists with |Y | ≤ L (cf. Theorem 23.5). According to Lemma23.3 it suffices to show that, for some constant c > 0,

lim supn→∞

∫ ∑ni=1 |Yi|Khn(x−Xi)∑n

i=1Khn(x−Xi)µ(dx) ≤ cE|Y | a.s.

for every distribution of (X,Y ) with E|Y | < ∞. But this follows fromLemma 23.5.

In Theorem 23.6 the statistician does not change the bandwidth at eachchange of n, as is done in the usual choice h∗

n = cnγ , c > 0, 0 < γd < 1. Butif the example choice in Theorem 23.6 is written in the form hn = cnn

−γ ,one has |cn − c| ≤ c(eγ/q − 1) such that hn and h∗

n are of the same order,and even the factor c in h∗

n can be arbitrarily well-approximated by use ofa sufficiently large q in the definition of hn. This is important in view ofthe rate of convergence under regularity assumptions.

The modification

m′n(x) =

∑ni=1 YiKhn(x−Xi)

maxδ,∑ni=1Khn(x−Xi)

of the classical kernel estimate, fixed δ > 0 (see Spiegelman and Sacks(1980)), which for 1 ≥ δ > 0 coincides with it for naive kernel, yields conti-nuity of m′

n(·) if the kernel K is continuous. For K sufficiently smooth (e.g.,


Gaussian kernelK(x) = e−‖x‖2or quartic kernelK(x) = (1−‖x‖2)2I‖x‖≤1)

Walk (2002c) showed strong universal consistency in the case hn = n−γ

(0 < γd < 1). Here, for Gaussian kernel K, the proof of Theorem 23.6 canbe modified majorizing K in the denominator by K(·/p), 1 < p <

√2, and

noticing regularity of the kernel K(·/p)2/K(·). For a general smooth kernelone first shows strong consistency of (m′

1 + · · ·+m′n)/n by martingale the-

ory (cf. Walk (2002a)) and then strong consistency of m′n by a Tauberian

argument of summability theory.

23.3 k-NN Estimates

As in Chapter 6, we shall assume that ties occur with probability 0.

Theorem 23.7. Assume that P|Y | ≤ L = 1 for some L < ∞ andthat for each x the random variable ‖X − x‖ is absolutely continuous. Ifkn → ∞ and kn/n → 0, then the kn-NN regression function estimate isstrongly consistent.

Proof. We show that, for sufficiently large n,

P∫

|m(x) −mn(x)|µ(dx) > ε

≤ 4e−nε2/(18L2γ2

d),

where γd has been defined in Chapter 6. Define ρn(x) as the solution of theequation

kn

n= µ(Sx,ρn(x)).

Note that the condition that for each x the random variable ‖X − x‖ isabsolutely continuous implies that the solution always exists. (This is theonly point in the proof where we use this assumption.) Also define

m∗n(x) =

1kn

n∑

j=1

YjI‖Xj−x‖<ρn(x).

The basis of the proof is the following decomposition:

|m(x) −mn(x)| ≤ |mn(x) −m∗n(x)| + |m∗

n(x) −m(x)|.For the first term on the right-hand side observe that, denoting Rn(x) =‖X(kn,n)(x) − x‖,

|m∗n(x) −mn(x)| =

1kn

∣∣∣∣∣∣

n∑

j=1

YjIXj∈Sx,ρn(x) −n∑

j=1

YjIXj∈Sx,Rn(x)

∣∣∣∣∣∣

≤ L

kn

n∑

j=1

∣∣∣IXj∈Sx,ρn(x) − IXj∈Sx,Rn(x)

∣∣∣ .

23.3. k-NN Estimates 487

By considering the cases ρn(x) ≤ Rn(x) and ρn(x) > Rn(x) one gets thatIXj∈Sx,ρn(x) −IXj∈Sx,Rn(x) have the same sign for each j. It follows that

|m∗n(x) −mn(x)| ≤ L

∣∣∣∣∣∣

1kn

n∑

j=1

IXj∈Sx,ρn(x) − 1

∣∣∣∣∣∣= L|M∗

n(x) −M(x)|,

whereM∗n is defined asm∗

n with Y replaced by the constant random variableY = 1, and M ≡ 1 is the corresponding regression function. Thus,

|m(x) −mn(x)| ≤ L|M∗n(x) −M(x)| + |m∗

n(x) −m(x)|. (23.20)

First we show that the expected values of the integrals of both terms onthe right-hand side converge to zero. Then we use McDiarmid’s inequalityto prove that both terms are very close to their expected values with largeprobability. For the expected value of the first term on the right-hand sideof (23.20), using the Cauchy-Schwarz inequality, we have

LE∫

|M∗n(x) −M(x)|µ(dx) ≤ L

∫ √E |M∗

n(x) −M(x)|2µ(dx)

= L

∫ √E |M∗

n(x) − EM∗n(x)|2µ(dx)

= L

∫ √1k2

n

nVarIX∈Sx,ρn(x)µ(dx)

≤ L

∫ √1k2

n

nµ(Sx,ρn(x))µ(dx)

= L

∫ √n

k2n

kn

nµ(dx)

=L√kn

,

which converges to zero. For the expected value of the second term on theright-hand side of (23.20), note that Theorem 6.1 implies that

limn→∞

E∫

|m(x) −mn(x)|µ(dx) = 0.

Therefore,

E∫

|m∗n(x) −m(x)|µ(dx)

≤ E∫

|m∗n(x) −mn(x)|µ(dx) + E

∫|m(x) −mn(x)|µ(dx)

≤ LE∫

|M∗n(x) −Mn(x)|µ(dx) + E

∫|m(x) −mn(x)|µ(dx) → 0.


Assume now that n is so large that

LE∫

|M∗n(x) −M(x)|µ(dx) + E

∫|m∗

n(x) −m(x)|µ(dx) <ε

3.

Then, by (23.20), we have

P∫

|m(x) −mn(x)|µ(dx) > ε

(23.21)

≤ P∫

|m∗n(x) −m(x)|µ(dx) − E

∫|m∗

n(x) −m(x)|µ(dx) >ε

3

+ PL

∫|M∗

n(x) −M(x)|µ(dx) − EL∫

|M∗n(x) −M(x)|µ(dx) >

ε

3

.

Next we get an exponential bound for the first probability on the right-handside of (23.21) by McDiarmid’s inequality (Theorem A.2). Fix an arbitraryrealization of the data Dn = (x1, y1), . . . , (xn, yn), and replace (xi, yi) by(xi, yi), changing the value of m∗

n(x) to m∗ni(x). Then

∣∣∣∣

∫|m∗

n(x) −m(x)|µ(dx) −∫

|m∗ni(x) −m(x)|µ(dx)

∣∣∣∣

≤∫

|m∗n(x) −m∗

ni(x)|µ(dx).

But |m∗n(x) −m∗

ni(x)| is bounded by 2L/kn and can differ from zero onlyif ‖x− xi‖ < ρn(x) or ‖x− xi‖ < ρn(x). Observe that ‖x− xi‖ < ρn(x) or‖x−xi‖ < ρn(x) if and only if µ(Sx,‖x−xi‖) < kn/n or µ(Sx,‖x−xi‖) < kn/n.But the measure of such x’s is bounded by 2 · γdkn/n by Lemma 6.2.Therefore,

supx1,y1,...,xn,yn,xi,yi

∫|m∗

n(x) −m∗ni(x)|µ(dx) ≤ 2L

kn

2 · γdkn

n=

4Lγd

n

and, by Theorem A.2,

P∣∣∣∣

∫|m(x) −m∗

n(x)|µ(dx) − E∫


∣∣∣∣ >

ε

3

≤ 2e−nε2/(72L2γ2d).

Finally, we need a bound for the second term on the right-hand sideof (23.21). This probability may be bounded by McDiarmid’s inequalityexactly in the same way as for the first term, obtaining

P∣∣∣∣L

∫|M∗

n(x) −M(x)|µ(dx) − EL∫

|M∗n(x) −M(x)|µ(dx)

∣∣∣∣ >

ε

3

≤ 2e−nε2/(72L2γ2d),

and the proof is completed.

23.3. k-NN Estimates 489

Theorem 23.8. Assume that for each x the random variable ‖X − x‖ isabsolutely continuous. If kn/ log n → ∞ and kn/n → 0 then the kn-nn

regression function estimate is strongly universally consistent.

Before we prove Theorem 23.8 we will formulate and prove two lemmas.Let Ai be the collection of all x that are such that Xi is one of its

kn nearest neighbors of x in X1, . . . , Xn. Here, we use some geometricarguments similar to those in the proof of Lemma 6.2. Similarly, let usdefine cones x+C1, . . . , x+Cγd

, where x defines the top of the cones andthe union of C1, . . . , Cγd

covers Rd. Thenγd⋃

j=1

x+ Cj = Rd

regardless of how x is picked. According to the cone property, if u, u′ ∈x+Cj , and ‖x− u‖ < ‖x− u′‖, then ‖u− u′‖ < ‖x− u′‖. Furthermore, if‖x− u‖ ≤ ‖x− u′‖, then ‖u− u′‖ ≤ ‖x− u′‖. In the space Rd, define thesets

Ci,j = Xi + Cj (1 ≤ i ≤ n, 1 ≤ j ≤ γd).

Let Bi,j be the subset of Ci,j consisting of all x ∈ Ci,j that are among thekn nearest neighbors of Xi in the set

X1, . . . , Xi−1, Xi+1, . . . , Xn, x ∩ Ci,j .

(If Ci,j contains fewer than kn − 1 of the Xl points i = l, then Bi,j = Ci,j .)Equivalently,Bi,j is the subset of Ci,j consisting of all x that are closer toXi

than the knth nearest neighbor ofXi in X1, . . . , Xi−1, Xi+1, . . . , Xn∩Ci,j .

Lemma 23.10. Assume that for each x the random variable ‖X − x‖ isabsolutely continuous. Let 1 ≤ i ≤ n. If x ∈ Ai, then x ∈ ⋃d

j=1Bi,j, andthus

µ(Ai) ≤γd∑

j=1

µ(Bi,j).

Proof. To prove this claim, take x ∈ Ai. Then locate a j for which x ∈ Ci,j .We have to show that x ∈ Bi,j to conclude the proof. Thus, we need toshow that x is one of the kn nearest neighbors of Xi in the set

X1, . . . , Xi−1, Xi+1, . . . , Xn, x ∩ Ci,j .

Take Xl ∈ Ci,j . If ‖Xl − Xi‖ < ‖x − Xi‖, we recall that by the propertyof our cones that ‖x −Xl‖ < ‖x −Xi‖, and thus Xl is one of the kn − 1nearest neighbors of x in X1, . . . , Xn because of x ∈ Ai. This shows thatin Ci,j there are at most kn − 1 points Xl closer to Xi than x. Thus x isone of the kn nearest neighbors of Xi in the set

X1 . . . , Xi−1, Xi+1, . . . , Xn, x ∩ Ci,j .


This concludes the proof of the claim.

Lemma 23.11. If kn/ log(n) → ∞ and kn/n → 0 then, for every j ∈1, . . . , γd,

lim supn→∞

n

knmax

1≤i≤nµ(Bi,j) ≤ 2 a.s.

Proof. We prove that, for every j,

∞∑

n=1

Pn

knmax

1≤i≤nµ(Bi,j) > 2

< ∞.

In order to do this we give a bound for

Pµ(Bi,j) > p|Xifor 0 p|Xi = 0, therefore we assume that µ(Ci,j) > p. Fix Xi. Define

Gi,p = Ci,j ∩ SXi,Rn(Xi),

where Rn(Xi) > 0 is chosen such that µ(Gi,p) = p. Observe that eitherBi,j ⊇ Gi,p or Bi,j ⊆ Gi,p, therefore we have the following dual relationship:

Pµ(Bi,j) > p|Xi= Pµ(Bi,j) > µ(Gi,p)|Xi= PBi,j ⊃ Gi,p|Xi= PGi,p captures < kn of the points Xl ∈ Ci,j , l = i|Xi .

The number of points Xl (l = i) captured by Gi,p given Xi is binomiallydistributed with parameters (n−1, p), so by Lemma A.1, with p = 2kn/(n−1) and ε = p/2 = kn/(n− 1), we have that

P

max1≤i≤n

µ(Bi,j) > p

≤ nPµ(B1,j) > p= nEPµ(B1,j) > p|X1= nEPG1,p captures < kn of the points Xl ∈ C1,j , l = 1|X1≤ ne−(n−1)[p−ε+ε log(ε/p)]

= ne−2kn+kn+kn log 2

≤ ne−kn(1−log 2),

which is summable because of kn/ log n → ∞.


Proof of Theorem 23.8. By Lemma 23.3 and Theorem 23.7 it isenough to prove that there is a constant c > 0,

lim supn→∞

n∑

i=1

∫Wni(x)µ(dx)Y 2

i ≤ cEY 2 a.s.

Observe thatn∑

i=1

∫Wni(x)µ(dx)Y 2

i =1kn

n∑

i=1

Y 2i µ(Ai) ≤

(n

knmax

iµ(Ai)

)1n

n∑

i=1

Y 2i .

If we can show that

lim supn→∞

n

knmax

iµ(Ai) ≤ c a.s. (23.22)

for some constant c > 0, then by the law of large numbers

lim supn→∞

(n

knmax

iµ(Ai)

)1n

n∑

i=1

Y 2i ≤ lim sup

n→∞c1n

n∑

i=1

Y 2i = cEY 2 a.s.,

so we have to prove (23.22). But by Lemma 23.10,

µ(Ai) ≤γd∑

j=1

µ(Bi,j),

therefore, Lemma 23.11 implies that (23.22) is satisfied with c = 2γd, sothe proof of the theorem is completed.


Devroye and Gyorfi (1983) proved Theorem 23.1. Theorem 23.3 is due toGyorfi (1991). Theorems 23.4 and 23.6 have been shown in Walk (2002a).Theorem 23.5 has been proved by Devroye and Krzyzak (1989). Lemma23.7 is from Devroye, Gyorfi, and Lugosi (1996). Theorems 23.7 and 23.8are due to Devroye et al. (1994). Strong consistency of the partitioningestimate with a cross-validated choice of partitions and also of the kernelestimate with a cross-validated bandwidth for bounded Y are in Kohler,Krzyzak, and Walk (2002).


Problem 23.1. Prove (23.11) and (23.12).Hint: Proceed as in the proof of Lemma 4.1.

Problem 23.2. Show the bounds for B and C in the proof of Lemma 23.5.


Problem 23.3. Prove Lemma 23.7.Hint: Apply the covering in part (iv) of the proof of Theorem 5.1.

24Semirecursive Estimates

24.1 A General Result

For a sequence of measurable real-valued functions Kn(x, u) (n = 1, 2, ...),x, u ∈ Rd, we consider estimates of the form

mn(x) =∑n

i=1 YiKi(x,Xi)∑ni=1Ki(x,Xi)

, (24.1)

(0/0 is 0 by definition). We call such regression function estimates semi-recursive since both the numerator and the denominator can be calculatedrecursively. Thus estimates can be updated sequentially when new ob-servations became available. At the nth stage one has to store onlythe numerator and the denominator, not the whole set of observations(X1, Y1), . . . , (Xn, Yn). Another simple interpretation of the estimate is thatthe denominator and the estimate are stored: put

m1(x) = Y1,

f1(x) = K1(x,X1),

fn+1(x) = fn(x) +Kn+1(x,Xn+1),

mn+1(x) =(

1 − Kn+1(x,Xn+1)fn+1(x)

)mn(x) +

Yn+1Kn+1(x,Xn+1)fn+1(x)

,

if f1(x) = 0.

494 24. Semirecursive Estimates

Consider estimates of the form (24.1) with general Kn(x, z) satisfying

0 ≤ Kn(x, z) ≤ Kmax for all x, z ∈ Rd (24.2)

for some Kmax ∈ R, Kmax ≥ 1.As a byproduct we also consider pointwise consistency such that first

we give sufficient conditions of strong pointwise consistency (Theorem24.1), and then prove weak universal consistency (Lemma 24.1), and stronguniversal consistency (Lemma 24.2).

To simplify the notation we use the abbreviation mod µ to indicate thata relation holds for µ-almost all x ∈ Rd. The Euclidean norm of a vectorx ∈ Rd is denoted by ‖x‖ and the Lebesgue measure by λ.

Theorem 24.1. Assume that, for every distribution µ of X,∫Kn(x, z)f(z)µ(dz)∫Kn(x, z)µ(dz)

→ f(x) mod µ (24.3)

for all square µ-integrable functions f on Rd and∞∑

n=1

EKn(x,Xn) = ∞ mod µ. (24.4)

Then

mn(x) → m(x) a.s.

for µ-almost all x for all distributions of (X,Y ) with E|Y |2 < ∞.

Proof. We have to show that∑n

i=1Ki(x,Xi)Yi∑ni=1Ki(x,Xi)

→ m(x) a.s. mod µ. (24.5)

This follows from∑n

i=1 EKi(x,Xi)Yi∑ni=1 EKi(x,Xi)

=∑n

i=1

∫Ki(x, z)m(z)µ(dz)

∑ni=1

∫Ki(x, z)µ(dz)

→ m(x) mod µ (24.6)

and∑n

i=1 (Ki(x,Xi)Yi − EKi(x,Xi)Yi)∑ni=1 EKi(x,Xi)

→ 0 a.s. mod µ (24.7)

and∑n

i=1Ki(x,Xi)∑ni=1 EKi(x,Xi)

→ 1 a.s. mod µ. (24.8)

Convergence (24.6) is a consequence of the Toeplitz lemma (see ProblemA.5), (24.3), and (24.4). In order to prove (24.7) we set

Vi = Ki(x,Xi)Yi − EKi(x,Xi)Yi

24.1. A General Result 495

and

bn =n∑

i=1

EKi(x,Xi).

Put M(x) = EY 2|X = x. Then

EV 2n ≤ EKn(x,Xn)2Y 2

n ≤ Kmax · EKn(x,Xn)M(Xn)EKn(x,Xn)

· EKn(x,Xn),

and (24.3) implies

∑

n

EV 2n

b2n≤ d(x) ·

∑

n

EKn(x,Xn)∑n

i=1 EKi(x,Xi)2

for some d(x) < ∞ mod µ. By the Abel-Dini theorem (see Problem 24.1)one gets

∑

n

EKn(x,Xn)∑n

i=1 EKi(x,Xi)2 < ∞.

Therefore∑

n

EV 2n

b2n< ∞ mod µ,

and Theorem A.6 implies (24.7). Further (24.7) for Yi = 1 implies (24.8).

Put

mn(x) =EKn(x,X)YEKn(x,X)

=∫Kn(x, z)m(z)µ(dz)∫Kn(x, z)µ(dz)

.

Lemma 24.1. Assume the conditions of Theorem 24.1. If, in addition, aconstant c > 0 exists such that

supn

E∫ (∑n

i=1(Yi −mi(x))Ki(x,Xi)∑ni=1Ki(x,Xi)

)2

µ(dx) ≤ cEY 2 (24.9)

and∫

supn

|mn(x)|2µ(dx) ≤ cEY 2, (24.10)

for every distribution of (X,Y ) with square integrability of Y , then thesequence mn is weakly universally consistent.

Proof. Theorem 24.1 implies that

∑ni=1 YiKi(x,Xi)∑ni=1Ki(x,Xi)

→ m(x) a.s. mod µ (24.11)


especially for bounded Y . From (24.9) and (24.10) one obtains

E∫ (∑n

i=1 YiKi(x,Xi)∑ni=1Ki(x,Xi)

)2

µ(dx) ≤ cEY 2

for each square integrable Y and n ∈ N . From this result and relation(24.11) for bounded Y , which by Lebesgue’s dominated convergence the-orem yields weak (and strong) consistency in the boundedness case, oneobtains the assertion in the general case by a truncation argument for theYi’s (see the proof of Lemma 23.3).

Lemma 24.2. (a) Assume the conditions of Lemma 24.1. If, in addition,a constant c > 0 exists such that

∫ ∞∑

n=1

EY 2

nKn(x,Xn)2

(∑n

i=1Ki(x,Xi))2µ(dx) ≤ cEY 2, (24.12)

and∫ ∞∑

n=1

Emn(x)2Kn(x,Xn)2

(∑n

i=1Ki(x,Xi))2 µ(dx) ≤ cEY 2 (24.13)

for every distribution of (X,Y ) with square integrability of Y , and

K1(x, z) = 1 for all x, z ∈ Rd (24.14)

or

Kn(x, z) ∈ 0 ∪ [α, β] for all x, z ∈ Rd (24.15)

for some 0 < α < β < ∞, then the sequence mn is strongly universallyconsistent.(b) Conditions (24.13) and (24.12) imply (24.9).


24.2 Semirecursive Kernel Estimate

The semirecursive kernel estimate is defined according to (24.1) by a kernelK : Rd → R+ and a sequence of bandwidths hn > 0 via

Kn(x, u) = K

(x− u

hn

). (24.16)

Theorem 24.2. Either let (24.16) hold for n ≥ 2 and let K1(x, u) =K(0), x, u ∈ Rd, with symmetric Lebesgue-integrable kernel K : Rd → R+satisfying

αH(‖x‖) ≤ K(x) ≤ βH(‖x‖), x ∈ Rd, (24.17)

24.2. Semirecursive Kernel Estimate 497

for some 0 < α < β < ∞ and nonincreasing H : R+ → R+ with H(+0) >0, or let (24.16) hold for n ≥ 1 with K : Rd → R+ satisfying

αIS0,R≤ K ≤ βIS0,R

(24.18)

for some 0 < α < β < ∞, 0 < R < ∞. Assume further

hn ↓ 0 (n → ∞),∑

hdn = ∞ . (24.19)

Then the semirecursive kernel estimate is weakly and strongly universallyconsistent.

In both cases (trivially in the second case with H = I[0,R]) theassumptions in Theorem 24.2 imply that

K ≥ bIS0,Rfor some b > 0, 0 < R < ∞ ,

and

rdH(r) → 0 (r → ∞) .

The next covering lemma using balls plays an important role in the proofof Lemma 24.4.

Lemma 24.3. Let A be a bounded set in Rd, k ∈ N , 0 < r1 < · · · <rk < ∞. For each x ∈ A let S(x) be a closed ball centered at x with radiusr(x) ∈ r1, . . . , rk. Then there exists an m ∈ N depending only on d, butnot on A or k with the following property: there exists a finite number ofpoints x1, . . . , xl ∈ A such that A ⊆ S(x1) ∪ · · · ∪ S(xl) and each v ∈ Rd

belongs to at most m of the sets S(x1), . . . , S(xl).

Proof. Choose x1 such that r(x1) is the largest possible. Then choosex2 ∈ A−S(x1) such that r(x2) is the largest possible, and choose x3 ∈ A−(S(x1) ∪ S(x2)) such that r(x3) is the largest possible, etc. The procedureterminates with the choice of xl for some l, because ||xi − xj || > r1 (i = j)and, due to the boundedness of A, where || · || is the Euclidean norm. ThusS(x1), . . . , S(xl) cover A.

For each v ∈ Rd there exists a finite number m (depending only on d,but not on v) of congruent cones C1, . . . , Cm with vertex v covering Rd,such that two arbitrary rays in a cone starting at v form an angle less thanor equal to π/4.

Then for all x, y in a cone with ||x − v|| ≤ r, ||y − x|| > r for somer ∈ (0,∞) one has ||y − v|| > r.

Consequently, there do not exist two points xi, xj in the same coneCn (n ∈ 1, . . . ,m) with v ∈ S(xi), v ∈ S(xj), because ||xi − xj || >maxr(xi), r(xj) = r(xi) (the latter w.l.o.g.), ||xi − v|| ≤ r(xi) imply||xj − v|| > r(xi) in contrast to ||xj − v|| ≤ r(xj). Therefore the number ofxi’s with v ∈ S(xi) is at most m. This completes the proof.


In the following let µ, ν be measures on Bd assigning finite values tobounded sets. It is assumed that in the expressions

suph>0

ν(Sx,h)µ(Sx,h)

,

lim suph→0

ν(Sx,h)µ(Sx,h)

,

(x ∈ Rd) h assumes countably many positive values thus making theexpressions measurable functions of x. Here 0/0 = 0.

Lemma 24.4. There is a constant c depending on d such that

µ

x ∈ Rd : sup

h>0

ν(Sx,h)µ(Sx,h)

> α

≤ c

αν(Rd)

for any α > 0.

Proof. Let H = h1, h2, . . . be the countable set of positive h’s. Letα > 0 be fixed. Set

M =x ∈ Rd : sup

h>0

ν(Sx,h)µ(Sx,h)

> α

and let G be an arbitrary bounded Borel set. Define further

DN =x ∈ G ∩M : ∃h∈h1,...,hN

ν(Sx,h)µ(Sx,h)

> α

.

Then DN ↑ G∩M . Let N be arbitrary. Choose x1, . . . , xl ∈ DN accordingto Lemma 24.3 with corresponding h(x1), . . . , h(xl) ∈ h1, . . . , hN andm = c (depending only on d) such that, with notation Sj = Sxj ,h(xj),

ν(Sj)µ(Sj)

> α (j = 1, . . . , l),

DN ⊆l⋃

j=1

Sj ,

l∑

j=1

ISj ≤ c.

Then

µ(DN ) ≤l∑

j=1

µ(Sj) <1α

l∑

j=1

ν(Sj) =1α

l∑

j=1

∫

Rd

ISj dν ≤ c

αν(Rd),

and, by N → ∞,

µ(G ∩M) ≤ c

αν(Rd).


Letting G ↑ Rd, one obtains the assertion µ(M) ≤ cαν(Rd).

Now we state the generalized pointwise Lebesgue density theorem:

Lemma 24.5. Let f be a Borel measurable function integrable on Rd.Then

limh→0

∫Sx,h

|f(t) − f(x)|µ(dt)

µ(Sx,h)= 0 a.s. mod µ.

Proof. For any ε > 0, according to Theorem A.1 choose a continuousfunction g of compact support such that

∫

Rd

|f − g| dµ < ε2

2(c+ 1)

with constant c from Lemma 24.4. We have1

µ(Sx,h)

∫

Sx,h

|f(t) − f(x)|µ(dt)

≤ 1µ(Sx,h)

∫

Sx,h

|f − g| dµ+ |f(x) − g(x)|

+1

µ(Sx,h)

∫

Sx,h

|g(t) − g(x)|µ(dt).

Since g is continuous, the last term on the right-hand side converges to 0.Define the set

Tε =

x : suph>0

1µ(Sx,h)

∫

Sx,h

|f − g| dµ+ |f(x) − g(x)| > ε

.

By Lemma 24.4 and the Markov inequality

µ(Tε) ≤ µ

(

x : suph>0

1µ(Sx,h)

∫

Sx,h

|f − g| dµ > ε/2

)

+µ(x : |f(x) − g(x)| > ε/2)

≤ c(2/ε)∫

Rd

|f − g| dµ+ (2/ε)∫

Rd

|f − g| dµ

=2(c+ 1)

ε

∫

Rd

|f − g| dµ ≤ ε,

where ε → 0 yields the assertion. In the proof we tacitly assumed thatµ(Sx,h) > 0 for all x ∈ Rd, h > 0. We can show that µ(T ) = 0, where

T = x : ∃hx > 0, µ(Sx,hx) = 0.

Let Q denote a countable dense set in Rd. Then for each x ∈ T , there isqx ∈ Q with ‖x−qx‖ ≤ hx/3. This implies that Sqx,hx/2 ⊂ Sx,hx . Therefore


µ(Sqx,hx/2) = 0, x ∈ T , and

S ⊆⋃

x∈T

Sqx,hx/2.

The right-hand side is a union of countably many sets of zero measure, andtherefore µ(T ) = 0.

Lemma 24.6. It holds that

lim suph→0

hd

µ(Sx,h)= g(x) < ∞ mod µ.

Proof. Let SR be the open sphere centered at 0 with radius R and definethe finite measure λ′ by λ′(B) = λ(B ∩ SR), B ∈ Bd. We obtain

µ

(x ∈ SR; lim sup

h→0

λ(Sx,h)µ(Sx,h)

= ∞)

= µ

(x ∈ SR; lim sup

h→0

λ′(Sx,h)µ(Sx,h)

= ∞)

≤ µ

(x ∈ SR; sup

h>0

λ′(Sx,h)µ(Sx,h)

= ∞)

= lims→∞

µ

(x ∈ SR; sup

h>0

λ′(Sx,h)µ(Sx,h)

> s

)

= 0

by Lemma 24.4 with ν = λ′, then, by R → ∞, the relation

lim suph→0

λ(Sx,h)µ(Sx,h)

< ∞ mod µ

and thus the assertion because λ(Sx,h) = Vdhd, where Vd is the volume of

the unit ball in Rd.

Lemma 24.7. Let m ∈ L2(µ) and let m∗ be the generalized Hardy-Littlewood maximal function of m defined by

m∗(x) = suph>0

1µ(Sx,h)

∫

Sx,h

|m| dµ, x ∈ Rd.

Thus m∗ ∈ L2(µ) and∫m∗(x)2µ(dx) ≤ c∗

∫m(x)2µ(dx),

where c∗ < ∞ depends only on d.

Proof. For arbitrary α > 0 define gα = mI[|m|≥α/2] and let g∗α be its

generalized Hardy-Littlewood maximal function. One has

|m| ≤ |gα| + α/2,


m∗ ≤ g∗α + α/2.

Thus, with the image measure µm∗ ,

µm∗((α,∞)) = µ(x ∈ Rd : m∗(x) > α)

≤ µ(x ∈ Rd : g∗α(x) > α/2)

≤ 2cα

∫|gα| dµ

=2cα

∫

x∈Rd:|m(x)|≥α/2|m| dµ

with c depending only on d and the last inequality following fromLemma 24.4. Furthermore,

∫m∗2 dµ =

∫

R+

s2 dµm∗(s) = 2∫

R+

αµm∗((α,∞)) dα

by using transformation of integrals and integration by parts. Thus∫m∗2 dµ ≤ 4c

∫

R+

(∫

x∈Rd:|m(x)|≥α/2|m| dµ

)

dα

= 4c∫

Rd

|m(x)|(∫ 2|m(x)|

0dα

)

µ(dx)

(by Fubini’s theorem)

= 8c∫

Rd

|m(x)|2µ(dx).

The proof is complete.

Lemma 24.8. Assume

c1H(||x||) ≤ K(x) ≤ c2H(||x||), c1, c2 > 0,

H(+0) > 0, (24.20)

tdH(t) → 0 as t → ∞, (24.21)

where H is a nonincreasing Borel function on [0,∞). Then, for all µ-integrable functions f ,

limh→0

∫K((x− z)/h)f(z)µ(dz)∫K((x− z)/h)µ(dz)

= f(x) mod µ.

Proof. Clearly∣∣∣∣∣

∫K((x− z)/h)f(z)µ(dz)∫K((x− z)/h)µ(dz)

− f(x)

∣∣∣∣∣

≤ c2c1

∫H

( ||x− y||h

)|f(x) − f(y)|µ(dy)

/∫H

( ||x− y||h

)µ(dy).


Observe

H(t) =∫ ∞

0IH(t)>s(s) ds.

Thus∫H

( ||x− y||h

)µ(dy) =

∫ (∫ ∞

0IH

( ||x−y||h

)>sds

)µ(dy)

=∫ ∞

0µ

y : H

( ||x− y||h

)> t

dt

=∫ ∞

0µ(At,h) dt,

likewise,∫H

( ||x− y||h

)|f(x) − f(y)|µ(dy) =

∫ ∞

0

(∫

At,h

|f(x) − f(y)|µ(dy)

)

dt,

where At,h = y : H(||x− y||/h) > t.Let δ = εhd, ε > 0. Obviously,

∫ ∞

δ

(∫

At,h

|f(x) − f(y)|µ(dy)

)

dt/∫ ∞

0µ(At,h)dt

≤ supt≥δ

(∫

At,h

|f(x) − f(y)|µ(dy)/µ(At,h)

)

. (24.22)

It is clear that the radii of sets At,h, t ≥ δ, do not exceed the radius ofAδ,h, which in turn is h times that of the set Aδ,1. The radius of Aδ,1 doesnot exceed the length H+(δ) of the interval t : H(t) > δ. Thus the radiusof Aδ,h is dominated by hH+(δ). Now by (24.21) and the definition of δ,hH+(δ) = hH+(εhd) converges to zero as h → 0. Since At,h is a ball,Lemma 24.5 implies that the right-hand side of (24.22) tends to zero atµ-almost all x ∈ Rd. But

∫ δ

0

(∫

At,h

|f(x) − f(y)|µ(dy)

)

dt ≤ (c3 + |f(x)|)δ

where c3 =∫ |f(x)|µ(dx). Using (24.20), and thus cI‖x‖≤r ≤ H(r) for

suitable c > 0 and r > 0, we get∫H

( ||x− y||h

)µ(dy) ≥ cµ(Srh) =

c(rh)d

arh(x)

where ah(x) = hd/µ(Sh). Using the above and the definition of δ we obtain∫ δ

0

(∫

At,h

|f(x) − f(y)|µ(dy)

)

dt/

∫ ∞

0µ(At,h) dt


≤ ε

(c3 + |f(x)|

crd

)arh(x).

By Lemma 24.6 for µ-almost all x the right-hand side of the inequalityabove may be made arbitrarily small for ε small enough. The proof iscompleted.

Note that for K(x) = IS0,1(x) Lemma 24.8 reduces to Lemma 24.5.

Proof of Theorem 24.2. By Lemmas 24.1 and 24.2, this can be doneby verifying (24.3), (24.4), (24.10), (24.12), and (24.13).

Proof of (24.3) and (24.4). Under the assumptions of the theorem,(24.3) and (24.4) hold according to Lemmas 24.8 and 24.6.

Proof of (24.10). The function x → H(‖x‖) which is Riemann-integrableon compact spheres, can be approximated from below by a sequence ofpositive linear combinations

N∑

k=1

cN,kIS0,RN,k

of indicator functions for spheres. Let m∗ be the generalized Hardy-Littlewood maximal function for m, which is defined by

m∗(x) := suph>0

∫

Sx,h

|m(t)|µ(dt)

µ(Sx,h), x ∈ Rd.

Because of Lemma 24.7, m∗ ∈ L2(µ) with∫m∗(x)2µ(dx) ≤ c∗

∫m(x)2µ(dx) ≤ c∗EY 2.

For each h > 0,∫

|m(t)|IS0,RN,k

(x− t

h

)µ (dt) ≤ m∗(x)

∫IS0,RN,k

(x− t

h

)µ (dt),

therefore, by the dominated convergence theorem,∫

|m(t)|H(∥∥∥∥x− t

h

∥∥∥∥

)µ(dt)

= limN→∞

∫|m(t)|

N∑

k=1

cN,kIS0,RN,k

(x− t

h

)µ(dt)

≤ m∗(x) limN→∞

∫ N∑

k=1

cN,kIS0,RN,k

(x− t

h

)µ(dt)

= m∗(x)∫H

(∥∥∥∥x− t

h

∥∥∥∥

)µ(dt),


thus,∫

|m(t)|K(x− t

h

)µ(dt) ≤ β

∫|m(t)|H

(∥∥∥∥x− t

h

∥∥∥∥

)µ(dt)

≤ βm∗(x)∫H

(∥∥∥∥x− t

h

∥∥∥∥

)µ(dt)

≤ β

αm∗(x)

∫K

(x− t

h

)µ(t) .

Therefore,

suph>0

∫ |m(t)|K(x−th )µ(dt)

∫K(x−t

h )µ(dt)≤ β

αm∗(x),

and (24.10) is proved.

Proof of (24.12). We use the assumptions of the theorem with (24.16)for n ≥ 2 and K1(x, u) = K(0) = 1 (w.l.o.g.), x, u ∈ Rk. For n ≥ 2 oneobtains

EY 2

nKn(x,Xn)2(

n∑

i=1Ki(x,Xi)

)2 ≤ E1

(1 +

n−1∑

i=2K(x−Xi

hi))2 EY 2K

(x−X

hn

)2

,

thus, by Fubini’s theorem,∫ ∞∑

n=1

EY 2

nKn(x,Xn)2(

n∑

i=1Ki(x,Xi)

)2µ(dx)

≤ EY 2 +∫

E(Y 2|X = z)∫ ∞∑

n=2

K

(x− z

hn

)2

×E1

(1 +

n−1∑

i=2K(x−Xi

hi))2µ(dx)µ(dz).

It suffices to show the existence of a constant c1 > 0 such that

supz

∞∑

n=2

E∫K

(x− z

hn

)2 1(

1 +n−1∑

i=2K(x−Xi

hi))2µ(dx) ≤ c1

for any distribution µ. A covering argument of Devroye and Krzyzak (1989)with a refined lower bound is used. Choose R > 0 such that H(R) > 0. LetRd be covered by spheres Ak = xk + S0,R/2 such that every x ∈ Rd getscovered at most k1 = k1(d) times. For each n ≥ 2, z ∈ Rd, we show that


x ∈ z + hnAk implies

K

( · − x

hi

)≥ cK

( · − z

hi

)IAk

( · − z

hi

)

for all i ∈ 2, . . . , n with c2 = αH(R)/βH(0) ∈ (0, 1] . Without loss ofgenerality let z = 0 . With x/hn = x it suffices to show that

K

(t− x

hn

hi

)≥ c2K(t)

for all x, t ∈ Ak and all n ≥ 2, i ∈ 2, . . . , n. Because of∥∥∥∥t− x

hn

hi

∥∥∥∥ ≤ max

0≤r≤1‖t− rx‖

= max‖t‖, ‖t− x‖ ≤ max‖t‖, R(since hn ≤ hi) one has

H

(∥∥∥∥t− x

hn

hi

∥∥∥∥

)≥ H(R)H(‖t‖)/H(0)

in both cases ‖t − xhn/hi‖ ≤ ‖t‖, ‖t − xhn/hi‖ ≤ R by monotonicity ofH, and thus the desired inequality.Now for each z ∈ Rk one obtains

∞∑

n=2

E∫

Rd

K(x−zhn

)2(

1 +n−1∑

i=2K(x−Xi

hi))2µ(dx)

=∞∑

k=1

∞∑

n=2

E∫

z+hnAk

K(x−zhn

)2(

1 +n−1∑

i=2K(x−Xi

hi))2µ(dx)

≤∞∑

k=1

∞∑

n=2

E∫

z+hnAk

K(x−zhn

)2(

1 + c2n−1∑

i=2K(Xi−z

hi)IAk

(Xi−zhi

))2µ(dx)

≤ 1c22

∞∑

k=1

∞∑

n=2

E∫

Rd

K(x−zhn

)2IAk(x−z

hn)

(1 +n−1∑

i=2K(Xi−z

hi)IAk

(Xi−zhi

))2µ(dx)

≤ 1c22

E∫ ∞∑

k=1

sups∈Ak

K(s)∞∑

n=2

K(x−zhn

)IAk(x−z

hn)

(1 +

n−1∑

i=2K(Xi−z

hi)IAk

(Xi−zhi

))2µ(dx)

≤ 1c22

∞∑

k=1

sups∈Ak

K(s) · E∞∑

n=2

K(Xn−zhn

)IAk(Xn−z

hn)

(1 +

n−1∑

i=2K(Xi−z

hi)IAk

(Xi−zhi

))2


≤ 4K2max

c22

∞∑

k=1

sups∈Ak

K(s) · E∞∑

n=2

K(Xn−zhn

)IAk(Xn−z

hn)

(1 +

n∑

i=2K(Xi−z

hi)IAk

(Xi−zhi

))2

≤ 4K2max

c22

∞∑

k=1

sups∈Ak

K(s)

< ∞ .

Here the formulaN∑

n=2

an

(1 +∑n

i=2 ai)2 ≤ 1 − 1

1 +∑N

i=2 ai

,

valid for all sequences an with an ≥ 0, n ≥ 2 and obtainable by induction,is used, and also the properties of K.

The case Kn(x, z) = K

(x− z

hn

)with αIS0,R

≤ K ≤ βIS0,R(0 < α <

β < ∞, 0 < R < ∞), hence w.l.o.g. K = IS0,Ris treated analogously, but

in a slightly simpler way, using

EY 2

nK(x−Xn

hn)2

(∑ni=1K(x−Xi

hi))2 = E

Y 2nK(x−Xn

hn)

(1 +

∑n−1i=1 K(x−Xi

hi))2

= E1

(1 +

∑n−1i=1 K(x−Xi

hi))2 EY 2

nK

(x−Xn

hn

)

and∞∑

n=1

IS0,R∩Ak(Xn−z

hn)

(∑ni=1 IS0,R∩Ak

(Xi−zhi

))2 ≤

∞∑

n=1

1n2 ≤ 2

for all k and z.

Proof of (24.13). One notices∫ ∞∑

n=1

Emn(x)2Kn(x,Xn)2

(∑n

i=1Ki(x,Xi))2 µ(dx)

≤∫

supnmn(x)2E

∞∑

n=1

Kn(x,Xn)2

(∑n

i=1Ki(x,Xi))2µ(dx).

Under the assumptions of the theorem (w.l.o.g. K(0) = 1, in the caseαIS0,R

≤ K ≤ βIS0,Reven K = IS0.R

) one has∞∑

n=1

Kn(x,Xn)2

(∑n

i=1Ki(x,Xi))2 ≤ Kmax ·

∞∑

n=1

Kn(x,Xn)(

n∑

i=1Ki(x,Xi)

)2 ≤ 2 ·Kmax

24.3. Semirecursive Partitioning Estimate 507

for all x and all sequences Xn, according to the final argument in the proofof (24.12). This together with (24.10) yields the assertion.

24.3 Semirecursive Partitioning Estimate

For the semirecursive partitioning estimate we are given a sequence of (fi-nite or countably infinite) partitions Pn = An,1, An,2, . . . of Rd, whereAn,1, An,2, . . . are Borel sets. Then put

Kn(x, u) =∞∑

j=1

I[x∈An,j ,u∈An,j ]. (24.23)

For z ∈ Rd set An(z) = An,j if z ∈ An,j . Then (24.23) can be written inthe form

Kn(x, u) = IAn(x)(u) = IAn(u)(x).

We call the sequence of partitions Pn nested if the sequence of generatedσ-algebras F(Pn) is increasing.

Theorem 24.3. If

the sequence of partitions Pn is nested, (24.24)

diamAn(z) := supu,v∈An(z)

‖u− v‖ → 0 (n → ∞) (24.25)

for each z ∈ Rd and∞∑

i=n

λ(Ai(z)) = ∞ (24.26)

for each z and n, then the semirecursive partitioning estimate is weakly andstrongly universally consistent.

If the sequence of partitions Pn is nested, then the semirecursive par-titioning estimator has the additional advantage that it is constant overeach cell of Pn. Such an estimator can be represented computationally bystoring the constant numerator and denominator for each (nonempty) cell.

The proof of Theorem 24.3 applies the pointwise generalized Lebesguedensity theorem for nested partitions (first part of Lemma 24.10).

In all lemmas of this section, µ and ν are assumed to be measuresassigning finite values to bounded sets.

Lemma 24.9. Assume (24.24). Then

µ

x ∈ Rd : sup

n

ν(An(x))µ(An(x))

> α

≤ ν(Rd)

α

for any α > 0.


Proof. Set

S :=x ∈ Rd : sup

n>0

ν(An(x))µ(An(x))

> α

.

For x ∈ S choose nx, ix ∈ N with x ∈ Anx,ixand

ν(Anx,ix)

µ(Anx,ix)> α. (24.27)

Clearly

S ⊆⋃

x∈S

Anx,ix .

By the nestedness of Pn

Anj ,ij : j

:=Anx,ix

: x ∈ S and for every y ∈ S : x ∈ Any,iyor ny > nx

is a disjoint finite- or infinite-countable cover of S. Using this and (24.27)we obtain

µ(S) ≤ µ

⋃

j

Anj ,ij

=∑

j

µ(Anj ,ij )

≤∑

j

1αν(Anj ,ij

) =1αν

⋃

j

Anj ,ij

≤ 1αν(Rd)

and the assertion is proved.

Lemma 24.10. Assume (24.24) and (24.25). Then

limn→∞

∫An(x) f(z)µ(dz)

µ(An(x))= f(x) mod µ,

and

lim infn→∞

µ(An(x))λ(An(x))

> 0 mod µ.


Lemma 24.11. Let the conditions of Lemma 24.9 be fulfilled, let m ∈L2(µ), and let m∗ denote the generalized Hardy-Littlewood maximalfunction of m defined by

m∗(x) = supn

1µ(An(x))

∫

An(x)|m| dµ, x ∈ Rd.

24.3. Semirecursive Partitioning Estimate 509

Thus m∗ ∈ L2(µ) and∫m∗(x)2µ(dx) ≤ c∗

∫m(x)2µ(dx),

where c∗ depends only on d.


Proof of Theorem 24.3. By Lemmas 24.1 and 24.2, this can be doneby verifying (24.3), (24.4), (24.10), (24.12), and (24.13).

Proof of (24.3). See Lemma 24.10.

Proof of (24.4). By Lemma 24.10 we have that

lim infn→∞

µ(Ai(x))λ(Ai(x))

> 0 mod µ,

which together with (24.26) impliesn∑

i=1

EKi(x,Xi) =n∑

i=1

µ(Ai(x))λ(Ai(x))

· λ(Ai(x)) → ∞ mod µ.

Proof of (24.10). See Lemma 24.11.

Proof of (24.12). It suffices to show the existence of a constant c1 > 0such that

supz

E∞∑

n=1

∫Kn(x, z)

1(1 +

∑n−1i=1 Ki(x,Xi)

)2µ(dx) ≤ c1

for any distribution µ. The sequence of partitions is nested, thus x ∈ An(z)and i ≤ n imply z ∈ An(x) ⊆ Ai(x) which in turn implies Ai(x) = Ai(z).Therefore,

E∞∑

n=1

∫Kn(x, z)

1(1 +

∑n−1i=1 Ki(x,Xi)

)2 µ(dx)

= E∞∑

n=1

∫Ix∈An(z)

1(1 +

∑n−1i=1 IXi∈Ai(x)

)2 µ(dx)

= E∞∑

n=1

∫Ix∈An(z)

1(1 +

∑n−1i=1 IXi∈Ai(z)

)2 µ(dx)

= E∞∑

n=1

µ(An(z))(1 +

∑n−1i=1 IXi∈Ai(z)

)2


= E∞∑

n=1

IXn∈An(z)(∑n

i=1 IXi∈Ai(z))2 . (24.28)

If for fixed X1, X2, ..., n is the kth index with IXn∈An(z) = 1, then∑ni=1 IXi∈Ai(z) = k. Therefore,

∞∑

n=1

IXn∈An(z)(∑n

i=1 IXi∈Ai(z))2 ≤

∞∑

k=1

1k2 ≤ 2

which together with (24.28) yields the assertion.

Proof of (24.13). One notices∫ ∞∑

n=1

Emn(x)2Kn(x,Xn)2(

n∑

i=1Ki(x,Xi)

)2 µ (dx)

≤∫

supnmn(x)2E

∞∑

n=1

Kn(x,Xn)(

n∑

i=1Ki(x,Xi)

)2 µ (dx) .

One has∞∑

n=1

Kn(x,Xn)(

n∑

i=1Ki(x,Xi)

)2 ≤ 2

for all x and all sequences Xn, according to the final argument in the proofof (24.12). This together with (24.10) yields the assertion.


Theorems 24.1, 24.2, and 24.3 are due to Gyorfi, Kohler, and Walk (1998).In the literature, the semirecursive estimates have been considered with thegeneral form

Kn(x, u) = αnK

(x− u

hn

)

with a sequence of weights αn > 0. Motivated by a recursive kernel den-sity estimate due to Wolverton and Wagner (1969b) and Yamato (1971),Greblicki (1974) and Ahmad and Lin (1976) proposed and studied semire-cursive kernel estimates with αn = 1/hd

n, see also Krzyzak and Pawlak(1983). The choice αn = 1 has been proposed and investigated by De-vroye and Wagner (1980b). Consistency properties of this estimate werestudied by Krzyzak and Pawlak (1984a), Krzyzak (1992), Greblicki andPawlak (1987b), and Gyorfi, Kohler, and Walk (1998). Lemma 24.4 is


related to Lemma 10.47 of Wheeden and Zygmund (1977). Lemma 24.6is due to Devroye (1981). Lemma 24.7 deals with the Hardy-Littlewoodmaximal function (see Stein and Weiss (1971) and Wheeden and Zyg-mund (1977)). Concerning Lemma 24.8 see Greblicki, Krzyzak, and Pawlak(1984), Greblicki and Pawlak (1987a), and Krzyzak (1991). Semirecursivekernel estimates was applied to estimation of nonlinear, dynamic systemsin Krzyzak (1993).


Problem 24.1. Prove the Abel-Dini theorem: for any sequence an ≥ 0 witha1 > 0,

∞∑

n=1

an(∑n

i=1 ai

)2 < ∞.

Problem 24.2. Prove Lemma 24.2.Hint: First show that

∫ (∑n

i=1 mi(x)Ki(x, Xi)∑n

i=1 Ki(x, Xi)− m(x)

)2

µ(dx) → 0 a.s.

by the use of (24.3), (24.10), (24.4), (24.8), the Toeplitz lemma, and Lebesgue’sdominated convergence theorem. Then formulate a recursion for Un with

Un(x) =

∑n

i=1(Yi − mi(x))Ki(x, Xi)∑n

i=1 Ki(x, Xi)

and show a.s. convergence of∫

Un(x)2µ(dx) by the use of (24.12), (24.13), andTheorem A.6 distinguishing cases (24.14) and (24.15). These results yield a.s.convergence of

∫(mn(x) − m(x))2µ(dx) and thus, by Lemma 24.1, the assertion

of part (a). Prove part (b) analogously by taking expectations.



Problem 24.5. Formulate and prove the variant of Lemma 23.3 by which theproof of Lemma 24.1 can be finished.

Problem 24.6. For d = 1 and nonnested partitions where each partition consistsof nonaccumulating intervals, prove the assertions of Lemma 24.9 with factor 2on the right-hand side of the inequalities, and then prove Theorem 24.3 for d = 1without condition (24.24).

Hint: Use arguments in the proof of Lemma 24.4.

Problem 24.7. Prove both parts of Lemma 24.10 using a martingale conver-gence theorem (Theorem A.4).

Hint: Let Fn be the σ-algebra generated by the partition Pn. Put

fn(x) = Ef(X)|X ∈ An(x).

Then (fn, Fn) forms a convergent martingale on the probability space (Rd, Bd, µ).

25Recursive Estimates

25.1 A General Result

Introduce a sequence of bounded, measurable, symmetric and nonnegativevalued functions Kn(x, z) on Rd × Rd. Let an be a sequence of positivenumbers, then the estimator is defined by the following recursion:

m1(x) = Y1,

mn+1(x) = mn(x)(1 − an+1Kn+1(x,Xn+1)) + an+1Yn+1Kn+1(x,Xn+1).(25.1)

By each new observation the estimator will be updated. At the nth stageone has only to store mn(x). The estimator is of a stochastic approximationtype, especially of Robbins-Monro type (cf. Ljung, Pflug, and Walk (1992)with further references). mn+1(x) is obtained as a linear combination ofthe estimates mn(x) and Yn+1 with weights 1 − an+1Kn+1(x,Xn+1) andan+1Kn+1(x,Xn+1), respectively.

Theorem 25.1. Assume that there exists a sequence hn of positive num-bers tending to 0 and a nonnegative nonincreasing function H on [0,∞)with rdH(r) → 0 (r → ∞) such that

hdnKn(x, z) ≤ H(‖x− z‖/hn), (25.2)

supx,z,n

anKn(x, z) < 1, (25.3)


lim infn

∫Kn(x, t)µ(dt) > 0 for µ-almost all x, (25.4)

∑

n

an = ∞, (25.5)

and∑

n

a2n

h2dn

< ∞. (25.6)

Then mn is weakly and strongly universally consistent.

Proof. The proof of the first statement can be done by the verification ofthe conditions of Stone’s theorem (Theorem 4.1). If, by definition, a voidproduct is 1, then by (25.1),

Wn,i(z) =n∏

l=i+1

(1 − alKl(z,Xl))aiKi(z,Xi)

for n ≥ 2, i = 1, . . . , n, and

W1,1(z) = 1.

It is easy to check by induction that these are probability weights. Tocheck condition (i) let bn be defined by (25.1) if Yi is replaced by f(Xi)and b1(x) = f(X1), where f is an arbitrary nonnegative Borel function.Then we prove that

Ebn(X) = Ef(X), (25.7)

which implies (i) with c = 1. Introduce the notation Fn for the σ-algebragenerated by (Xi, Yi) (i = 1, 2, . . . , n),

Ebn+1(X)|Fn= Ebn(X)|Fn + an+1E(f(Xn+1) − bn(X))Kn+1(X,Xn+1)|Fn

= Ebn(X)|Fn + an+1

∫ ∫(f(x) − bn(z))Kn+1(z, x)µ(dx)µ(dz)

= Ebn(X)|Fn + an+1

∫ ∫(f(x) − bn(x))Kn+1(x, z)µ(dx)µ(dz)

= Ebn(X)|Fn + an+1E(f(X) − bn(X))Kn+1(X,Xn+1)|Fn,where the symmetry of Kn(x, z) was applied. Thus

Ebn+1(X) = Ebn(X) + an+1E(f(X) − bn(X))Kn+1(X,Xn+1).Define another sequence

b∗1(X) = f(X),

b∗n+1(X) = b∗n(X) + an+1(f(X) − b∗n(X))Kn+1(X,Xn+1).

514 25. Recursive Estimates

Then

b∗n(X) = f(X)

and

Eb∗n+1(X) = Eb∗n(X) + an+1E(f(X) − b∗n(X))Kn+1(X,Xn+1).For Eb∗n(X) and for Ebn(X) we have the same iterations and, thus,

Eb∗n(X) = Ebn(X),so (25.7) and therefore (i) is proved. Set

pn(x) =∫Kn(x, t)µ(dt).

Then by condition (25.4) for each fixed i and for µ-almost all z,

EWn,i(z) ≤ H(0)ai

hdi

e−∑n

l=i+1alEKl(z,X) =

H(0)ai

hdi

e−∑n

l=i+1alpl(z) → 0.

(25.8)Obviously,

En∑

i=1

W 2n,i(z) → 0 (25.9)

for µ-almost all z implies (v), so we prove (25.9) showing

En∑

i=1

W 2n,i(z) ≤

n∑

i=1

EWn,i(z)H(0)ai

hdi

→ 0

by the Toeplitz lemma and by (25.6) and (25.8). Concerning (iii) it isenough to show that for all a > 0 and for µ-almost all z,

E

n∑

i=1

Wn,i(z)I[‖Xi−z‖>a]

→ 0.

By (25.4),

lim infn

pn(z) = 2p(z) > 0 for µ-almost all z,

so for such z there is an n0(z) such that, for n > n0(z),

pn(z) ≥ p(z).

Because of (25.8) it suffices to show, for these z,

En∑

i=n0(z)

Wn,i(z)I[‖Xi−z‖>a] → 0.


Because of the conditions

En∑

i=n0(z)

Wn,i(z)I[‖Xi−z‖>a]

=n∑

i=n0(z)

EWn,i(z)EKi(z,Xi)I[‖Xi−z‖>a]

EKi(z,Xi)

≤n∑

i=n0(z)

EWn,i(z)h−di H(a/hi)pi(z)

≤n∑

i=n0(z)

EWn,i(z)h−di H(a/hi)p(z)

→ 0

by the Toeplitz lemma. Thus the first statement is proved:

E‖mn −m‖2 → 0. (25.10)

In order to prove the second statement note

mn+1(x) − Emn+1(x)

= mn(x) − Emn(x)

− an+1(mn(x)Kn+1(x,Xn+1) − Emn(x)Kn+1(x,Xn+1))

+ an+1(Yn+1Kn+1(x,Xn+1) − EYn+1Kn+1(x,Xn+1)),

therefore,

E(mn+1(x) − Emn+1(x))2|Fn = I1 + I2 + I3 + I4 + I5 + I6,

where

I1 = (mn(x) − Emn(x))2

and because of the independence of mn(x) and Kn+1(x,Xn+1),

I2 = Ea2n+1(mn(x)Kn+1(x,Xn+1) − Emn(x)Kn+1(x,Xn+1))2|Fn

= a2n+1(mn(x)2(EKn+1(x,Xn+1)2 − [EKn+1(x,Xn+1)]2)

+(mn(x) − Emn(x))2[EKn+1(x,Xn+1)]2)

≤ H(0)2a2n+1

h2dn+1

(mn(x)2 + (mn(x) − Emn(x))2)

and

I3 = Ea2n+1(Yn+1Kn+1(x,Xn+1) − EYn+1Kn+1(x,Xn+1))2|Fn

≤ a2n+1EY

2n+1Kn+1(x,Xn+1)2

≤ H(0)2a2n+1

h2dn+1

EY 2


and

I4 = −2an+1(mn(x) − Emn(x))Emn(x)Kn+1(x,Xn+1)

−Emn(x)Kn+1(x,Xn+1)|Fn= −2an+1(mn(x) − Emn(x))2EKn+1(x,Xn+1)

≤ 0

and

I5 = 2an+1(mn(x) − Emn(x))

×EYn+1Kn+1(x,Xn+1) − EYn+1Kn+1(x,Xn+1)|Fn= 0

and

I6

= −2a2n+1E(mn(x)Kn+1(x,Xn+1) − Emn(x)Kn+1(x,Xn+1))

×(Yn+1Kn+1(x,Xn+1) − EYn+1Kn+1(x,Xn+1))|Fn= −2a2

n+1mn(x)

×Em(Xn+1)Kn+1(x,Xn+1)2 −Kn+1(x,Xn+1)EYn+1Kn+1(x,Xn+1)

≤ 2H(0)2a2

n+1

h2dn+1

|mn(x)|(E|m(X)| + E|Y |)

≤ 4H(0)2a2

n+1

h2dn+1

|mn(x)|E|Y |

≤ 2H(0)2a2

n+1

h2dn+1

(mn(x)2 + EY 2).

Thus summarizing

E(mn+1(x) − Emn+1(x))2|Fn

≤(

1 +H(0)2a2

n+1

h2dn+1

)(mn(x) − Emn(x))2 + 3

H(0)2a2n+1

h2dn+1

(mn(x)2 + EY 2).

(25.11)

Analogously, by taking the integral with respect to µ, one obtains

E‖mn+1 − Emn+1‖2|Fn

≤(

1 +H(0)2a2

n+1

h2dn+1

)‖mn − Emn‖2 + 3

H(0)2a2n+1

h2dn+1

(‖mn‖2 + EY 2).

(25.12)

25.2. Recursive Kernel Estimate 517

Relation (25.10) implies

‖Emn −m‖2 → 0 (25.13)

and

E‖mn − Emn‖2 → 0 (25.14)

and

E‖mn‖2 = O(1). (25.15)

Now according to Theorem A.5, because of (25.6) and (25.15), from (25.12)one obtains a.s. convergence of ‖mn − Emn‖2. Because of (25.14),

‖mn − Emn‖2 → 0

in probability, which together with the a.s. convergence of ‖mn − Emn‖2

yields ‖mn − Emn‖2 → 0 a.s. This together with (25.13) yields the secondstatement.

25.2 Recursive Kernel Estimate

For a kernel K : Rd → R+ consider the recursive estimator (25.1) with

Kn(x, z) =1hd

n

K

(x− z

hn

), x, z ∈ Rd. (25.16)

Theorem 25.2. Assume for the kernel K that there is a ball S0,r of radiusr > 0 centered at the origin, and a constant b > 0 such that K(x) ≥ bIS0,r

and that there is a nonnegative nonincreasing Borel function H on [0,∞)with rdH(r) → 0 (r → ∞) such that

K(x) ≤ H(‖x‖), (25.17)

hn > 0, limnhn = 0, (25.18)

an > 0,∑

n

an = ∞, (25.19)

supxK(x) sup

n

an

hdn

< 1, (25.20)

and∑

n

a2n

h2dn

< ∞. (25.21)




25.3 Recursive Partitioning Estimate

For the recursive partitioning estimate we are given a sequence of (fi-nite or countably infinite) partitions Pn = An,1, An,2, . . . of Rd, whereAn,1, An,2, . . . are Borel sets. For z ∈ Rd set An(z) = An,j if z ∈ An,j , thenconsider the recursive estimator (25.1) with

Kn(x, z) =1hd

n

Iz∈An(x), x, z ∈ Rd. (25.22)

Theorem 25.3. Assume that for a nested sequence of partitions with

diamAn(z) ≤ hn → 0, lim infn→∞

λ(An(z))/hdn > 0, (25.23)

for all z ∈ Rd,

an > 0,∑

n

an = ∞, (25.24)

supn

an

hdn

< 1, (25.25)

and∑

n

a2n

h2dn

< ∞. (25.26)



25.4 Recursive NN Estimate

Introduce the recursive NN estimate such that it splits the data sequenceDn = (X1, Y1), . . . , (Xn, Yn) into disjoint blocks of length l1, . . . , lN ,where l1, . . . , lN are positive integers. In each block find the nearest neigh-bor of x, and denote the nearest neighbor of x from the ith block by X∗

i (x).Let Y ∗

i (x) be the corresponding label. Ties are broken by comparing in-dices as in Chapter 6. The recursive regression estimate is as follows: if∑N

i=1 li ≤ n <∑N+1

i=1 li, then

mn(x) =1N

N∑

i=1

Y ∗i (x).

Theorem 25.4. If limN→∞ lN = ∞, then mn is weakly and stronglyuniversally consistent.

25.4. Recursive NN Estimate 519

Proof. Put

mN (x) = m∑N

i=1li(x) =

1N

N∑

i=1

Y ∗i (x),

then it suffices to prove the weak and strong universal consistency of mN .mN is an average of independent random variables taking values in L2.First we show that concerning the bias

‖EmN −m‖ → 0,

which follows from the Toeplitz lemma if

‖EY ∗i −m‖ → 0.

Obviously,

EY ∗i (x) = E

∑i

k=1lk∑

j=∑i−1

k=1lk+1

YjIX∗i(x)=Xj

= Eli∑

j=1

YjIX∗(x)=Xj

= Eli∑

j=1

m(Xj)IX∗(x)=Xj

= Em(X∗(x)),

where X∗(x) is the nearest neighbor of x from X1, . . . , Xli . Because ofProblem 6.3,

E(m(X∗(X)) −m(X))2 → 0,

therefore,

E(Em(X∗(X))|X −m(X))2 → 0.

Turning next to the variation term we show that

C = supi

E‖Y ∗i − EY ∗

i ‖2 < ∞.

Put σ2(x) = EY 2|X = x and C∗ = supz,i

∫Ki(x, z)µ(dx) ≤ γd then

E‖Y ∗i − EY ∗

i ‖2 ≤ E‖Y ∗i ‖2

= EY ∗i (X)2

=∫ ∫

σ2(z)Ki(x, z)µ(dx)µ(dz)

≤ C∗∫σ2(z)µ(dz)


= C∗EY 2

= C.

Thus

E‖mN − EmN‖2 =1N2

N∑

i=1

E‖Y ∗i − EY ∗

i ‖2 ≤ C1N

→ 0,

so the weak universal consistency is proved. Let Fn denote the σ-algebragenerated by X1, Y1, . . . , Xn, Yn. Concerning the strong universal con-sistency we can apply the almost supermartingale convergence theorem(Theorem A.5) since

E

‖mN − EmN‖2|F∑N−1

j=1lj

= (1 − 1/N)2‖mN−1 − EmN−1‖2 + 1/N2E‖Y ∗N − EY ∗

N‖2

≤ ‖mN−1 − EmN−1‖2 + C/N2.

25.5 Recursive Series Estimate

Introduce a sequence of functions on Rd: φ1, φ2, . . .. Let kn be a non-decreasing sequence of positive integers and let

Kn(x, u) =kn∑

j=1

φj(x)φj(u), x, u ∈ Rd.

Then the recursive series estimate is defined by the following recursion:

m1(x) = Y1,

mn+1(x) = mn(x) − an+1(mn(Xn+1) − Yn+1)Kn+1(x,Xn+1),

where an > 0.

Theorem 25.5. Let the functions φ1, φ2, . . . be uniformly bounded by1 and square integrable with respect to the Lebesgue measure λ. Assume,moreover, that they form an orthonormal system in L2(λ) and span L2(µ),i.e., the set of finite linear combinations of the φi’s is dense in L2(µ). If

kn → ∞,∑

n

an = ∞,∑

n

a2nk

2n < ∞,

then the recursive series estimate is weakly and strongly consistent.

Observe that Theorem 25.5 implies a strong universal consistency resultif one has an example of φi’s satisfying the conditions universally for all

25.5. Recursive Series Estimate 521

µ. As a possible example consider d = 1 (it can be extended to d > 1).By Theorem A.1, for an arbitrary f ∈ L2(µ), choose a continuous f∗ ofcompact support with ‖f − f∗‖µ < ε. For instance, let (φ′

i) be the Walshsystem of support (0, 1] and

φij(x) = φ′i(x− j), i = 0, 1, 2, . . . , j = 0,±1,±2, . . . .

Then (φi,j) is an orthonormal system in L2(λ) and f∗ can be approximatedwith respect to the supremum norm by linear combinations of φij ’s (see,e.g., Alexits (1961), Sections 1.6, 1.7, 4.2), which then approximate f , too,in L2(µ). Instead of the Walsh system one can use the standard trigono-metric system of support (−π, π] or the system of Legendre polynomials ofsupport (−1, 1].

When kn is fixed, say k, then the corresponding results hold with m re-placed by the projection Pkm of m onto spanφ1, . . . , φk with respect toL2(µ). Pkm is unique as an element in L2(µ), but its representation as a lin-ear combination of φ1, . . . , φk is generally not unique. The a.s. convergenceresult for fixed kn also follows from the general stochastic approximation re-sults in Hilbert space (Fritz (1974), Gyorfi (1980; 1984), Walk (1985), Walkand Zsido (1989)), where in the case of weights an = 1/n only the station-arity and ergodicity of (Xn, Yn), together with the condition EY 2 < ∞,are assumed.

The proper choice of an and kn can be made knowing the rate ofthe convergence, which is possible assuming regularity conditions on thedistributions of (X,Y ).

For the proof of Theorem 25.5 we use some lemmas and the followingnotations. Let (·, ·), (·, ·)λ, ‖ · ‖, ‖ · ‖λ denote the inner products and thenorms in the spaces L2(µ), L2(λ), respectively. Moreover, put

Hn = spanφ1, . . . , φkn.Introduce the notation Fn for the σ-algebra generated by (Xi, Yi) (i =1, 2, . . . , n). Let, for z ∈ L2(µ),

Anz = z(Xn)kn∑

i=1

φi(Xn)φi, Anz := EAnz,

and

bn = Yn

kn∑

i=1

φi(Xn)φi, bn := Ebn.

Obviously,

Anz =kn∑

i=1

(z, φi)φi


and

bn =kn∑

i=1

(m,φi)φi

and

Anm = bn.

The above recursion can now be written as

mn+1 = mn − anAn+1mn + anbn+1. (25.27)

Lemma 25.1. For each z ∈ L2(µ) one has

‖Anz‖ ≤ kn‖z‖,E‖Anz‖2 ≤ k2

n‖z‖2,

E‖bn‖2 ≤ k2nEY 2

1 ,

‖Anz‖2λ ≤ kn‖z‖2,

E‖Anz‖2λ ≤ kn‖z‖2,

E‖(An+1 − An+1)mn‖2λ ≤ kn+1E‖mn‖2,

and

E‖bn − bn‖2λ ≤ knEY 2

1 .

Proof. Using the boundedness assumption on the φi’s, the proof isstraightforward. In particular, one obtains

‖Anz‖2λ =

kn∑

i=1

(z, φi)2 ≤kn∑

i=1

‖z‖2‖φi‖2 ≤ kn‖z‖2

and

E‖Anz‖2λ = E

z(Xn)2kn∑

i=1

φi(Xn)2

≤ knEz(Xn)2 = kn‖z‖2.

Moreover, by use of the independence assumption,

E‖(An+1 − An+1)mn‖2λ = EE‖(An+1 − An+1)mn‖2

λ|Fn

=∫

E‖(An+1 − An+1)v‖2λPmn(dv)

≤∫

E‖An+1v‖2λPmn(dv)

≤ kn+1

∫Ev(Xn+1)2Pmn(dv)

= kn+1

∫ [∫v(x)2µ(dx)

]Pmn(dv)


= kn+1

∫‖v‖2Pmn(dv)

= kn+1E‖mn‖2

and

E‖bn − bn‖2λ ≤ E‖bn‖2

λ ≤ knEY 21 .

A connection between the inner product in L2(λ) and the norm in L2(µ)is given by the following lemma:

Lemma 25.2. For each z ∈ Hn, one has

(Anz, z)λ = ‖z‖2.

Proof.

(Anz, z)λ =kn∑

i=1

(z, φi)µ(z, φi)λ

=∫z(x)

kn∑

i=1

(z, φi)λφi(x)µ(dx)

=∫z(x)2µ(dx)

= ‖z‖2.

Lemma 25.3. Let z1 ∈ L2(µ) and

zn+1 = zn − an+1An+1zn. (25.28)

Then

‖zn‖ → 0. (25.29)

Proof. In the first step we show that, for each starting point z1 ∈ L2(µ)the sequence ‖zn‖ is convergent. For this we write

‖zn+1‖2 = ‖zn‖2 − 2an+1(An+1zn, zn) + a2n+1‖An+1zn‖2.

On the one hand, (An+1zn, zn) =∑kn+1

i=1 (zn, φi)2 ≥ 0 and, on the otherhand, ‖An+1zn‖ ≤ kn+1‖zn‖ because of Lemma 25.1. Therefore, by∑

n a2nk

2n < ∞, the assertion follows. In the second step we show that,

for the sequence of bounded linear operators,

Bn = (I − an+1An+1) . . . (I − a2A2)

from L2(µ) into L2(µ), where I denotes the identity operator, and thesequence of norms ‖Bn‖ is bounded. We notice that for each z1 ∈ L2(µ)


the sequence Bnz1 equals zn given by recursion (25.28) and thus is boundedin L2(µ) according to the first step. Now the uniform boundedness principleyields the assertion.

Our aim is to show (25.29), i.e., ‖Bnz1‖ → 0, for each starting pointz1 ∈ L2(µ). Because

⋃j Hj is dense in L2(µ) and the sequence ‖Bn‖ is

bounded (according to the second step), so it suffices to show (25.29) foreach starting point z1 ∈ ⋃

j Hj . This will be done in the third step. Wenotice

zn+1 = (I − an+1An+1) . . . (I − a2A2)z1.

Choose j such that z1 ∈ Hj . Then (I − ajAj) . . . (I − a2A2)z1 ∈ Hj forj = 2, 3, . . .. Therefore, it suffices to prove that, for z∗ ∈ Hj ,

‖(I − an+1An+1) . . . (I − aj+1Aj+1)z∗‖ → 0.

Since∑∞

n=j a2nk

2n < ∞ (used below) it suffices to consider the case j = 1

with z∗ ∈ H1, i.e., one has to show that for z1 ∈ H1, ‖zn‖ → 0. In therecursion formula (25.28) take the norm square with respect to L2(λ) toobtain

‖zn+1‖2λ = ‖zn‖2

λ − 2an+1(An+1zn, zn)λ + a2n+1‖An+1zn‖2

λ.

From Lemma 25.1,

‖An+1zn‖2λ ≤ kn+1‖zn‖2.

Further, noticing zn ∈ Hn, we get, by Lemma 25.2,

(An+1zn, zn)λ = ‖zn‖2.

Thus

‖zn+1‖2λ ≤ ‖zn‖2

λ − 2an+1‖zn‖2 + a2n+1kn+1‖zn‖2.

According to the first step, the sequence ‖zn‖2 is bounded; thus, since∑n a

2nkn < ∞, we have

∑

n

a2n+1kn+1‖zn‖2 < ∞.

Therefore∑

n

an+1‖zn‖2 < ∞.

This, together with∑

n an = ∞ and the convergence of ‖zn‖2 (by the firststep), yields ‖zn‖ → 0.

Proof of Theorem 25.5. We use the notation at the beginning of thissection. In the first step from the recursion formula (25.27) we obtain

mn+1 −m

= mn −m− an+1An+1(mn −m) − an+1(An+1 − An+1)(mn −m)

− an+1An+1m+ an+1bn+1.


Now take the norm squared with respect to L2(µ) and then conditionalexpectations using the independence assumption and the relation

−an+1An+1m+ an+1bn+1 = −an+1(An+1 − An+1)m+ an+1(bn+1 − bn+1).

Thus

E‖mn+1 −m‖2|Fn= ‖mn −m‖2 − 2an+1(An+1(mn −m),mn −m)

+ a2n+1‖An+1(mn −m)‖2

+ a2n+1E‖(An+1 − An+1)(mn −m) +An+1m− bn+1‖2|Fn.

Notice that, by Lemma 25.1,

(An+1(mn −m),mn −m) =kn+1∑

i=1

(mn −m,φi)2 ≥ 0.

Application of Lemma 25.1 yields, by the independence assumption,

E‖mn+1 −m‖2|Fn≤ ‖mn −m‖2 + a2

n+1k2n+1‖mn −m‖2

+ 3a2n+1E‖An+1(mn −m)‖2|Fn

+ 3a2n+1E‖An+1m‖2 + 3a2

n+1E‖bn+1‖2

≤ (1 + 4a2n+1k

2n+1)‖mn −m‖2 + 3a2

n+1k2n+1(‖m‖2 + EY 2

1 ).

Since∑

n a2nk

2n < ∞, by Theorem A.5 we obtain a.s. convergence of ‖mn −

m‖2 and convergence of E‖mn −m‖2.In the second step subtract the expectations in the recursion formula

(25.27), where the independence assumption is used. We have

mn+1 − Emn+1 = mn − Emn − an+1An+1(mn − Emn)

− an+1(An+1 − An+1)mn + an+1(bn+1 − bn+1).

Take the norm squared with respect to L2(λ) and then the expectationusing the independence assumption once more. This yields

E‖mn+1 − Emn+1‖2λ

= E‖mn − Emn‖2λ − 2an+1E(An+1(mn − Emn),mn − Emn)λ

+a2n+1E‖An+1(mn − Emn)‖2

λ

+a2n+1E‖(An+1 − An+1)mn − bn+1 + bn+1)‖2

λ.

Notice that E(An+1(mn−Emn),mn−Emn)λ = E‖mn−Emn‖2, by Lemma25.2. Lemma 25.1 shows that

E‖mn+1 − Emn+1‖2λ

≤ E‖mn − Emn‖2λ − 2an+1E‖mn − Emn‖2


+ a2n+1kn+1(E‖mn − Emn‖2 + 2E‖mn‖2 + 2EY 2

1 ).

Since E‖mn‖2 = O(1) and∑

n a2nkn < ∞, we have

∑

n

a2n+1kn+1(E‖mn − Emn‖2 + 2E‖mn‖2 + 2EY 2

1 ) < ∞.

Thus∑

n

an+1E‖mn − Emn‖2 < ∞.

Since∑

n an = ∞, there exists an index subsequence n′ with

E‖mn′ − Emn′‖2 → 0.

In the third step take the expectations in the recursion sequence to obtain

E(mn+1 −m) = E(mn −m) − an+1An+1E(mn −m)

by use of the independence assumption and of the relation An+1m = bn+1.By Lemma 25.3,

‖E(mn −m)‖ → 0.

Finally, for the index subsequence n′ above, we obtain√

E‖mn′ −m‖2 ≤√

E‖mn′ − Emn′‖2 + ‖Emn′ −m‖ → 0

by the second and third steps. Because, according to the first step, E‖mn −m‖2 is convergent, we have

E‖mn −m‖2 → 0.

This, together with a.s. convergence of ‖mn −mn‖2, according to the firststep, yields

‖mn −m‖2 → 0 a.s.

25.6 Pointwise Universal Consistency

Let (X,Y ) be as before and again our goal is to infer the regression function.Let mn(x) be an estimate of the regression function m(x) based on thetraining sequence

(X1, Y1), . . . , (Xn, Yn).

The main focus of this book is the L2 error. As a slight detour, in thissection we consider another error criterion.

25.6. Pointwise Universal Consistency 527

Definition 25.1. The sequence mn(x) is called strongly pointwiseconsistent if

mn(x) → m(x) a.s.

and for almost all x mod µ. The sequence mn(x) is called strongly univer-sally pointwise consistent (s.u.p.c.) if it is strongly pointwise consistentfor all distributions of (X,Y ) with E |Y | < ∞.

Thus s.u.p.c. is required that mn(x) converges to m(x) with probabilityone and for µ-almost every x. This is equivalent to the statement, that forall distributions of (X,Y ) with E |Y | < ∞,

mn(X) → m(X) a.s.

This notion of consistency is very strong, and it is not at all obvious how toconstruct such estimates, and how to prove their consistency, for example,it is still an open question whether the standard regression estimates ares.u.p.c. This problem occurs in function estimation for additive noise, wherethe noise distribution has a large tail. The difficulty is caused by the factthat in the neighborhood of x there are few observations in order to havestrong consistency.

Consider some partitioning estimates.

Theorem 25.6. In addition to the conditions of Theorem 24.3, assumethat, for all x,

nλ(An(x))/ log n → ∞.

Then the partitioning estimate is strongly pointwise consistent for |Y | < L.

Proof. See Problem 25.8.The question whether standard partitioning estimates are s.u.p.c. is still

open. For semirecursive and recursive partitioning estimates we have suchconsistency.

Theorem 25.7. Under the conditions of Theorem 24.3 the semirecursivepartitioning estimate is s.u.p.c.

For the proof of Theorem 25.7 we establish two lemmas. The first is avariant of Lemma 23.3, and the second one is purely deterministic.

Lemma 25.4. Let Kn be a sequence of measurable nonnegative functionswith

Kn(x, z) ≤ Kmax.

Assume that, for every distribution µ of X,∫Kn(x, z)f(z)µ(dz)∫Kn(x, z)µ(dz)

−→ f(x) mod µ (25.30)


for all µ-integrable functions f and∑

n

∫Kn(x, z)µ(dz) = ∞ mod µ. (25.31)

Assume further that a finite constant c∗ exists such that

a.s. lim supn

n∑

i=1YiKi(x,Xi)

1 +n∑

i=1

∫Ki(x, z)µ(dz)

≤ c∗m(x) mod µ (25.32)

for all distributions P(X,Y ) with Y ≥ 0, EY < ∞. Let mn be an estimateof the form

mn(x) =

n∑

i=1YiKi(x,Xi)

n∑

i=1Ki(x,Xi)

,

where E|Y | < ∞. Then

mn(x) −→ m(x) a.s. mod µ.


Lemma 25.5. Let 0 ≤ rn ≤ 1, Rn := r1 + · · · + rn, R0 := 0. There is asequence pi of integers with pi ↑ ∞ and

Rpi ≤ i+ 1, (25.33)

∞∑

j=pi

rj(1 +Rj)2

<1i. (25.34)

Proof. Set R∞ := limRn and 1/(1 + R∞) := 0 if R∞ = ∞. For p ∈2, 3, . . . we have

∞∑

j=p

rj(1 +Rj)2

≤∞∑

j=p

(1

1 +Rj−1− 1

1 +Rj

)=

11 +Rp−1

− 11 +R∞

.

Choose pi ∈ 2, 3, . . . as the first index with

11 +Rpi−1

− 11 +R∞

<1i.

Then (25.34) holds, and by definition of pi,

Rpi−2 ≤ i− 1 if pi ≥ 3,

thus (25.33).


Proof of Theorem 25.7. We use Lemma 25.4 with

Kn(x, t) = IAn(x)(t).

Relations (25.30) and (25.31) follow from the assumptions by Lemma 24.10.It remains to verify (25.32) for P(X,Y ) with Y ≥ 0, EY < ∞. We shall usea suitable truncation. According to Lemma 25.5 with rn = µ(An(t)) wechoose indices pi = p(t, i) ↑ ∞ (i → ∞) such that (25.33), (25.34) hold forall i. We define for p(t, ·) an inverse function q(t, ·) by

q(t, n) := maxi; p(t, i) ≤ n,further the truncated random variables

Zi := YiI[Yi≤q(Xi,i)].

It will be shownn∑

i=1

[ZiIAi(x)(Xi) − EZiIAi(x)(Xi)

]

1 +n∑

i=1µ(Ai(x))

−→ 0 a.s. mod µ .

Because of (25.31) and Theorem A.6 it suffices to show

∑

n

EZ2nIAn(x)(Xn)

(1 +

n∑

i=1µ(Ai(x))

)2 < ∞ mod µ.

But this follows from∫ ∞∑

n=1

EZ2nIAn(x)(Xn)

(

1 +n∑

j=1µ(Aj(x))

)2µ(dx)

=∞∑

n=1

∫

∫EZ2

n | Xn = tIAn(x)(t)(

1 +n∑

j=1µ(Aj(x))

)2 µ(dx)

µ(dt)

=∞∑

n=1

∫ ∫ q(t,n)∑

i=1

∫

(i−1,i]

v2PY |X=t(dv)IAn(t)(x)

(

1 +n∑

j=1µ(Aj(x))

)2µ(dx)µ(dt)

=∫ ∞∑

i=1

∫

(i−1,i]

v2PY |X=t(dv)∞∑

n=p(t,i)

µ(An(t))(

1 +n∑

j=1µ(Aj(t))

)2µ(dt)


≤∫

EY |X = tµ(dt)

= EY < ∞.

Here we obtain the third equality by noticing that for the nested sequenceof partitions the relation x ∈ An(t) and j ≤ n imply Aj(x) = Aj(t) (asin the proof of Theorem 24.3), and the inequality is obtained by use of(25.34). Further

lim supn

n∑

i=1EZiIAi(x)(Xi)

1 +n∑

i=1µ(Ai(x))

≤ limn

n∑

i=1

∫m(t)IAi(x)(t)µ(dt)

1 +n∑

i=1µ(Ai(x))

= m(x), mod µ

because of (25.30), (25.31), and the Toeplitz lemma. Thus

lim supn

n∑

i=1ZiIAi(x)(Xi)

1 +n∑

i=1µ(Ai(x))

≤ m(x) a.s. mod µ. (25.35)

In the next step we show∑

n

PZnIAn(x)(Xn) = YnIAn(x)(Xn) < ∞ mod µ. (25.36)

This follows from∫ ∞∑

n=1

PYn > q(Xn, n), Xn ∈ An(x)µ(dx)

=∫ ∞∑

n=1

∫PY > q(t, n)|X = tIAn(x)(t)µ(dt)µ(dx)

=∞∑

n=1

∫PY > q(t, n)|X = tµ(An(t))µ(dt)

≤∞∑

i=1

∫PY ∈ (i, i+ 1]|X = t

p(t,i+1)∑

n=1

µ(An(t))µ(dt)

≤ 3∫

EY |X = tµ(dt)

= 3EY < ∞by the use of (25.33). Because of (25.31),

1 +n∑

i=1

µ(Ai(x)) → ∞ mod µ. (25.37)


Relations (25.35), (25.36), and (25.37) yield (25.32). Now the assertionfollows by Lemma 25.4.

Theorem 25.8. Under the conditions of Theorem 25.3 together withan/h

dn = O(1/n) the recursive partitioning estimate is s.u.p.c.

Proof. Without loss of generality Yn ≥ 0 may be assumed. We use thenotations

Tn(x) := anKn(x,Xn) := an1hd

n

IAn(x)(Xn),

Bn(x) :=[(1 − T2(x)) . . . (1 − Tn(x))

]−1,

Gn(x) := Tn(x)Bn(x) (n = 2, 3, . . .),

B1(x) := 1, G1(x) := 1.

Representations of the following kind are well-known (compare Ljung,Pflug, and Walk (1992), Part I, Lemma 1.1):

Bn(x) =n∑

i=1

Gi(x), (25.38)

mn(x) = Bn(x)−1n∑

i=1

Gi(x)Yi. (25.39)

The assumption an/hdn = O(1/n) yields

∑ETn(x)2 < ∞), thus by Theo-

rem A.6 almost sure convergence of∑

(Tn(x) −ETn(x)). Relations (25.23)and (25.24) yield

∑ETn(x) = ∞ mod µ by the second part of Lemma

24.10. Therefore∑

Tn(x) = ∞ a.s. mod µ

and thus

Bn(x) ↑ ∞ a.s. mod µ. (25.40)

Let Yn := YnI[Yn≤n]. Then, by Lemma 23.4,

∑ 1n2 EY 2

n < ∞, (25.41)

further,∑

PYn = Yn =∑

PY > n ≤ EY < ∞. (25.42)

By (25.39) we can use the representation

mn(x) = m(1)n (x) +m(2)

n (x) +m(3)n (x)


with

m(1)n (x) =

1Bn(x)

n∑

i=1

Gi(x)

(

Yi − EYiKi(x,Xi)EKi(x,Xi)

)

,

m(2)n (x) =

1Bn(x)

n∑

i=1

Gi(x)EYiKi(x,Xi)EKi(x,Xi)

,

m(3)n (x) =

1Bn(x)

n∑

i=1

Gi(x)(Yi − Yi).

In the first step we show

m(1)n (x) → 0 a.s. mod µ. (25.43)

By (25.40) and the Kronecker lemma it suffices to show a.s. convergence of

∑Tn(x)

(

Yn − EYnKn(x,Xn)EKn(x,Xn)

)

.

But this follows from∑

E(Tn(x)Yn

)2< ∞,

which holds because of an/hdn = O(1/n) and (25.41).

In the second step we show

m(2)n (x) → m(x) a.s. mod µ. (25.44)

Because of (25.38), (25.40), and the Toeplitz lemma it suffices to show

EYnKn(x,Xn)EKn(x,Xn)

→ m(x) mod µ. (25.45)

Because of (24.23), via the first part of Lemma 24.10, we have

lim supn


≤ limn

EY Kn(x,X)EKn(x,X)

= m(x) mod µ,

on the other side, for each c > 0,

lim infn


≥ limn

EY I[y≤c]Kn(x,X)EKn(x,X)

= E(Y I[Y ≤c]|X = x) mod µ.

These relations with c → ∞ yield (25.45).In the third step we obtain

m(3)n (x) → 0 a.s. mod µ (25.46)

by (25.40) and (25.42). Now (25.43), (25.44), (25.46) yield the assertion.


For k ≥ 1 let

τk,0(x) = 0

and inductively define τk,j(x) as the jth recurrence time of Ak(x):

τk,j(x) = inft > τk,j−1(x) : Xt ∈ Ak(x).We have that µ(Ak(x)) > 0 modµ, so that τk,1(x), τk,2(x), . . . are finitewith probability one and for µ-almost every x. Clearly, τk,j(X)j≥0 is anincreasing sequence of stopping times adapted to the filtration Ftt≥0,where Ft = σ(X,X1, . . . , Xt).

Given an integer sequence Jkk≥1 such that Jk ↑ ∞, we define themodified estimates

mk(x) =1Jk

Jk∑

j=1

Yτk,j(x).

This estimate is well-defined since τk,j(x) is finite with probability one andfor µ-almost every x. Unfortunately, the sample size required for evaluatingmk(x) is random, it is τk,Jk

(x).To get a modified estimate of fixed sample size, we set

kn(x) = maxk : τk,Jk(x) ≤ n,

mn(x) = mkn(x)(x).

The strong universal pointwise consistency of mk(x) implies that ofmn(x), since mn(x) is a subsequence of mk(x). The estimate mn(x) canbe interpreted as a standard partitioning estimate with data-dependentpartitioning.

The s.u.p.c. of modified partitioning estimates is open. Only theirtruncated versions are guaranteed to be s.u.p.c.

For any real number y and any truncation level G ≥ 0 we define thetruncated value

y(G) = y 1|y| ≤ G.Given integers Jk such that Jk ↑ ∞ and truncation levels Gj such that

Gj ↑ ∞, we define the modified truncated partitioning estimates

mk(x) =1Jk

Jk∑

j=1

Y(Gj)τk,j(x).

Let j → Kj denote the inverse of the map k → Jk (so that j ≤ Jk iffKk ≤ k), and let

Rj =∑

k≥Kj

1J2

k

.


Assume that Jk ↑ ∞ and Gj ↑ ∞ in such a way that

M = supi

Gi+1

∑

j≥i+1

Rj

< ∞. (25.47)

Theorem 25.9. Let Pkk≥1 be a nested sequence of partitions thatasymptotically generate the Borel σ-field. If Jk ↑ ∞ and Gj ↑ ∞ in sucha way that (25.47) holds, then the modified truncated partitioning estimatemP

k (x) is s.u.p.c.

Let

τP0 (x) = 0

and inductively define

τPj (x) = inft > τP

j−1(x) : Xt ∈ Aj(x).We may introduce the modified recursive partitioning estimate

m∗Pk (x) =

1k

k∑

j=1

YτPj

(x).

Theorem 25.10. Suppose the partitions Pk are nested and asymptoticallygenerate the Borel σ-field. Then the modified recursive partitioning estimatem∗P

k (x) is s.u.p.c.

Let us turn to the kernel estimates.

Theorem 25.11. For the naive kernel assume that

hn → 0 and nhdn/ log n → ∞.

Then the kernel estimate is strongly pointwise consistent for |Y | < L.

Proof. See Problem 25.9.It is unknown whether the standard kernel estimate is s.u.p.c. However,

it is known that truncation of the response variables yields estimates thatare s.u.p.c.

Theorem 25.12. Put

hn = Cn−δ, 0 < δ < 1/d,

and

Gn = n1−δd.

Then the truncated kernel estimate

mn(x) =

∑ni=1 Y

(Gn)i K

(x−Xi

hn

)

∑ni=1K

(x−Xi

hn

)


with naive kernel is s.u.p.c.

Again the semirecursive and recursive estimates are easier.

Theorem 25.13. Under the conditions concerning (24.18) of Theorem24.2 the semirecursive kernel estimate is s.u.p.c.

Theorem 25.14. Under the conditions of Theorem 25.2 together withK(x) ≥ cH(‖x‖) for some c > 0 and an/h

dn = O(1/n) the recursive kernel

estimate is s.u.p.c.

Given B = S1(0) and a bandwidth sequence hkk≥1 such that hk ↓ 0,let

Bk(x) = x+ hk B.

Let

τBk,0(x) = 0


τBk,j(x) = inft > τB

k,j−1(x) : Xt ∈ Bk(x).If hk > 0 then µ(Bk(x)) > 0 mod µ, so all τB

k,j(x) are finite and YτBk,1(x) is

well-defined. We define the modified truncated kernel estimate as follows:

mBk (x) =

1Jk

Jk∑

j=1

Y(Gj)τB

k,j(x).

Yakowitz (1993) defined a modified kernel estimate of fixed sample size,which can be viewed as the modified kernel estimate without truncationand with Jk = k. Yakowitz called it the r-nearest-neighbor estimate, sinceit is closer in spirit to nearest neighbor estimates than to kernel estimates.

Theorem 25.15. Let hkk≥1 be a bandwidth sequence such that hk → 0.If Jk ↑ ∞ and Gj ↑ ∞ in such a way that (25.47) holds, then the modifiedtruncated kernel estimate mB

k (x) is s.u.p.c.

Let

τB0 (x) = 0


τBj (x) = inft > τB

j−1(x) : Xt ∈ Bj(x).Then µ(Bk(x)) > 0 mod µ and all τB

j (x) are finite with probability one byPoincare’s recurrence theorem. Thus one may define the modified recursivekernel estimates

m∗Bk (x) =

1k

k∑

j=1

YτBj

(x).


Theorem 25.16. Suppose some bandwidth sequence hkk≥1 such thathk → 0 as k → ∞. Then the modified recursive kernel estimate m∗B

k (x) iss.u.p.c.

Consider some nearest neighbor estimates.

Theorem 25.17. If

kn/ log n → ∞ and kn/n → 0

then the nearest neighbor estimate is strongly pointwise consistent for |Y | <L.


It is unknown whether the standard nearest neighbor estimate is s.u.p.c.Let kk≥1 be a sequence of positive integers such that

k ↑ ∞ as k → ∞.

For each k ≥ 1 one may subdivide the data into successive segments oflength k. Let Xk,j(x) denote the nearest neighbor of x among the obser-vations from the jth segment X(j−1)k+1, . . . , Xjk

, and let Yk,j(x) denotethe corresponding label of Xk,j(x).

The modified truncated nearest neighbor estimate is defined by

mNNk (x) =

1Jk

Jk∑

j=1

Y(Gj)k,j (x).

Theorem 25.18. Let kk≥1 be a deterministic sequence of integers suchthat k → ∞. If Jk ↑ ∞ and Gj ↑ ∞ in such a way that (25.47) holds, thenthe modified truncated nearest neighbor estimate mNN

k (x) is s.u.p.c.

Given a sequence 1, 2, . . . of positive integers, we split the data sequence(X1, Y1), (X2, Y2), . . . into disjoint blocks with lengths 1, . . . , k and findthe nearest neighbor of x in each block. Let X∗

j (x) denote the nearestneighbor of x from the jth block (ties are broken by selecting the nearestneighbor with the lowest index), and let Y ∗

j (x) denote the correspondinglabel. The recursive nearest neighbor estimate is defined as

mNNn (x) =

1k

k∑

j=1

Y ∗j (x)

if∑k

j=1 j ≤ n <∑k+1

j=1 j .

Theorem 25.19. If k → ∞, then the recursive nearest neighbor estimateis s.u.p.c.



Theorems 25.1, 25.2, and 25.5 are due to Gyorfi and Walk (1997; 1996).The estimator with (25.16) was introduced and investigated by Revesz(1973). Under regularity conditions on m and under the condition thatthe real X has a density, Revesz (1973) proved a large deviation theorem.Gyorfi (1981) proved the weak universal consistency in the multidimen-sional case for a kernel of compact support. The weak universal consistencyof the recursive NN estimate has been proved by Devroye and Wise (1980).Theorems 25.7, 25.8, 25.13, and 25.14 have been proved by Walk (2001).Stronger versions of Theorems 25.11 and 25.17 can be found in Devroye(1982b; 1981). Theorems 25.9, 25.15, 25.16, 25.18, and 25.19 are in Algoetand Gyorfi (1999). Theorem 25.10 is due to Algoet (1999). Kozek, Leslie,and Schuster (1998) proved Theorem 25.12.


Problem 25.1. Prove Theorem 25.2.Hint: Apply Theorem 25.1 and Lemma 24.6.

Problem 25.2. Prove Theorem 25.3.Hint: Apply Theorem 25.1 and Lemma 24.10.

Problem 25.3. Prove Theorem 25.3 for d = 1, where each partition consists ofnonaccumulating intervals, without the condition that the sequence of partitionsis nested.

Hint: Compare Problem 24.6.

Problem 25.4. Prove Lemma 25.4.Hint: Use the truncation YL = Y IY ≤L + LIY >L for Y ≥ 0 and notice for

mL(x) = EYL|X = x that for ε > 0 an integer L0(x) exists with

|mL(x) − m(x)| < ε

for all L > L0(x) and for µ-almost all x.

Problem 25.5. Prove Theorem 25.13 for a naive kernel.Hint: Argue according to the proof of Theorem 25.7, but use the covering

argument of Theorem 24.2 and use Lemmas 24.5 and 24.6.

Problem 25.6. Prove Theorem 25.14 for a naive kernel K.Hint: Argue according to the proof of Theorem 25.8, but with Kn(x, z) =

1hd

nK(

x−zhn

), and use Lemmas 24.5 and 24.6.

Problem 25.7. Prove Theorem 25.19.

Problem 25.8. Prove Theorem 25.6.Hint: Put

mn(x) =

∫An(x)

m(z)µ(dz)

µ(An(x)),


then, by Lemma 24.10,

mn(x) → m(x) mod µ.

Moreover,

mn(x) =

∑n

i=1YiIXi∈An(x)

nµ(An(x))∑n

i=1IXi∈An(x)

nµ(An(x))

,

therefore, it is enough to show that∑n

i=1 YiIXi∈An(x)nµ(An(x))

− mn(x) → 0 a.s. mod µ

By Bernstein’s inequality

P

∣∣∣∣

∑n

i=1 YiIXi∈An(x)nµ(An(x))

− mn(x)

∣∣∣∣ > ε

≤ 2e−nε2

µ(An(x))2

2Var(Y1IX1∈An(x))+4Lεµ(An(x))/3

≤ 2e−nε2

µ(An(x))2

2L2µ(An(x))+4Lεµ(An(x))/3

≤ 2e−nε2

µ(An(x))λ(An(x))

λ(An(x))

2L2+4Lε/3 ,

which is summable because of the condition and because of

lim infn→∞

µ(An(x))λ(An(x))

> 0 mod µ

(cf. Lemma 24.10).

Problem 25.9. Prove Theorem 25.11.Hint: Proceed as for Problem 25.8 such that we refer to Lemmas 24.5 and

24.6.

Problem 25.10. Prove Theorem 25.17.Hint: Put

mR(x) =

∫Sx,R

m(z)µ(dz)

µ(Sx,R),

then, by Lemma 24.6,

limR→0

mR(x) = m(x) mod µ.

Given ‖x − X(kn,n)(x)‖ = R, the distribution of

(X(1,n)(x), Y(1,n)(x)), . . . , (X(kn,n)(x), Y(kn,n)(x))

is the same as the distribution of the nearest neighbor permutation of the i.i.d.

(X∗1 (x), Y ∗

1 (x)), . . . , (X∗kn

(x), Y ∗kn

(x)),

where

PX∗i (x) ∈ A, Y ∗

i (x) ∈ B = PXi ∈ A, Yi ∈ B|Xi ∈ Sx,R,


therefore,

EY ∗i (x) = EYi|Xi ∈ Sx,R = mR(x).

Thus, by Hoeffding’s inequality,

P

∣∣∣∣∣

1kn

kn∑

i=1

Y(i,n)(x) − mR(x)

∣∣∣∣∣> ε

∣∣∣‖x − X(kn,n)(x)‖ = R

= P

∣∣∣∣∣

1kn

kn∑

i=1

(Y ∗i (x) − EY ∗

i (x))

∣∣∣∣∣> ε

≤ 2e− knε2

2L2 ,

which is summable because of kn/ log n → ∞.

26Censored Observations

26.1 Right Censoring Regression Models

This chapter deals with nonparametric regression analysis in the presenceof randomly right censored data.

Let Y be a nonnegative random variable representing the survival timeof an individual or subject taking part in a medical or other experimentalstudy, and let X = (X(1), . . . , X(d)) be a random vector of covariates, e.g.,the medical file of a patient, jointly distributed with Y . In the model ofright censoring the survival time Yi is subject to right censoring so thatthe observable random variables are given by Xi, Zi = min(Yi, Ci), andδi = IYi≤Ci. Here Ci is a nonnegative random variable, representing thecensoring time, which could be the subject’s time to withdrawal or the timeuntil the end of the study.

It is well-known that in medical studies the observation on the survivaltime of a patient is often incomplete due to right censoring. Classical ex-amples of the causes of this type of censoring are that the patient was aliveat the termination of the study, that the patient withdrew alive during thestudy, or that the patient died from other causes than those under study.Another example of the same model we would like to offer here is in theworld of medical insurance. Suppose an insurance company sets its variousinsurance rates on a basis of the lifetimes Yi of its clients having a spe-cific disorder (heart disease, diabetes, etc.). In addition, for each client avector Xi of additional measurements is available from medical examina-tions. However, for several reasons, a patient may stop the contract with

26.2. Survival Analysis, the Kaplan-Meier Estimate 541

the insurance company at time Ci (due to lack of money or because ofmore favorable conditions elsewhere), so that the actual file of the patientwith the insurance company is incomplete in the above sense: it does notcontain Yi, it contains only Ci with the indication that it is not the reallife time.

The distribution of the observable random vector (X,Z, δ) does notidentify the conditional distribution of Y given X. The problem of identi-fiability can be resolved by imposing additional conditions on the possibledistributions of (X,Y,C).

Model A: Assume that C and (X,Y ) are independent.Model B: Assume that Y and C are conditionally independent given X.

Under these conditions we show consistent regression function estimateswhich imply that the distribution of (X,Z, δ) identifies the distribution of(X,Y ).

Model A is plausible in all situations where the censoring is caused by ex-traneous circumstances, not related to any characteristics of the individual.This can clearly be argued to be the case in the medical insurance exampledescribed above, as well as in many classical survival studies. One couldthink, e.g., of volunteers participating in a medical study, but for reasonsnot related to the data contained in (X,Y ) they withdraw prematurelyfrom the study, such as because of lack of enthusiasm. Or the censoringmay be caused by the (random) termination of the study, which can beassumed to occur independently of the persons participating in the study.

Obviously Model A is a special case of Model B, therefore the estimationproblem for Model A is easier such that for Model A the estimates aresimpler and the rate of convergence might be better. However, in mostpractical problems only the conditions of Model B are satisfied.

26.2 Survival Analysis, the Kaplan-Meier Estimate

In this section we present some results on distribution estimation for cen-sored observation when there is no observation vector X. Thus let us firstconsider the case that there are no covariates.

Assume that Y and C are independent. We observe

(Z1, δ1), . . . , (Zn, δn)

where δi = IYi≤Ci, and Zi = min(Yi, Ci).Introduce

F (t) = PY > t,G(t) = PC > t,K(t) = PZ > t = F (t)G(t).

542 26. Censored Observations

In survival analysis such tail distribution functions are called survivalfunctions.

Define

TF = supy : F (y) > 0,TG = supy : G(y) > 0,TK = supy : K(y) > 0 = minTF , TG.

The main subject of survival analysis is to estimate F and G. Kaplanand Meier (1958) invented such estimates, which are often called product-limit estimates. Let Fn and Gn be the Kaplan-Meier estimates of F andG, respectively, which are defined as

Fn(t) = ∏

i:Z(i)≤t(n−i

n−i+1 )δ(i) if t ≤ Z(n),

0 otherwise,(26.1)

and

Gn(t) = ∏

i:Z(i)≤t(n−i

n−i+1 )1−δ(i) if t ≤ Z(n),

0 otherwise,(26.2)

where (Z(i), δ(i)) (i = 1, . . . , n) are the n pairs of observed (Zi, δi) orderedon the Z(i), i.e.,

Z(1) ≤ Z(2) ≤ · · · ≤ Z(n) := TKn . (26.3)

Note that since F is arbitrary, some of the Zi may be identical. In thiscase the ordering of the Zi’s into Z(i)’s is not unique. However, it is easyto see that the Kaplan-Meier estimator is unique. We can observe thatFn(t) has jumps at uncensored sample points. Similarly, Gn(t) has jumpsat censored sample points. It is not at all obvious why they should beconsistent. The first interpretation in this respect is due to Kaplan andMeier (1958) showing that the estimates are of maximum likelihood type.

Efron (1967) introduced another interpretation. He gave a computation-ally simple algorithm for calculating Fn(t). It is a recursive rule workingfrom the left to right for Z(i). Place probability mass 1/n at each of thepoints Z(i), that is, construct the conventional empirical distribution. If Z(1)is not censored (δ(1) = 1) then keep its mass, otherwise remove its massand redistribute it equally among the other points. Continue this procedurein a recursive way. Suppose that we have already considered the first i− 1points. If Z(i) is not censored (δ(i) = 1) then keep its current mass, other-wise remove its mass and redistribute it equally among the n− i points tothe right of it, Z(i+1), . . . , Z(n). Make these steps n− 1 times, so Fn(t) = 0for t > Z(n). It is easy to check that Efron’s redistribution algorithm resultsin the same Kaplan-Meier estimate (cf. Problem 26.1).


Table 26.1. Efron’s redistribution algorithm for F6(t).

Z(1) Z(2) Z(3) Z(4) Z(5) Z(6)1/6 1/6 1/6 1/6 1/6 1/61/6 1/6 1/6 1/6 1/6 1/61/6 0 5/24 5/24 5/24 5/241/6 0 0 5/18 5/18 5/181/6 0 0 5/18 5/18 5/181/6 0 0 5/18 5/18 5/18

Z(1) Z(2) Z(3) Z(4) Z(5) Z(6) t

Fn(t)

1

Figure 26.1. Illustration for F6(t).

Example. For the sake of illustration consider the following example:

(Zi, δi)6i=1 = (5, 1), (2, 0), (6, 1), (1, 1), (3, 0), (7, 1).

Then

(Z(i), δ(i))6i=1 = (1, 1), (2, 0), (3, 0), (5, 1), (6, 1), (7, 1).

In Table 26.1 we describe the steps for calculating F6(t). Initially all samplepoints have mass 1/6. Z(1) is not censored so there is no change in Step1. Z(2) is censored, so its mass is redistributed among Z(3), Z(4), Z(5), Z(6).Z(3) is censored, too, this results in the next row. Since the last three samplepoints are not censored, there are no changes in the final three steps. Fromthis final row we can generate F6(t) (Fig. 26.1).

Table 26.2 and Fig. 26.2 show the same for G6(t).

Peterson (1977) gave a third interpretation utilizing the fact that thesurvival functions F (t) and G(t) are the functions of the subsurvivalfunctions

F ∗(t) = PY > t, δ = 1and

G∗(t) = PC > t, δ = 0,


Table 26.2. Efron’s redistribution algorithm for G6(t).

Z(1) Z(2) Z(3) Z(4) Z(5) Z(6)1/6 1/6 1/6 1/6 1/6 1/6

0 1/5 1/5 1/5 1/5 1/50 1/5 1/5 1/5 1/5 1/50 1/5 1/5 1/5 1/5 1/50 1/5 1/5 0 3/10 3/100 1/5 1/5 0 0 3/5

Z(1) Z(2) Z(3) Z(4) Z(5) Z(6) t

Gn(t)

1

Figure 26.2. Illustration for G6(t).

and the subsurvival functions can be estimated by averages. Thisobservation led to the following strong consistency theorem:

Theorem 26.1. Assume that F and G have no common jumps. Then, forall t < TK ,

Fn(t) −→ F (t) a.s.

as n → ∞ and

Gn(t) → G(t) a.s.

as n → ∞.

Proof. We prove pointwise convergence of Gn(t) to G(t) for all t < TK .The case of Fn(t) is similar. For a function S let D(S) denote the set ofthe jump points of S. Then any tail distribution (survival) function can beexpressed as

S(t) = exp

(∫

[0,t]\D(S)

dS(s)S(s)

)

·∏

s∈D(S)s≤t

S(s)S(s−)

(26.4)


(see Cox (1972)), where S(s−) = limt↑s S(t). Thus

G(t) = exp

(∫

[0,t]\D(G)

dG(s)G(s)

)

·∏

s∈D(G)s≤t

G(s)G(s−)

(26.5)

and

Gn(t) =∏

s∈D(Gn)s≤t

Gn(s)Gn(s−)

since Gn(·) is a discrete survival function. The subsurvival function G∗(·)can be expressed in terms of the survival functions G(·) and F (·)

G∗(t) = PC > t, δ = 0= PC > t, Y > C= EPC > t, Y > C|C= E

IC>tF (C)

(because of independence of Y and C)

= −∫ ∞

t

F (s) dG(s). (26.6)

Note that

G∗(t) + F ∗(t) = PZ > t = K(t)

= Pmin(Y,C) > t = G(t) · F (t). (26.7)

For real-valued functions H1 and H2 define a function φ by

φ(H1, H2, t)

= exp

(∫

[0,t]\D(H1)

dH1(s)H1(s) +H2(s)

)

·∏

s∈D(H1)s≤t

H1(s) +H2(s)H1(s−) +H2(s−)

.

We will show that

G(t) = φ(G∗(·), F ∗(·), t). (26.8)

Because of (26.6) G∗ has the same jump points as G. Thus (26.8) followsfrom (26.5), since

dG∗(s)G∗(s) + F ∗(s)

=dG∗(s)

G(s) · F (s)(from (26.7))

=F (s) · dG(s)F (s) ·G(s)

(from (26.6))


=dG(s)G(s)

andG∗(s) + F ∗(s)

G∗(s−) + F ∗(s−)=

G(s) · F (s)G(s−) · F (s−)

(from (26.7))

=G(s)G(s−)

if s is a jump point ofG(·), because by assumptionG(·) and F (·) do not havecommon jumps, thus s cannot be a jump point of F (·), so F (s) = F (s−).Now we express the Kaplan-Meier estimator in terms of the empiricalsubsurvival functions

G∗n(t) =

1n

n∑

i=1

IZi>t,δi=0

and

F ∗n(t) =

1n

n∑

i=1

IZi>t,δi=1.

First note that (26.7) is also true for the estimates, since

G∗n(t) + F ∗

n(t) =1n

n∑

i=1

IZi>t,δi=0 +1n

n∑

i=1

IZi>t,δi=1

=1n

n∑

i=1

IZi>t

=: Kn(t)

and

Gn(t) · Fn(t) =∏

Z(i)≤t

(n− i

n− i+ 1

)1−δ(i)

·∏

Z(i)≤t

(n− i

n− i+ 1

)δ(i)

=∏

Z(i)≤t

(n− i

n− i+ 1

)

=n− 1n

· n− 2n− 1

· · · n− ∑ni=1 IZi≤t

n− ∑ni=1 IZi≤t + 1

=1n

n∑

i=1

IZi>t


= Kn(t).

Therefore,

G∗n(s) + F ∗

n(s)G∗

n(s−) + F ∗n(s−)

=Gn(s) · Fn(s)

Gn(s−) · Fn(s−)=

Gn(s)Gn(s−)

if s is a jump point of Gn(·), because by assumption Gn(·) and Fn(·) donot have common jumps a.s., thus Fn(s) = Fn(s−). Hence

Gn(t) = φ(G∗n(·), F ∗

n(·), t)for t ≤ Z(n). The empirical subsurvival functions G∗

n(t) and F ∗n(t) converge

almost surely uniformly in t (Glivenko-Cantelli theorem). It is easy to showthat

Z(n) = TKn → TK a.s. as n → ∞. (26.9)

The almost sure convergence of Gn(·) to G(·) then follows from the factthat φ(H1(·), H2(·), t) is a continuous function of the argument H1(·) andH2(·), where the metric on the space of bounded functions is the supremumnorm.

The regression function is a conditional expectation, therefore wenow consider the problem of estimating an expectation from censoredobservations. If the Yi are i.i.d. and available, then the arithmetic mean

1n

n∑

i=1

Yi

is the usual estimate of EY . If instead of the Yi only the Zi, δi areavailable and, in addition, the censoring distribution G is known, then anunbiased estimate of EY is

1n

n∑

i=1

δiZi

G(Zi),

as can easily be verified by using properties of conditional expectation andthe assumed independence of Y and C:

E

δZ

G(Z)

∣∣X

= E

IY ≤C min(Y,C)G(min(Y,C))

∣∣X

= EIY ≤C

Y

G(Y )

∣∣X

= EEIY ≤C

Y

G(Y )

∣∣X,Y

∣∣X

= EG(Y )

Y

G(Y )

∣∣X

= E Y | X


= m(X). (26.10)

If G is unknown then EY can be estimated by the so-called Kaplan-Meiermean

Mn :=1n

n∑

j=1

δjZj

Gn(Zj)=

1n

n∑

j=1

δjYj

Gn(Yj)

=∫ +∞

0Fn(y) dy (26.11)

=∫ TF

0Fn(y) dy,

where (26.11) follows from an identity in Susarla, Tsai, and Van Ryzin(1984) (cf. Problem 26.2). Theorem 26.1 already implies that, for faircensoring (TF ≤ TG),

Mn → EY a.s.

(cf. Problem 26.3).

26.3 Regression Estimation for Model A

The regression function m(x) must be estimated from the censored data,which is an i.i.d. sequence of random variables:

(X1, Z1, δ1), . . . , (Xn, Zn, δn),where δi = IYi≤Ci, and Zi = min(Yi, Ci).

Throughout this section we rely on the following assumptions:(i) (X,Y ) and C are independent;(ii) F and G have no common jumps;(iii) TF < ∞; and(iv) G is continuous in TK and G(TF ) > 0.The last condition implies that TF < TG and hence TK = TF by

definition of TK .We define local averaging estimates for censored data by

mn,1(x) =n∑

i=1

Wni(x) · δi · Zi

Gn(Zi). (26.12)

According to such an estimate we calculate first Gn(t) from the sample(Z1, δ1), . . . , (Zn, δn) and then proceed as in (26.12). The fact 0 ≤ Y ≤TF < ∞ a.s. implies that 0 ≤ m(x) ≤ TF (x ∈ Rd). In order to ensure thatthe estimates are bounded in the same way, set

mn(x) = min (TKn , mn,1(x)) (26.13)

26.3. Regression Estimation for Model A 549

with TKn= Z(n). Observe 0 ≤ TKn

≤ TK = TF < ∞.Our first result concerns the strong consistency of local averaging

estimates for censored data.In the censored case, the partitioning estimate is

m(part)n (x) = min

(

TKn ,

∑ni=1

δiZi

Gn(Zi)IXi∈An(x)

∑ni=1 IXi∈An(x)

)

. (26.14)

Theorem 26.2. Assume (i)–(iv). Then under the conditions (4.1) and(4.2), for the regression function estimate m(part)

n , one has∫

(m(part)n (x) −m(x))2µ(dx) → 0 a.s. as n → ∞.

The proof of this theorem uses the following lemma:

Lemma 26.1. Assume that TF < ∞, G is continuous, and G(TF ) > 0.Then

1n

∑

j

δjZj

∣∣∣∣

1G(Zj)

− 1Gn(Zj)

∣∣∣∣ → 0 a.s.

Proof. Applying the identity

|a− b| = (a− b)+ + (b− a)+

= 2(a− b)+ + b− a

we obtain

1n

∑

j

δjZj

∣∣∣∣

1G(Zj)

− 1Gn(Zj)

∣∣∣∣ ≤ 2

1n

n∑

j=1

δjZj

(1

G(Zj)− 1Gn(Zj)

)+

+1n

n∑

j=1

δjZj

Gn(Zj)− 1n

n∑

j=1

δjZj

G(Zj)

:= I + II − III. (26.15)

Next introduce

Sn(y) = supn≤m

(1 − G(y)

Gm(y)

)+

.

Since G is continuous, F and G have no common jumps and, therefore,because of Theorem 26.1,

Gn(y) → G(y) a.s.

as n → ∞ for all y < TK = TF . Hence Sn(y) has the following properties:(a) 0 ≤ Sn(y) ≤ 1; and(b) Sn(y) → 0 a.s. for all y < TK = TF as n → ∞.


By the dominated convergence theorem

ESn(Y ) → 0 as n → ∞,

and hence one can choose, for any ε > 0, n0 sufficiently large such that

0 ≤ ESn0(Y ) < ε.

Now fix n0 and take n ≥ n0. Then, since δj Zj = δj Yj and from thedefinition of Sn(y), expression I in (26.15) can be written as

I = 21n

n∑

j=1

δjYj

G(Yj)

(1 − G(Yj)

Gn(Yj)

)+

≤ 21n

n∑

j=1

δjYj

G(Yj)Sn0(Yj). (26.16)

Taking lim sup on both sides of (26.16) we obtain, by applying thestrong law of large numbers to the sequence of random variables Vj :=δjYj

G(Yj)Sn0(Yj) and observing that Y ≤ TF < ∞,

0 ≤ lim supn→∞

21n

n∑

j=1

δjYj

(1

G(Yj)− 1Gn(Yj)

)+

≤ lim supn→∞

21n

n∑

j=1

δjYj

G(Yj)Sn0(Yj)

= 2E

δY

G(Y )Sn0(Y )

a.s.

≤ 2TF E

δ

G(Y )Sn0(Y )

= 2TF ESn0(Y ) < 2TF ε

which can be made arbitrarily small by an appropriate choice of ε, hencelimn→∞ I = 0 a.s. Using again δj Zj = δj Yj and the strong law of largenumbers, we obtain, for expression III, that

III =1n

n∑

j=1

δjYj

G(Yj)→ E

δY

G(Y )

= EY a.s. as n → ∞.

For the expression II we get

II =1n

n∑

j=1

δjZj

Gn(Zj)=

∫ TF

0Fn(y) dy →

∫ TF

0F (y) dy = EY a.s.

This, combined with the result that limn→∞

I = 0 a.s., completes the proof ofthe lemma.


Proof of Theorem 26.2. Because of∫

(m(part)n (x) −m(x))2µ(dx) ≤ TF

∫|m(part)

n (x) −m(x)|µ(dx)

it is enough to show that∫

|m(part)n (x) −m(x)|µ(dx) → 0 a.s. as n → ∞.

Put

Kn(x, z) = Iz∈An(x),

then

mn(x) = m(part)n (x) = min

(

TKn,

∑ni=1

δiZi

Gn(Zi)Kn(x,Xi)

∑ni=1Kn(x,Xi)

)

.

Introduce the following notations:

mn(x) = min

(

TKn,

∑ni=1

δiZi

Gn(Zi)Kn(x,Xi)

nEKn(x,X)

)

and

mn(x) = min

(

TKn,

∑ni=1

δiZi

G(Zi)Kn(x,Xi)

nEKn(x,X)

)

.

Then

|mn(x) −m(x)|≤ |mn(x) − mn(x)| + |mn(x) − mn(x)| + |mn(x) −m(x)|= In,1(x) + In,2(x) + In,3(x).

Because of Lemma 23.2 and (26.9),∫In,3(x)µ(dx) → 0 a.s.

We have that∫

Kn(x,Xi)EKn(x,X)

µ(dx) = 1,

therefore,∫In,2(x)µ(dx) ≤ 1

n

n∑

i=1

∣∣∣∣δiZi

G(Zi)− δiZi

Gn(Zi)

∣∣∣∣

∫Kn(x,Xi)EKn(x,X)

µ(dx)

=1n

n∑

i=1

∣∣∣∣δiZi

G(Zi)− δiZi

Gn(Zi)

∣∣∣∣

→ 0 a.s.


by Lemma 26.1. Moreover,

In,1(x) ≤ mn(x)∣∣∣∣1 − mn(x)

mn(x)

∣∣∣∣

≤ TKn

∣∣∣∣∣∣∣∣

1 −min

(TKn ,

∑n

i=1

δiZiGn(Zi)

Kn(x,Xi)

nEKn(x,X)

)

min(TKn ,

∑n

i=1

δiZiGn(Zi)

Kn(x,Xi)∑n

i=1Kn(x,Xi)

)

∣∣∣∣∣∣∣∣

.

If mn(x) = TKn , mn(x) = TKn , then∣∣∣∣1 − mn(x)

mn(x)

∣∣∣∣ = 0.

If mn(x) < TKn, mn(x) < TKn , then∣∣∣∣1 − mn(x)

mn(x)

∣∣∣∣ =

∣∣∣∣1 −

∑ni=1Kn(x,Xi)

nEKn(x,X)∣∣∣∣ .

If mn(x) < TKn , mn(x) = TKn , then∣∣∣∣1 − mn(x)

mn(x)

∣∣∣∣ = 1 − mn(x)

mn(x)

≤ 1 −

∑n

i=1

δiZiGn(Zi)

Kn(x,Xi)

nEKn(x,X)∑n

i=1

δiZiGn(Zi)

Kn(x,Xi)∑n

i=1Kn(x,Xi)

= 1 −∑n

i=1Kn(x,Xi)nEKn(x,X)

≤∣∣∣∣1 −

∑ni=1Kn(x,Xi)


If mn(x) = TKn , mn(x) < TKn , then∣∣∣∣1 − mn(x)

mn(x)

∣∣∣∣ =

mn(x)mn(x)

− 1

≤

∑n

i=1

δiZiGn(Zi)

Kn(x,Xi)

nEKn(x,X)∑n

i=1

δiZiGn(Zi)

Kn(x,Xi)∑n

i=1Kn(x,Xi)

− 1

=∑n

i=1Kn(x,Xi)nEKn(x,X) − 1

≤∣∣∣∣1 −

∑ni=1Kn(x,Xi)



Thus, in general,∫In,1(x)µ(dx) ≤ TF

∫ ∣∣∣∣1 −

∑ni=1Kn(x,Xi)

nEKn(x,X)∣∣∣∣µ(dx)

→ 0 a.s.

by Lemma 23.2.

In the censored case, the kernel estimate is

m(kern)n (x) = min

(

TKn,

∑ni=1

δiZi

Gn(Zi)Khn

(x−Xi)∑n

i=1Khn(x−Xi)

)

. (26.17)

Theorem 26.3. Assume (i)–(iv). Then under the conditions of Theorem23.5, for the regression function estimate m(kern)

n one has∫

(m(kern)n (x) −m(x))2µ(dx) → 0 a.s. as n → ∞.


In the censored case, the k-NN estimate is

m(k−NN)n (x) = min

(

TKn ,

n∑

i=1

1kIXi is among the k-NNs of x

δiZi

Gn(Zi)

)

.

(26.18)

Theorem 26.4. Assume (i)-(iv). Then under the conditions of Theorem23.7, for the regression function estimate m(k−NN)

n one has∫

(m(k−NN)n (x) −m(x))2µ(dx) → 0 a.s. as n → ∞.

The proof is a consequence of the following lemma:

Lemma 26.2. Assume, that TF < ∞, G is continuous, and G(TF ) > 0.Moreover, assume that Wn,i are subprobability weights such that, for allbounded Y ,

limn→∞

∫ (n∑

i=1

Wn,i(x)Yi −m(x)

)2

µ(dx) = 0 a.s. (26.19)

and

limn→∞

n∑

i=1

∣∣∣∣δiZi

G(Zi)− δiZi

Gn(Zi)

∣∣∣∣

∫Wn,i(x)µ(dx) = 0 a.s. (26.20)


Then for the regression function estimate mn(x), defined by (26.12) and(26.13), one has

∫(mn(x) −m(x))2µ(dx) → 0 a.s. as n → ∞.

Proof. Because of the conditionsδiZi

G(Zi)=

δiYi

G(Yi)≤ TF

G(TF )< ∞,

therefore (26.19) and (26.10) imply that

limn→∞

∫ (n∑

i=1

Wn,i(x)δiZi

G(Zi)−m(x)

)2

µ(dx) = 0 a.s.

Because G(TF ) > 0, we have TK = TF , and thus

TKn → TF a.s. as n → ∞,

thus

limn→∞

∫ (

min

(

TKn,

n∑

i=1

Wn,i(x)δiZi

G(Zi)

)

−m(x)

)2

µ(dx) = 0,

and, therefore,∫


≤ 2∫ (

min

(

TKn ,

n∑

i=1

Wn,i(x)δiZi

Gn(Zi)

)

−min

(

TKn ,

n∑

i=1

Wn,i(x)δiZi

G(Zi)

))2

µ(dx)

+ 2∫ (

min

(

TKn ,

n∑

i=1

Wn,i(x)δiZi

G(Zi)

)

−m(x)

)2

µ(dx)

≤ 2TK

n∑

i=1

∣∣∣∣δiZi

Gn(Zi)− δiZi

G(Zi)

∣∣∣∣

∫Wn,i(x)µ(dx)

+ 2∫ (

min

(

TKn ,

n∑

i=1

Wn,i(x)δiZi

G(Zi)

)

−m(x)

)2

µ(dx)

→ 0 a.s.,

where we used (26.20).

Proof of Theorem 26.4. We will verify the conditions of Lemma 26.2.Equation (26.19) follows from the fact that the estimate is consistent in the

26.4. Regression Estimation for Model B 555

uncensored case. Concerning (26.20) we will need the following coveringresults (Lemmas 23.10 and 23.11):

lim supn→∞

n

knµ (x : Xi is among the kn-NNs of x) ≤ const a.s.

Nown∑

i=1

∣∣∣∣δiZi

G(Zi)− δiZi

Gn(Zi)

∣∣∣∣

∫Wn,i(x)µ(dx)

=n∑

i=1

∣∣∣∣δiZi

G(Zi)− δiZi

Gn(Zi)

∣∣∣∣

∫1knIXi is among the kn-NNs of xµ(dx)

=n∑

i=1

∣∣∣∣δiZi

G(Zi)− δiZi

Gn(Zi)

∣∣∣∣

1knµ (x : Xi is among the kn-NNs of x)

≤ n

knmax

iµ (x : Xi is among the kn-NNs of x)

× 1n

n∑

i=1

∣∣∣∣δiZi

G(Zi)− δiZi

Gn(Zi)

∣∣∣∣

→ 0

a.s. because of Lemma 26.1.

26.4 Regression Estimation for Model B

Assume that Y and C are conditionally independent given X (Model B).Let

F (t|x) = PY > t|X = x,

G(t|x) = PC > t|X = x,

K(t|x) = PZ > t|X = x = F (t|x)G(t|x),and

TF (x) = supt : F (t|x) > 0,

TG(x) = supt : G(t|x) > 0,

TK(x) = minTF (x), TG(x).We observe

(X1, Z1, δ1), . . . , (Xn, Zn, δn),


where δi = IYi≤Ci and Zi = min(Yi, Ci). Two related problems arestudied:

(1) estimating the conditional survival functions F (t|x) and G(t|x);and

(2) estimating the regression function m(x) = EY |X = x from thecensored data.

Define the local averaging estimates of F (t|x) and G(t|x) in the followingway:

Fn(t|x) =∏

Z(i)≤t∑n

j=iWn(j)(x)>0

(∑nj=i+1Wn(j)(x)∑n

j=iWn(j)(x)

)δ(i)

It≤Z(n) (26.21)

and

Gn(t|x) =∏

Z(i)≤t∑n

j=iWn(j)(x)>0


j=iWn(j)(x)

)1−δ(i)

It≤Z(n) (26.22)

where (Z(i), δ(i)) (i = 1, . . . , n) are the n pairs of observed (Zi, δi) orderedon the Z(i), i.e.,

Z(1) ≤ Z(2) ≤ · · · ≤ Z(n),

and Wni are subprobability weights depending on X1, . . . , Xn, i.e.,Wni(x) ≥ 0 and

∑nj=1Wni(x) ≤ 1, and Wn(i) are these weights ordered

according to the ordering of Z(i).Examples of local averaging-type estimates are the nearest neighbor

estimates, the kernel estimates, and the partitioning estimates.The estimates defined in (26.21) and (26.22) extend the Kaplan-Meier

method to conditional survival functions. If Wni(x) = 1n for all i then

Fn(t|x) =∏

i:Z(i)≤t


j=iWn(j)(x)

)δ(i)

=∏

i:Z(i)≤t

(n− i

n− i+ 1

)δ(i)

for all x, which is the Kaplan-Meier estimator Fn(t).In this section we assume that:(I) Y and C are conditionally independent given X;(II) F and G have no common jumps; and(III) TF (x) < TG(x) for all x.Now we will show that these estimates are consistent if the weights are

the consistent k-NN, kernel, or partitioning weights.

Theorem 26.5. Assume (I)–(III). If the weights Wni(x) are such thatthe local averaging regression estimate using probability weights is stronglypointwise consistent for bounded Y , then, for t < TK(x),


(a) Fn(t|x) → F (t|x); and

(b) Gn(t|x) → G(t|x);a.s. as n → ∞ for µ-almost all x.

Proof. The argument follows the reasoning of the proof of Theorem 26.1,here we prove the statement for F . For G the proof is similar. F (·|x) canbe expressed as

F (t|x) = exp

(∫

[0,t]\D(F (·|x))

dF (s|x)F (s|x)

)

·∏

s∈D(F (·|x))s≤t

F (s|x)F (s−|x) (26.23)

and since Fn(·|x) is a discrete survival function

Fn(t|x) =∏

s∈D(Fn(·|x))s≤t

Fn(s|x)Fn(s−|x) ,

where D(S) is, as earlier, the set of jump points of function S. Define theconditional subsurvival functions

F ∗(t|x) = PZ > t, δ = 1|X = x,

G∗(t|x) = PZ > t, δ = 0|X = x.F ∗(·|x) can be expressed in terms of the survival functions F (·|x) andG(·|x),

F ∗(t|x) =∫ ∞

t

G(s|x)[−dF (s|x)], (26.24)

because the random variables Y and C are conditionally independent givenX (cf. (26.11)). Note that

F ∗(t|x) +G∗(t|x) = PZ > t|X = x= K(t|x)= Pmin(Y,C) > t|X = x= F (t|x) ·G(t|x). (26.25)

For real-valued functions H1 and H2 define a function φ by

φ(H1, H2, t)

= exp

(∫

[0,t]\D(H1)

dH1(s)H1(s) +H2(s)

)

·∏

s∈D(H1)s≤t

H1(s) +H2(s)H1(s−) +H2(s−)

.

We will show that

F (t|x) = φ(F ∗(·|x), G∗(·|x), t). (26.26)


Because of (26.24), F ∗ has the same jump points as F . Thus (26.26) followsfrom (26.23), since

dF ∗(s|x)F ∗(s|x) +G∗(s|x) =

dF ∗(s|x)F (s|x) ·G(s|x) (from (26.25))

=−G(s|x) · dF (s|x)G(s|x) · F (s|x) (from (26.24))

=−dF (s|x)F (s|x)

and

F ∗(s|x) +G∗(s|x)F ∗(s−|x) +G∗(s−|x) =

F (s|x) ·G(s|x)F (s−|x) ·G(s−|x) (from (26.25))

=F (s|x)F (s−|x)

if s is a jump point of F (·|x) because, by assumption, F (·|x) and G(·|x)do not have common jumps, thus s cannot be a jump point of G(·|x), soG(s|x) = G(s−|x).

Now we express Fn(·|x) in terms of the local averaging estimates

F ∗n(t|x) =

n∑

i=1

Wni(x)IZi>t,δi=1

and

G∗n(t|x) =

n∑

i=1

Wni(x)IZi>t,δi=0

of the conditional subsurvival functions.First note that (26.25) is also true for the estimates, since

F ∗n(t|x) +G∗

n(t|x) =n∑

i=1

Wni(x)IZi>t,δi=1 +n∑

i=1

Wni(x)IZi>t,δi=0

=n∑

i=1

Wni(x)IZi>t

and

Fn(t|x) ·Gn(t|x) =∏

Z(i)≤t∑n

j=iWn(j)(x)>0


j=iWn(j)(x)

)δ(i)


×∏

Z(i)≤t∑n

j=iWn(j)(x)>0


j=iWn(j)(x)

)1−δ(i)

=∏

Z(i)≤t∑n

j=iWn(j)(x)>0


j=iWn(j)(x)

)

=∑n

i=1Wni(x)IZi>t∑nj=1Wnj(x)

=n∑

i=1

Wni(x)IZi>t

because of∑n

j=1Wnj(x) = 1. Therefore

F ∗n(s|x) +G∗

n(s|x)F ∗

n(s−|x) +G∗n(s−|x) =

Fn(s|x) ·Gn(s|x)Fn(s−|x) ·Gn(s−|x) =

Fn(s|x)Fn(s−|x)

if s is a jump point of Fn(·|x) because, by assumption, Fn(·|x) and Gn(·|x)do not have common jumps a.s., thus Gn(s|x) = Gn(s−|x). Hence

Fn(t|x) = φ(F ∗n(·|x), G∗

n(·|x), t)for t ≤ Z(n).

Function φ(H1(·), H2(·), t) is a continuous function of the argumentH1(·)and H2(·) according to the sup norm topology. F ∗(·|x) and G∗(·|x) areregression functions of indicator random variables, and Fn(·|x) and Gn(·|x)are regression estimates. They are pointwise consistent on [0, TK ] becauseof the condition. The limits are nonincreasing, therefore the convergence isuniform, consequently, Fn(·|x) converges to F (·|x) almost surely.

From a consistent estimator Fn(·|x) of the conditional survival functionF (·|x) we can obtain a consistent nonparametric estimator mn(x) of theregression function m(x).

Let

mn(x) =∫ ∞

0Fn(t|x) dt =

∫ TKn

0Fn(t|x) dt. (26.27)

Theorem 26.6. Under the assumptions of Theorem 26.5∫

Rd

|mn(x) −m(x)|2µ(dx) → 0 a.s. as n → ∞.


Proof. Since Y is bounded by TF , we have m(x) ≤ TF and hence also

|m(x) −mn(x)| ≤ TF .

TF (x) < TG(x) implies that TF (x) = TK(x) therefore∫

|m(x) −mn(x)|2µ(dx) ≤ TF

∫|m(x) −mn(x)|µ(dx)

= TF

∫ ∣∣∣∣

∫ ∞

0(F (t|x) − Fn(t|x))dt

∣∣∣∣µ(dx)

≤ TF

∫ ∫ TF (x)

0|F (t|x) − Fn(t|x)| dtµ(dx),

which tends to zero a.s. by Theorem 26.5 and by the dominated convergencetheorem.

To see that the integration in (26.27) can be carried out easily, wecalculate the jump of the estimator Fn(·|x) defined by (26.21) in Zi.

Lemma 26.3. For probability weights

mn(x) =n∑

i=1

Wni(x)Ziδi

Gn(Zi|x) . (26.28)

Proof. We show that the jump of the estimator Fn(·|x) in an observationZi is given by

dFn(Zi|x) =Wni(x)δiGn(Zi|x) ,

where Gn(·|x) is the estimator of G(·|x) defined by (26.22). Fn(·|x) has nojump at a censored observation, therefore we consider only the jump in anuncensored observation, i.e., let δ(i) = 1,

dFn(Z(i)|x) = Fn(Z(i−1)|x) − Fn(Z(i)|x)

=∏

k≤i−1∑n

j=kWn(j)(x)>0

(∑nj=k+1Wn(j)(x)∑n

j=k Wn(j)(x)

)δ(k)

−∏

l≤i∑n

j=lWn(j)(x)>0

(∑nj=l+1Wn(j)(x)∑n

j=l Wn(j)(x)

)δ(l)


=∏

k≤i−1∑n

j=kWn(j)(x)>0


j=k Wn(j)(x)

)δ(k)

×

1 −(∑n

j=i+1Wn(j)(x)∑nj=iWn(j)(x)

)δ(i)

=∏

k≤i−1∑n

j=kWn(j)(x)>0


j=k Wn(j)(x)

)δ(k)

× Wn(i)(x)∑nj=iWn(j)(x)

.

If Wn(i)(x) = 0, then dFn(Z(i)|x) = 0, otherwise∑n

j=k Wn(j)(x) > 0 for allk ≤ i, thus

dFn(Z(i)|x)

=∏

k≤i−1


j=k Wn(j)(x)

)δ(k)Wn(i)(x)∑n

j=iWn(j)(x)

=∏

k≤i


j=k Wn(j)(x)

)δ(k)(∑n

j=i+1Wn(j)(x)∑nj=iWn(j)(x)

)−1

× Wn(i)(x)∑nj=iWn(j)(x)

(since δ(i) = 1)

=∏

k≤i


j=k Wn(j)(x)

)δ(k)

Wn(i)(x)1

∑nj=i+1Wn(j)(x)

.

Since Wn(j)(x) are probability weights,

n∑

j=i+1

Wn(j)(x) =

∑nj=i+1Wn(j)(x)∑nj=1Wn(j)(x)

=∏

k≤i


j=k Wn(j)(x)

)

.

Thus, by the definition of Gn(·|x), and since δ(i) = 1,

dFn(Z(i)|x) =∏

k≤i


j=k Wn(j)(x)

)δ(k)−1

Wn(i)(x)


=1

∏k≤i

(∑n

j=k+1Wn(j)(x)∑n

j=kWn(j)(x)

)1−δ(k)Wn(i)(x)

=Wn(i)(x)δ(i)Gn(Z(i)|x) .

The integral of Fn(t|x) can be expressed as the sum of Zi times the jumpof Fn in Zi that is apparent once we calculate the area in question bydecomposing it into horizontal stripes, rather than into vertical stripes, asis customary in the calculation of integrals. Thus

∫ ∞

0Fn(t|x) dt =

n∑

i=1

ZiWni(x)δiGn(Zi|x) .

Hence the regression estimate defined by (26.27) can be written in thefollowing form:

mn(x) =n∑

i=1

Wni(x)Ziδi

Gn(Zi|x) .

We have consequences for local averaging estimates. Formally Theorem26.6 assumed probability weights, which is satisfied for partitioning andkernel estimates if x is in the support set and n is large.

Corollary 26.1. Under the conditions of Theorems 26.6 and 25.6, for thepartitioning estimate defined by (26.28),

∫

Rd

|mn(x) −m(x)|2µ(dx) → 0 (n → ∞) a.s.

Corollary 26.2. Under the conditions of Theorems 26.6 and 25.11, forthe kernel estimate defined by (26.28),

∫

Rd

|mn(x) −m(x)|2µ(dx) → 0 (n → ∞) a.s.

Corollary 26.3. Under the conditions of Theorems 26.6 and 25.17, forthe nearest neighbor estimate defined by (26.28),

∫

Rd

|mn(x) −m(x)|2µ(dx) → 0 (n → ∞) a.s.



Theorem 26.1 is due to Peterson (1977). A Glivenko-Cantelli type sharp-ening is due to Stute and Wang (1993), see also the survey article of Gill(1994). Carbonez, Gyorfi, and van der Meulen (1995) proved Theorem 26.2.Theorems 26.3 and 26.4 are due to Kohler, Mathe, and Pinter (2002). Un-der the assumption that Y and C are conditionally independent given X(Model B), regression estimation was considered by many authors, the firstfully nonparametric approach was given by Beran (1981) who introduced aclass of nonparametric regression estimates to estimate conditional survivalfunctions in the presence of right censoring. Beran (1981) proved Theorem26.5. Dabrowska (1987; 1989) proved some consistency results for this esti-mate. An alternative approach was taken by Horvath (1981) who proposedto estimate the conditional survival function by integrating an estimate ofthe conditional density. Lemmas 26.2, 26.3, and Theorem 26.6 are due toPinter (2001).


Problem 26.1. Prove that the Kaplan-Meier and Efron estimates are the same.Hint: Both estimates have jumps at sample points. By induction verify that

the jumps are the same.


δj

nGn(Zj) is the jump of Fn in Zj .

Problem 26.3. Prove that the Kaplan-Meier mean is a consistent estimate ofthe expectation if TF < TG.

Hint: Apply Theorem 26.1 and the dominated convergence theorem.

Problem 26.4. Prove Theorem 26.3.Hint: With the notation

Kn(x, z) = Khn(x − z)

we copy the proof of Theorem 26.2 such that the only difference is the coveringlemma (Lemma 23.6)

∫Khn(x − Xi)EKhn(x − X)

µ(dx) ≤ ρ,

moreover, instead of Lemma 23.2, we refer to Lemma 23.9.

27Dependent Observations

If the data are not i.i.d. then the regression problem has two versions.For the regression function estimation from dependent dataDn =

(Xi, Yi)ni=1, the random vectors (Xi, Yi) (i = 1, 2, . . .) are not i.i.d., but

Dn and (X,Y ) are independent. Because of this independence

EY |X,Dn = EY |X,

therefore

minf

E(f(X,Dn) − Y )2 = minf

E(f(X) − Y )2,

so the best L2 error for the regression problem is the same as the best L2error for i.i.d. data.

For the time series problem (X,Y ) = (Xn+1, Yn+1). Here

EYn+1|Xn+1, Dn and EYn+1|Xn+1

are not identical in general, thus we may have that

minf

E(f(Xn+1, Dn) − Yn+1)2 < minf

E(f(Xn+1) − Yn+1)2,

thus in the time series problem the best possible L2 error can be improvedwith respect to the i.i.d. case.

In this chapter we are interested only in the extreme dependence, whenthe data are stationary and ergodic.

27.1. Stationary and Ergodic Observations 565

27.1 Stationary and Ergodic Observations

Since our main interest is the ergodicity, we summarize first the basics ofstationary and ergodic sequences.

Definition 27.1. A sequence of random variables Z = Zi∞i=−∞ defined

on a probability space (Ω,F ,P) is stationary if for any integers k andn the random vectors (Z1, . . . , Zn) and (Zk+1, . . . , Zk+n) have the samedistribution.

A measurable-transformation T from Ω to Ω is called measure preservingif, for all A ∈ F ,

PA = PT−1A.It is easy to show that for any stationary sequence Z there are a randomvariable Z and a measure-preserving transformation T such that

Zn(ω) = Z(Tnω). (27.1)

Definition 27.2. A stationary sequence of random variables

Z = Zi∞i=−∞

represented by a measure-preserving transformation T is called ergodic if

T−1A = A

implies that PA is either 0 or 1.

The next lemma is the strong law of large numbers for ergodic sequences.

Lemma 27.1. (Birkhoff’s Ergodic Theorem). Let Zi∞−∞ be a

stationary and ergodic process with E|Z1| < ∞. Then

limn→∞

1n

n∑

i=1

Zi = EZ1 a.s.

Proof. First we prove the so-called maximal ergodic theorem: Let Zbe an integrable real random variable and let T be a measure-preservingtransformation, further

Mn = maxk=1,2,...,n

k−1∑

i=0

Z T i.

Then

EZIMn>0 ≥ 0.

In order to show this, notice that, for k = 0, 1, . . . , n,

M+n ≥

k−1∑

i=0

Z T i.

566 27. Dependent Observations

(The void sum is 0 by definition.) Thus

Z +M+n T ≥ Z +

k−1∑

i=0

Z T i+1 =k∑

i=0

Z T i

for k = 0, 1, . . . , n, and

Z ≥ maxk=1,2,...,n

k∑

i=0

Z T i −M+n T ≥ Mn −M+

n T.

Because Mn = M+n on [Mn > 0] = [M+

n > 0], one obtains

EZIMn>0 =∫

M+n >0

Z dP

≥∫

M+n >0

(M+n −M+

n T ) dP

=∫

ΩM+

n dP −∫

M+n >0

M+n T dP

≥∫

ΩM+

n dP −∫

ΩM+

n T dP

=∫

ΩM+

n dP −∫

T −1(Ω)M+

n dPT

=∫

ΩM+

n dP −∫

ΩM+

n dP

= 0,

since PT−1(Ω) = PΩ = 1 and PT = P for measure-preserving T , andthe proof of the maximal ergodic theorem is complete.

In order to prove Lemma 27.1 we use the representation of Zi∞i=−∞

given after Definition 27.1. Without loss of generality assume EZ = 0. Itsuffices to show

Z := lim supn→∞

1n

n−1∑

i=0

Zi ≤ 0 a.s.,

because then by use of −Zi we get

lim infn→∞

1n

n−1∑

i=0

Zi ≥ 0 a.s.

and thus

1n

n−1∑

i=0

Zi → 0 a.s.

27.1. Stationary and Ergodic Observations 567

Notice that Z is T -invariant, i.e., Z T = Z. Because of

T−1Z > s = T−1Z−1((s,∞))

= (ZT )−1((s,∞))

= Z−1((s,∞))

= Z > sfor each real s, by ergodicity of Zi the probability PZ > s equals 0 or1 for each s. Thus a constant δ ∈ R ∪ −∞,∞ exists such that

Z = δ a.s.

We show that the assumption δ > 0 leads to a contradiction. Set δ∗ = δ if0 < δ < ∞ and δ∗ = 1 if δ = ∞, further

Z∗ = Z − δ∗

and

Mn = maxk=1,2,...,n

1k

k−1∑

i=0

Z∗ T i.

On the one hand, by the maximal ergodic theorem

EZ∗IMn>0 ≥ 0.

On the other hand,

Mn > 0 ↑

supk

1k

k−1∑

i=0

Z∗ T i > 0

=

supk

1k

k−1∑

i=0

Z T i ≥ δ∗

and

P

supk

1k

k−1∑

i=0

Z T i ≥ δ∗

= 1,

since PZ = δ = 1. Because of

|Z∗IMn>0| ≤ |Z| + δ∗

Lebesgue’s dominated convergence theorem yields

EZ∗IMn>0 → EZ∗ = −δ∗ < 0

for n → ∞, which is a contradiction.

The main ingredient of the proofs in the sequel is known as Breiman’sgeneralized ergodic theorem.

Lemma 27.2. Let Z = Zi∞−∞ be a stationary and ergodic process. Let

T denote the left shift operator. Let fi be a sequence of real-valued func-tions such that for some function f , fi(Z) → f(Z) a.s. Assume that


E supi |fi(Z)| < ∞. Then

limn→∞

1n

n∑

i=1

fi(T iZ) = Ef(Z) a.s.

Proof. Put

gk = infk≤i

fi

and

Gk = supk≤i

fi,

then for any integer k > 0 Lemma 27.1 implies that

lim supn→∞

1n

n∑

i=1

fi(T iZ) ≤ lim supn→∞

1n

n∑

i=1

Gk(T iZ) = EGk(Z) a.s.

since by our condition EGk(Z) is finite. The monotone convergence theoremyields

EGk(Z) ↓ E

lim supi→∞

fi(Z)

= Ef(Z).

Similarly,

lim infn→∞

1n

n∑

i=1

fi(T iZ) ≥ lim infn→∞

1n

n∑

i=1

gk(T iZ) = Egk(Z) a.s.

and

Egk(Z) ↑ E

lim infi→∞

fi(Z)

= Ef(Z).

27.2 Dynamic Forecasting: Autoregression

The autoregression is a special time series problem where there is no Xi

only Yi. Here Cover (1975) formulated the following two problems:

Dynamic Forecasting. Find an estimator E(Y n−10 ) of the value EYn|Y n−1

0 such that for all stationary and ergodic sequences Yi

limn→∞

|E(Y n−10 ) − EYn|Y n−1

0 | = 0 a.s.

27.2. Dynamic Forecasting: Autoregression 569

Static Forecasting. Find an estimator E(Y −1−n ) of the value EY0|Y −1

−∞such that for all stationary and ergodic sequences Yi

limn→∞

E(Y −1−n ) = EY0|Y −1

−∞ a.s.

First we show that the dynamic forecasting is impossible.

Theorem 27.1. For any estimator E(Y n−10 ) there is a stationary

ergodic binary valued process Yi such that

P

lim supn→∞

|E(Y n−10 ) − EYn|Y n−1

0 | ≥ 1/4

≥ 18.

Proof. We define a Markov process which serves as the technical tool forthe construction of our counterexample. Let the state space S be the non-negative integers. From state 0 the process certainly passes to state 1 andthen to state 2, at the following epoch. From each state s ≥ 2, the Markovchain passes either to state 0 or to state s+ 1 with equal probabilities 0.5.This construction yields a stationary and ergodic Markov process Miwith stationary distribution

PM = 0 = PM = 1 =14

and

PM = i =12i

for i ≥ 2.

Let τk denote the first positive time of the occurrence of state 2k:

τk = mini ≥ 0 : Mi = 2k.Note that if M0 = 0 then Mi ≤ 2k for 0 ≤ i ≤ τk. Now we define thehidden Markov chain process Yi, which we denote as Yi = f(Mi). It willserve as the stationary unpredictable time series. We will use the notationMn

0 to denote the sequence of states M0, . . . ,Mn. Let f(0) = 0, f(1) = 0,and f(s) = 1 for all even states s. A feature of this definition of f(·) is thatwhenever Yn = 0, Yn+1 = 0, Yn+2 = 1 we know that Mn = 0 and vice versa.Next we will define f(s) for odd states s maliciously. We define f(2k + 1)inductively for k ≥ 1. Assume f(2l+1) is defined for l < k. If M0 = 0 (i.e.,f(M0) = 0, f(M1) = 0, f(M2) = 1), then Mi ≤ 2k for 0 ≤ i ≤ τk and themapping

Mτk0 → (f(M0), . . . , f(Mτk

))

is invertible. (It can be seen as follows: Given Y n0 find 1 ≤ l ≤ n and

positive integers 0 = r0 < r1 < · · · < rl = n + 1 such that Y n0 =

(Y r1−1r0

, Y r2−1r1

, . . . , Y rl−1rl−1

), where 2 ≤ ri+1 − 1 − ri < 2k for 0 ≤ i < l − 1,rl − 1 − rl−1 = 2k and for 0 ≤ i < l, Y ri+1−1

ri = (f(0), f(1), . . . , f(ri+1 −1 − ri)). Now τk = n and Mri+1−1

ri = (0, 1, . . . , ri+1 − 1 − ri) for 0 ≤ i < l.


This construction is always possible under our postulates that M0 = 0 andτk = n.) Let

B+k =

M0 = 0, E(f(M0), . . . , f(Mτk

)) ≥ 14

and

B−k =

M0 = 0, E(f(M0), . . . , f(Mτk

)) <14

.

Now notice that the sets B+k and B−

k do not depend on the future values off(2r+1) for r ≥ k. One of the two sets B+

k , B−k has at least probability 1/8.

Now we specify f(2k+1). Let f(2k+1) = 1, Ik = B−k if PB−

k ≥ PB+k

and let f(2k + 1) = 0, Ik = B+k if PB−

k < PB+k . Because of the

construction of Mi, on event Ik,

EYτk+1|Y τk0 = f(2k + 1)PYτk+1 = f(2k + 1)|Y τk

0 = f(2k + 1)PMτk+1 = 2k + 1|Mτk

0 = 0.5f(2k + 1).

The difference of the estimate and the conditional expectation is at least 14

on set Ik and this event occurs with probability not less than 1/8. Finally,by Fatou’s lemma,

P

lim supn→∞

|E(Y n−10 ) − EYn|Y n−1

0 | ≥ 1/4

≥ P

lim supn→∞

|E(Y n−10 ) − EYn|Y n−1

0 | ≥ 1/4, Y0 = Y1 = 0, Y2 = 1

≥ P

lim supk→∞

|E(f(M0), . . . , f(Mτk))

− Ef(Mτk+1)|f(M0), . . . , f(Mτk)| ≥ 1/4,M0 = 0

≥ P

lim supk→∞

Ik

= E

lim supk→∞

IIk

≥ lim supk→∞

EIIk

= lim supk→∞

PIk ≥ 18.

One-Step Dynamic Forecasting. Find an estimator E(Y n−10 ) of the

value EYn|Yn−1 such that, for all stationary and ergodic sequences Yi,

limn→∞

|E(Y n−10 ) − EYn|Yn−1| = 0 a.s.

27.2. Dynamic Forecasting: Autoregression 571

We show next that the one-step dynamic forecasting is impossible.

Theorem 27.2. For any estimator E(Y n−10 ) there is a stationary

ergodic real-valued process Yi such that

P

lim supn→∞

|E(Y n−10 ) − EYn|Yn−1| ≥ 1/8

≥ 1

8.

Proof. We will use the Markov process Mi defined in the proof of The-orem 27.1. Note that one must pass through state s to get to any states′ > s from 0. We construct a process Yi which is in fact just a relabeledversion of Mi. This construct uses a different (invertible) function f(·),for Yi = f(Mi). Define f(0)=0, f(s) = Ls + 2−s if s > 0 where Ls is either0 or 1 as specified later. In this way, knowing Yi is equivalent to knowingMi and vice versa. Thus Yi = f(Mi) where f is one-to-one. For s ≥ 2 theconditional expectation is

EYt|Yt−1 = Ls + 2−s =Ls+1 + 2−(s+1)

2.

We complete the description of the function f(·) and thus the conditionalexpectation by defining Ls+1 so as to confound any proposed predictorE(Y n−1

0 ). Let τs denote the time of the first occurrence of state s:

τs = mini ≥ 0 : Mi = s.Let L1 = L2 = 0. Suppose s ≥ 2. Assume we specified Li for i ≤ s. Define

B+s = Y0 = 0, E(Y τs

0 ) ≥ 1/4and

B−s = Y0 = 0, E(Y τs

0 ) < 1/4.One of the two sets has at least probability 1/8. Take Ls+1 = 1 and Is = B−

s

if PB−s ≥ PB+

s . Let Ls+1 = 0 and Is = B+s if PB−

s < PB+s . The

difference of the estimation and the conditional expectation is at least 1/8on set Is and this event occurs with probability not less than 1/8. ByFatou’s lemma,

P

lim supn→∞

|E(Y n−10 ) − EYn|Yn−1| ≥ 1/8

≥ P

lim sups→∞

|E(Y τs0 ) − EYτs+1|Yτs

| ≥ 1/8, Y0 = 0

≥ P

lim sups→∞

Is

≥ lim sups→∞

PIs ≥ 1/8.


27.3 Static Forecasting: General Case

Consider the static forecasting problem. Assume, in general, that (Xi, Yi)is a stationary and ergodic time series. One wishes to infer the conditionalexpectation

EY0|X0−∞, Y

−1−∞.

Let Pn = An,j and Qn = Bn,j be sequences of nested partitions ofRd and R, respectively, and an and bn the corresponding quantizers withreproduction points an,j ∈ An,j and bn,j ∈ Bn,j , respectively,

an(x) = an,j if x ∈ An,j

and

bn(y) = bn,j if y ∈ Bn,j .

The estimate is as follows: Set λ0 = 1 and for 0 < k < ∞ define recursively

τk = mint > 0 : ak(X−t−λk−1−t) = ak(X0

−λk−1), bk(Y −1−t

−λk−1−t) = bk(Y −1−λk−1

),and

λk = τk + λk−1.

The number τk is random, but is finite almost surely.The kth estimate of EY0|X0

−∞, Y−1−∞ is provided by

mk =1k

∑

1≤j≤k

Y−τj . (27.2)

To obtain a fixed sample-size version we apply the same method as inAlgoet (1992). For 0 < t < ∞ let κt denote the maximum of the integers ksuch that λk ≤ t. Formally, let

κt = maxk ≥ 0 : λk ≤ t.Now put

m−t = mκt. (27.3)

Theorem 27.3. Assume that Pk asymptotically generates the Borel σ-algebra, i.e., F(Pk) ↑ Bd and

εk = supjdiam(Bk,j) → 0.

Then

limk→∞

mk = EY0|X0−∞, Y

−1−∞ a.s.

and

limt→∞

m−t = EY0|X0−∞, Y

−1−∞ a.s.

27.3. Static Forecasting: General Case 573

Proof. Following the technique by Morvai (1995) we first show that for

Fj = F(aj(X0−λj

), bj(Y −1−λj

))

and Borel set C,

PY−τj ∈ C|Fj−1 = PY0 ∈ C|Fj−1. (27.4)

For 0 < m < ∞, 0 < l < ∞, any sequences x0−m and y−1

−m, and Borel set Cwe prove that

T−l(aj−1(X0−m) = aj−1(x0

−m),

bj−1(Y −1−m) = bj−1(y−1

−m), λj−1 = m, τj = l, Y−l ∈ C)

= aj−1(X0−m) = aj−1(x0

−m),

bj−1(Y −1−m) = bj−1(y−1

−m), λj−1 = m, τj = l, Y0 ∈ Cwhere T denotes the left shift operator and

τk = mint > 0 : ak(Xt−λk−1+t) = ak(X0

−λk−1), bk(Y −1+t

−λk−1+t)

= bk(Y −1−λk−1

).

Since the event λj−1 = m is measurable with respect toF(aj−1(X0

−m), bj−1(Y −1−m)), either

aj−1(X0−m) = aj−1(x0

−m), bj−1(Y −1−m) = bj−1(y−1

−m), λj−1 = m = ∅and the statement is trivial, or

aj−1(X0−m) = aj−1(x0

−m), bj−1(Y −1−m) = bj−1(y−1

−m), λj−1 = m= aj−1(X0

−m) = aj−1(x0−m), bj−1(Y −1

−m) = bj−1(y−1−m).

Then

T−l(aj−1(X0−m) = aj−1(x0

−m),

bj−1(Y −1−m) = bj−1(y−1

−m), τj = l, Y−l ∈ C)

= T−l(aj−1(X0−m) = aj−1(x0

−m), aj(X−l−m−l) = aj(X0

−m),

bj−1(Y −1−m) = bj−1(y−1

−m), bj(Y −1−l−m−l) = bj(Y −1

−m),

aj(X−t−m−t) = aj(X0

−m) or bj(Y −1−t−m−t) = bj(Y −1

−m) for all 0 < t < l,

Y−l ∈ C)

= aj−1(X l−m+l) = aj−1(x0

−m), aj(X0−m) = aj(X l

−m+l),

bj−1(Y −1+l−m+l) = bj−1(y−1

−m), bj(Y −1−m) = bj(Y −1+l

−m+l),

aj(X−t+l−m−t+l) = aj(X l

−m+l)

or bj(Y −1−t+l−m−t+l) = bj(Y −1+l

−m+l) for all 0 < t < l,


Y0 ∈ C= aj−1(X0

−m) = aj−1(x0−m), aj(X0

−m) = aj(X l−m+l),

bj−1(Y −1−m) = bj−1(y−1

−m), bj(Y −1−m) = bj(Y −1+l

−m+l),

aj(Xt−m+t) = aj(X0

−m) or bj(Y −1+t−m+t) = bj(Y −1

−m) for all 0 < t < l,

Y0 ∈ C= aj−1(X0

−m) = aj−1(x0−m),

bj−1(Y −1−m) = bj−1(y−1

−m), τj = l, Y0 ∈ C,and the equality of the events is proven. Now note that each generatingatom of Fj−1 (there are countably many of them) has the form

H = aj−1(X0−m) = aj−1(x0

−m), bj−1(Y −1−m) = bj−1(y−1

−m), λj−1 = m,so we have to show that for any atom H the following equality holds:

∫

H

PY−τj∈ C|Fj−1 dP =

∫

H

PY0 ∈ C|Fj−1 dP,

which can be done using the properties of conditional probability and theprevious identities:∫

H

PY−τj∈ C|Fj−1 dP = PH ∩ Y−τj

∈ C

=∑

1≤l<∞PH ∩ τj = l, Y−l ∈ C

=∑

1≤l<∞PT−l(H ∩ τj = l, Y−l ∈ C)

=∑

1≤l<∞PH ∩ τj = l, Y0 ∈ C

= PH ∩ Y0 ∈ C

=∫

H

PY0 ∈ C|Fj−1 dP.

Introduce another estimate:

mk =1k

∑

1≤j≤k

bj(Y−τj ).

By condition,

|mk − mk| ≤ 1k

∑

1≤j≤k

|Y−τj− bj(Y−τj

)| ≤ 1k

∑

1≤j≤k

εj → 0,

thus it suffices to show that

limk→∞

mk = EY0|X0−∞, Y

−1−∞ a.s.

27.3. Static Forecasting: General Case 575

We have that

mk − EY0|X0−∞, Y

−1−∞

=1k

∑

1≤j≤k

(bj(Y−τj ) − Ebj(Y−τj )|Fj−1)

+1k

∑

1≤j≤k

(Ebj(Y−τj )|Fj−1 − Ebj(Y0)|Fj−1)

+1k

∑

1≤j≤k

(Ebj(Y0)|Fj−1 − EY0|Fj−1)

+1k

∑

1≤j≤k

(EY0|Fj−1 − EY0|X0−∞, Y

−1−∞)

= I1 + I2 + I3 + I4.

By (27.4),

I2 = 0,

moreover,

|I3| ≤ 1k

∑

1≤j≤k

εj → 0.

Since Fj ↑ F(X0−∞, Y

−1−∞) the martingale convergence theorem (Theorem

A.4) implies that

EY0|Fj−1 → EY0|X0−∞, Y

−1−∞ a.s.,

so

I4 → 0 a.s.

Observe that I1 is an average of martingale differences. Apply TheoremA.6 for ci = 1, then we have to show that

supj

E(bj(Y−τj ) − Ebj(Y−τj )|Fj−1)2 < ∞.

Use again (27.4) then

E(bj(Y−τj) − Ebj(Y−τj

)|Fj−1)2 ≤ Ebj(Y−τj)2

= EEbj(Y−τj )2|Fj−1

= EEbj(Y0)2|Fj−1= Ebj(Y0)2≤ E2(Y 2

0 + ε2j )≤ 2EY 2

0 + 2 maxk

ε2k < ∞.


The second statement is an obvious consequence of the first one.

27.4 Time Series Problem: Cesaro Consistency

Introduce the notation

(X,Y) = (Xi, Yi)∞i=−∞,

then

m−n(X0−n, Y

−1−n ) = m−n(X,Y).

Define the estimate

mn(Xn0 , Y

n−10 ) = m−n(Tn(X,Y)).

We show that mn is Cesaro consistent a.s.

Theorem 27.4. Assume that |Y | ≤ L a.s. Then, under the conditions ofTheorem 27.3,

limn→∞

1n

n∑

i=1

(mi − EYi|Xi0, Y

i−10 )2 = 0 a.s.

Proof. Put

Z = (X,Y)

and

fn(Z) = (m−n(X,Y) − EY0|X0−n, Y

−1−n )2.

Then Theorem 27.3 and the martingale convergence theorem (TheoremA.4) imply that

fn(Z) → f(Z) = 0 a.s.

Because of |Y | ≤ L we have that |fn(Z)| ≤ L, so by Lemma 27.2,

limn→∞

1n

n∑

i=1

fi(T iZ) = limn→∞

1n

n∑

i=1

(mi−EYi|Xi0, Y

i−10 )2 = Ef(Z) = 0 a.s.

27.5 Time Series Problem: Universal Prediction

Unfortunately, the Morvai, Yakowitz, and Gyorfi (1996) estimator con-sumes data very rapidly. Morvai, Yakowitz, and Algoet (1997) proposed a

27.5. Time Series Problem: Universal Prediction 577

very simple estimator which seems to be more efficient and is weakly con-sistent. In this section we introduce another Cesaro consistent estimator,which is derived from a sequential prediction of a real-valued sequence. Ateach time instant i = 1, 2, . . ., the predictor is asked to guess the value ofthe next outcome yi of a sequence of real numbers y1, y2, . . . with knowledgeof the past (xi

1, yi−11 ). Thus, the predictor’s estimate, at time i, is based on

the value of (xi1, y

i−11 ). Formally, the strategy of the predictor is a sequence

g = gi∞i=1 of decision functions

gi : (Rd)i × Ri−1 → Rand the prediction formed at time i is gi(xi

1, yi−11 ). After n rounds of play,

the normalized cumulative prediction error on the string xn1 , y

n1 is

Ln(g) =1n

n∑

i=1

(gi(xi1, y

i−11 ) − yi)2.

The fundamental limit for the predictability of the sequence can be de-termined based on a result of Algoet (1994) who showed that, for anyprediction strategy g and stationary ergodic process Xn, Yn∞

−∞,

lim infn→∞

Ln(g) ≥ L∗ a.s.,

where

L∗ = E(Y0 − EY0|X0

−∞, Y−1−∞)2

is the minimal mean square error of any prediction for the value of Y0 basedon the infinite past X0

−∞, Y−1−∞. We prove this if both |Y | and |gi(Xi

1, Yi−11 )|

are bounded by L. Then for Fi−1 = F(Xi1, Y

i−11 ) we get that

Ln(g) =1n

n∑

i=1

(gi(Xi1, Y

i−11 ) − Yi)2

=1n

n∑

i=1

((gi(Xi

1, Yi−11 ) − Yi)2 − E(gi(Xi

1, Yi−11 ) − Yi)2|Fi−1

)

+1n

n∑

i=1

E(gi(Xi1, Y

i−11 ) − Yi)2|Fi−1

≥ 1n

n∑

i=1

((gi(Xi

1, Yi−11 ) − Yi)2 − E(gi(Xi

1, Yi−11 ) − Yi)2|Fi−1

)

+1n

n∑

i=1

E(EYi|Fi−1 − Yi)2|Fi−1,

where the first term on the right-hand side is an average of bounded mar-tingale difference, so it tends to 0 a.s. (Theorem A.6), while the secondterm tends to L∗ a.s. (Theorem A.4).


This lower bound gives sense to the following definition:

Definition 27.3. A prediction strategy g is called universal with respectto a class C of stationary and ergodic processes (Xn, Yn)∞

−∞, if for eachprocess in the class,

limn→∞

Ln(g) = L∗ a.s.

We introduce our prediction strategy for bounded ergodic processes. Weassume that |Y0| is bounded by a constant B > 0, with probability one,and the bound B is known.

The prediction strategy is defined, at each time instant, as a convex com-bination of elementary predictors, where the weighting coefficients dependon the past performance of each elementary predictor.

We define an infinite array of elementary predictors h(k,) (k, = 1, 2, . . .)as follows. Let P = A,j and Q = B,j be sequences of finite parti-tions of Rd and R, respectively, and let G and H be the correspondingquantizers:

G(x) = j if x ∈ A,j

and

H(y) = j if y ∈ B,j .

Fix positive integers k, , and for each k + 1, k-long strings s, z of positiveintegers, define the partitioning regression function estimate

E(k,)n (xn

1 , yn−11 , s, z) =

∑k<i<n:G(xi

i−k)=s,H(y

i−1i−k

)=z yi∣∣k k + 1, where 0/0 is defined to be 0.Now we define the elementary predictor h(k,) by

h(k,)n (xn

1 , yn−11 ) = E(k,)

n (xn1 , y

n−11 , G(xn

n−k), H(yn−1n−k)), n = 1, 2, . . . .

That is, h(k,) quantizes the sequence xn1 , y

n1 according to the partitions

P and Q, and looks for all appearances of the last-seen quantized stringsG(xn

n−k), H(yn−1n−k) of length k+1, k in the past. Then it predicts according

to the average of the yi’s following the string.The proposed prediction algorithm proceeds as follows: let qk, be a

probability distribution on the set of all pairs (k, ) of positive integers suchthat for all k, , qk, > 0. Put c = 8B2, and define the weights

wt,k, = qk,e−(t−1)Lt−1(h(k,))/c

and their normalized values

vt,k, =wt,k,∑∞

i,j=1 wt,i,j.


The prediction strategy g is defined by

gt(xt1, y

t−11 ) =

∞∑

k,=1

vt,k,h(k,)(xt

1, yt−11 ), t = 1, 2, . . . . (27.5)

Theorem 27.5. Assume that the sequences of partitions P and Q arenested and asymptotically generate the Borel σ-field. Then the predictionscheme g defined above is universal with respect to the class of all ergodicprocesses such that PYi ∈ [−B,B] = 1.

One of the main ingredients of the proof is the following lemma, whoseproof is a straightforward extension of standard arguments in the predictiontheory of individual sequences.

Lemma 27.3. Let h1, h2, . . . be a sequence of prediction strategies (ex-perts), and let qk be a probability distribution on the set of positiveintegers. Assume that hi(xn

1 , yn−11 ) ∈ [−B,B] and yn

1 ∈ [−B,B]n. Define

wt,k = qke−(t−1)Lt−1(hk)/c

with c ≥ 8B2, and

vt,k =wt,k∑∞i=1 wt,i

.

If the prediction strategy g is defined by

gt(xt1, y

t−11 ) =

∞∑

k=1

vt,khk(xt1, y

t−11 )

then, for every n ≥ 1,

Ln(g) ≤ infk

(Ln(hk) − c ln qk

n

).

Here − ln 0 is treated as ∞.

Proof. Introduce W1 = 1 and Wt =∑∞

k=1 wt,k for t > 1. First we showthat, for each t > 1,

[ ∞∑

k=1

vt,k

(yt − hk(xt

1, yt−11 )

)]2

≤ −c lnWt+1

Wt. (27.6)

Note that

Wt+1 =∞∑

k=1

wt,ke−(yt−hk(xt

1,yt−11 ))2

/c = Wt

∞∑

k=1

vt,ke−(yt−hk(xt

1,yt−11 ))2

/c,

so that

−c lnWt+1

Wt= −c ln

( ∞∑

k=1

vt,ke−(yt−hk(xt

1,yt−11 ))2

/c

)

.


Therefore, (27.6) becomes

exp

−1c

[ ∞∑

k=1

vt,k

(yt − hk(xt

1, yt−11 )

)]2

≥∞∑

k=1

vt,ke−(yt−hk(xt

1,yt−11 ))2

/c,

which is implied by Jensen’s inequality and the concavity of the functionFt(z) = e−(yt−z)2/c for c ≥ 8B2. Thus, (27.6) implies that

nLn(g) =n∑

t=1

(yt − g(xt

1, yt−11 )

)2

=n∑

t=1

[ ∞∑

k=1

vt,k

(yt − hk(xt

1, yt−11 )

)]2

≤ −cn∑

t=1

lnWt+1

Wt

= −c lnWn+1

= −c ln

( ∞∑

k=1

wn+1,k

)

= −c ln

( ∞∑

k=1

qke−nLn(hk)/c

)

≤ −c ln(

supkqke

−nLn(hk)/c

)

= infk

(−c ln qk + nLn(hk)

),

which concludes the proof.

Proof of Theorem 27.5. By a double application of the ergodic theorem,as n → ∞,

E(k,)n (Xn

1 , Yn−11 , s, z)

=1n

∑k<i<n:G(Xi

i−k)=s,H(Y

i−1i−k

)=z Yi

1n

∣∣k < i < n : G(Xi

i−k) = s,H(Y i−1i−k ) = z∣∣

→EY0IG(X0

−k)=s,H(Y

−1−k

)=zPG(X0

−k) = s,H(Y −1−k ) = z

= EY0|G(X0−k) = s,H(Y −1

−k ) = z a.s.

s and z may have finitely many values, therefore,

limn→∞

sups,z

|E(k,)n (Xn

1 , Yn−11 , s, z) − EY0|G(X0

−k) = s,H(Y −1−k ) = z| = 0


a.s. Thus, by Lemma 27.2, as n → ∞,

Ln(h(k,)) =1n

n∑

i=1

(h(k,)(Xi1, Y

i−11 ) − Yi)2

=1n

n∑

i=1

(E(k,)n (Xi

1, Yi−11 , Xi

i−k, Yi−1i−k ) − Yi)2

→ E(Y0 − EY0|G(X0

−k), H(Y −1−k ))2

def= εk, a.s.

Since the partitions P and Q are nested, EY0|G(X0

−k), H(Y −1−k )

is

a martingale indexed by the pair (k, ). Thus, by the martingale con-vergence theorem (Theorem A.4) εk, converges and, since the partitionsasymptotically generate the Borel σ-field, the limit is L∗:

limk,→∞

εk, = E(Y0 − EY0|X0

−∞, Y−1−∞)2

= L∗.

Now by Lemma 27.3,

Ln(g) ≤ infk,

(Ln(h(k,)) − c ln qk,

n

), (27.7)

and therefore,

lim supn→∞

Ln(g) ≤ lim supn→∞

infk,


n

)

≤ infk,

lim supn→∞


n

)

≤ infk,

lim supn→∞

Ln(h(k,))

= infk,εk,

= limk,→∞

εk,

= L∗ a.s.,

and the proof of the theorem is finished.

Theorem 27.5 shows that, asymptotically, the predictor gt defined by(27.5) predicts as well as the optimal predictor given by the regressionfunction EYt|Xt

−∞, Yt−1−∞ . In fact, gt gives a Cesaro consistent estimate

of the regression function in the following sense:

Corollary 27.1. Under the conditions of Theorem 27.5,

limn→∞

1n

n∑

i=1

(EYi|Xi

−∞, Yi−1−∞ − gi(Xi

1, Yi−11 )

)2= 0 a.s.


Proof. By Theorem 27.5,

limn→∞

1n

n∑

i=1

(Yi − gi(Xi

1, Yi−11 )

)2= L∗ a.s.

Consider the following decomposition:(Yi − gi(Xi

1, Yi−11 )

)2

=(Yi − EYi|Xi

−∞, Yi−1−∞ )2

+ 2(Yi − EYi|Xi

−∞, Yi−1−∞ ) (EYi|Xi

−∞, Yi−1−∞ − gi(Xi

1, Yi−11 )

)

+(EYi|Xi

−∞, Yi−1−∞ − gi(Xi

1, Yi−11 )

)2.

Then the ergodic theorem implies that

limn→∞

1n

n∑

i=1

(Yi − EYi|Xi

−∞, Yi−1−∞ )2 = L∗ a.s.

It remains to show that

1n

n∑

i=1

(Yi − EYi|Xi

−∞, Yi−1−∞ ) (EYi|Xi

−∞, Yi−1−∞ − gi(Xi

1, Yi−11 )

)

converges to 0 a.s., which is a consequence of Theorem A.6 with ci = 1since the martingale differences

Zi =(Yi − EYi|Xi

−∞, Yi−1−∞ ) (EYi|Xi

−∞, Yi−1−∞ − gi(Xi

1, Yi−11 )

)

are bounded by 4B2.

27.6 Estimating Smooth Regression Functions

For regression function estimation from ergodic data there are similarnegative findings:

Theorem 27.6. (Nobel (1999)). There is no weakly consistent regres-sion estimate for all stationary and ergodic sequences (Xi, Yi) with0 ≤ Xi ≤ 1 and Yi ∈ 0, 1.Theorem 27.7. (Yakowitz and Heyde (1998)). There is no weaklyconsistent autoregression estimate for all stationary and ergodic sequences(Xi, Yi), where Yi = Xi+1.

Because of these theorems we assume some smoothness on the regressionfunction. Now we attack the problem of estimating the regression functionm(x) which combines partitioning estimation with a series expansion.

27.6. Estimating Smooth Regression Functions 583

Let Pk = Ak,i be a nested cubic partition of Rd with volume (2−k−2)d.Define Ak(x) to be the partition cell of Pk in which x falls. Take

Mk(x) := EY |X ∈ Ak(x). (27.8)

By Lemma 24.10,

Mk(x) → m(x) (27.9)

for µ-almost all x ∈ Rd. Motivated by (27.9) our analysis depends on therepresentation,

m(x) = M1(x) +∞∑

k=2

∆k(x) = limkMk(x)

for µ-almost all x ∈ Rd, where

∆k(x) = Mk(x) −Mk−1(x).

Now, for integer k > 1 and real L > 0, define

∆k,L(x) = sign(Mk(x) −Mk−1(x)) min(|Mk(x) −Mk−1(x)|, L2−k).

Let

mL(x) := M1(x) +∞∑

i=1

∆i,L(x).

Notice that |∆i,L| ≤ L2−i, hence mL(x) is well-defined for all x ∈ Rd.The crux of the truncated partitioning estimate is inference of the terms

∆k(x) = Mk(x) −Mk−1(x). Define

Mk,n(x) :=

∑nj=1 YjIXj∈Ak(x)∑n

j=1 IXj∈Ak(x).

Now for k > 1, put

∆k,n,L(x) = sign(Mk,n(x) − Mk−1,n(x)) min(|Mk,n(x) − Mk−1,n(x)|, L2−k)

and

mn,L(x) = M1,n(x) +Nn∑

k=2

∆k,n,L(x).

Theorem 27.8. Let (Xi, Yi) be a stationary ergodic time series. AssumeNn → ∞. Then, for µ-almost all x ∈ Rd,

mn,L(x) → mL(x) a.s. (27.10)

If the support S of µ is a bounded subset of Rd, then

supx∈S

|mn,L(x) −mL(x)| → 0 a.s. (27.11)

If in addition, either:


(i) |Y | ≤ D < ∞ (D need not be known); or(ii) µ is of bounded support,

then∫

(mn,L(x) −mL(x))2µ(dx) → 0 a.s. (27.12)

Proof. Define the support S of µ as

S := x ∈ Rd : µ(Ak(x)) > 0 for all k ≥ 1.Then µ(S) = 1. Assume µ(Ak,i) > 0. Then by the ergodic theorem, asn → ∞,

∑nj=1 IXj∈Ak,i

n→ PX ∈ Ak,i = µ(Ak,i) a.s.

Similarly,∑n

j=1 IXj∈Ak,iYj

n→ EY IX∈Ak,i =

∫

Ak,i

m(z)µ(dz) a.s.,

which is finite since EY is finite. Now apply this for Ak(x) = Ak,i. One canwrite Mk,n(x) as the ratio of these two almost surely convergent sequences.Thus, for all x ∈ S,

Mk,n(x) → Mk(x) a.s.

and so

∆k,n,L(x) → ∆k,L(x) a.s.

Let integer R > 1 be arbitrary. Let n be so large that Nn > R. For allx ∈ S,

|mn,L(x) −mL(x)|

≤ |M1,n(x) −M1(x)| +Nn∑

k=2

|∆k,n,L(x) − ∆k,L(x)| +∞∑

k=Nn+1

|∆k,L(x)|

≤ |M1,n(x) −M1(x)| +R∑

k=2

|∆k,n,L(x) − ∆k,L(x)|

+∞∑

k=R+1

(|∆k,n,L(x)| + |∆k,L(x)|)

≤ |M1,n(x) −M1(x)| +R∑

k=2

|∆k,n,L(x) − ∆k,L(x)| + 2L∞∑

k=R+1

2−k

27.6. Estimating Smooth Regression Functions 585

≤ |M1,n(x) −M1(x)| +R∑

k=2

|∆k,n,L(x) − ∆k,L(x)| + L2−(R−1).

Now, for all x ∈ S,

|M1,n(x) −M1(x)| +R∑

k=2

|∆k,n,L(x) − ∆k,L(x)| → 0 a.s.

Then

lim supn→∞

|mn,L(x) −mL(x)| ≤ L2−(R−1) a.s.

Since R was arbitrary, (27.10) is proved. Now we prove (27.11). Assumethe support S of µ is bounded. Let Ak denote the set of cells from partitionPk with nonempty intersection with S. That is, define

Ak = A ∈ Pk : A ∩ S = ∅.Since S is bounded, Ak is a finite set. For A ∈ Pk, let a(A) be the centerof A. Then

supx∈S

(

|M1,n(x) −M1(x)| +R∑

k=2

|∆k,n,L(x) − ∆k,L(x)|)

≤ maxA∈A1

|M1,n(a(A)) −M1(a(A))|

+R∑

k=2

maxA∈Ak

|∆k,n,L(a(A)) − ∆k,L(a(A))|

→ 0

keeping in mind that only finitely many terms are involved in the maxi-mization operation. The rest of the proof goes virtually as before. Now weprove (27.12),

|mn,L(x) −mL(x)|2

≤ 2

|M1,n(x) −M1(x)|2 +

(

M1(x) +Nn∑

k=2

∆k,n,L(x) −mL(x)

)2

.

If condition (i) holds, then for the first term we have dominated convergence

|M1,n(x) −M1(x)|2 ≤ (2D)2,

and for the second one, too,∣∣∣∣∣M1(x) +

Nn∑

k=2

∆k,n(x) −mL(x)

∣∣∣∣∣

≤∞∑

k=2

(|∆k,n| + |∆k,L|)

≤ 2L,


and thus (27.12) follows by Lebesgue’s dominated convergence theorem,

0 =∫

limn→∞

|mn(x) −mL(x)|2µ(dx) = limn→∞

∫|mn(x) −mL(x)|2µ(dx)

a.s. If condition (ii) holds then (27.12) follows from (27.11).

Corollary 27.2. Assume m(x) is Lipschitz continuous with Lipschitz con-stant L/

√d. Then Theorem 27.7 holds with mL(x) = m(x) for x ∈ S.

Proof. Since m(x) is Lipschitz with constant L/√d, for x ∈ S,

|Mk(x) −m(x)| ≤∣∣∣∣∣

∫Ak(x)m(y)µ(dy)

µ(Ak(x))−m(x)

∣∣∣∣∣

≤ 1µ(Ak(x))

∫

Ak(x)|m(y) −m(x)|µ(dy)

≤ 1µ(Ak(x))

∫

Ak(x)(L/

√d)(2−k−2

√d)µ(dy)

= L2−k−2.

For x ∈ S we get

|Mk(x) −Mk−1(x)| ≤ |Mk(x) −m(x)| + |m(x) −Mk−1(x)|≤ L2−k−2 + L2−k−1

< L2−k.

Hence for µ-almost all x ∈ Rd,

mL(x) = M1(x) +∞∑

k=2

∆k(x) = m(x),

and Corollary 27.2 is proved.

If there is no truncation, that is, L = ∞, then mn = MNn,n; that is, mn

is the standard partitioning estimate.Our consistency is not universal, however, since m should be Lipschitz

continuous.Nn can be data dependent, provided Nn → ∞ a.s.

In order to introduce the truncated kernel estimate let K(x) be a non-negative kernel function with

bIx∈SO,r ≤ K(x) ≤ Ix∈SO,1,

where 0 < b ≤ 1 and 0 < r ≤ 1. (Sz,r denotes the ball around z with radiusr.) Choose

hk = 2−k−2


and

M∗k (x) =

EY K(X−xhk

)EK(X−x

hk) =

∫m(z)K( z−x

hk)µ(dz)

∫K( z−x

hk)µ(dz)

.

Let

∆∗k(x) = M∗

k (x) −M∗k−1(x).

By Lemma 24.8 we have (27.9). Now for k > 1, define

∆∗k,L(x) = sign(M∗

k (x) −M∗k−1(x)) min(|M∗

k (x) −M∗k−1(x)|, L2−k).

Let

m∗L(x) := M∗

1 (x) +∞∑

i=1

∆∗i,L(x).

Put

M∗k,n(x) :=

∑nj=1 YjK(Xj−x

hk)

∑nj=1K(Xj−x

hk).

For k > 1, put

∆∗k,n,L(x) = sign(M∗

k,n(x) − M∗k−1,n(x)) min(|M∗

k,n(x) − M∗k−1,n(x)|, L2−k)

and

m∗n,L(x) = M∗

1,n(x) +Nn∑

k=2

∆∗k,n,L(x).

Theorem 27.9. Assume Nn → ∞. Then, for µ-almost all x ∈ Rd,

m∗n(x) → m∗

L(x) a.s.

If, in addition, |Y | ≤ D < ∞ (D need not be known), then∫

(m∗n(x) −m∗

L(x))2µ(dx) → 0 a.s.

Corollary 27.3. Assume m(x) is Lipschitz continuous with Lipschitzconstant L/2. Then Theorem 27.9 holds with m∗

L(x) = m(x).


Under various mixing conditions on the data Dn the consistency of the con-ventional regression estimates has been summarized in Gyorfi et al. (1989).Lemma 27.1 is due to Birkhoff (1931). Versions of Lemma 27.2 have beenproved by Algoet (1994), Barron (1985), Breiman (1957) and Maker (1940).


Bailey (1976) showed that the problem of dynamic forecasting cannot besolved. Bailey’s counterexample uses the technique of cutting and stack-ing developed by Ornstein (1974) (see also Shields (1991)). Bailey hasn’tpublished his result, so Ryabko (1988) rediscovered this result with a muchsimpler proof. Theorem 27.1 is due to Bailey (1976) and Ryabko (1988),while the current proof together with Theorem 27.2 is from Gyorfi, Morvai,and Yakowitz (1998) based on a clever counterexample of Ryabko (1988).For the static forecasting binary sequences Ornstein (1978) gave an esti-mator, which has been extended to real-valued data by Algoet (1992). Amuch simpler estimate can be found in Morvai, Yakowitz, and Gyorfi (1996)(cf. Theorem 27.3). Theorem 27.5 has been proved in Gyorfi, and Lugosi(2002). Lemma 27.3 is due to Kivinen and Warmuth (1999) and Singer andFeder (1999). Concerning the prediction of individual sequences we referto Haussler, Kivinen, and Warmuth (1998). Theorems 27.8 and 27.9 andtheir corollaries are from Yakowitz et al. (1999).


Problem 27.1. Prove Theorem 27.9.

Appendix ATools

A.1 A Denseness Result

The universal consistency of some nonparametric regression estimates isshown such that first the consistency is proved for continuous regressionfunctions, and then it is extended to arbitrary regression functions. Thisextension is possible because of the following theorem:

Theorem A.1. For any p ≥ 1 and any probability measure µ, the set ofcontinuous functions of bounded support is dense in Lp(µ), i.e. for anyf ∈ Lp(µ) and ε > 0 there is a continuous function g with compact supportsuch that

∫|f(x) − g(x)|p µ(dx) ≤ ε.

Proof. Without loss of generality assume that f ≥ 0. The case of arbitraryf can be handled similarily as the case f ≥ 0 by using the decompositionf = f+ − f− where f+ = maxf, 0 and f− = −minf, 0. f ∈ Lp(µ)implies that there is an open sphere S centered at the origin with radius Rand an integer K > 0 such that

∫ ∣∣f(x) − minf(x),KIx∈S

∣∣p µ(dx) ≤ ε/2p.

590 Appendix A. Tools

Because of |a+ b|p ≤ 2p−1(|a|p + |b|p), we have that∫

|f(x) − g(x)|pµ(dx) ≤ 2p−1∫ ∣

∣f(x) − minf(x),KIx∈S∣∣p µ(dx)

+ 2p−1∫ ∣

∣minf(x),KIx∈S − g(x)∣∣p µ(dx),

and it suffices to show that there is a continuous function g with supportin the closure S of S such that

∫

S

∣∣minf(x),KIx∈S − g(x)

∣∣p µ(dx) ≤ ε/2p.

Without loss of generality, assume that K = 1. Put

f∗(x) = minf(x), 1Ix∈S.

Using the technique of Lusin’s theorem we show that there is a continuousfunction g with support in S such that 0 ≤ g ≤ 1 and

µ(x ∈ S : f∗(x) = g(x)) ≤ ε/2p. (A.1)

Obviously, this will imply the theorem since∫

S

|f∗(x) − g(x)|p µ(dx) ≤ µ(x ∈ S; f∗(x) = g(x)) ≤ ε/2p.

We have now that 0 ≤ f∗ ≤ 1. A function s is called a simple functionif its range consists of finitely many points in [0,∞). We can construct amonotonically increasing sequence of simple functions s1 ≤ s2 ≤ · · · ≤ f∗

such that sn(x) → f∗(x) as n → ∞ for every x ∈ S. Indeed, for n = 1, 2, . . .,define the sets

En,i =x ∈ S :

i− 12n

≤ f∗(x) <i

2n

(1 ≤ i < 2n)

and

En,2n =x ∈ S :

2n − 12n

≤ f∗(x) ≤ 1

and the simple function

sn =2n∑

i=1

i− 12n

IEn,i.

Sets En,i are inverse images of intervals through a measurable func-tion and thus are measurable. Clearly the sequence sn is monotonicallyincreasing, sn ≤ f∗, and |sn(x) − f∗(x)| ≤ 2−n for all x ∈ S.

Now define t1 = s1 and tn = sn − sn−1 (n = 2, 3, . . .). By definition ofsn, sn(x) = sn−1(x) implies sn(x) − sn−1(x) = 2−n, therefore 2ntn is the

A.1. A Denseness Result 591

f∗

sn1

Figure A.1. Approximation by step function.

indicator function of a set Tn ⊂ S and

f∗(x) =∞∑

n=1

tn(x).

Then there exist compact sets Un and open sets Vn such that Un ⊂ Tn ⊂Vn ⊂ S ⊂ S and µ(Vn − Un) < 2−nε/2p. Next we need a special versionof Urysohn’s lemma which states that for any compact set U and boundedopen set V such that U ⊂ V there exists a continuous function h : Rd →[0, 1] such that h(x) = 1 for x ∈ U and h(x) = 0 for x ∈ Rd \ V . In orderto show this special case of Urysohn’s lemma for any set A introduce thefunction

d(x,A) = infz∈A

‖x− z‖.

Then d(x,A) (as a function of x) is continuous, and d(x,A) = 0 iff x liesin the closure of A. Such a function h can be defined as

h(x) =d(x,Rd \ V )

d(x, U) + d(x,Rd \ V ).

( [ ] ) ( [ ] )

1

Figure A.2. Approximation of a step function by a continuous function.


Thus by Urysohn’s lemma there are continuous functions hn : Rd → [0, 1]that assume value 1 on Un and are zero outside Vn. Let

g(x) =∞∑

n=1

2−nhn(x).

Since the functions hn are bounded by 1 this series converges uniformly on Sand thus g is continuous. Also its support lies in S. Since 2−nhn(x) = tn(x)except in Vn − Un, we obtain g(x) = f∗(x) except in

⋃∞n=1(Vn − Un), but

µ

( ∞⋃

n=1

(Vn − Un)

)

≤∞∑

n=1

µ(Vn − Un) ≤ ε/(2p)∞∑

n=1

2−n = ε/2p,

which proves (A.1).

Corollary A.1. For any p ≥ 1 and any probability measure µ, the setof infinitely often differentiable functions of bounded support is dense inLp(µ).

Proof. We notice that the function g in the proof of Theorem A.1 can beapproximated arbitrarily well by an infinitely often differentiable functionhaving bounded support with respect to the supremum norm and thusalso with respect to ‖ · ‖Lp(µ). In fact, choose K ∈ C∞(Rd) with K ≥ 0,∫K(x) dx = 1, K(x) = 0 for ‖x‖ > 1, e.g.,

K(x) =c · exp (−1/(1 − ‖x‖)) if ‖x‖ < 1,0 else,

with suitable c > 0, and set

gh(x) =1hd

∫

Rd

K

(x− z

h

)g(z) dz, h > 0.

Then gh is infinitely often differentiable, has bounded support, and satisfies

supx∈Rd

|g(x) − gh(x)| ≤ supx,z∈Rd,‖x−z‖≤h

|g(x) − g(z)| → 0 (h → 0).

A.2 Inequalities for Independent RandomVariables

Lemma A.1. (Chernoff (1952)). Let B be a binomial random variablewith parameters n and p. Then, for 1 > ε > p > 0,

PB > nε ≤ e−n[ε log εp +(1−ε) log 1−ε

1−p ] ≤ e−n[p−ε+ε log(ε/p)]

A.2. Inequalities for Independent Random Variables 593

and, for 0 < ε 0,

PB > nε = PsB > snε= PesB > esnε≤ e−snεEesB

(by the Markov inequality)

= e−snεn∑

k=0

esk(nk

)pk(1 − p)n−k

= e−snε(esp+ 1 − p)n

= [e−sε(esp+ 1 − p)]n.

Next choose s such that

es =ε

1 − ε

1 − p

p.

With this value we get

e−sε(esp+ 1 − p) = e−ε·log( ε1−ε

1−pp ) ·

(ε

1 − ε

1 − p

p· p+ 1 − p

)

= e−ε·log( εp

1−p1−ε ) ·

(ε · 1 − p

1 − ε+ 1 − p

)

= e−ε·log εp −ε·log 1−p

1−ε +log 1−p1−ε

= e−ε·log εp +(1−ε)·log 1−p

1−ε ,

which implies the first inequality.The second inequality follows from

(1 − ε) log1 − ε

1 − p= −(1 − ε) log

1 − p

1 − ε

= −(1 − ε) log(

1 +ε− p

1 − ε

)

≥ −(1 − ε) · ε− p

1 − ε

(by log(1 + x) ≤ x)

= p− ε.


To prove the second half of the lemma, observe that n − B is a binomialrandom variable with parameters n and 1 − p. Hence for ε n(1 − ε)≤ e−n[(1−ε) log 1−ε

1−p +ε log εp ]

= e−n[ε log εp +(1−ε) log 1−ε

1−p ]

≤ e−n[p−ε+ε log(ε/p)].

Lemma A.2. (Bernstein (1946)). Let X1, . . . , Xn be independent real-valued random variables, let a, b ∈ R with a < b, and assume that Xi ∈ [a, b]with probability one (i = 1, . . . , n). Let

σ2 =1n

n∑

i=1

VarXi > 0.

Then, for all ε > 0,

P

∣∣∣∣∣1n

n∑

i=1

(Xi − EXi)

∣∣∣∣∣> ε

≤ 2e− nε2

2σ2+2ε(b−a)/3 .

Proof. Set Yi = Xi−EXi (i = 1, . . . , n). Then we have, with probabilityone,

|Yi| ≤ b− a and EY 2i = VarXi (i = 1, . . . , n).

By Chernoff’s exponential bounding method we get, for arbitrary s > 0,

P

1n

n∑

i=1

(Xi − EXi) > ε

= P

1n

n∑

i=1

Yi > ε

= P

s

n∑

i=1

Yi − snε > 0

≤ Ees∑n

i=1Yi−snε

= e−snεn∏

i=1

EesYi,

by the independence of Yi’s. Because of |Yi| ≤ b− a a.s.

esYi = 1 + sYi +∞∑

j=2

(sYi)j

j!


≤ 1 + sYi +∞∑

j=2

sjY 2i (b− a)j−2

2 · 3j−2

= 1 + sYi +s2Y 2

i

2

∞∑

j=2

(s (b− a)

3

)j−2

= 1 + sYi +s2Y 2

i

21

1 − s(b− a)/3

if |s(b − a)/3| < 1. This, together with EYi = 0 (i = 1, . . . , n) and1 + x ≤ ex (x ∈ R), implies

P

1n

n∑

i=1

(Xi − EXi) > ε

≤ e−snεn∏

i=1

(1 +

s2 VarXi2

11 − s(b− a)/3

)

≤ e−snεn∏

i=1

exp(s2 VarXi

21

1 − s(b− a)/3

)

= exp(

−snε+s2nσ2

2(1 − s(b− a)/3)

).

Set

s =ε

ε(b− a)/3 + σ2 .

Then∣∣∣∣s(b− a)

3

∣∣∣∣ < 1

and

−snε+s2nσ2

2(1 − s(b− a)/3)

=−nε2

ε(b− a)/3 + σ2 +ε2

(ε(b− a)/3 + σ2)2· nσ2

2(1 − ε(b−a)/3

ε(b−a)/3+σ2

)

=−nε2

ε(b− a)/3 + σ2 +ε2

ε(b− a)/3 + σ2 · nσ2

2 (ε(b− a)/3 + σ2 − ε(b− a)/3)

=−nε2

2ε(b− a)/3 + 2σ2 ,

hence

P

1n

n∑

i=1

(Xi − EXi) > ε

≤ exp( −nε2

2ε(b− a)/3 + 2σ2

).


Similarly,

P

1n

n∑

i=1

(Xi − EXi) < −ε

= P

1n

n∑

i=1

(−Xi − E−Xi) > ε

≤ exp( −nε2

2ε(b− a)/3 + 2σ2

),


Lemma A.3. (Hoeffding (1963)). Let X1, . . . , Xn be independent real-valued random variables, let a1, b1, . . . , an, bn ∈ R, and assume that Xi ∈[ai, bi] with probability one (i = 1, . . . , n). Then, for all ε > 0,

P

∣∣∣∣∣1n

n∑

i=1

(Xi − EXi)

∣∣∣∣∣> ε

≤ 2e− 2nε2

1n

∑n

i=1|bi−ai|2

.

Proof. Let s > 0 be arbitrary. Similarly to the proof of Lemma A.2 weget

P

1n

n∑

i=1

(Xi − EXi) > ε

≤ exp(−snε) ·n∏

i=1

E exp (s · (Xi − EXi)) .

We will show momentarily

E exp (s · (Xi − EXi)) ≤ exp(s2(bi − ai)2

8

)(i = 1, . . . , n), (A.2)

from which we can conclude

P

1n

n∑

i=1

(Xi − EXi) > ε

≤ exp

(

−snε+s2

8

n∑

i=1

(bi − ai)2)

.

The right-hand side is minimal for

s =4n ε

∑ni=1(bi − ai)2

.

With this value we get

P

1n

n∑

i=1

(Xi − EXi) > ε

≤ exp(

− 4nε21n

∑ni=1(bi − ai)2

+2nε2

1n

∑ni=1(bi − ai)2

)

= exp(

− 2nε21n

∑ni=1(bi − ai)2

).


This implies that

P

∣∣∣∣∣1n

n∑

i=1

(Xi − EXi)

∣∣∣∣∣> ε

= P

1n

n∑

i=1

(Xi − EXi) > ε

+ P

1n

n∑

i=1

(−Xi − E−Xi) > ε

≤ 2 exp(

− 2nε21n

∑ni=1(bi − ai)2

).

So it remains to show (A.2). Fix i ∈ 1, . . . , n and set

Y = Xi − EXi.

Then Y ∈ [ai −EXi, bi −EXi] =: [a, b] with probability one, a−b = ai −bi,and EY = 0. We have to show

E exp(sY ) ≤ exp(s2(b− a)2

8

). (A.3)

Because of esx convex we have

esx ≤ x− a

b− aesb +

b− x

b− aesa for all a ≤ x ≤ b,

thus

Eexp(sY ) ≤ EY − a

b− aesb +

b− EY b− a

esa

= esa

(1 +

a

b− a− a

b− aes(b−a)

)

(because of EY = 0).

Setting

p = − a

b− a

we get

Eexp(sY ) ≤ (1 − p+ p · es(b−a))e−s p (b−a) = eΦ(s(b−a)),

where

Φ(u) = ln((1 − p+ peu)e−pu

)= ln (1 − p+ peu) − pu.

Next we make a Taylor expansion of Φ. Because of

Φ(0) = 0,

Φ′(u) =peu

1 − p+ peu− p, hence Φ′(0) = 0


and

Φ′′(u) =(1 − p+ peu)peu − peupeu

(1 − p+ peu)2=

(1 − p)peu

(1 − p+ peu)2

≤ (1 − p)peu

4(1 − p)peu=

14

we get, for any u > 0,

Φ(u) = Φ(0) + Φ′(0)u+12Φ′′(η)u2 ≤ 1

8u2

for some η ∈ [0, u]. We conclude

Eexp(sY ) ≤ eΦ(s(b−a)) ≤ exp(

18s2(b− a)2

),

which proves (A.3).

A.3 Inequalities for Martingales

Let us first recall the notion of martingales. Consider a probability space(Ω,F ,P).

Definition A.1. A sequence of integrable random variables Z1, Z2, . . . iscalled a martingale if

E Zi+1|Z1, . . . , Zi = Zi with probability one

for each i > 0.Let X1, X2, . . . be an arbitrary sequence of random variables. The se-

quence Z1, Z2, . . . of integrable random variables is called a martingale withrespect to the sequence X1, X2, . . . if for every i > 0, Zi is a measurablefunction of X1, . . . , Xi, and

E Zi+1|X1, . . . , Xi = Zi with probability one.

Obviously, if Z1, Z2, . . . is a martingale with respect to X1, X2, . . ., thenZ1, Z2, . . . is a martingale, since

E Zi+1|Z1, . . . , Zi = E E Zi+1|X1, . . . , Xi |Z1, . . . , Zi= E Zi|Z1, . . . , Zi= Zi.

The most important examples of martingales are sums of indepen-dent zero-mean random variables. Let U1, U2, . . . be independent random

A.3. Inequalities for Martingales 599

variables with zero mean. Then the random variables

Si =i∑

j=1

Uj , i > 0,

form a martingale. Martingales share many properties of sums of indepen-dent variables. The role of the independent random variables is played hereby a so-called martingale difference sequence.

Definition A.2. A sequence of integrable random variables V1, V2, . . . is amartingale difference sequence if

E Vi+1|V1, . . . , Vi = 0 with probability one

for every i > 0.A sequence of integrable random variables V1, V2, . . . is called a martin-

gale difference sequence with respect to a sequence of random variablesX1, X2, . . . if for every i > 0, Vi is a measurable function of X1, . . . , Xi,and

E Vi+1|X1, . . . , Xi = 0 with probability one.

Again, it is easily seen that if V1, V2, . . . is a martingale difference se-quence with respect to a sequence X1, X2, . . . of random variables, then itis a martingale difference sequence. Also, any martingale Z1, Z2, . . . leadsnaturally to a martingale difference sequence by defining

Vi = Zi − Zi−1

for i > 1 and V1 = Z1.In Problem A.1 we will show the following extension of the Hoeffding

inequality:

Lemma A.4. (Hoeffding (1963), Azuma (1967)). Let X1, X2, . . . bea sequence of random variables, and assume that V1, V2, . . . is a martingaledifference sequence with respect to X1, X2, . . .. Assume, furthermore, thatthere exist random variables Z1, Z2, . . . and nonnegative constants c1, c2, . . .such that for every i > 0, Zi is a function of X1, . . . , Xi−1, and

Zi ≤ Vi ≤ Zi + ci a.s.

Then, for any ε > 0 and n,

P

n∑

i=1

Vi ≥ ε

≤ e−2ε2/∑n

i=1c2

i

and

P

n∑

i=1

Vi ≤ −ε

≤ e−2ε2/∑n

i=1c2

i .


Next we present a generalization of Hoeffding’s inequality, due to Mc-Diarmid (1989). The result will equip us with a powerful tool to handlecomplicated functions of independent random variables. The inequalityhas found many applications in combinatorics, as well as in nonparametricstatistics (see McDiarmid (1989) and Devroye (1991) for surveys, see alsoDevroye, Gyorfi and Lugosi (1996)).

Theorem A.2. (McDiarmid (1989)). Let Z1, . . . , Zn be independentrandom variables taking values in a set A and assume that f : An → Rsatisfies

supz1,...,zn,

z′i∈A

|f(z1, . . . , zn) − f(z1, . . . , zi−1, z′i, zi+1, . . . , zn)| ≤ ci, 1 ≤ i ≤ n.


P f(Z1, . . . , Zn) − Ef(Z1, . . . , Zn) ≥ ε ≤ e−2ε2/∑n

i=1c2

i ,

and

P Ef(Z1, . . . , Zn) − f(Z1, . . . , Zn) ≥ ε ≤ e−2ε2/∑n

i=1c2

i .

Proof. See Problem A.2.

The following theorem concerns the variance of a function of independentrandom variables:

Theorem A.3. (Efron and Stein (1981), Steele (1986), Devroye

(1991)). Let Z1, . . . , Zn, Z1, . . . , Zn be independent m-dimensional randomvectors where the two random vectors Zk and Zk have the same distribution(k = 1, . . . , n). For measurable f : Rm·n → R assume that f(Z1, . . . , Zn)is square integrable. Then

Varf(Z1, . . . , Zn)

≤ 12

n∑

k=1

E|f(Z1, . . . , Zk, . . . , Zn) − f(Z1, . . . , Zk, . . . , Zn)|2.

Proof. Versions of the following simple proof together with applicationscan be found in Devroye, Gyorfi, and Lugosi (1996), Lugosi (2002), andWalk (2002a). Set

V = f(Z1, . . . , Zn) − Ef(Z1, . . . , Zn),

V1 = EV |Z1,Vk = EV |Z1, . . . , Zk − EV |Z1, . . . , Zk−1, k ∈ 2, . . . , n.

A.4. Martingale Convergences 601

V1, . . . , Vn form a martingale difference sequence with respect to Z1, . . . , Zn,hence, for k < l,

EVkVl = EEVkVl|Z1, . . . , Zl−1= EVkEVl|Z1, . . . , Zl−1= EVk · 0= 0.

We have that

V =n∑

k=1

Vk

and

EV 2 =n∑

k=1

n∑

l=1

EVkVl =n∑

k=1

EV 2k .

Put f = f(Z1, . . . , Zn). To bound EV 2k , note that, by Jensen’s inequality,

V 2k = (Ef |Z1, . . . , Zk − Ef |Z1, . . . , Zk−1)2

= (E Ef |Z1, . . . , Zn

− Ef |Z1, . . . , Zk−1, Zk+1, . . . , Zn∣∣∣Z1, . . . , Zk

)2

≤ E

(f − Ef |Z1, . . . , Zk−1, Zk+1, . . . , Zn)2∣∣∣Z1, . . . , Zk

,

and, therefore,

EV 2k ≤ E

(f − Ef |Z1, . . . , Zk−1, Zk+1, . . . , Zn)2

=12E(

f(Z1, . . . , Zn) − f(Z1, . . . , Zk, . . . , Zn)2,

where at the last step we used the elementary fact that on a conditionalprobability space given Z1, . . . , Zk−1, Zk+1, . . . , Zn, if U and V are in-dependent and identically distributed random variables, then Var(U) =(1/2)E(U − V )2.

A.4 Martingale Convergences

The following results from martingale theory will be used in Chapters 24and 25:

Definition A.3. A sequence of integrable random variables V1, V2, . . . iscalled a supermartingale with respect to a sequence of random variables


X1, X2, . . . if for every i > 0 the random variable Vi is a measurablefunction of X1, . . . , Xi, and

E Vi+1|X1, . . . , Xi ≤ Vi with probability one.

In the next theorem we will use the abbreviation V − = max0,−V forthe negative part of a random variable V .

Theorem A.4. (Supermartingale Convergence Theorem; Doob

(1940)). A supermartingale V1, V2, . . . with supi>0

EV −i < ∞ converges

with probability one to an integrable random variable.

For the standard proof see Doob (1953) or Bauer (1996). A simple proofcan be found in Johansen and Karush (1966). In the following, anothersimple proof from Nevelson and Khasminskii (1973) is given, for which themost important tool is the Upcrossing lemma.

To formulate this lemma, we need the following definition: Let a < b bereal numbers and let x1, . . . , xn ∈ R. The number of upcrossings of [a, b]by x1, . . . , xn is the number of times the sequence x1, . . . , xn passes from≤ a to ≥ b (in one or several steps).

Lemma A.5. (Upcrossing Lemma). Let V1, V2, . . . be a supermartin-gale. For an interval [a, b], a < b, and T ∈ N let H = HT

a,b be the randomnumber of upcrossings of [a, b] by V1, V2, . . . , VT . Then

EHTa,b ≤ E(a− VT )+

b− a.

Proof. We show first that there is a sequence of binary valued randomvariables Ik such that Ik is measurable with respect to X1, . . . , Xk and

(b− a)H ≤T−1∑

t=1

It(Vt+1 − Vt) + (a− VT )+.

Let t1 be the first (random) time moment for which Vt ≤ a and let t2 bethe first (random) moment after t1 for which Vt ≥ b, and let t3 be the first(random) moment after t2 for which Vt ≤ a, etc.

Then during [t2H , T ] the process does not have an upcrossing of theinterval [a, b] and

(b− a)H ≤H∑

i=1

(Vt2i − Vt2i−1).

Set I1 = 1 if t1 = 1 (i.e., if V1 ≤ a) and I1 = 0 otherwise. ChooseI2, I3, . . . , IT ∈ 0, 1 such that I1, I2, . . . , IT changes its binary values atthe moments (indices) tk. We distinguish two cases.Case 1: t2H+1 ≥ TThis means

Vj > a for j ∈ t2H + 1, . . . , T − 1


t1 t2 t3 t4

a

b

Vt

Figure A.3. Crossings of [a, b].

which implies

It2H= It2H+1 = · · · = IT−1 = 0,

therefore,

T−1∑

t=1

It(Vt+1 − Vt) =H∑

i=1

(Vt2i− Vt2i−1) ≥ (b− a)H.

Case 2: t2H+1 < TThis means

Vj > a for j ∈ t2H + 1, . . . , t2H+1 − 1 , Vt2H+1 ≤ a,

and

Vj < b for j ∈ t2H+1, . . . , T ,therefore,

It2H= It2H+1 = · · · = It2H+1−1 = 0

and

It2H+1 = It2H+1+1 = · · · = IT−1 = 1,

which implies

T−1∑

t=1

It(Vt+1 − Vt) =H∑

i=1

(Vt2i − Vt2i−1) + VT − Vt2H+1

≥H∑

i=1

(Vt2i− Vt2i−1) + (VT − a)

≥ (b− a)H − (a− VT )+.


Hence, in both cases,

(b− a)H ≤T−1∑

t=1

It(Vt+1 − Vt) + (a− VT )+

and Ik is measurable with respect to X1, . . . , Xk. Thus

(b− a)EHTa,b ≤

T−1∑

t=1

EIt(Vt+1 − Vt) + E(a− VT )+

=T−1∑

t=1

EEIt(Vt+1 − Vt)|X1, . . . Xt + E(a− VT )+

=T−1∑

t=1

EIt(EVt+1|X1, . . . Xt − Vt) + E(a− VT )+

≤ E(a− VT )+.

Proof of Theorem A.4. Let B be the event that Vt has no limit, thatis,

B =

lim inft→∞

Vt < lim supt→∞

Vt

.

Thus

B =⋃

r1<r2

lim inft→∞

Vt < r1 < r2 < lim supt→∞

Vt

,

where the union is taken over all rational numbers r1 < r2, therefore,

B =⋃

r1<r2

H(∞)r1,r2

= ∞.

By the monotone convergence theorem and the Upcrossing lemma

EH(∞)r1,r2

= E

limT→∞

HTr1,r2

= limT→∞

EHTr1,r2

≤ lim supT→∞

E(r1 − VT )+r2 − r1

≤ lim supT→∞

|r1| + EV −T

r2 − r1< ∞,

therefore,

PH(∞)r1,r2

= ∞ = 0,


so

PB = 0.

This proves that Vt converges with probability one. What is left to showis that

E∣∣∣ limt→∞

Vt

∣∣∣< ∞.

By Fatou’s lemma

E∣∣∣ limt→∞

Vt

∣∣∣

= E

lim inft→∞

|Vt|

≤ lim inft→∞

E|Vt|= lim inf

t→∞(EVt + 2EV −

t )

≤ lim inft→∞

(EV1 + 2EV −t )

(because V1, V2, . . . is a supermartingale)

≤ EV1 + 2 supt

EV −t < ∞.

Theorem A.5. (A.S. Convergence of Almost Supermartingales;

Robbins and Siegmund (1971)). Let V1, V2, . . . , also α1, α2, . . . andβ1, β2, . . . , be sequences of integrable nonnegative random variables with∑∞

i=1 Eαi < ∞,∑∞

i=1 Eβi < ∞, and let X1, X2, . . . be a sequence of ran-dom variables such that for every i, Vi, αi, βi are measurable functions ofX1, . . . , Xi and

E Vi+1|X1, . . . , Xi ≤ (1 + αi)Vi + βi .

Then the sequence V1, V2, . . . , which is called a nonnegative almostsupermartingale, converges with probability one.

For related results compare also Aizerman, Braverman, and Rozonoer(1964), Gladyshev (1965), MacQueen (1967), and Van Ryzin (1969).

Proof. The random variables

Zi =

i−1∏

j=1

(1 + αj)

−1

Vi −i−1∑

k=1

k∏

j=1

(1 + αj)

−1

βk , i > 0 ,

form a supermartingale with respect toX1, X2, . . .. The relation supi EZ−i <

∞ immediately follows from the nonnegativity of αi, βi, Vi and∑∞

k=1 Eβk <∞ . By Theorem A.4 the sequence Z1, Z2 . . . is convergent with probabilityone. This, together with the almost sure convergence of the series

∑∞k=1 αk


and that of the sequence∏i

j=1 (1 + αj) (which follows from

i∏

j=1

(1 + αj) ≤i+1∏

j=1

(1 + αj) ≤i+1∏

j=1

eαj ≤ e

∑∞j=1

αj < ∞ a.s.)

and with almost sure convergence of the series∑∞

k=1 βk, yields theassertion.

The following theorem contains, in the case of independence randomvariables, a classical theorem of Kolmogorov:

Theorem A.6. (Chow (1965); cf. Stout (1974)). Let Zi be amartingale difference sequence with Var(Zi) = σ2

i and ci > 0. If∞∑

i=1

ci = ∞ and∞∑

n=1

σ2n

(∑n

i=1 ci)2< ∞,

then

limn→∞

∑ni=1 Zi∑ni=1 ci

= 0 a.s.

Proof. Put

Vn =∑n

i=1 Zi∑ni=1 ci

.

Then

EV 2n+1|Z1, . . . , Zn = E

∣∣∣∣∣

∑ni=1 Zi

∑n+1i=1 ci

+Zn+1

∑n+1i=1 ci

∣∣∣∣∣

2 ∣∣∣Z1, . . . , Zn

=

∣∣∣∣∣

∑ni=1 Zi

∑n+1i=1 ci

∣∣∣∣∣

2

+ E

∣∣∣∣∣Zn+1

∑n+1i=1 ci

∣∣∣∣∣

2 ∣∣∣Z1, . . . , Zn

=

(∑ni=1 ci∑n+1i=1 ci

)2

V 2n +

EZ2n+1|Z1, . . . , Zn(∑n+1

i=1 ci)2

≤ V 2n +

EZ2n+1|Z1, . . . , Zn(∑n+1

i=1 ci)2,

thus by Theorem A.5, V 2n is convergent a.s. By the Kronecker lemma (cf.

Problem A.6)

limn→∞

∑ni=1 σ

2i

(∑n

i=1 ci)2= lim

n→∞

1(∑n

i=1 ci)2

n∑

k=1

(k∑

i=1

ci

)2

· σ2k(∑k

i=1 ci

)2 = 0


because of∑∞

i=1 ci = ∞, and thus

limn→∞

E

(∑ni=1 Zi∑ni=1 ci

)2

= limn→∞

1

(∑n

i=1 ci)2 E

(n∑

i=1

Zi

)2

= limn→∞

1

(∑n

i=1 ci)2

n∑

i=1

EZ2

i

= limn→∞

∑ni=1 σ

2i

(∑n

i=1 ci)2

= 0,

therefore the limit of Vn is 0 a.s.


Problem A.1. Prove Lemma A.4.Hint: Show that

EesVi |X1, . . . , Xi−1

≤ es2c2i /8,

and, therefore,

P

n∑

i=1

Vi > ε

≤ E∏n

i=1 esVi

esε

=EE∏n

i=1 esVi |X1, . . . , Xn−1

esε

=E∏n−1

i=1 esViEesVn |X1, . . . , Xn−1

esε

≤E∏n−1

i=1 esVies2c2n/8

esε

≤∏n

i=1 es2c2i /8

esε.

Finish as in the proof of Lemma A.3.

Problem A.2. Prove Theorem A.2.Hint: Use the decomposition

f(Z1, . . . , Zn) − Ef(Z1, . . . , Zn) =n∑

i=1

Vi

of the proof of Theorem A.3 and apply Lemma A.4 for the martingale differencesequence Vn.


Problem A.3. Prove Bernstein’s inequality for martingale differences: letX1, . . . , Xn be martingale differences such that Xi ∈ [a, b] with probability one(i = 1, . . . , n). Assume that, for all i,

EX2i |X1, . . . , Xi−1 ≤ σ2 a.s.


P

∣∣∣∣∣1n

n∑

i=1

Xi

∣∣∣∣∣> ε

≤ 2e− nε2

2σ2+2ε(b−a)/3 .

Problem A.4. Show Theorem A.6 by use of Theorem A.4, but without use ofTheorem A.5.

Problem A.5. (Toeplitz Lemma). Assume that, for the double array ani,

supn

∞∑

i=1

|ani| < ∞

and

limn→∞

∞∑

i=1

ani = a

with |a| < ∞ and, for each fixed i,

limn→∞

ani = 0,

moreover, for the sequence bn,

limn→∞

bn = b.

Then

limn→∞

∞∑

i=1

anibi = ab.

Problem A.6. (Kronecker Lemma). Let x1, x2, . . . be a sequence of realnumbers such that

∞∑

k=1

xk

converges and let σ1, σ2, . . . be a sequence of positive real numbers which tendsmonotonically to infinity. Then

1σn

n∑

k=1

σkxk → 0 (n → ∞).

Hint: Set vn =∑∞

k=n+1 xk. Then v0 = s, xn = vn−1 − vn, and

n∑

k=1

σkxk =n∑

k=1

σk(vk−1 − vk) =n−1∑

k=1

(σk+1 − σk)vk + σ1s − σnvn.

Apply the Toeplitz lemma.

Notation

• s.u.p.c. strongly universally pointwise consistent.• RBF radial basis function.• ANN artificial neural network.• NN-clustering scheme nearest neighbor-clustering scheme.• ULLN uniform law of large numbers.• k-NN k-nearest neighbor.• a.s. almost surely.• w.r.t. with respect to.• w.l.o.g. without loss of generality.• i.i.d. independent and identically distributed.• mod µ assertion holds for µ-almost all arguments.• Z set of integers 0, ±1, ±2, ±3,. . .• N set of natural numbers 1, 2, 3, . . . .• N0 set of nonnegative integers.• R set of real numbers.• R+ set of nonnegative real numbers.• Rd set of d-dimensional real vectors.• IA indicator of an event A.• IB(x) = Ix∈B indicator function of a set B.• |A| cardinality of a finite set A.• Ac complement of a set A.• AB symmetric difference of sets A,B.• f g composition of functions f, g.• log natural logarithm (base e).• x integer part of the real number x.

610 Notation

• x upper integer part of the real number x.• x+ = x+ = maxx, 0 positive part of the real number x.• x− = x− = max−x, 0 negative part of the real number x.• X ∈ Rd observation vector, vector-valued random variable.• Y ∈ R response, real random variable.• Dn = (X1, Y1), . . . , (Xn, Yn) training data, sequence of i.i.d. pairs

that are independent of (X,Y ), and have the same distribution asthat of (X,Y ).

• m(x) = EY |X = x regression function.• mn : Rd ×Rd × Rn → R regression estimate. The short notationmn(x) = mn(x,Dn) is also used.

• µ(A) = PX ∈ A probability measure of X.• µn(A) = 1

n

∑ni=1 IXi∈A empirical measure corresponding to

X1, . . . , Xn.• λ Lebesgue measure on Rd.• x(1), . . . , x(d) components of the d-dimensional column vector x.• (x1, x2) =

∑dj=1 x

(j)1 x

(j)2 inner product of x1, x2 ∈ Rd.

• ‖x‖ =√∑d

i=1

(x(i)

)2 Euclidean norm of x ∈ Rd.

• ‖f‖ =∫

f(x)2µ(dx)1/2

L2(µ) norm of f : Rd → R.• ‖f‖∞ = supx∈Rd |f(x)| supremum norm of f : Rd → R.• ‖f‖∞,A = supx∈A |f(x)| supremum norm of f : Rd → R restricted

to A ⊆ Rd.• P partition of Rd.• X(k,n)(x), X(k,n), X(k)(x) kth nearest neighbor of x amongX1, . . . , Xn.• K : Rd → R kernel function.• h, hn > 0 smoothing factor for a kernel rule.• Kh(x) = K(x/h) scaled kernel function.• F class of functions f : Rd → R.• F+ class of all subgraphs of functions f ∈ F .• D class of distributions of (X,Y ).• s(A, n) n-th shatter coefficient of the class A of sets.• VA Vapnik–Chervonenkis dimension of the class A of sets.• N (

ε,G, ‖ · ‖Lp(ν))

ε-covering number of G w.r.t. ‖ · ‖Lp(ν).• Np (ε,G, zn

1 ) Lp ε-covering number of G on zn1 .

• M (ε,G, ‖ · ‖Lp(ν)

)ε-packing number of G w.r.t. ‖ · ‖Lp(ν).

• Mp (ε,G, zn1 ) Lp ε-packing number of G on zn

1 .• Sx,r = y ∈ Rd : ‖y − x‖ ≤ r closed Euclidean ball in Rd centered

at x ∈ Rd, with radius r > 0.• C(Rd) set of all continuous functions f : Rd → R.• Cl(A) set of all l times continuously differentiable functions f : A →

R, A ⊆ Rd.• C∞

0 (Rd) set of all infinitely often continuously differentiablefunctions f : Rd → R with compact support.

Notation 611

• z = arg minx∈D f(x) abbreviation for z ∈ D and f(z) =minx∈D f(x).

• δj,k Kronecker symbol, equals one if j = k and zero otherwise.• diam(A) = supx,z∈A ||x− z|| diameter of a set A ⊂ Rd.

Bibliography

Agarwal, G. G. (1989). Splines in statistics. Bulletin of the AllahabadMathematical Society, 4:1–55.

Agarwal, G. G. and Studden, W. J. (1980). Asymptotic integrated mean squareerror using least squares and bias minimizing splines. Annals of Statistics,8:1307–1325.

Ahmad, I. A. and Lin, P. E. (1976). Nonparametric sequential estimation of amultiple regression function. Bulletin of Mathematical Statistics, 17:63–75.

Aizerman, M. A., Braverman, E. M., and Rozonoer, L. I. (1964). The proba-bility problem of pattern recognition learning and the method of potentialfunctions. Automation and Remote Control, 25:1175–1190.

Akaike, H. (1954). An approximation to the density function. Annals of theInstitute of Statistical Mathematics, 6:127–132.

Akaike, H. (1974). A new look at the statistical model identification. IEEETransactions on Automatic Control, 19:716–723.

Alesker, S. (1997). A remark on the Szarek-Talagrand theorem. Combinatorics,Probability, and Computing, 6:139–144.

Alexander, K. (1984). Probability inequalities for empirical processes and a lawof the iterated logarithm. Annals of Probability, 4:1041–1067.

Alexits, G. (1961). Convergence Problems of Orthogonal Series. Pergamon Press,Oxford, UK.

Bibliography 613

Algoet, P. (1992). Universal schemes for prediction, gambling and portfolioselection. Annals of Probability, 20:901–941.

Algoet, P. (1994). The strong law of large numbers for sequential decisions underuncertainty. IEEE Transactions on Information Theory, 40:609–633.

Algoet, P. (1999). Universal schemes for learning the best nonlinear predic-tor given the infinite past and side information. IEEE Transactions onInformation Theory, 45:1165–1185.

Algoet, P. and Gyorfi, L. (1999). Strong universal pointwise consistency of someregression function estimation. Journal of Multivariate Analysis, 71:125–144.

Allen, D. M. (1974). The relationship between variable selection and dataaugmentation and a method for prediction. Technometrics, 16:125–127.

Alon, N., Ben-David, S., and Haussler, D. (1997). Scale-sensitive dimensions,uniform convergence, and learnability. Journal of the ACM, 44:615–631.

Amemiya, T. (1985). Advanced Econometrics. Harvard University Press,Cambridge, MA.

Andrews, D. W. K. and Whang, Y. J. (1990). Additive interactive regressionmodels: Circumvention of the curse of dimensionality. Econometric Theory,6:466–479.

Antoniadis, A., Gregoire, G., and Vial, P. (1997). Random design wavelet curvesmoothing. Statistics and Probability Letters, 35:225–232.

Antos, A., Gyorfi, L., and Kohler, M. (2000). Lower bounds on the rate ofconvergence of nonparametric regression estimates. Journal of StatisticalPlanning and Inference, 83:91–100.

Antos, A. and Lugosi, G. (1998). Strong minimax lower bound for learning.Machine Learning, 30:31–56.

Asmus, V. V., Vadasz, V., Karasev, A. B., and Ketskemety, L. (1987). An adap-tive and automatic method for estimating the crops prospects from satelliteimages (in Russian). Researches on the Earth from Space (Issledovanie Zemliiz Kosmosa), 6:79–88.

Azuma, K. (1967). Weighted sums of certain dependent random variables. TohokuMathematical Journal, 68:357–367.

Bailey, D. (1976). Sequential schemes for classifying and predicting ergodicprocesses. PhD Thesis, Stanford University.

Baraud, Y., Comte, F., and Viennet, G. (2001). Adaptive estimation in autore-gression or β-mixing regression via model selection. Annals of Statistics,29:839–875.

Barron, A. R. (1985). The strong ergodic theorem for densities: generalizedShannon-McMillan-Breiman theorem. Annals of Probability, 13:1292–1303.

614 Bibliography

Barron, A. R. (1989). Statistical properties of artificial neural networks. InProceedings of the 28th Conference on Decision and Control, pages 280–285.Tampa, FL.

Barron, A. R. (1991). Complexity regularization with application to artificialneural networks. In Nonparametric Functional Estimation and Related Top-ics, Roussas, G., editor, pages 561–576. NATO ASI Series, Kluwer AcademicPublishers, Dordrecht.

Barron, A. R. (1993). Universal approximation bounds for superpositions of asigmoidal function. IEEE Transactions on Information Theory, 39:930–944.

Barron, A. R. (1994). Approximation and estimation bounds for artificial neuralnetworks. Machine Learning, 14:115–133.

Barron, A. R., Birge, L., and Massart, P. (1999). Risk bounds for model selectionvia penalization. Probability Theory and Related Fields, 113:301–413.

Bartlett, P. L. and Anthony, M. (1999). Neural Network Learning: TheoreticalFoundations. Cambridge University Press, Cambridge.

Bauer, H. (1996). Probability Theory. de Gruyter, New York.

Beck, J. (1979). The exponential rate of convergence of error for kn-nn non-parametric regression and decision. Problems of Control and InformationTheory, 8:303–311.

Beirlant, J. and Gyorfi, L. (1998). On the asymptotic L2-error in partitioningregression estimation. Journal of Statistical Planning and Inference, 71:93–107.

Bellman, R. E. (1961). Adaptive Control Processes. Princeton University Press,Princeton, NJ.

Beran, R. (1981). Nonparametric regression with randomly censored survivaldata. Technical Report, University of California, Berkeley.

Bernstein, S. N. (1946). The Theory of Probabilities. Gastehizdat PublishingHouse, Moscow.

Bhattacharya, P. K. and Mack, Y. P. (1987). Weak convergence of k–nn den-sity and regression estimators with varying k and applications. Annals ofStatistics, 15:976–994.

Bickel, P. J. and Breiman, L. (1983). Sums of functions of nearest neighbordistances, moment bounds, limit theorems and a goodness of fit test. Annalsof Probability, 11:185–214.

Bickel, P. J., Klaasen, C. A. J., Ritov, Y., and Wellner, J. A. (1993). Efficientand Adaptive Estimation for Semiparametric Models. The Johns HopkinsUniversity Press, Baltimore, MD.

Bibliography 615

Birge, L. (1983). Approximation dans les espaces metriques et theorie del’estimation. Zeitschrift fur Wahrscheinlichkeitstheorie und verwandteGebiete, 65:181–237.

Birge, L. (1986). On estimating a density using Hellinger distance and some otherstrange facts. Probability Theory and Related Fields, 71:271–291.

Birge, L. and Massart, P. (1993). Rates of convergence for minimum contrastestimators. Probability Theory and Related Fields, 97:113–150.

Birge, L. and Massart, P. (1998). Minimum contrast estimators on sieves:exponential bounds and rate of convergence. Bernoulli, 4:329–375.

Birkhoff, G. D. (1931). Proof of the ergodic theorem. Proceedings of the NationalAcademy of Sciences U.S.A., 17:656–660.

Bosq, D. (1996). Nonparametric Statistics for Stochastic Processes. Springer-Verlag, New York.

Bosq, D. and Lecoutre, J. P. (1987). Theorie de l’ Estimation Fonctionnelle.Economica, Paris.

Breiman, L. (1957). The individual ergodic theorem of information theory. Annalsof Mathematical Statistics, 28:809–811.

Breiman, L. (1993). Fitting additive models to regression data. ComputationalStatistics and Data Analysis, 15:13–46.

Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformationsfor multiple regression and correlation. Journal of the American StatisticalAssociation, 80:580–598.

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984).Classification and Regression Trees. Wadsworth International, Belmont, CA.

Bretagnolle, J. and Huber, C. (1979). Estimation des densites: risque minimax.Zeitschrift fur Wahrscheinlichkeitstheorie und verwandte Gebiete, 47:119–137.

Broomhead, D. S. and Lowe, D. (1988). Multivariable functional interpolationand adaptive networks. Complex Systems, 2:321–323.

Burman, P. (1990). Estimation of generalized additive models. Journal ofMultivariate Analysis, 32:230–255.

Cacoullos, T. (1965). Estimation of a multivariate density. Annals of the Instituteof Statistical Mathematics, 18:179–190.

Carbonez, A., Gyorfi, L., and van der Meulen, E. C. (1995). Partitioning-estimates of a regression function under random censoring. Statistics andDecisions, 13:21–37.

616 Bibliography

Chen, S., Cowan, C. F. N., and Grant, P. M. (1991). Orthogonal least squareslearning algorithm for radial basis networks. IEEE Transactions on NeuralNetworks, 2:302–309.

Chen, Z. (1991). Interaction spline models and their convergence rates. Annalsof Statistics, 19:1855–1868.

Cheng, P. E. (1995). A note on strong convergence rates in nonparametricregression. Statistics and Probability Letters, 24:357–364.

Chernoff, H. (1952). A measure of asymptotic efficiency of tests of a hypothesisbased on the sum of observations. Annals of Mathematical Statistics, 23:493–507.

Chiu, S. T. (1991). Some stabilized bandwidth selectors for nonparametricregression. Annals of Statistics, 19:1528–1546.

Chow, Y. S. (1965). Local convergence of martingales and the law of largenumbers. Annals of Mathematical Statistics, 36:552–558.

Chui, C. (1992). Wavelets: A Tutorial in Theory and Applications. AcademicPress, Boston, MA.

Clark, R. M. (1975). A calibration curve for radiocarbon dates. Antiquity, 49:251–266.

Cleveland, W. S. (1979). Robust locally weighted regression and smoothingscatterplots. Journal of the American Statistical Association, 74:829–836.

Collomb, G. (1977). Quelques proprietes de la methode du noyau pourl’estimation nonparametrique de la regression en un point fixe. ComptesRendus de l’Academie des Sciences de Paris, 285:28–292.

Collomb, G. (1979). Estimation de la regression par la methode des k pointsles plus proches: proprietes de convergence ponctuelle. Comptes Rendus del’Academie des Sciences de Paris, 289:245–247.

Collomb, G. (1980). Estimation de la regression par la methode des k points lesplus proches avec noyau. Lecture Notes in Mathematics #821, Springer-Verlag, Berlin. 159–175.

Collomb, G. (1981). Estimation non parametrique de la regression: revuebibliographique. International Statistical Review, 49:75–93.

Collomb, G. (1985). Nonparametric regression: an up-to-date bibliography.Statistics, 16:300–324.

Cover, T. M. (1968a). Estimation by the nearest neighbor rule. IEEETransactions on Information Theory, 14:50–55.

Cover, T. M. (1968b). Rates of convergence for nearest neighbor procedures.In Proceedings of the Hawaii International Conference on Systems Sciences,pages 413–415. Honolulu, HI.

Bibliography 617

Cover, T. M. (1975). Open problems in information theory. In 1975 IEEE-USSRJoint Workshop on Information Theory, Forney, G. D., editor, pages 35–36.IEEE Press, New York.

Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pattern classification.IEEE Transactions on Information Theory, 13:21–27.

Cox, D. D. (1984). Multivariate smoothing spline functions. SIAM Journal onNumerical Analysis, 21:789–813.

Cox, D. D. (1988). Approximation of least squares regression on nested subspaces.Annals of Statistics, 16:713–732.

Cox, D. R. (1972). Regression models and life tables. Journal of the RoyalStatistical Society Series B, 34:187–202.

Csibi, S. (1971). Simple and compound processes in iterative machine learning.Technical Report, CISM Summer Course, Udine, Italy.

Cybenko, G. (1989). Approximations by superpositions of sigmoidal functions.Mathematics of Control, Signals, and Systems, 2:303–314.

Dabrowska, D. M. (1987). Nonparametric regression with censored data.Scandinavian J. Statistics, 14:181–197.

Dabrowska, D. M. (1989). Uniform consistency of the kernel conditional Kaplan-Meier estimate. Annals of Statistics, 17:1157–1167.

Daniel, C. and Wood, F. S. (1980). Fitting Equations to Data. Wiley, New York,2nd edition.

Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM, Philadelphia, PA.

Davidson, R. and MacKinnon, J. G. (1993). Estimation and Inference inEconometrics. Oxford University Press, New York.

de Boor, C. (1978). A Practical Guide to Splines. Springer-Verlag, New York.

Devroye, L. (1978a). The uniform convergence of nearest neighbor regres-sion function estimators and their application in optimization. IEEETransactions on Information Theory, 24:142–151.

Devroye, L. (1978b). The uniform convergence of the Narayada-Watsonregression function estimate. The Canadian Journal of Statistics, 6:179–191.

Devroye, L. (1981). On the almost everywhere convergence of nonparametricregression function estimates. Annals of Statistics, 9:1310–1319.

Devroye, L. (1982a). Bounds on the uniform deviation of empirical measures.Journal of Multivariate Analysis, 12:72–79.

Devroye, L. (1982b). Necessary and sufficient conditions for the almosteverywhere convergence of nearest neighbor regression function esti-mates. Zeitschrift fur Wahrscheinlichkeitstheorie und verwandte Gebiete,61:467–481.

618 Bibliography

Devroye, L. (1988). Automatic pattern recognition: A study of the probabilityof error. IEEE Transactions on Pattern Analysis and Machine Intelligence,10:530–543.

Devroye, L. (1991). Exponential inequalities in nonparametric estimation.In Nonparametric Functional Estimation and Related Topics, Roussas,G., editor, pages 31–44. NATO ASI Series, Kluwer Academic Publishers,Dordrecht.

Devroye, L. and Gyorfi, L. (1983). Distribution-free exponential bound on theL1 error of partitioning estimates of a regression function. In Proceedingsof the Fourth Pannonian Symposium on Mathematical Statistics, Konecny,F., Mogyorodi, J., and Wertz, W., editors, pages 67–76. Akademiai Kiado,Budapest, Hungary.

Devroye, L. and Gyorfi, L. (1985). Nonparametric Density Estimation: The L1

View. Wiley, New York.

Devroye, L., Gyorfi, L., and Krzyzak, A. (1998). The Hilbert kernel regressionestimate. Journal of Multivariate Analysis, 65:209–227.

Devroye, L., Gyorfi, L., Krzyzak, A., and Lugosi, G. (1994). On the strong uni-versal consistency of nearest neighbor regression function estimates. Annalsof Statistics, 22:1371–1385.

Devroye, L., Gyorfi, L., and Lugosi, G. (1996). Probabilistic Theory of PatternRecognition. Springer-Verlag, New York.

Devroye, L. and Krzyzak, A. (1989). An equivalence theorem for L1 conver-gence of the kernel regression estimate. Journal of Statistical Planning andInference, 23:71–82.

Devroye, L. and Lugosi, G. (2001). Combinatorial Methods in Density Estimation.Springer-Verlag, New York.

Devroye, L. and Wagner, T. J. (1976). Nonparametric discrimination and densityestimation. Technical Report 183, Electronics Research Center, Universityof Texas.

Devroye, L. and Wagner, T. J. (1979). Distribution-free inequalities for thedeleted and holdout error estimates. IEEE Transactions on InformationTheory, 25:202–207.

Devroye, L. and Wagner, T. J. (1980a). Distribution-free consistency results innonparametric discrimination and regression function estimation. Annals ofStatistics, 8:231–239.

Devroye, L. and Wagner, T. J. (1980b). On the L1 convergence of kernel estima-tors of regression functions with applications in discrimination. Zeitschriftfur Wahrscheinlichkeitstheorie und verwandte Gebiete, 51:15–21.

Bibliography 619

Devroye, L. and Wise, G. L. (1980). Consistency of a recursive nearest neighborregression function estimate. Journal of Multivariate Analysis, 10:539–550.

Dippon, J., Fritz, P., and Kohler, M. (2002). A statistical approach to case basedreasoning, with application to breast cancer data. Computational Statisticsand Data Analysis, (to appear).

Donoho, D. (1997). Cart and best–ortho–basis: a connection. Annals of Statistics,25:1870–1911.

Donoho, D. and Johnstone, I. M. (1994). Ideal spatial adaptation by waveletshrinkage. Biometrika, 81:425–455.

Donoho, D. and Johnstone, I. M. (1998). Minimax estimation via waveletshrinkage. Annals of Statistics, 26:879–921.

Donoho, D., Johnstone, I. M., Kerkyacharian, G., and Picard, D. (1995). Waveletshrinkage: Asymptopia? Journal of the Royal Statistical Society, Series B,57:301–369.

Doob, J. L. (1940). Regularity properties of certain families of chance variables.Transactions of the American Mathematical Society, 47:455–486.

Doob, J. L. (1953). Stochastic Processes. Wiley, New York.

Draper, N. R. and Smith, H. (1981). Applied Regression Analysis, 2nd ed. Wiley,New York.

Duchon, J. (1976). Interpolation des fonctions de deux variables suivant leprincipe de la flexion des plaques minces. RAIRO Analyse Numerique,10:5–12.

Dudley, R. M. (1978). Central limit theorems for empirical measures. Annals ofProbability, 6:899–929.

Dudley, R. M. (1984). A course on empirical processes. In Ecole de Probabilite deSt. Flour 1982, pages 1–142. Lecture Notes in Mathematics #1097, Springer-Verlag, New York.

Dudley, R. M. (1987). Universal Donsker classes and metric entropy. Annals ofProbability, 15:1306–1326.

Dyn, N. (1989). Interpolation and approximation by radial basis functions. InApproximation Theory VI, Chui, C. K., Schumaker, L. L., and Ward, J. D.,editors, pages 211–234. Academic Press, New York.

Efromovich, S. (1999). Nonparametric Curve Estimation: Methods, Theory, andApplications. Springer-Verlag, New York.

Efron, B. (1967). The two-sample problem with censored data. In Proceedings ofthe Fifth Berkeley Symposium on Mathematical Statistics and Probability,pages 831–853. University of California Press, Berkeley.

620 Bibliography

Efron, B. and Stein, C. (1981). The jackknife estimate of variance. Annals ofStatistics, 9:586–596.

Eggermont, P. P. B. and LaRiccia, V. N. (2001). Maximum Penalized LikelihoodEstimation. Vol. I: Density Estimation. Springer-Verlag, New York.

Emery, M., Nemirovsky, A. S., and Voiculescu, D. (2000). Lectures on ProbabilityTheory and Statistics. Springer-Verlag, New York.

Engel, J. (1994). A simple wavelet approach to nonparametric regression fromrecursive partitioning schemes. Journal of Multivariate Analysis, 49:242–254.

Etemadi, N. (1981). An elementary proof of the strong law of large numbers.Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete, 55:119–122.

Eubank, R. L. (1999). Nonparametric Regression and Spline Smoothing. MarcelDekker, New York.

Fan, J. (1993). Local linear regression smoothers and their minimax efficiencies.Annals of Statistics, 21:196–216.

Fan, J. and Gijbels, I. (1992). Variable bandwidth and local linear regressionsmoothers. Annals of Statistics, 20:2008–2036.

Fan, J. and Gijbels, I. (1995). Local Polynomial Modeling and its Applications.Chapman and Hall, London.

Fan, J., Hu, T. C., and Truong, Y. K. (1994). Robust nonparametric functionestimation. Scandinavian Journal of Statistics, 21:433–446.

Farago, A. and Lugosi, G. (1993). Strong universal consistency of neural networkclassifiers. IEEE Transactions on Information Theory, 39:1146–1151.

Farebrother, R. W. (1988). Linear Least Squares Computations. Marcel Dekker,New York.

Farebrother, R. W. (1999). Fitting Linear Relationships: A History of theCalculus of Observations 1750–1900. Springer-Verlag, New York.

Fix, E. and Hodges, J. L. (1951). Discriminatory analysis. Nonparametric dis-crimination: Consistency properties. Technical Report 4, Project Number21-49-004, USAF School of Aviation Medicine, Randolph Field, TX.

Fix, E. and Hodges, J. L. (1952). Discriminatory analysis: small sample perfor-mance. Technical Report 21-49-004, USAF School of Aviation Medicine,Randolph Field, TX.

Friedman, J. H. (1977). A recursive partitioning decision rule for nonparametricclassification. IEEE Transactions on Computers, 26:404–408.

Friedman, J. H. (1991). Multivariate adaptive regression splines (with discussion).Annals of Statistics, 19:1–141.

Bibliography 621

Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. Journalof the American Statistical Association, 76:817–823.

Friedman, J. H. and Tukey, J. W. (1974). A projection pursuit algorithm forexploratory data analysis. IEEE Transactions on Computers, 23:881–889.

Fritz, J. (1974). Learning from ergodic training sequence. In Limit Theo-rems of Probability Theory, Revesz, P., editor, pages 79–91. North-Holland,Amsterdam.

Funahashi, K. (1989). On the approximate realization of continuous mappingsby neural networks. Neural Networks, 2:183–192.

Gaenssler, P. and Rost, D. (1999). Empirical and Partial-sum Processes; Revis-ited as Random Measure Processes. Centre for Mathematical Physics andStochastics, University of Aarhus, Denmark.

Gallant, A. R. (1987). Nonlinear Statistical Models. Wiley, New York.

Gasser, T. and Muller, H.-G. (1979). Kernel estimation of regression functions. InSmoothing Techniques for Curve Estimation, Gasser, T. and Rosenblatt, M.,editors, pages 23–68. Lecture Notes in Mathematics #757, Springer-Verlag,Heidelberg.

Geman, S. and Hwang, C.-R. (1982). Nonparametric maximum likelihoodestimation by the method of sieves. Annals of Statistics, 10:401–414.

Gessaman, M. P. (1970). A consistent nonparametric multivariate density es-timator based on statistically equivalent blocks. Annals of MathematicalStatistics, 41:1344–1346.

Gill, R. D. (1994). Lectures on survival analysis. In Lectures on ProbabilityTheory, Bernard, P., editor, pages 115–241. Springer-Verlag.

Gine, E. (1996). Empirical processes and applications: An overview. Bernoulli,2:1–28.

Girosi, F. (1994). Regularization theory, radial basis functions and networks. InFrom Statistics to Neural Networks. Theory and Pattern Recognition Appli-cations, Cherkassky, V., Friedman, J. H., and Wechsler, H., editors, pages166–187. Springer-Verlag, Berlin.

Girosi, F. and Anzellotti, G. (1992). Convergence rates of approximation bytranslates. M.I.T. AI Memo. No.1288, MIT, Cambridge, MA.

Girosi, F. and Anzellotti, G. (1993). Rates of convergence for radial basis func-tions and neural networks. In Artificial Neural Networks for Speech andVision, Mammone, R. J., editor, pages 97–113. Chapman and Hall, London.

Gladyshev, E. G. (1965). On stochastic approximation. Theory of Probabilityand its Applications, 10:275–278.

622 Bibliography

Glick, N. (1973). Sample-based multinomial classification. Biometrics, 29:241–256.

Gordon, L. and Olshen, R. A. (1980). Consistent nonparametric regression fromrecursive partitioning schemes. Journal of Multivariate Analysis, 10:611–627.

Gordon, L. and Olshen, R. A. (1984). Almost surely consistent nonparamet-ric regression from recursive partitioning schemes. Journal of MultivariateAnalysis, 15:147–163.

Greblicki, W. (1974). Asymptotically optimal probabilistic algorithms for patternrecognition and identification. Technical Report, Monografie No. 3, PraceNaukowe Instytutu Cybernetyki Technicznej Politechniki Wroclawskiej No.18, Wroclaw, Poland.

Greblicki, W. (1978a). Asymptotically optimal pattern recognition procedureswith density estimates. IEEE Transactions on Information Theory, 24:250–251.

Greblicki, W. (1978b). Pattern recognition procedures with nonparametric den-sity estimates. IEEE Transactions on Systems, Man and Cybernetics,8:809–812.

Greblicki, W., Krzyzak, A., and Pawlak, M. (1984). Distribution-free pointwiseconsistency of kernel regression estimate. Annals of Statistics, 12:1570–1575.

Greblicki, W. and Pawlak, M. (1987a). Necessary and sufficient conditionsfor Bayes risk consistency of a recursive kernel classification rule. IEEETransactions on Information Theory, 33:408–412.

Greblicki, W. and Pawlak, M. (1987b). Necessary and sufficient consistencyconditions for a recursive kernel regression estimate. Journal of MultivariateAnalysis, 23:67–76.

Green, P. J. and Silverman, B. W. (1994). Nonparametric Regression and Gen-eralized Linear Models: a Roughness Penalty Approach. Chapman and Hall,London.

Grenander, U. (1981). Abstract Inference. Wiley, New York.

Guerre, E. (2000). Design adaptive nearest neighbor regression estimation.Journal of Multivariate Analysis, 75:219–255.

Gyorfi, L. (1975). An upper bound of error probabilities for multihypothesistesting and its application in adaptive pattern recognition. Problems ofControl and Information Theory, 5:449–457.

Gyorfi, L. (1978). On the rate of convergence of nearest neighbor rules. IEEETransactions on Information Theory, 29:509–512.

Bibliography 623

Gyorfi, L. (1980). Stochastic approximation from ergodic sample for linear re-gression. Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete,54:47–55.

Gyorfi, L. (1981). Recent results on nonparametric regression estimate andmultiple classification. Problems of Control and Information Theory,10:43–52.

Gyorfi, L. (1984). Adaptive linear procedures under general conditions. IEEETransactions on Information Theory, 30:262–267.

Gyorfi, L. (1991). Universal consistencies of a regression estimate for unboundedregression functions. In Nonparametric Functional Estimation and Re-lated Topics, Roussas, G., editor, pages 329–338. NATO ASI Series, KluwerAcademic Publishers, Dordrecht.

Gyorfi, L. and Gyorfi, Z. (1975). On the nonparametric estimate of a posteri-ori probabilities of simple statistical hypotheses. In Colloquia MathematicaSocietatis Janos Bolyai: Topics in Information Theory, pages 299–308.Keszthely, Hungary.

Gyorfi, L., Hardle, W., Sarda, P., and Vieu, P. (1989). Nonparametric CurveEstimation from Time Series. Springer Verlag, Berlin.

Gyorfi, L., Kohler, M., and Walk, H. (1998). Weak and strong universal consis-tency of semi-recursive partitioning and kernel regression estimate. Statisticsand Decisions, 16:1–18.

Gyorfi, L. and Lugosi, G. (2002). Strategies for sequential prediction of stationarytime series. In Modeling Uncertainity: An Examination of its Theory, Meth-ods and Applications, Dror, M., L’Ecuyer, P., and Szidarovszky, F., editors,pages 225–248. Kluwer Academic Publishers, Dordrecht.

Gyorfi, L., Morvai, G., and Yakowitz, S. (1998). Limits to consistent on-line fore-casting for ergodic time series. IEEE Transactions on Information Theory,44:886–892.

Gyorfi, L., Schafer, D., and Walk, H. (2002). Relative stability of global errorsin nonparametric function estimations. IEEE Transactions on InformationTheory, (to appear).

Gyorfi, L. and Walk, H. (1996). On strong universal consistency of a series typeregression estimate. Mathematical Methods of Statistics, 5:332–342.

Gyorfi, L. and Walk, H. (1997). Consistency of a recursive regression estimateby Pal Revesz. Statistics and Probability Letters, 31:177–183.

Hald, A. (1998). A History of Mathematical Statistics from 1750 to 1930. Wiley,New York.

624 Bibliography

Hall, P. (1984). Asymptotic properties of integrated square error and cross-validation for kernel estimation of a regression function. Zeitschrift furWahrscheinlichkeitstheorie und verwandte Gebiete, 67:175–196.

Hall, P. (1988). Estimating the direction in which a data set is most interesting.Probability Theory and Related Fields, 80:51–77.

Hall, P. and Turlach, B. A. (1997). Interpolation methods for nonlinear waveletregression with irregular spaced design. Annals of Statistics, 25:1912–1925.

Hamers, M. and Kohler, M. (2001). A bound on the expected maximal devia-tions of sample averages from their means. Preprint 2001-9, MathematicalInstitute A, University of Stuttgart.

Hardle, W. (1990). Applied Nonparametric Regression. Cambridge UniversityPress, Cambridge, UK.

Hardle, W., Hall, P., and Marron, J. S. (1988). How far are automatically chosenregression smoothing parameter selectors from their optimum? Journal ofthe American Statistical Association, 83:86–101.

Hardle, W. and Kelly, G. (1987). Nonparametric kernel regression estimation:optimal choice of bandwidth. Statistics, 1:21–35.

Hardle, W., Kerkyacharian, G., Picard, D., and Tsybakov, A. B. (1998). Wavelets,Approximation, and Statistical Applications. Springer-Verlag, New York.

Hardle, W. and Marron, J. S. (1985). Optimal bandwidth selection innonparametric regression function estimation. Annals of Statistics,13:1465–1481.

Hardle, W. and Marron, J. S. (1986). Random approximations to an errorcriterion of nonparametric statistics. Journal of Multivariate Analysis,20:91–113.

Hardle, W. and Stoker, T. M. (1989). Investigating smooth multiple regressionby the method of average derivatives. Journal of the American StatisticalAssociation, 84:986–995.

Harrison, D. and Rubinfeld, D. L. (1978). Hedonic prices and the demand forclean air. Journal of Environmental Economics and Management, 5:81–102.

Hart, D. J. (1997). Nonparametric Smoothing and Lack-of-Fit Tests. Springer-Verlag, New York.

Hastie, T. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapmanand Hall, London, U. K.

Hastie, T., Tibshirani, R. J., and Friedman, J. H. (2001). The Elements ofStatistical Learning. Springer-Verlag, New York.

Haussler, D. (1992). Decision theoretic generalizations of the PAC model forneural net and other learning applications. Information and Computation,100:78–150.

Bibliography 625

Haussler, D., Kivinen, J., and Warmuth, M. K. (1998). Sequential prediction ofindividual sequences under general loss functions. IEEE Transactions onInformation Theory, 44:1906–1925.

Hertz, J., Krogh, A., and Palmer, R. G. (1991). Introduction to the Theory ofNeural Computation. Addison-Wesley, Redwood City, CA.

Hewitt, E. and Ross, K. A. (1970). Abstract Harmonic Analysis II. Springer-Verlag, New York.

Hoeffding, W. (1963). Probability inequalities for sums of bounded randomvariables. Journal of the American Statistical Association, 58:13–30.

Hollig, K. (1998). Grundlagen der Numerik. MathText, Zavelstein, Germany.

Hornik, K. (1991). Approximation capabilities of multilayer feedforwardnetworks. Neural Networks, 4:251–257.

Hornik, K. (1993). Some new results on neural network approximation. NeuralNetworks, 6:1069–1072.

Hornik, K., Stinchcombe, M., and White, H. (1989). Multi-layer feedforwardnetworks are universal approximators. Neural Networks, 2:359–366.

Horowitz, J. L. (1998). Semiparametric Methods in Econometrics. Springer-Verlag, New York.

Horvath, L. (1981). On nonparametric regression with randomly censored data.In Proc. Third Pannonian Symp. on Math. Statist. Visegrad, Hungary, Mo-gyorodi, J., Vincze, I., and Wertz, W., editors, pages 105–112. AkademiaiKiado, Budapest, Hungary.

Hristache, M., Juditsky, A., Polzehl, J., and Spokoiny, V. G. (2001). Structureadaptive approach for dimension reduction. Annals of Statistics, 29:1537–1566.

Huang, J. (1998). Projection esrtimation in multiple regression with applicationsto functional ANOVA models. Annals of Statistics, 26:242–272.

Ibragimov, I. A. and Khasminskii, R. Z. (1980). On nonparametric estimation ofregression. Doklady Akademii Nauk SSSR, 252:780–784.

Ibragimov, I. A. and Khasminskii, R. Z. (1981). Statistical Estimation:Asymptotic Theory. Springer-Verlag, New York.

Ibragimov, I. A. and Khasminskii, R. Z. (1982). On the bounds for quality ofnonparametric regression function estimation. Theory of Probability and itsApplications, 27:81–94.

Ichimura, H. (1993). Semiparametric least squares (sls) and weighted slsestimation of single-index models. Journal of Econometrics, 58:71–120.

Johansen, S. and Karush, J. (1966). On the semimartingale convergence theorem.Annals of Mathematical Statistics, 37:690–694.

626 Bibliography

Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incompleteobservations. J. Amer. Statist. Assoc., 53:457–481.

Katkovnik, V. Y. (1979). Linear and nonlinear methods for nonparametricregression analysis. Avtomatika, 5:35–46.

Katkovnik, V. Y. (1983). Convergence of linear and nonlinear nonparametricestimates of “kernel” type. Automation and Remote Control, 44:495–506.

Katkovnik, V. Y. (1985). Nonparametric Identification and Data Smoothing:Local Approximation Approach (in Russian). Nauka, Moscow.

Kivinen, J. and Warmuth, M. K. (1999). Averaging expert predictions. In Com-putational Learning Theory: Proceedings of the Fourth European Conference,Eurocolt’99, Simon, H. U. and Fischer, P., editors, pages 153–167. Springer.

Kohler, M. (1998). Nonparametric regression function estimation using in-teraction least squares splines and complexity regularization. Metrika,47:147–163.

Kohler, M. (1999). Universally consistent regression function estimation usinghierarchical B-splines. Journal of Multivariate Analysis, 68:138–164.

Kohler, M. (2000a). Analyse von nichtparametrischen Regressionsschatzern unterminimalen Voraussetzungen. Habilitationsschrift. Shaker Verlag, Aachen.

Kohler, M. (2000b). Inequalities for uniform deviations of averages from expecta-tions, with applications to nonparametric regression. Journal of StatisticalPlanning and Inference, 89:1–23.

Kohler, M. (2002a). Nonlinear orthogonal series estimates for random designregression. To appear in Journal of Statistical Planning and Inference.

Kohler, M. (2002b). Universal consistency of local polynomial kernel estimates.To appear in Annals of the Institute of Statistical Mathematics.

Kohler, M. and Krzyzak, A. (2001). Nonparametric regression estimation usingpenalized least squares. IEEE Transaction on Information Theory, 47:3054–3058.

Kohler, M., Krzyzak, A., and Schafer, D. (2002). Application of structuralrisk minimization to multivariate smoothing spline regression estimates.Bernoulli, 8:1–15.

Kohler, M., Krzyzak, A., and Walk, H. (2002). Strong universal consistency ofautomatic kernel regression estimates. Submitted.

Kohler, M., Mathe, K., and Pinter, M. (2002). Prediction from randomly rightcensored data. Journal of Multivariate Analysis, 80:73–100.

Korostelev, A. P. and Tsybakov, A. B. (1993). Minimax Theory of ImageReconstruction. Springer-Verlag, Berlin.

Bibliography 627

Kovac, A. and Silverman, B. W. (2000). Extending the scope of wavelet regressionmethods by coefficient-dependent thresholding. Journal of the AmericanStatistical Association, 95:172–183.

Kozek, A. S., Leslie, J. R., and Schuster, E. F. (1998). On a universal strong lawof large numbers for conditional expectations. Bernoulli, 4:143–165.

Krahl, D., Windheuser, U., and Zick, F.-K. (1998). Data Mining. Einsatz in derPraxis. Addison–Wesley, Bonn.

Krzyzak, A. (1986). The rates of convergence of kernel regression estimates andclassification rules. IEEE Transactions on Information Theory, 32:668–679.

Krzyzak, A. (1990). On estimation of a class of nonlinear systems by the kernelregression estimate. IEEE Transactions on Information Theory, 36:141–152.

Krzyzak, A. (1991). On exponential bounds on the Bayes risk of the kernelclassification rule. IEEE Transactions on Information Theory, 37:490–499.

Krzyzak, A. (1992). Global convergence of the recursive kernel regression esti-mates with applications in classification and nonlinear system estimation.IEEE Transactions on Information Theory, IT-38:1323–1338.

Krzyzak, A. (1993). Identification of nonlinear block-oriented systems by therecursive kernel regression estimate. Journal of the Franklin Institute,330(3):605–627.

Krzyzak, A. (2001). Nonlinear function learning using optimal radial basisfunction networks. Journal on Nonlinear Analysis, 47:293–302.

Krzyzak, A. and Linder, T. (1998). Radial basis function networks and complexityregularization in function learning. IEEE Transactions on Neural Networks,9(2):247–256.

Krzyzak, A., Linder, T., and Lugosi, G. (1996). Nonparametric estima-tion and classification using radial basis function nets and empirical riskminimization. IEEE Transactions on Neural Networks, 7(2):475–487.

Krzyzak, A. and Niemann, H. (2001). Convergence and rates of convergence ofradial basis functions networks in function learning. Journal on NonlinearAnalysis, 47:281–292.

Krzyzak, A. and Pawlak, M. (1983). Universal consistency results for Wolverton-Wagner regression function estimate with application in discrimination.Problems of Control and Information Theory, 12:33–42.

Krzyzak, A. and Pawlak, M. (1984a). Almost everywhere convergence of recursivekernel regression function estimates. IEEE Transactions on InformationTheory, 30:91–93.

Krzyzak, A. and Pawlak, M. (1984b). Distribution-free consistency of a nonpara-metric kernel regression estimate and classification. IEEE Transactions onInformation Theory, 30:78–81.

628 Bibliography

Krzyzak, A. and Schafer, D. (2002). Nonparametric regression estimation by nor-malizeed radial basis function networks. Preprint, Mathematisches InstitutA, Universitat Stuttgart.

Kulkarni, S. R. and Posner, S. E. (1995). Rates of convergence of nearest neighborestimation under arbitrary sampling. IEEE Transactions on InformationTheory, 41:1028–1039.

Lecoutre, J. P. (1980). Estimation d’une fonction de regression pour parla methode du regressogramme a blocks equilibres. Comptes Rendus del’Academie des Sciences de Paris, 291:355–358.

Ledoux, M. (1996). On Talagrand’s deviation inequalities for product measures.ESAIM: Probability and Statistics, 1:63–87.

Ledoux, M. and Talagrand, M. (1991). Probability in Banach Space. Springer-Verlag, New York.

Lee, W. S., Bartlett, P. L., and Williamson, R. C. (1996). Efficient agnosticlearning of neural networks with bounded fan-in. IEEE Transactions onInformation Theory, 42(6):2118–2132.

Li, K. C. (1984). Consistency for cross-validated nearest neighbor estimates innonparametric regression. Annals of Statistics, 12:230–240.

Li, K. C. (1986). Asymptotic optimality of cl and generalized cross-validation inridge regression with application to spline smoothing. Annals of Statistics,14:1101–1112.

Li, K. C. (1987). Asymptotic optimality for cp, cl cross-validation and generalizedcross-validation: discrete index set. Annals of Statistics, 15:958–975.

Light, W. A. (1992). Some aspects of radial basis function approximation. In Ap-proximation Theory, Spline Functions and Applications, Singh, S. P., editor,pages 163–190. NATO ASI Series, Kluwer Academic Publishers, Dordrecht.

Linton, O. B. (1997). Efficient estimation of additive nonparametric regressionmodels. Biometrika, 84:469–474.

Linton, O. B. and Hardle, W. (1996). Estimating additive regression models withknown links. Biometrika, 83:529–540.

Linton, O. B. and Nielsen, J. B. (1995). A kernel method of estimating structurednonparametric regression based on marginal integration. Biometrika, 82:93–100.

Ljung, L., Pflug, G., and Walk, H. (1992). Stochastic Approximation andOptimization of Random Systems. Birkhauser, Basel, Boston, Berlin.

Loftsgaarden, D. O. and Quesenberry, C. P. (1965). A nonparametric estimate ofa multivariate density function. Annals of Mathematical Statistics, 36:1049–1051.

Bibliography 629

Lugosi, G. (2002). Pattern classification and learning theory. In Principles ofNonparametric Learning, Gyorfi, L., editor, pages 5–62. Springer-Verlag,Wien NewYork.

Lugosi, G. and Nobel, A. (1996). Consistency of data-driven histogram methodsfor density estimation and classification. Annals of Statistics, 24:687–706.

Lugosi, G. and Nobel, A. (1999). Adaptive model selection using empiricalcomplexities. Annals of Statistics, 27:1830–1864.

Lugosi, G. and Zeger, K. (1995). Nonparametric estimation via empirical riskminimization. IEEE Transactions on Information Theory, 41:677–678.

Lunts, A. L. and Brailovsky, V. L. (1967). Evaluation of attributes obtained instatistical decision rules. Engineering Cybernetics, 3:98–109.

Mack, Y. P. (1981). Local properties of k–nearest neighbor regression estimates.SIAM Journal on Algebraic and Discrete Methods, 2:311–323.

MacQueen, J. (1967). Some methods for classification and analysis of multi-variate observations. In Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability, Neyman, J., editor, pages 281–297.University of California Press, Berkeley and Los Angeles, California.

Maindonald, J. H. (1984). Statistical Computation. Wiley, New York.

Maiorov, V. E. and Meir, R. (2000). On the near optimality of the stochas-tic approximation of smooth functions by neural networks. Advances inComputational Mathematics, 13:79–103.

Maker, P. T. (1940). The ergodic theorem for a sequence of functions. Duke MathJ., 6:27–30.

Mallows, C. L. (1973). Some comments on Cp. Technometrics, 15:661–675.

Mammen, E. and van de Geer, S. (1997). Locally adaptive regression splines.Annals of Statistics, 25:387–413.

Manski, C. F. (1988). Identification of binary response models. Journal of theAmerican Statistical Association, 83:729–738.

Massart, P. (1990). The tight constant in the Dvoretzky-Kiefer-Wolfowitzinequality. Annals of Probability, 18:1269–1283.

McCaffrey, D. F. and Gallant, A. R. (1994). Convergence rates for single hiddenlayer feedforward networks. Neural Networks, 7(1):147–158.

McCulloch, W. and Pitts, W. (1943). A logical calculus of ideas immanent inneural activity. Bulletin of Mathematical Biophysics, 5:115–133.

McDiarmid, C. (1989). On the method of bounded differences. In Surveys in Com-binatorics 1989, pages 148–188. Cambridge University Press, Cambridge,UK.

630 Bibliography

Meyer, Y. (1993). Wavelets: Algorithms and Applications. SIAM, Philadelphia,PA.

Mhaskar, H. N. (1996). Neural networks for optimal approximation of smoothand analytic functions. Neural Computation, 8:164–177.

Mielniczuk, J. and Tyrcha, J. (1993). Consistency of multilayer perceptronregression estimators. Neural Networks, 6:1019–1022.

Minsky, M. L. and Papert, S. (1969). Perceptrons: An Introduction toComputational Geometry. MIT Press, Cambridge, MA.

Moody, J. and Darken, J. (1989). Fast learning in networks of locally-tunedprocessing units. Neural Computation, 1:281–294.

Morvai, G. (1995). Estimation of conditional distributions for stationary timeseries. PhD Thesis, Technical University of Budapest.

Morvai, G., Yakowitz, S., and Algoet, P. (1997). Weakly convergent nonparamet-ric forecasting of stationary time series. IEEE Transactions on InformationTheory, 43:483–498.

Morvai, G., Yakowitz, S., and Gyorfi, L. (1996). Nonparametric inference forergodic, stationary time series. Annals of Statistics, 24:370–379.

Nadaraya, E. A. (1964). On estimating regression. Theory of Probability and itsApplications, 9:141–142.

Nadaraya, E. A. (1970). Remarks on nonparametric estimates for density func-tions and regression curves. Theory of Probability and its Applications,15:134–137.

Nadaraya, E. A. (1989). Nonparametric Estimation of Probability Densities andRegression Curves. Kluwer Academic Publishers, Dordrecht.

Nemirovsky, A. S., Polyak, B. T., and Tsybakov, A. B. (1983). Estimators ofmaximum likelihood type for nonparametric regression. Soviet MathematicsDoklady, 28:788–792.

Nemirovsky, A. S., Polyak, B. T., and Tsybakov, A. B. (1984). Signal processingby the nonparametric maximum likelihood method. Problems of InformationTransmssion, 20:177–192.

Nemirovsky, A. S., Polyak, B. T., and Tsybakov, A. B. (1985). Rate of conver-gence of nonparametric estimators of maximum-likelihood type. Problemsof Information Transmission, 21:258–272.

Neumann, M. H. and Spokoiny, V. G. (1995). On the efficiency of wavelet estima-tors under arbitrary error distributions. Mathematical Methods of Statistics,4:137–166.

Nevelson, M. B. and Khasminskii, R. Z. (1973). Stochastic Approximation andRecursive Estimation. American Mathematical Society, Providence, R.I.

Bibliography 631

Newey, W. K. (1994). Kernel estimation of partial means and a general varianceestimator. Econometric Theory, 10:233–253.

Nicoleris, T. and Yatracos, Y. G. (1997). Rates of convergence of estimates, Kol-mogorov’s entropy and the dimensionality reduction principle in regression.Annals of Statistics, 25:2493– 2511.

Nilsson, N. J. (1965). Learning Machines: Foundations of Trainable PatternClassifying Systems. McGraw-Hill, New York.

Niyogi, P. and Girosi, F. (1996). On the relationship between generalization error,hypothesis complexity, and sample complexity for radial basis functions.Neural Computation, 8:819–842.

Nobel, A. (1996). Histogram regression estimation using data dependentpartitions. Annals of Statistics, 24:1084–1105.

Nobel, A. (1999). Limits to classification and regression estimation from ergodicprocesses. Annals of Statistics, 27:262–273.

Nolan, D. and Pollard, D. (1987). U-processes: Rates of convergence. Annals ofStatistics, 15:780–799.

Ornstein, D. S. (1974). Ergodic Theory, Randomness and Dynamical Systems.Yale University Press, New Haven.

Ornstein, D. S. (1978). Guessing the next output of a stationary process. IsraelJournal of Mathematics, 30:292–296.

Park, J. and Sandberg, I. W. (1991). Universal approximation using radial-basis-function networks. Neural Computation, 3:246–257.

Park, J. and Sandberg, I. W. (1993). Approximation and radial-basis-functionnetworks. Neural Computation, 5:305–316.

Parzen, E. (1962). On the estimation of a probability density function and themode. Annals of Mathematical Statistics, 33:1065–1076.

Pawlak, M. (1991). On the almost everywhere properties of the kernel regressionestimate. Annals of the Institute of Statistical Mathematics, 43:311–326.

Peterson, A. V. (1977). Expressing the Kaplan-Meier estimator as a function ofempirical subsurvival functions. J. Amer. Statist. Assoc., 72:854–858.

Pinter, M. (2001). Consistency results in nonparametric regression estimationand classification. PhD Thesis, Technical University of Budapest.

Poggio, T. and Girosi, F. (1990). A theory of networks for approximation andlearning. Proceedings of the IEEE, 78:1481–1497.

Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag, NewYork.

Pollard, D. (1986). Rates of uniform almost sure convergence for empiricalprocesses indexed by unbounded classes of functions. Manuscript.

632 Bibliography

Pollard, D. (1989). Asymptotics via empirical processes. Statistical Science,4:341–366.

Pollard, D. (1990). Empirical Processes: Theory and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics, Instituteof Mathematical Statistics, Hayward, CA.

Polyak, B. T. and Tsybakov, A. B. (1990). Asymptotic optimality of the cp-testfor the orthogonal series estimation of regression. Theory of Probability andits Applications, 35:293–306.

Powel, J. L. (1994). Estimation of semiparametric models. In Handbook of Econo-metrics, Engle, R. F. and McFadden, D. F., editors, pages 79–91. Elsevier,Amsterdam.

Powel, J. L., Stock, J. H., and Stoker, T. M. (1989). Semiparametric estimationof index coefficients. Econometrica, 51:1403–1430.

Powell, M. J. D. (1987). Radial basis functions for multivariable interpolation:a review. In Algorithms for Approximation, Mason, J. C. and Cox, M. G.,editors, pages 143–167. Oxford University Press, Oxford, UK.

Powell, M. J. D. (1992). The theory of radial basis functions approximation.In Advances in Numerical Analysis III, Wavelets, Subdivision Algorithmsand Radial Basis Functions, Light, W. A., editor, pages 105–210. ClarendonPress, Oxford, UK.

Prakasa Rao, B. L. S. (1983). Nonparametric Functional Estimation. AcademicPress, New York.

Rafajlowicz, E. (1987). Nonparametric orthogonal series estimators of regres-sion: a class attaining the optimal convergence rate in L2. Statistics andProbability Letters, 5:213–224.

Rao, R. C. (1973). Linear Statistical Inference and Its Applications. Wiley, NewYork, 2nd edition.

Reinsch, C. (1967). Smoothing by spline functions. Numerische Mathematik,10:177–183.

Rejto, L. and Revesz, P. (1973). Density estimation and pattern classification.Problems of Control and Information Theory, 2:67–80.

Revesz, P. (1973). Robbins-Monro procedure in a Hilbert space and itsapplication in the theory of learning processes. Studia ScientiariumMathematicarum Hungarica, 8:391–398.

Rice, J. and Rosenblatt, M. (1983). Smoothing splines: regression, derivativesand deconvolution. Annals of Statistics, 11:141–156.

Ripley, B. D. (1996). Pattern Recognition and Neural Networks. CambridgeUniversity Press, Cambridge, UK.

Bibliography 633

Robbins, H. and Siegmund, D. (1971). A convergence theorem for nonnegativealmost supermartingales and some applications. In Optimizing Methods inStatistics, Rustagi, J., editor, pages 233–257. Academic Press, New York.

Robinson, P. M. (1987). Asymptotically efficient estimation in the presence ofheteroscedasticity of unknown form. Econometrica, 55:167–182.

Robinson, P. M. (1988). Root-n-consistent semiparametric regression.Econometrica, 56:931–954.

Rosenblatt, F. (1958). The perceptron: A probabilistic model for informationstorage and organization in the brain. Psychological Review, 65:386–408.

Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and the Theoryof Brain Mechanisms. Spartan Books, Washington, DC.

Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a densityfunction. Annals of Mathematical Statistics, 27:832–837.

Royall, R. M. (1966). A class of nonparametric estimators of a smooth regressionfunction. PhD Thesis, Stanford University, Stanford, CA.

Rudin, W. (1964). Principles of Mathematical Analysis. McGraw-Hill, New York.

Rudin, W. (1966). Real and Complex Analysis. McGraw-Hill, New York.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning inter-nal representations by error propagation. In Parallel Distributed ProcessingVol. I, Rumelhart, D. E., McClelland, J. L., and the PDP Research Group,editors. MIT Press, Cambridge, MA. Reprinted in: J. A. Andersonand E. Rosenfeld, Neurocomputing—Foundations of Research, MIT Press,Cambridge, MA, pp. 673–695, 1988.

Rumelhart, D. E. and McClelland, J. L. (1986). Parallel Distributed Processing:Explorations in the Microstructure of Cognition. Volume 1. Foundations.MIT Press, Cambridge, MA.

Ryabko, B. Y. (1988). Prediction of random sequences and universal coding.Problems of Information Transmission, 24:87–96.

Ryzin, J. V. (1966). Bayes risk consistency of classification procedures usingdensity estimation. Sankhya Series A, 28:161–170.

Ryzin, J. V. (1969). On strong consistency of density estimates. Annals ofMathematical Statistics, 40:1765–1772.

Samarov, A. M. (1993). Exploring regression structure using nonparametricfunctional estimation. Journal of the American Statistical Association,88:836–847.

Sauer, N. (1972). On the density of families of sets. Journal of CombinatorialTheory Series A, 13:145–147.

634 Bibliography

Schoenberg, I. J. (1964). Spline functions and the problem of graduation.Proceedings of the National Academy of Sciences U.S.A., 52:947–950.

Schumaker, L. (1981). Spline Functions: Basic Theory. Wiley, New York.

Seber, G. A. F. (1977). Linear Regression Analysis. Wiley, New York.

Shelah, S. (1972). A combinatorial problem: Stability and order for models andtheories in infinity languages. Pacific Journal of Mathematics, 41:247–261.

Shen, X. (1998). On the method of penalization. Statistica Sinica, 8:337–357.

Shen, X. and Wong, W. H. (1994). Convergence rate of sieve estimates. Annalsof Statistics, 22:580–615.

Shibata, R. (1976). Selection of the order of an autoregressive model by Akaike’sinformation criterion. Biometrica, 63:117–126.

Shibata, R. (1981). An optimal selection of regression variables. Biometrika,68:45–54.

Shields, P. C. (1991). Cutting and stacking: a method for constructing stationaryprocesses. IEEE Transactions on Information Theory, 37:1605–1614.

Shorack, G. R. and Wellner, J. A. (1986). Empirical Processes with Applications.Wiley, New York.

Simonoff, J. S. (1996). Smoothing Methods in Statistics. Springer-Verlag, NewYork.

Singer, A. and Feder, M. (1999). Universal linear prediction by model orderweighting. IEEE Transactions on Signal Processing, 47:2685–2699.

Speckman, P. (1985). Spline smoothing and optimal rates of convergence innonparametric regression models. Annals of Statistics, 13:970–983.

Spiegelman, C. and Sacks, J. (1980). Consistent window estimation innonparametric regression. Annals of Statistics, 8:240–246.

Steele, J. M. (1975). Combinatorial entropy and uniform limit laws. PhD Thesis,Stanford University, Stanford, CA.

Steele, J. M. (1986). An Efron-Stein inequality for nonsymmetric statistics.Annals of Statistics, 14:753–758.

Stein, E. M. (1970). Singular Integrals and Differentiability Properties ofFunctions. Princeton University Press, Princeton, New Jersey, NJ.

Stein, E. M. and Weiss, G. (1971). Introduction to Fourier Analysis on EuclideanSpaces. Princeton University Press, Princeton, NJ.

Stigler, S. M. (1999). Statistics on the Table: The History of Statistical Conceptsand Methods. Harvard University Press, Cambridge, MA.

Stoer, J. (1993). Numerische Mathematik, Vol. 1. Springer-Verlag, Berlin.

Bibliography 635

Stoker, T. M. (1991). Equivalence of direct, indirect and slope estimators of aver-age derivatives. In Economics and Statistics, Barnett, W. A., Powel, J., andTauchen, G., editors, pages 79–91. Cambridge University Press, Cambridge,UK.

Stone, C. J. (1977). Consistent nonparametric regression. Annals of Statistics,5:595–645.

Stone, C. J. (1980). Optimal rates of convergence for nonparametric estimators.Annals of Statistics, 8:1348–1360.

Stone, C. J. (1982). Optimal global rates of convergence for nonparametricregression. Annals of Statistics, 10:1040–1053.

Stone, C. J. (1985). Additive regression and other nonparametric models. Annalsof Statistics, 13:689–705.

Stone, C. J. (1994). The use of polynomial splines and their tensor products inmultivariate function estimation. Annals of Statistics, 22:118–184.

Stone, C. J., Hansen, M. H., Kooperberg, C., and Truong, Y. K. (1997). Polyno-mial splines and their tensor product in extended linear modeling. Annalsof Statistics, 25:1371–1410.

Stone, M. (1974). Cross-validatory choice and assessment of statisticalpredictions. Journal of the Royal Statistical Society, 36:111–147.

Stout, W. F. (1974). Almost Sure Convergence. Academic Press, New York.

Stute, W. (1984). Asymptotic normality of nearest neighbor regression functionestimates. Annals of Statistics, 12:917–926.

Stute, W. and Wang, J. L. (1993). The strong law under random censoring.Annals of Statistics, 21:1591–1607.

Susarla, V., Tsai, W. Y., and Ryzin, J. V. (1984). A Buckley-James-typeestimator for the mean with censored data. Biometrika, 71:624–625.

Szarek, S. J. and Talagrand, M. (1997). On the convexified Sauer-Shelah theorem.Journal of Combinatorial Theory, Series B, 69:183–192.

Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes.Annals of Probability, 22:28–76.

Thompson, J. W. and Tapia, R. A. (1990). Nonparametric Function Estimation,Modeling, and Simulation. SIAM, Philadelphia.

Tsybakov, A. B. (1986). Robust reconstruction of functions by thelocal-approximation method. Problems of Information Transmission,22:133–146.

Tukey, J. W. (1947). Nonparametric estimation II. Statistically equivalent blocksand tolerance regions. Annals of Mathematical Statistics, 18:529–539.

636 Bibliography

Tukey, J. W. (1961). Curves as parameters and touch estimation. Proceedings ofthe Fourth Berkeley Symposium, pages 681–694.

van de Geer, S. (1987). A new approach to least squares estimation, withapplications. Annals of Statistics, 15:587–602.

van de Geer, S. (1990). Estimating a regression function. Annals of Statistics,18:907–924.

van de Geer, S. (2000). Empirical Process in M-Estimation. Cambridge UniversityPress, New York.

van de Geer, S. (2001). Least squares estimation with complexity penalties.Report MI 2001-04, Mathematical Institute, Leiden University.

van de Geer, S. and Wegkamp, M. (1996). Consistency for the least squaresestimator in nonparametric regression. Annals of Statistics, 24:2513–2523.

van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and EmpiricalProcesses, with Applications to Statistics. Springer-Verlag, New York.

Vapnik, V. N. (1982). Estimation of Dependencies Based on Empirical Data.Springer-Verlag, New York.

Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.

Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence ofrelative frequencies of events to their probabilities. Theory of Probabilityand its Applications, 16:264–280.

Vapnik, V. N. and Chervonenkis, A. Y. (1974). Theory of Pattern Recogni-tion. Nauka, Moscow. (in Russian); German translation: Theorie derZeichenerkennung, Akademie Verlag, Berlin, 1979.

Wahba, G. (1975). Smoothing noisy data with spline functions. NumerischeMathematik, 24:383–393.

Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia,PA.

Wahba, G., Wang, Y., Gu, C., Klein, R., and Klein, B. (1995). Smoothingspline ANOVA for exponential families, with application to the Wiscon-sin epidemiological study of diabetic retinopathy. Annals of Statistics,23:1865–1895.

Wahba, G. and Wold, S. (1975). A completely automatic French curve: fittingspline functions by cross-validation. Communications in Statistics – Theoryand Methods, 4:1–17.

Walk, H. (1985). Almost sure convergence of stochastic approximation processes.Statistics and Decisions, Supplement Issue No. 2:137–141.

Walk, H. (2001). Strong universal pointwise consistency of recursive regressionestimates. Annals of the Institute of Statistical Mathematics, 53:691–707.

Bibliography 637

Walk, H. (2002a). Almost sure convergence properties of Nadaraya-Watson re-gression estimates. In Modeling Uncertainity: An Examination of its Theory,Methods and Applications, Dror, M., L’Ecuyer, P., and Szidarovszky, F.,editors, pages 201–223. Kluwer Academic Publishers, Dordrecht.

Walk, H. (2002b). On cross–validation in kernel and partitioning regressionestimation. Statistics and Probability Letters, (to appear).

Walk, H. (2002c). Strong universal consistency of smooth kernel regressionestimates. Preprint 2002-8, Math. Inst. A, Universitat Stuttgart.

Walk, H. and Zsido, L. (1989). Convergence of the Robbins-Monro method forlinear problems in a Banach space. Journal of Mathematical Analysis andApplications, 139:152–177.

Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall,London.

Watson, G. S. (1964). Smooth regression analysis. Sankhya Series A, 26:359–372.

Wegman, E. J. and Wright, I. W. (1983). Splines in statistics. Journal of theAmerican Statistical Association, 78:351–365.

Wheeden, R. L. and Zygmund, A. (1977). Measure and Integral. Marcel Dekker,New York.

White, H. (1990). Connectionist nonparametric regression: multilayer feed-forward networks can learn arbitrary mappings. Neural Networks,3:535–549.

White, H. (1991). Nonparametric estimation of conditional quantiles using neuralnetworks. In Proceedings of the 23rd Symposium of the Interface: Comput-ing Science and Statistics, pages 190–199. American Statistical Association,Alexandria, VA.

Whittaker, E. (1923). On a new method of graduation. Proceedings of theEdinburgh Mathematical Society, 41:63–75.

Wolverton, C. T. and Wagner, T. J. (1969a). Asymptotically optimal discrimi-nant functions for pattern classification. IEEE Transactions on InformationTheory, 15:258–265.

Wolverton, C. T. and Wagner, T. J. (1969b). Recursive estimates of probabilitydensities. IEEE Transactions on Systems, Science and Cybernetics, 5:246–247.

Wong, W. H. (1983). On the consistency of cross-validation in kernelnonparametric regression. Annals of Statistics, 11:1136–1141.

Xu, L., Krzyzak, A., and Oja, E. (1993). Rival penalized competitive learningfor clustering analysis, RBF net and curve detection. IEEE Transactions onNeural Networks, 4:636–649.

638 Bibliography

Xu, L., Krzyzak, A., and Yuille, A. L. (1994). On radial basis function nets andkernel regression: approximation ability, convergence rate and receptive fieldsize. Neural Networks, 7:609–628.

Yakowitz, S. (1993). Nearest neighbor regression estimation for null-recurrentMarkov time series. Stochastic Processes and their Applications, 48:311–318.

Yakowitz, S., Gyorfi, L., Kieffer, J., and Morvai, G. (1999). Strongly consis-tent nonparametric estimation of smooth regression functions for stationaryergodic sequences. Journal of Multivariate Analysis, 71:24–41.

Yakowitz, S. and Heyde, C. (1998). Stationary Markov time series with longrange dependence and long, flat segments with application to nonparametricestimation. Submitted.

Yamato, H. (1971). Sequential estimation of a continuous probability densityfunction and the mode. Bulletin of Mathematical Statistics, 14:1–12.

Zhang, P. (1991). Variable selection in nonparametric regression with continuouscovarates. Annals of Statistics, 19:1869–1882.

Zhao, L. C. (1987). Exponential bounds of mean error for the nearest neighborestimates of regression functions. Journal of Multivariate Analysis, 21:168–178.

Zhu, L. X. (1992). A note on the consistent estimator of nonparametric regres-sion constructed by splines. Computers and Mathematics with Applications,24:65–70.

Author Index

Agarwal, G. G., 281Ahmad, I. A., 510Aizerman, M. A., 605Akaike, H., 82, 233Alesker, S., 156Alexander, K., 156Alexits, G., 521Algoet, P., 537, 572, 576, 577, 587,

588Allen, D. M., 29, 127Alon, N., 156Amemiya, T., 457Andrews, D. W. K., 457Anthony, M., 16, 156, 326Antoniadis, A., 378Antos, A., 50Anzellotti, G., 346, 347, 350Asmus, V. V., 5Azuma, K., 599

Bailey, D., 588Baraud, Y., 233Barron, A. R., 108, 110, 233, 320,

327, 335, 343, 350, 587Bartlett, P. L., 16, 156, 220, 326Bauer, H., 602Beck, J., 96

Beirlant, J., 67, 69Bellman, R. E., 29Ben-David, S., 156Beran, R., 563Bernstein, S. N., 594Bhattacharya, P. K., 96Bickel, P. J., 16, 96, 457Birge, L., 50, 180, 220, 233Birkhoff, G. D., 587Bosq, D., 16, 67Brailovsky, V. L., 29, 127Braverman, E. M., 605Breiman, L., 4, 16, 96, 241, 250, 457,

587Bretagnolle, J., 50Broomhead, D. S., 350Burman, P., 457

Cacoullos, T., 82Carbonez, A., 563Chen, S., 350Chen, Z., 457Cheng, Ph. E., 96Chernoff, H., 592Chervonenkis, A. Ya., 16, 156, 180,

233Chiu, S. T., 127

639

640 Author Index

Chow, Y. S., 606Chui, C., 378Clark, R. M., 127Cleveland, W. S., 82Collomb, G., 16, 67, 96Comte, F., 233Cover, T. M., 28, 50, 96, 98, 568Cowan, C. F. N., 350Cox, D. D., 220, 429, 446Cox, D. R., 545Csibi, S., 16Cybenko, G., 327

Dabrowska, D. M., 563Daniel, C., 180Darken, J., 350Daubechies, I., 378Davidson, R., 457de Boor, C., 281, 406Devroye, L., 8, 16, 36, 50, 67, 82,

96, 98, 108, 127, 156, 180,250, 326, 327, 491, 510,511, 537, 600

Dippon, J., 5Donoho, D., 354, 378Doob, J. L., 602Draper, N. R., 16Duchon, J., 429Dudley, R. M., 156, 327Dyn, N., 350

Efromovich, S., 16Efron, B., 542, 600Eggermont, P. P. B., 16, 429Emery, M., 108Engel, J., 378Etemadi, N., 471Eubank, R. L., 16, 429

Fan, J., 16, 29, 82Farago, A., 327Farebrother, R. W., 16, 29, 180Feder, M., 588Fix, E., 28, 96Friedman, J. H., 4, 16, 28, 53, 241,

250, 296, 455, 457Fritz, J., 96, 521Fritz, P., 5Funahashi, K., 327

Gaenssler, P., 156Gallant, A. R., 327, 457Gasser, T., 16Geman, S., 180Gessaman, M. P., 250Gijbels, I., 16, 29, 82Gill, R. D., 563Gine, E., 156Girosi, F., 346, 347, 350Gladyshev, E. G., 605Glick, N., 16Gordon, L., 250Grant, P. M., 350Greblicki, W., 82, 510, 511Green, P. J., 429Gregoire, G., 378Grenander, U., 159Gu, C., 457Guerre, E., 96Gyorfi, L., 8, 16, 36, 50, 67, 68, 69,

82, 96, 98, 156, 180, 250,326, 327, 491, 510, 521,537, 563, 576, 587, 588,600

Gyorfi, Z., 96

Hald, A., 29Hall, P., 127, 378, 457Hamers, M., 108Hansen, M. H., 296Hardle, W., 16, 79, 82, 96, 127, 378,

457, 587Harrison, D., 4Hart, D. J., 16Hart, P. E., 28, 96, 98Hastie, T., 16, 455, 457Haussler, D., 156, 180, 220, 327, 588Hertz, J., 326Hewitt, E., 308Heyde, C., 582Hinton, G. E., 350Hodges, J. L., 28, 96Hoeffding, W., 596, 599Hollig, K., 281Hornik, K., 301, 327, 328Horowitz, J. L., 16, 457Horvath, L., 563Hristache, M., 457Hu, T. C., 82

Author Index 641

Huang, J., 457Huber, C., 50Hwang, C.-R., 180

Ibragimov, I. A., 50Ichimura, H., 457

Johansen, S., 602Johnstone, I. M., 354, 378Jones, M. C., 16Juditsky, A., 457

Kaplan, E. L., 542Karasev, A. B., 5Karush, J., 602Katkovnik, V. Ya., 82Kelly, G., 127Kerkyacharian, G., 16, 378Ketskemety, L., 5Khasminskii, R. Z., 50, 602Kieffer, J., 588Kivinen, J., 588Klaasen, C. A. J., 16, 457Klein, B., 457Klein, R., 457Kohler, M., 5, 50, 82, 108, 233, 281,

378, 404, 429, 446, 457,491, 510, 563

Kooperberg, C., 296Korostelev, A. P., 16, 50, 82Kovac, A., 378Kozek, A. S., 537Krahl, D., 4Krogh, A., 326Krzyzak, A., 82, 96, 233, 350, 429,

446, 491, 510, 511Kulkarni, S. R., 96

LaRiccia, V. N., 16, 429Lecoutre, J. P., 16, 67Ledoux, M., 156Lee, W. S., 220Leslie, J. R., 537Li, K. C., 127, 233Light, W. A., 350Lin, P. E., 510Linder, T., 233, 350Linton, O. B., 457Ljung, L., 512, 531

Loftsgaarden, D. O., 28Lowe, D., 350Lugosi, G., 8, 16, 36, 50, 96, 98, 108,

156, 180, 233, 250, 326,327, 350, 491, 588, 600

Lunts, A. L., 29, 127

Mack, Y. P., 96MacKinnon, J. G., 457MacQueen, J., 605Maindonald, J. H., 180Maiorov, V. E., 327Maker, P. T., 587Mallows, C. L., 233Mammen, E., 429Manski, C. F., 457Marron, J. S., 127Massart, P., 156, 180, 220, 233Mathe, K., 563McCaffrey, D. F., 327McClelland, J. L., 301McCulloch, W.S., 297McDiarmid, C., 600Meier, P., 542Meir, R., 327Meyer, Y., 378Mhaskar, H. N., 327Mielniczuk, J., 327Minsky, M. L., 299Moody, J., 350Morvai, G., 573, 576, 588Muller, H.-G., 16

Nadaraya, E. A., 16, 28, 82Nemirovsky, A. S., 108, 180Neumann, M. H., 378Nevelson, M. B., 602Newey, W. K., 457Nicoleris, T., 457Nielsen, J. B., 457Niemann, H., 350Nilsson, N. J., 326Niyogi, P., 350Nobel, A., 233, 250, 582Nolan, D., 327

Oja, E., 350Olshen, R. A., 4, 16, 241, 250Ornstein, D. S., 588

642 Author Index

Palmer, R. G., 326Papert, S., 299Park, J., 350Parzen, E., 82Pawlak, M., 82, 510, 511Peterson, A. V., 543, 563Pflug, G., 512, 531Picard, D., 16, 378Pinter, M., 563Pitts, W., 297Poggio, T., 350Pollard, D., 156, 219, 220, 327Polyak, B. T., 180, 233Polzehl, J., 457Posner, S. E., 96Powel, J. L., 457Powell, M. J. D., 350Prakasa Rao, B. L. S., 16

Quesenberry, C. P., 28

Rafajlowicz, E., 220Rao, R. C., 16Reinsch, C., 29, 429Rejto, L., 82Revesz, P., 82, 537Rice, J., 446Ripley, B. D., 326Ritov, Y., 16, 457Robbins, H., 605Robinson, P. M., 457Rosenblatt, F., 298Rosenblatt, M., 82, 446Ross, K. A., 308Rost, D., 156Royall, R. M., 28Rozonoer, L. I., 605Rubinfeld, D. L., 4Rudin, W., 302, 316, 327, 346Rumelhart, D. E., 301, 350Ryabko, B. Ya., 588Ryzin, J. Van, 16, 548, 605

Sacks, J., 82, 485Samarov, A. M., 457Sandberg, I. W., 350Sarda, P., 587Sauer, N., 156Schafer, D., 68, 350, 446

Schoenberg, I. J., 29, 429Schumaker, L., 281Schuster, E. F., 537Seber, G. A. F., 16Shelah, S., 156Shen, X., 220, 446Shibata, R., 233Shields, P. C., 588Shorack, G. R., 156Siegmund, D., 605Silverman, B. W., 378, 429Simonoff, J. S., 16, 127Singer, A., 588Smith, H., 16Speckman, P., 446Spiegelman, C., 82, 485Spokoiny, V. G., 378, 457Steele, J. M., 156, 600Stein, C., 600Stein, E. M., 348, 511Stigler, S. M., 29Stinchcombe, M., 301, 327, 328Stock, J. H., 457Stoer, J., 163Stoker, T. M., 457Stone, C. J., 4, 16, 28, 50, 51, 67, 68,

82, 96, 98, 241, 250, 296,405, 457

Stone, M., 29, 127Stout, W. F., 606Studden, W. J., 281Stuetzle, W., 457Stute, W., 96, 563Susarla, V., 548Szarek, S. J., 156

Talagrand, M., 156Tapia, R. A., 16Thompson, J. W., 16Tibshirani, R. J., 16, 455, 457Truong, Y. K., 82, 296Tsai, W. Y., 548Tsybakov, A. B., 16, 50, 82, 180,

233, 378Tukey, J. W., 28, 67, 457Turlach, B. A., 378Tyrcha, J., 327

Vadasz, V., 5

Author Index 643

van de Geer, S., 156, 180, 220, 389,404, 429, 446

van der Meulen, E. C., 563van der Vaart, A. W., 133, 156, 404Vapnik, V. N., 16, 29, 156, 180, 233Vial, P., 378Viennet, G., 233Vieu, Ph., 587Voiculescu, D., 108

Wagner, T. J., 16, 82, 127, 510Wahba, G., 29, 127, 429, 446, 457Walk, H., 68, 127, 485, 486, 491, 510,

512, 521, 531, 537, 600Wand, M. P., 16Wang, J. L., 563Wang, Y., 457Warmuth, M. K., 588Watson, G. S., 28, 82Wegkamp, M., 180Wegman, E. J., 281Weiss, G., 511Wellner, J. A., 16, 133, 156, 404, 457Whang, Y. J., 457Wheeden, R. L., 511White, H., 301, 327, 328Whittaker, E., 29, 429Williams, R. J., 350Williamson, R. C., 220Windheuser, U., 4Wise, G. L., 537Wold, S., 127Wolverton, C. T., 16, 510Wong, W. H., 127, 220Wood, F. S., 180Wright, I. W., 281

Xu, L., 350

Yakowitz, S., 535, 576, 582, 588Yamato, H., 510Yatracos, Y. G., 457Yuille, A. L., 350

Zeger, K., 180, 327Zhang, P., 457Zhao, L. C., 96Zhu, L. X., 281Zick, F.-K., 4

Zsido, L., 521Zygmund, A., 511

Subject Index

L2 error, 2L2 risk, 1L2(µ) norm, 184Lp ε–cover of G on zn

1 , 135Lp ε-packing of G on zn

1 , 140Lp error, 3Lp risk, 2N∞(ε, G), 132

(p, C)-smooth, 37

a posteriori probabilities, 7, 9adaptation, 14, 26additive models, 448, 449almost supermartingale convergence,

605applications, 4approximation

by neural networks, 301, 307,322

by piecewise polynomials, 194by polynomial splines, 274, 287by radial basis function net-

works, 334approximation error, 161Azuma-Hoeffding’s inequality, 599

B-spline basis, 256

B-splines, 256bandwidth, 70Bayes decision function, 6Bernstein’s inequality, 594bias–variance tradeoff, 25Birkhoff’s ergodic theorem, 565Boston housing values, 4boxed kernel, 72Breiman’s generalized ergodic theo-

rem, 568

cell, 52censoring time, 540Cesaro consistency from ergodic data,

576, 582chaining, 380Chernoff’s bound, 593choice of smoothing parameters, 26clustering scheme, 245complexity regularization, 28, 222,

227cone property, 90consistency, 12

of data-dependent partitioningestimates, 237

of empirical orthogonal seriesestimates, 367

644

Subject Index 645

of kernel estimates, 72, 479, 485of kernel estimates from cen-

sored data, 553, 562of least squares estimates, 164of least squares spline estimates,

267, 291of linear least squares series

estimates, 170of nearest neighbor estimates,

88, 486, 489of nearest neighbor estimates

from censored data, 553,562

of neural network estimates,301

of partitioning estimates, 60,459, 466, 470

of partitioning estimates fromcensored data, 549, 562

of penalized least squares esti-mates, 423, 428

of piecewise polynomial parti-tioning estimates, 174

of radial basis function net-works, 333

of recursive kernel estimates,517

of recursive nearest neighborestimates, 518

of recursive partitioning esti-mates, 518

of recursive series estimates,520

of semirecursive kernel estimates,497

of semirecursive partitioning es-timates, 507

of static forecasting, 572consistency from ergodic data, 584,

586, 587cover, 132, 134covering, 132, 134covering number, 132, 134cross-validation, 27, 112, 113, 127cubic partition, 52cubic partitions with data-dependent

grid size, 241curse of dimensionality, 23, 448

data, 3data-dependent partitioning, 235denseness, 589dimension reduction, 448dynamic forecasting, 568

Efron’s redistribution algorithm, 542Efron-Stein’s inequality, 600empirical L2 risk, 130, 158empirical norm, 184empirical orthogonal series estimates,

356Epanechnikov kernel, 70error criterion, 3estimation error, 160

fat-free weight, 5feedforward neural network, 299fixed design, 15

Gessaman rule, 245global modeling, 21

hidden layer, 299Hoeffding’s inequality, 596

individual lower minimax rate ofconvergence, 44

Kaplan-Meier estimate, 542kernel estimates, 19, 70kernel function, 70knot vector, 253Kronecker lemma, 608

least squares estimates, 21least squares principle, 158linear least squares estimates, 183linear regression, 9loan management, 4local averaging, 18local averaging estimate, 55local modeling, 20local polynomial kernel estimates,

21, 80lower minimax rate of convergence,

37, 38

marketing, 4martingale difference sequence, 599

646 Subject Index

martingales, 598McDiarmid’s inequality, 600mean squared error, 1measurability problems, 133method of sieves, 159minimax approach, 14minimax lower bound, 38

Nadaraya–Watson kernel estimates,19

naive kernel, 19, 70nearest neighbor, 86nearest neighbor clustering, 245nearest neighbor estimates, 19, 86nested partition, 470neural networks, 297

observation vector, 1one-step dynamic forecasting, 571optimal rate of convergence, 37optimal scaling, 107, 111, 129orthogonal series estimates, 353

packing, 140packing number, 140parametric estimation, 9partition, 52partitioning estimates, 19, 52partitioning number, 236pattern recognition, 6penalized least squares estimates,

22, 408, 425computation of multivariate es-

timates, 427computation of univariate esti-

mates, 412definition of multivariate esti-

mates, 425definition of univariate estimates,

408penalized modeling, 22penalty term, 22perceptron, 298piecewise polynomial partitioning es-

timates, 175plug-in estimate, 7, 9pointwise consistency, 527

of kernel estimates, 534

of nearest neighbor estimates,536

of partitioning estimates, 527of recursive kernel estimates,

535of recursive partitioning esti-

mates, 531of semirecursive kernel estimates,

535of semirecursive partitioning es-

timates, 527pointwise error, 3projection pursuit, 448, 451

quasi interpolant, 273, 287

radial basis function networks, 329random design, 15rate of convergence, 13

of additive models, 450of empirical orthogonal series

estimates, 373, 374of kernel estimates, 77, 79, 105,

113of least squares estimates, 201,

227of least squares spline estimates,

278, 281, 294, 295of linear least squares estimates,

192of nearest neighbor estimates,

93, 106, 127of neural networks, 316of partitioning estimates, 64,

106, 113of penalized least squares esti-

mates, 433, 441of piecewise polynomial par-

titioning estimates, 197,232, 397, 398, 402

of projection pursuit, 452of radial basis function net-

works, 342of single index models, 456

rectangle partition, 52recursive kernel estimates, 517recursive nearest neighbor estimates,

518recursive partitioning estimates, 518

Subject Index 647

recursive series estimates, 520regression analysis, 1regression function, 2regression function estimation, 3regression function estimation from

dependent data, 564response variable, 1resubstitution method, 26right censoring, 540

semiparametric models, 449semirecursive kernel estimates, 496semirecursive partitioning estimates,

507set of data, 3shatter, 143shatter coefficient, 143sigmoid function, 299simulated data, 10single index models, 449, 456slow rate of convergence, 32smoothing spline, 22smoothing spline estimates, see pe-

nalized least squares esti-mates

spline basis, 253spline space, 263splines, 252splitting the sample, 26, 101, 107squasher

arctan, 299cosine, 299Gaussian, 299logistic, 299ramp, 299threshold, 299

squashing function, 299static forecasting, 569, 572stationary and ergodic data, 565statistically equivalent blocks, 243Stone’s theorem, 56strong consistency, 13strong universal consistency, 13subgraph, 147supermartingale, 601supermartingale convergence, 602supremum norm, 23supremum norm error, 3survival analysis, 5, 541

survival functions, 542survival time, 540symmetrization, 136

tensor product, 283tensor product spline space, 283tensor product splines, 283thin plate spline estimates, 22thresholding, 357tie breaking, 86time series problem, 564Toeplitz lemma, 608truncation, 163

uniform law of large numbers, 153universal consistency, 13universal prediction, 578upcrossing lemma, 602

Vapnik–Chervonenkis dimension, 144VC dimension, 144

wavelet estimates, 353wavelets, 355, 378weak consistency, 13weak universal consistency, 13weights, 19wheat crop prediction, 5window kernel, 19

Documents

A Distribution-Free Theory of Nonparametric Regression · La´szlo´ Gyo¨rfi Michael Kohler Adam Krzyz˙ak Harro Walk A Distribution-Free Theory of Nonparametric Regression With