Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 5, SEPTEMBER 1997 985
Asymptotic Statistical Theory of Overtrainingand Cross-Validation
Shun-ichi Amari,Fellow, IEEE, Noboru Murata, Klaus-Robert M¨uller,Michael Finke, and Howard Hua Yang,Member, IEEE
Abstract—A statistical theory for overtraining is proposed.The analysis treats general realizable stochastic neural networks,trained with Kullback–Leibler divergence in the asymptoticcaseof a large number of training examples. It is shown that theasymptotic gain in the generalization error is small if we per-form early stopping, even if we have access to the optimalstopping time. Considering cross-validation stopping we answerthe question: In what ratio the examples should be divided intotraining and cross-validation sets in order to obtain the optimumperformance. Although cross-validated early stopping is useless inthe asymptotic region, it surely decreases the generalization errorin the nonasymptotic region. Our large scale simulations done ona CM5 are in nice agreement with our analytical findings.
Index Terms—Asymptotic analysis, cross-validation, early stop-ping, generalization, overtraining, stochastic neural networks.
I. INTRODUCTION
M ULTILAYER NEURAL networks improve their behav-ior by learning a set of examples showing the desired
input–output relation. This training procedure is usually carriedout by a gradient descent method minimizing a target function([1], [27], and many others).
When the number of examples is infinitely large and theyare unbiased, the network parameters converge to one of thelocal minima of the empirical risk function (expected loss)to be minimized. When the number of training examples isfinite, the true risk function (generalization error) is differentfrom the empirical risk function. Thus, since the trainingexamples are biased, the network parameters converge to abiased solution. This is known as overfitting or overtraining,1
Manuscript received September 11, 1995; revised October 21, 1996 andMay 10, 1997. K.-R. M¨uller was supported in part by the EC S & T fellowship(FTJ 3-004). This work was supported by the National Institutes of Health(P41RRO 5969) and CNCPST Paris (96JR063).
S. Amari is with the Lab. for Inf. Representation, RIKEN, Wakoshi,Saitama, 351-01, Japan. He is also with the Department of MathematicalEngineering and Information Physics, University of Tokyo, Tokyo 113, Japan.
N. Murata is with the Department of Mathematical Engineering andInformation Physics, University of Tokyo, Tokyo 113, Japan.
K.-R. Muller is with the Department of Mathematical Engineering andInformation Physics, University of Tokyo, Tokyo 113, Japan. He is also withGMD FIRST, 12489 Berlin, Germany.
M. Finke is with the Institut fur Logik, University of Karlsruhe, 76128Karlsruhe, Germany.
H. H. Yang is with the Lab. for Inf. Representation, RIKEN, Wakoshi,Saitama, 351-01, Japan.
Publisher Item Identifier S 1045-9227(97)06051-7.1The concept of overfitting refers to fitting specific examples too much,
thus loosing generality. Overtraining addresses the issue of using too manyiterations in the learning procedure, which leads to overfitting (see, e.g., [29]).
because the parameter values fit too well the speciality of thebiased training examples and are not optimal in the sense ofminimizing the generalization error given by the risk function.
There are a number of methods of avoiding overfitting.For example, model selection methods (e.g., [23], [20], [26]and many others), regularization ([25] and others), and earlystopping ([16], [15], [30], [11], [4] and others) or structuralrisk minimization (SRM, cf. [32]) can be applied.
Here we will consider early stopping in detail. There isa folklore that the generalization error decreases in an earlyperiod of training, reaches a minimum and then increasesas training goes on, while the training error monotonicallydecreases. Therefore, it is considered better to stop trainingat an adequate time, a technique often referred to as earlystopping . To avoid overtraining, the following simple stoppingrule has been proposed based on cross-validation: Divide allthe available examples into two disjoint sets. One set is usedfor training. The other set is used for validation such thatthe behavior of the trained network is evaluated by using thecross-validation examples and the training is stopped at thepoint that minimizes the error on the cross-validation set. Notethat dividing the available examples into two fixed sets is astrongly simplified implementation of k-fold cross-validation(cf. [12]).2 In our study we will consider only the abovedescribed two set cross-validation and we will refer to it ascross-validation in the following.
Wang et al. [30] analyzed the average optimal stoppingtime without cross-validation in the case of linear-machines.For the regression case Sjoberg and Ljung [29] calculatedasymptotically that the number of efficient parameters is linked1) to the regularization parameter if a specific regularizationis applied and 2) to the number of iterations of the learn-ing algorithm if early stopping is used. They denote earlystopping as implicit regularization. Bishop [9] showed thatregularization and early stopping lead to similar solutions andstressed the analogy between the number of iterations and theregularization parameter. Barberet al. [6], [7] considered theevaluation of the generalization error by cross-validation forlinear perceptrons.
Recently Guyon [14] and Kearns [18] derived a VC boundfor the optimal split between training and validation set, whichshows the same scaling as our result. The VC result scalesinversely with the square root of the VC dimension (cf. [31])
2For example in leave one out cross-validation, a pattern is drawn randomly,its error is validated, then another pattern is drawn and so on. Finally thevalidation error is determined by averaging over all chosen patterns.
1045–9227/97$10.00 1997 IEEE
986 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 5, SEPTEMBER 1997
of the network, which in the case of realizable rules coincideswith the number of parameters Their result was achievedby bounding the probability of error of the recognizer selectedby cross-validation using both classical and VC bounds; theresulting bound is then optimized with respect to the splitbetween training and validation set.
There are various phases in the overtraining phenomenadepending on the ratio of the numberof examples to thenumber of the modifiable parameters (see [24]). When
is smaller or nearly equal to the examples can inprinciple be memorized and overfitting is remarkable in thisphase, in particular around ([19], [10]). However, theapplication of simple two set cross-validation stopping hasserious disadvantages in this case. The simple splitting ofan already small set of examples decreases the scarce andvaluable information in the small data set. In order to avoidovertraining in this case, we need to use global methods likethe above mentioned regularization, SRM or k-fold cross-validation rather than simple two set cross-validation stopping.
In an intermediate phase (cf. [24]) of larger thansimulations show that cross-validation stopping is effective ingeneral. However, it is difficult to construct a general theoryin this phase (see also Section VI for discussion).
In the asymptotic phase whereis sufficiently large, theasymptotic theory of statistics is applicable and the estimatedparameters are approximately normally distributed around thetrue values.
As the first step toward elucidation of overtraining andcross-validation, the present paper gives a rigorous mathemat-ical analysis of overtraining phenomena for 1) a realizablestochastic machine (Section II); 2) Kullback-Leibler diver-gence (negative of the log likelihood loss); and 3) a sufficientlylarge number of examples (compared with the numberofparameters).
We analyze the relation between the training error andcross-validation error, and also the trajectory of learning usinga quadratic approximation of the risk function around theoptimal value in the asymptotic region (Section III). The effectof early stopping is studied on this basis. It is shown thatasymptotically we gain little by early stopping even if wehad access to the optimal stopping time (Section IV). Sincewe never have access to the optimal stopping time, the gen-eralization error becomes asymptotically worse, which meansthat the gain achieved through early stopping is asymptoticallysmaller than the loss of not using the cross-validation examplesfor training.
We then answer the question: In what ratio the examplesshould be divided into training and cross-validation sets inorder to obtain the optimum performance (Section V). We givea definite analytic answer to this problem. When the number
of network parameters is large, the best strategy is to usealmost all examples in the training set and to use onlyexamples in the cross-validation set, e.g., when , thismeans that only 7% of the training patterns are to be used inthe set determining the point for early stopping.
Our results are confirmed by large-scale computer simula-tions of three-layer feedforward networks where the number
of modifiable parameters is When the
theory fits well with simulations, showing cross-validation isnot necessary, because the generalization error becomes worseby using cross-validation examples to obtain an adequatestopping time. For an intermediate range, whereovertraining occurs surely and the cross-validation stoppingimproves the generalization ability strongly (Section VII).Finally, concluding remarks are given in Section VIII.
II. STOCHASTIC FEEDFORWARD NETWORKS
Let us consider a stochastic network which receives inputvector and emits output vector The network includesa modifiable vector parameter and thenetwork specified by is denoted by The input–outputrelation of the network is specified by the conditionalprobability It is assumed that input is randomlychosen from an unknown probability The joint probabil-ity of of is given by
(1)
We assume the following.
1) There exists a teacher network which generatestraining examples.
2) The Fisher information matrix defined by
(2)
has a full rank and is differentiable with respect towhere denotes the expectation with respect to
3) The training set
consists of i.i.d. examples generated by the distributionof
Let us define the risk and loss functions. When an in-put–output pair is an example generated from network
its loss or error is given by the negative of thelikelihood,
(3)
The risk function of network is the expectationof loss with respect to the true distribution
(4)
where denotes the expectation with respect toThe risk function is called the generalization error, sinceit is evaluated by the expectation of where the testpair is newly generated by It is easy to show
(5)
AMARI et al.: ASYMPTOTIC STATISTICAL THEORY 987
where
(6)
is the entropy of the teacher network and
(7)
is the Kullback–Leibler divergence from probability distribu-tion to or the divergence of from
Obviously, and the equality holdswhen, and only when, Hence, the risk measuresthe divergence between the true distribution andthe distribution of except for a constant term
which denotes the stochastic uncertainty of the teachermachine itself. Minimizing is equivalent to minimizing
and the minimum is attained atIn the case of multilayer perceptrons with additive Gaussian
noise, the output is written as
(8)
where is the analog function calculated by the mul-tilayer perceptron with a set of parameters andis Gaussian noise. When its componentsare independentsubject to we have
(9)
Hence the loss is the ordinary squared error
(10)
where does not depend on
III. A SYMPTOTIC ANALYSIS OF LEARNING
The maximum likelihood estimator (m.l.e.) maximizesthe likelihood of producing the training set
or equivalently minimizing the empirical risk function
(11)
This empirical risk is called the training error since it isevaluated by the training examples themselves. In order toavoid confusion, is denoted by when necessary.
The asymptotic theory of statistics proves that the m.l.e. isasymptotically subject to the normal distribution with mean
and variance
under certain regularity conditions, where is the inverseof the Fisher information matrix
By expanding the risk functions, we have the followingasymptotic evaluations of and in the neigh-borhood of
Lemma 1: When belongs to the -neighborhoodof
(12)
(13)
where denotes the transpose of the column vectorand
and represents the average with respect to the distributionof the sample [3].
The relation (12) is the Taylor expansion of (4), where theidentity
is used. The proof of (13) is given in Appendix A.By putting in (12) and (13), we have the asymptotic
evaluations of the generalization and training errors ofThey depend on the examples from which the m.l.e. iscalculated. We denote by the average with respect tothe distribution of the sample which determines Wethen obtain the following universal relation concerning thegeneralization error and training error. This was first provedby [3]. A similar but different universal property is proved byAmari [2] for deterministic dichotomy machines.
Corollary 1: For the m.l.e. the average training errorand generalization error are asymptotically evaluated by theAIC3 type criterion [3]
(14)
(15)
independently of the architecture of networks, whereis thenumber of modifiable parameters (dimension number ofand is the number of training patterns.
Murataet al. [22], [23] obtained more general versions andproposed the NIC (network information criterion) for modelselection by generalizing the AIC [5].
Let us consider the gradient descent learning rule, where theparameter at the th step is modified by
(16)
where is a small positive constant. More precisely,should be denoted by since it depends on but weomit the subscript for the sake of simplicity. This is batchlearning where all the training examples are used for each
3Akaike’s information criterion.
988 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 5, SEPTEMBER 1997
iteration of modifying . We can alternatively use on-linelearning4
(17)
where is randomly chosen at each step from thetraining data set The batch process is deterministic and
converges to ,5 provided the initial is includedin its basin of attraction. For large is in the -neighborhood of , and the gradient of is approximatedfrom (13) as
Hence, by neglecting the term of order (16) isapproximated by
This gives the asymptotic evaluation
where is the identity matrix and is assumed to belarge.
In order to make the analysis easier, we take the coordinatesystem such that the Fisher information matrixis equal tothe identity matrix
(18)
at This is possible without loss of generality, and theresults of the following analysis are the same whichevercoordinate system we use. Under this coordinate system, wehave
showing that the trajectory linearly approaches in theneighborhood of
We call the trajectory a ray which approacheslinearly when is large. An interesting question is from whichdirection the ray approaches Even if the initialis uniformly distributed, we cannot say that approaches
isotropically, since dynamics (16) is highly nonlinear inan early stage of learning. In other words, the distribution of
is not isotropic but may have biased directions.Although the rays are not isotropically distributed around
the quantity is isotropically distributed aroundbecause is put equal to at This implies that the relativedirection of a ray with respect to the isotropically distributedis isotropically distributed. This gives us the following lemma,which helps to calculate and
Lemma 2: Although does not necessarily approachisotropically, the ensemble averages and
are the same as those calculated by assumingthat approaches isotropically.
4Its dynamical behavior was studied by Amari [1], Heskes and Kappen[17], and recently by Barkaiet al. [8] and Solla and Saad [28].
5Or to wwwt; but the subscript is omitted hereafter.
Fig. 1. Geometrical picture to determine the optimal stopping pointwww�:
Proof: The distribution of the ray is not necessarilyisotropic but is distributed isotropically around Theaverage is the expectation with respect to theunknown initial which determines the ray and withrespect to which determines Let us fix a ray and takeaverage with respect to that is with respect to Sincethe relative direction between the fixed ray and all possibleis isotropically distributed, it follows that taking the averagewith respect to for a fixed ray is equivalent to taking theaverage with respect to isotropically distributed rays and afixed Therefore we may calculate the averages by usingthe isotropically distributed rays instead of the isotropicallydistributed
IV. V IRTUAL OPTIMAL STOPPING RULE
When the parameter approaches as learning goes on,the generalization behavior of network
is evaluated by the sequence
(19)
It is believed that decreases in an early period oflearning but it increases later. Therefore, there exists anoptimal stopping time at which is minimized. Thestopping time is a random variable depending onand the initial We evaluate the ensemble average of
The true and the m.l.e. are in general different, andtheir distance is of order Let us compose a sphereof which the center is at and which passesthrough both and as is shown in Fig. 1. Its diameter isdenoted by where
(20)
and
(21)
AMARI et al.: ASYMPTOTIC STATISTICAL THEORY 989
Fig. 2. Distribution of the angle�.
Let be the ray, that is the trajectory starting atwhich is far from the neighborhood of The optimal
stopping point that minimizes
(22)
is given by the following lemma.Lemma 3: The optimal stopping point is asymptotically
the first intersection of the ray and the sphereProof: Since is the point on such that is
orthogonal to it lies on the sphere (Fig. 1). When rayis approaching from the opposite side of (the right-handside in the figure), the first intersection point is itself. Inthis case, the optimal stopping never occurs until it convergesto
Let be the angle between the ray and the diameterof the sphere We now show the distribution of
when the rays are isotropically distributed relative toLemma 4: When ray is approaching from the side in
which is included, the probability density ofis given by
(23)
where
Proof: Let us compose a unit -dimensional spherecentered at (Fig. 2). Since ray is considered to
approach isotropically (Lemma 2), its intersection tois uniformly distributed on when is approaching fromthe side of Let us consider the area on such thatthe angles are between and Then, the area is an
-dimensional sphere on whose radius is(Fig. 2). Hence, its volume is where isthe volume of a unit -sphere. By normalization, thedensity of is
Now we have the following theorem.
Theorem 1: The average generalization error at the optimalstopping point is asymptotically given by
(24)
Proof: When ray is at angle theoptimal stopping point is on the sphere It is easilyshown that
This is the case where approaches from the left-handside in Fig. 1, which occurs with probability 0.5, and theaverage of is
Since we have
When is that is approaches from theopposite side, it does not stop until it reachesso that
This also occurs with probability 0.5. Hence, from (22), weproved the theorem.
The theorem shows that, when we could know the optimalstopping time for each trajectory, the generalization errordecreases by which has an effect of decreasing theeffective dimensions by This effect is negligible whenis large. The optimal stopping time is of order However,it is impossible to know the optimal stopping time. If we stoplearning at an estimated optimal time we have some gainwhen ray is from the same side as but we have someloss when ray is from the opposite direction.
Wanget al. [30] calculated in the case of linear -machines and defined the optimal average stopping timethat minimizes This is different from the presentsince our is defined for each trajectory Hence it is arandom variable depending on and Our average
is different from since is common to all thetrajectories while are different. We can show
We can prove
in agreement with Wanget al. [30]. This shows that the gainbecomes much smaller by using the average stopping time
However, the point is that there is no direct means toestimate except for cross-validation. Hence, we need toanalyze cross-validation early stopping.
990 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 5, SEPTEMBER 1997
Fig. 3. Optimal stopping pointwww� by cross-validation.
V. OPTIMAL STOPPING BY CROSS-VALIDATION
In order to find the optimal stopping time for each trajectory,an idea is to divide the available examples into two disjointsets; the training set for learning and the cross-validationset for evaluating the generalization error. The training er-ror monotonically decreases with iterations, but according tothe folklore the generalization error evaluated by the cross-validation set decreases in an early period but it increases aftera critical period. This gives the optimal time to stop train-ing. The present section studies two fundamental problems:1) Given examples, how many examples should be usedin the training set and how many in the cross-validation set?2) How much gain can one expect by the above cross-validatedstopping?
Let us divide examples into examples of the trainingset and examples of the cross-validation set, where
(25)
Let be the m.l.e. from training examples, and let bethe m.l.e. from the other cross-validation examples, that is
and minimize the training error function
(26)
and cross-validation error function
(27)
respectively, where summations are taken overtrainingexamples and cross-validation examples. Since the trainingexamples and cross-validation examples are independent, both
and are asymptotically subject to independent normaldistributions with mean and covariance matricesand respectively.
Let us compose the triangle with vertices and(Fig. 3). The trajectory starting at enters linearlyin the neighborhood of The point on the trajectorywhich minimizes the cross-validation error is the point onthat is closest to since the cross-validation error defined by(27) can be expanded from (13) as
(28)
Let be the sphere whose center is at and whichpasses through both and Its diameter is given by
(29)
Then, the optimal stopping point is given by the inter-section of the trajectory and sphere When the trajectorycomes from the opposite side of (right-hand side in thefigure), it does not intersect until it converges to so thatthe optimal point is in this case.
The generalization error of is given by (12), so that wecalculate the expectation of
Lemma 5:
Proof is given in Appendix B. It is immediate to showLemma 6.
Lemma 6: The average generalization error by the optimalcross-validated early stopping is asymptotically
(30)
We can then calculate the optimal division rate ofexamples which minimizes the generalization error.
Theorem 2: The average generalization error is minimizedasymptotically at
(31)
The theorem shows the optimal division of examples intotraining and cross-validation sets. Whenis large
(32)
showing that only of examples are to beused for cross-validation testing and remaining most examplesare used for training. When , this shows that 93% ofexamples are to be used for training and only 7% are to bekept for cross-validation. From this we obtain
Theorem 3: The asymptotically optimal generalization er-ror is
(33)
When is large, we have
(34)
This shows that the generalization error increases slightly bycross-validation early stopping compared with learning whichuses all the examples for training. That is
(35)
for the optimal cross-validated and the m.l.e. based onall the examples without cross-validation.
AMARI et al.: ASYMPTOTIC STATISTICAL THEORY 991
Fig. 4. Geometrical picture for the intermediate range.
VI. I NTERMEDIATE RANGE
So far we have seen that cross-validation stopping is asymp-totically not effective. Now we would like to discuss fromseveral viewpoints why cross-validation early stopping iseffective in the intermediate range. Note however that ourexplanations are not mathematically rigorous, but rather sketchthree possible lines along which our theory can be generalizedfor the intermediate range: 1) a geometrical picture; 2) thedistribution of the initial ; and 3) the nonlinearities of thetrajectories.
In order to have intuitive understanding, we draw anotherpicture (Fig. 4). Here, is distributed uniformly on the sphere
whose center is the true value. Let be the initial weightand let be the distance between and We drawtangent rays from to the sphere Then, the tangentpoints on form an -dimensional sphere that divides
into two parts (shaded, left side) and (right side).When lies on early stopping is not necessary, but when
lies on then early stopping improves the solution.In the asymptotic range whereis very large, whatever
is, it is far larger than the radius of This implies thatis located almost infinitely far, so that the -sphere
dividing into and is equal to the -spherewhich is the vertical cut of (the cut at orthogonal to
the line connecting and In this case, is dividedinto two parts and with an equal volume. Moreover,when is large, the most volume of is concentrated in aneighborhood of so that the effect of early stopping is notremarkable.
In the intermediate range whereis not so large, the sphereis different from and is located on the left side of.
Since most volume is concentrated in a neighborhood ofthe measure of is negligible in this case. This implies thatearly stopping improves with a probability close to one. Inthe extreme case where is very small and is insideimmediate stopping without any training is the best strategy.
This shows that, when is not asymptotically large, wecannot neglect the distribution of the initial which is notso far from Let be the parameter space. Whatis the natural distribution of and ? If we assume auniform distribution over a very large convex region, isvery large. However, a natural prior distribution is the Jeffrey
noninformative distribution6 which is given bywhere is the Riemannian volume element of
In most neural network architectures, the volumeis finite and this implies that the effect of the
distribution of initial cannot be neglected whenis notasymptotically large.
It is possible to construct a theory by taking intoaccount. However, for the theory to be valid whereis notasymptotically large, the nonlinear learning trajectories cannotbe neglected and we need higher-order corrections to theasymptotic properties of the estimator(cf. Amari7).
VII. SIMULATIONS
We use standard feedforward classifier networks withinputs, sigmoid hidden units and softmax outputs(classes). The network parameters consist of biases
and weights The inputlayer is connected to the hidden layer via the hiddenlayer is connected to the output layer via and no short-cut connections are present. The output activityof the thoutput unit is calculated via the softmax squashing function
where is the localfield potential and
is the activity of the -th hidden unit, given input .Each output codes thea posteriori probability of be-
ing in class Although the network is completely deter-ministic, it is constructed to approximate class conditionalprobabilities ([13]).
Therefore, each randomly generated teacher repre-sents by construction a multinomial probability distribution
over the classesgiven a random input We use the same network
architecture for teacher and student. Thus, we assume that themodel is faithful, i.e., the teacher distribution can be exactlyrepresented by a student
A training and cross-validation set of the formis generated randomly, by drawing
samples of from a uniform distribution and forward prop-agating through the teacher network. Then, according tothe teachers’ outputs one output unit is set to one
6Note also that the Bayes estimatorwwwBayes with the Jeffrey priorpg is
better than the m.l.e. from the point of view of minimizing Kullback–Leiblerdivergence, although they are equivalent for larget:
7Differential Genetical Methods in Statistics. New York: Springer-Verlag,Lecture Notes in Statistics no. 28, 1985.
992 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 5, SEPTEMBER 1997
Fig. 5. New coordinate systemz.
stochastically and all others are set to zero leading to thetarget vector A student networkis then trying to approximate the teacher given the exampleset For training the student network we use—withinthe backpropagation framework—conjugate gradient descentwith a line-search to optimize the training error function (26),starting from some random initial vector. The cross-validationerror (27) is measured on the cross-validation examples tostop learning. The average generalization ability (4) is ap-proximately estimated by
(36)
on a large test set patterns) and averagedover 128–256 trials.1 We compare the generalization error forthe settings: exhaustive training (no stopping), early stopping(controlled by the cross-validation examples) and optimalstopping (controlled by the large test set). The simulationswere performed on a parallel computer (CM5). Every curvein the figures takes about 8 h of computing time on a128, respectively, 256 partition of the CM5, i.e., we perform128–256 parallel trials. This setting enabled us to do extensivestatistics (cf. [4], [24]).
Fig. 6 shows the results of simulations, whereso that the number of modifiable parameters
is In Fig. 6(a)we see the intermediate range of patterns (see [24]),where early stopping improves the generalization ability to alarge extent, clearly confirming the folklore mentioned above.From Fig. 6 we note that the learning curves and variancesare similar in the intermediate range no matter how the split ischoosen. Only as we get to small numbers of patternswe find a growing of the variances for the small splits, whichis to be expected.
1Several sample sets have been used with changing initial vectors. In eachtrial a sample of sizet is generated, the net is trained starting from a randominitialization www(0): As the number of patterns is in subsequent experimentsincreased tot0; the newly generated patterns are added to the old set oftpatterns.
(a)
(b)
Fig. 6. Shown isR(www) plotted for different sizesr0 of the early stoppingset for a 8-8-4 classifier network(N = 8;H = 8; (M � 1) = 4) (a) in theintermediate and (b) in the asymptotic regime as a function of1=t. An earlystopping set of 20% means: 80% of thet patterns in the training set are usedfor training, while 20% of thet patterns are used to control the early stopping.opt. denotes the use of a very large test set (50 000) and no stopping addressesthe case where 100% of the training set is used for exhaustive learning.
From Fig. 6(b) we observe clearly, that saturated learningwithout early stopping is the best in the asymptotic rangeof a range which is due to the limited size of thedata sets often unaccessible in practical applications. Cross-validated early stopping does not improve the generalizationerror here, so that no overtraining is observed on the average inthis range. This result confirms similar findings by Sjoberg andLjung [29]. In the asymptotic area [Fig. 6(b)] we observe thatthe smaller the percentage of the cross-validation set, whichis used to determine the point of early stopping, the better theperformance of the generalization ability. Fig. 7 shows that thelearning curves for different sizes of the cross-validation setare in good agreement with the theoretical prediction of (30).
Three systematic contributions to the randomness arise 1)random examples; 2) initialization of the student weights;and 3) local minima. The part of the variance given by
AMARI et al.: ASYMPTOTIC STATISTICAL THEORY 993
Fig. 7. Shown isR(www) from the simulation plotted for different sizesr0
of the cross-validation stopping set for a 8-8-4 classifier network in theasymptotic regime as a function of1=t. The straight line interpolations areobtained from (30). Note the nice agreement between theory and experiment.
the examples is bounded by the Cramer–Rao bound .Initialization gives only a small contribution since the resultsdo not change qualitatively if we start with initial weightsfar enough outside the circle of Fig. 1. Finally there is acontribution to the variance from local minima distributedaround the m.l.e. solution. Note that local minima do notchange the validity of our theory. Our simulations essentiallymeasure a coarse distribution of local minima solutions aroundthe m.l.e. and contains a variance due to this fact (for a furtherdiscussion on local minima in learning curve simulationssee [24]).
In Fig. 8(a) we show an exponential interpolation of thelearning curve over the whole range of examples in thesituation of optimal stopping (controlled by the large testset). The fitted exponent of indicates a scaling.In the asymptotic range as seen from Fig. 8(a) and (b) the
fit fails to hold and a scaling gives much betterinterpolation results. An explanation of this effect can beobtained by information geometrical means: early stoppinggives per definition a solution, which is not a global orlocal minimum of the empirical risk function (11) on thetraining patterns. Therefore the gradient terms in the likelihoodexpansion (see Appendix A) are contributing and have to beconsidered carefully. For an intermediate range the gradientterm in the expansion, which scales as gives the dom-inant contribution. Asymptotically the gradient term fails togive large contributions because the solution taken is veryclose to a local minimum and thus a scaling dominates.
VIII. C ONCLUDING REMARKS
We proposed an asymptotic theory for overtraining. Theanalysis treats realizable stochastic neural networks, trainedwith Kullback–Leibler divergence. However a generalizationto unrealizable cases and other loss functions can be obtainedalong the lines of our reasoning.
It is demonstrated both theoretically and in simulations thatasymptoticallythe gain in the generalization error is small
(a)
(b)
Fig. 8. R(www) plotted as a function of1=t for optimal stopping. (a) Exponen-tial fit t�0:49 in the whole range oft and (b) comparison of the exponentialand them=2t fit in the asymptotic regime. Shown is data for a 8-8-4 classifiernetwork.
if we perform early stopping, even if we have access tothe optimal stopping time. For cross-validation stopping wecomputed the optimal split between training and validationexamples and showed for large that optimally only
examples should be used to determine the point ofearly stopping in order to obtain the best performance. Forexample, if this corresponds to using roughly93% of the training patterns for training and only 7% fortesting where to stop. Yet, even if we use for cross-validation stopping the generalization error is always increasedcomparing to exhaustive training. Nevertheless note, that thisasymptotic range is often unaccessible in practical applicationsdue to the limited size of the data sets.
In the nonasymptotic region our simulations confirm thefolklore that cross-validated early stopping always helps toenhance the performance since it decreases the generalizationerror. We gave an intuitive explanation why this is observed(see Section VI, Fig. 4).
Furthermore for this intermediate range our asymptotictheory provides a guideline which can be used in practice asa heuristic estimate for the choice of the optimal size of theearly stopping set in the same sense as NIC ([21]–[23]) is usedas a guideline for model selection.
In future studies we would like to extend our theory—alongthe lines of Section VI—to incorporate the prior distributionsof the initial weights and the nonlinear learning trajectoriesnecessary to understand the intermediate range.
994 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 5, SEPTEMBER 1997
APPENDIX APROOF OF (13)
Since is minimized at we have
(A.1)
Its second derivative
converges to its expectation by the law of largenumbers. Hence, by expanding at we have
(A.2)
In order to evaluate . we expand it as
(A.3)
Expanding (A.1) around we have
(A.4)
By substituting this in (A.3), we have
Again by substituting this in (A.2), we have (13).
APPENDIX BPROOF OF LEMMA 5
We have the triangle and let be the -sphere whose diameter is spanned byand . Let be aray approaching from the left-hand side, which intersects
at . This is the optimal stopping point. Let be theangle between the ray and the diameter . The probabilitydensity of is given by (23). Let us consider the setof points on whose angles are betweenand when
is on . It is a -sphere on . We then calculatethe average of the square of distance when ison (that is, when the angle is
where the integration is taken overThis is the case where is from the same side as
For calculation, we introduce an orthogonal coordinate systemin the space of (Fig. 5) such that 1) its origin is
put at so that ; 2) its and axes are on theplane of the triangle so that
3) the and axes are chosen such that
and (4) all the other axes are orthogonal to the triangle.Moreover, we assume The opposite case is analyzedin the same way, giving the same final result for symmetryreasons.
The sphere is written in this coordinate system as
where is the radius of the sphere
and
Hence
Here, we used the following properties:
From
we obtain
AMARI et al.: ASYMPTOTIC STATISTICAL THEORY 995
Therefore, from
when ray arrives from the left side, we get
When ray arrives from the right side, , so that
holds. Hence, by averaging the above two, we have
In order to obtain we evaluate
giving
which can be expanded for large as
ACKNOWLEDGMENT
The authors would like to thank Y. LeCun, S. Bos, and K.Schulten for valuable discussions. K.-R. M. thanks K. Schultenfor warm hospitality during his stay at the Beckman Inst. inUrbana, IL. and for warm hospitality at RIKEN during thecompletion of this work. The authors acknowledge computingtime on the CM5 in Urbana (NCSA) and in Bonn, Germany.They also acknowledge L. Ljung for communicating his resultson regularization and early stopping prior to publication.
REFERENCES
[1] S. Amari, “Theory of adaptive pattern classifiers,”IEEE Trans. Electron.Comput., vol. EC-16, pp. 299–307, 1967.
[2] , “A universal theorem on learning curves,”Neural Networks,vol. 6, pp. 161–166, 1993.
[3] S. Amari and N. Murata, “Statistical theory of learning curves underentropic loss criterion,”Neural Computa., vol. 5, pp. 140–153, 1993.
[4] S. Amari, N. Murata, K.-R. Muller, M. Finke, and H. Yang, “Statisticaltheory of overtraining—Is cross-validation effective?,” inNIPS‘95:Advances in Neural Information Processing Systems 8, D. S. Touretzky,M. C. Mozer, and M. E. Hasselmo, Eds. Cambridge, MA: MIT Press,1996.
[5] H. Akaike, “A new look at statistical model identification,”IEEE Trans.Automat. Contr., vol. AC-19, pp. 716–723, 1974.
[6] D. Barber, D. Saad, and P. Sollich, “Test error fluctuations in finitelinear perceptrons,”Neural Computa.vol. 7, pp. 809–821, 1995.
[7] , “Finite-size effects and optimal test size in linear perceptrons,”J. Phys. A, vol. 28, pp. 1325–1334, 1995.
[8] N. Barkai, H. S. Seung, and H. Sompolinsky, “On-line learning ofdichotomies,” inAdvances in Neural Information Processing SystemsNIPS 7, G. Tesauro, D. S. Touretzky, and T. K. Leen, Eds. Cambridge,MA: MIT Press, 1995.
[9] C. M. Bishop, “Regularization and complexity control in feedforwardnetworks,” Aston Univ., Tech. Rep. NCRG/95/022, 1995.
[10] S. Bos, W. Kinzel, and M. Opper, “Generalization ability of perceptronswith continuous outputs,”Phys. Rev., vol. E47, pp. 1384–1391, 1993.
[11] , “Avoiding overfitting by finite temperature learning and cross-validation,” in Proc. Int. Conf. Artificial Neural Networks ICANN‘95,Paris, 1995, pp. 111–116.
[12] B. Efron and R. Tibshirani,An Introduction to the Bootstrap. London:Chapman and Hall, 1993.
[13] M. Finke and K.-R. Muller, “Estimating a posteriori probabilitiesusing stochastic network models,” inProc. 1993 Connectionist ModelsSummer School, M. Mozer, P. Smolensky, D. S. Touretzky, J. L. Elman,and A. S. Weigend, Eds. Hillsdale, NJ: Lawrence Erlbaum, 1994,p. 324.
[14] I. Guyon 1996, “A scaling law for the validation-set training-set ratio,”preprint.
[15] M. H. Hassoun,Fundamentals of Artificial Neural Networks. Cam-bridge, MA: MIT Press, 1995.
[16] R. Hecht-Nielsen,Neurocomputing. Reading, MA: Addison-Wesley,1989.
[17] T. Heskes and B. Kappen, Learning process in neural networks,Phys.Rev., vol. A44, pp. 2718–2762, 1991.
[18] M. Kearns, “A bound on the error of cross validation using theapproximation and estimation rates, with consequences for the training-test split,” in NIPS‘95: Advances in Neural Information ProcessingSystems 8, D. S. Touretzky, M. C. Mozer and M. E. Hasselmo, Eds.Cambridge, MA: MIT Press, 1996.
[19] A. Krogh and J. Hertz, “Generalization in a linear perceptron in thepresence of noise,”J. Phys. A, vol. 25, pp. 1135–1147, 1992.
[20] J. E. Moody, “The effective number of parameters: An analysis ofgeneralization and regularization in nonlinear learning systems,” inNIPS4: Advances in Neural Information Processing Systems, J. E. Moody,S. J. Hanson, and R. P. Lippmann, Eds. San Mateo, CA: MorganKaufmann, 1992.
[21] N. Murata, S. Yoshizawa, and S. Amari, “A criterion for determiningthe number of parameters in an artificial neural-network model,” inArtificial Neural Networks, T. Kohonen,et al., Eds.. Amsterdam, TheNetherlands: Elsevier, 1991, pp. 9–14.
[22] , “Learning curves, model selection and complexity of neuralnetworks,” in NIPS 5: Advances in Neural Information ProcessingSystems, S. J. Hansonet al., Eds. San Mateo, CA: Morgan Kaufmann,1993.
[23] , “Network information criterion—Determining the number ofhidden units for an artificial neural-network model,”IEEE Trans. NeuralNetworks, vol. 5, pp. 865–872, 1994.
[24] K.-R. Muller, M Finke, N. Murata, K Schulten, and S. Amari, “Anumerical study on learning curves in stochastic multilayer feedforwardnetworks,” Univ. Tokyo, Tech. Rep. METR 03-95, 1995, alsoNeuralComputa., vol. 8, pp. 1085–1106, 1996.
[25] T. Poggio and F. Girosi, “Regularization algorithms for learning thatare equivalent to multilayer networks,”Science, vol. 247, pp. 978–982,1990.
[26] J. Rissanen, “Stochastic complexity and modeling,”Ann. Statist., vol.14, pp. 1080–1100, 1986.
[27] D. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal rep-resentations by error propagation,” inParallel Distributed Processing:Explorations in the Microstructure of Cognition, vol. 1—Foundations.Cambridge, MA: MIT Press, 1986.
[28] D. Saad, S. A. Solla, “On-line learning in soft committee machines,”Phys. Rev. E, vol. 52, pp. 4225–4243, and “Exact solution for on-linelearning in multilayer neural networks,”Phys. Rev. Lett., vol. 74, pp.4337–4340, 1995.
[29] J. Sjoberg and L. Ljung, “Overtraining, regularization and searchingfor minimum with application to neural networks,” Link¨oping Univ.,Sweden, Tech. Rep. LiTH-ISY-R-1567, 1994.
[30] C. Wang, S. S. Venkatesh, J. S. Judd, “Optimal stopping and effectivemachine complexity in learning,” to appear, 1994 (revised and extendedversion of NIPS vol. 6, pp. 303–310, 1995).
996 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 5, SEPTEMBER 1997
[31] V. Vapnik,Estimation of Dependences Based on Emprirical Data. NewYork: Springer-Verlag, 1992.
[32] , The Nature of Statistical Learning Theory. New York:Springer-Verlag, 1995.
Shun-ichi Amari (M’71–SM’92–F’94) received thebachelor’s degree from the University of Tokyo in1958 majoring in mathematical engineering, andreceived the Dr. Eng. degree from the Universityof Tokyo in 1963.
He was an Associate Professor at Kyushu Uni-versity, Japan, and a Professor at the University ofTokyo. He is now the Director of the Brain Infor-mation Processing Group in the Frontier ResearchProgram at the Institute of Physical and ChemicalResearch (RIKEN) in Japan. He has been engaged in
research in many areas of mathematical engineering and applied mathematics.In particular, he has devoted himself to mathematical foundations of neuralnetwork theory. Another main subject of his research is the informationgeometry that he himself proposed.
Dr. Amari is the former President of the International Neural NetworkSociety and a member of editorial or advisory editorial boards of more than15 international journals. He was given the IEEE Neural Networks PioneerAward in 1992, Japan Academy Award in 1995, and IEEE Emanuel R. PioreAward in 1997, among others.
Noboru Murata received the B. Eng., M. Eng., andDr. Eng in mathematical engineering and informa-tion physics from the University of Tokyo in 1987,1989, and 1992, respectively.
He was a Research Associate at the Universityof Tokyo, and he was a Visiting Researcher withthe Research Institute for Computer Architectureand Software Technology of the German NationalResearch Center for Information Technology (GMDFIRST) from 1995 to 1996 supported by Alexandervon Humboldt Foundation. He is currently a Fron-
tier Researcher with the Brain Information Processing Group in the FrontierResearch Program at the Institute of Physical and Chemical Research (RIKEN)in Japan. His research interests include the theoretical aspects of neuralnetworks, focusing on the dynamics and statistical properties of learning.
Klaus-Robert Muller received the Diplom degreein mathematical physics 1989 and the Ph.D. degreein theoretical computer science in 1992, both fromUniversity of Karlsruhe, Germany. From 1992 to1994 he was a Postdoctoral fellow at the ResearchInstitute for Computer Architecture and SoftwareTechnology of the German National Research Cen-ter for Information Technology (GMD FIRST) inBerlin.
From 1994 to 1995 he was a European Commu-nity STP Research Fellow at University of Tokyo.
Since 1995 he has been with the GMD FIRST in Berlin. He is also lecturing atHumboldt University and the Technical University of Berlin. He has workedon statistical physics and statistical learning theory of neural networks andtime-series analysis. His present interests are expanded to support vectorlearning machines and nonstationary blind separation techniques.
Michael Finke received the Diplom degree in com-puter science in 1993 from the University of Karl-sruhe, Germany. Since then he has been workingtoward the PhD. degree at the Interactive SystemsLaboratories in Karlsruhe.
He was a Research Associate at the HeidelbergScientific Research Center of IBM from 1989 to1993. Since 1995 he has been a Visiting Researcherat Carnegie Mellon University (CMU), Pittsburgh,PA. His primary research interests are the theory ofneural networks, probabilistic models and informa-
tion geometry of neural networks and graphical statistical models. His researchat CMU is also focused on developing a speech recognition engine for largevocabulary conversational speech recognition tasks.
Howard Hua Yang (M’95) received the B.Sc. de-gree in applied mathematics from Harbin Shipbuild-ing Engineering Institute, in 1982 and the M.Sc. andPh.D. degrees in probability and statistics in 1984and 1989, respectively, from Zhongshan University,P.R. China.
From January 1988 to January 1990, he wasa Lecturer in the Department of Mathematics atZhongshan University, P.R. China. From April 1990to March 1992, he was a Research Fellow in theDepartment of Computer Science and Computer
Engineering at La Trobe University, Australia, where he did research andteaching in neural networks and computer science. From April 1992 toDecember 1994, he was a Research Fellow working in communications anddigital signal processing group in the Department of Electrical and ElectronicEngineering at the University of Melbourne, Australia. Since January 1995, hehas been Frontier Researcher in the Laboratory for Information Representationin the Frontier Research Program of The Institute of Physical and ChemicalResearch (RIKEN) in Japan. His research interests include neural networks,statistical signal processing, blind deconvolution/equalization, and estimationtheory.