A Model of Inductive Bias Learning · of examples required to learn a task if the “true” set of featu res (that is, the correct bias) is already known, and n is the number of

Journal of Artificial Intelligence Research 12 (2000) 149–198 Submitted 11/99; published 3/00

A Model of Inductive Bias Learning

Jonathan Baxter JONATHAN .BAXTER@ANU .EDU.AU

Research School of Information Sciences and EngineeringAustralian National University, Canberra 0200, Australia

Abstract

A major problem in machine learning is that of inductive bias: how to choose a learner’s hy-pothesis space so that it is large enough to contain a solution to the problem being learnt, yet smallenough to ensure reliable generalization from reasonably-sized training sets. Typically such bias issupplied by hand through the skill and insights of experts. In this paper a model forautomaticallylearningbias is investigated. The central assumption of the model isthat the learner is embeddedwithin anenvironmentof related learning tasks. Within such an environment the learner can samplefrom multiple tasks, and hence it can search for a hypothesisspace that contains good solutions tomany of the problems in the environment. Under certain restrictions on the set of all hypothesisspaces available to the learner, we show that a hypothesis space that performs well on a sufficientlylarge number of training tasks will also perform well when learning novel tasks in the same en-vironment. Explicit bounds are also derived demonstratingthat learning multiple tasks within anenvironment of related tasks can potentially give much better generalization than learning a singletask.

1. Introduction

Often the hardest problem in any machine learning task is theinitial choice of hypothesis space;it has to be large enough to contain a solution to the problem at hand, yet small enough to ensuregood generalization from a small number of examples (Mitchell, 1991). Once a suitable bias hasbeen found, the actual learning task is often straightforward. Existing methods of bias generallyrequire the input of a human expert in the form of heuristics and domain knowledge (for example,through the selection of an appropriate set of features). Despite their successes, such methods areclearly limited by the accuracy and reliability of the expert’s knowledge and also by the extent towhich that knowledge can be transferred to the learner. Thusit is natural to search for methods forautomatically learningthe bias.

In this paper we introduce and analyze a formal model ofbias learning that builds uponthe PAC model of machine learning and its variants (Vapnik, 1982; Valiant, 1984; Blumer,Ehrenfeucht, Haussler, & Warmuth, 1989; Haussler, 1992). These models typically take thefollowing general form: the learner is supplied with a hypothesis spaceH and training dataz = f(x1; y1); : : : ; (xm; ym)g drawn independently according to some underlying distribution Pon X � Y . Based on the information contained inz, the learner’s goal is to select a hypothesish : X ! Y from H minimizing some measureerP (h) of expected loss with respect toP (for ex-ample, in the case of squared losserP (h) := E (x;y)�P (h(x)� y)2). In such models the learner’sbias is represented by the choice ofH; if H does not contain a good solution to the problem, then,regardless of how much data the learner receives, it cannot learn.

Of course, the best way to bias the learner is to supply it withanH containing just a single op-timal hypothesis. But finding such a hypothesis is preciselythe original learning problem, so in the

c 2000 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.

BAXTER

PAC model there is no distinction between bias learning and ordinary learning. Or put differently,the PAC model does not model the process of inductive bias, itsimply takes the hypothesis spaceHas given and proceeds from there. To overcome this problem, in this paper we assume that insteadof being faced with just a single learning task, the learner is embedded within anenvironmentofrelated learning tasks. The learner is supplied with afamily of hypothesis spacesH = fHg, and itsgoal is to find a bias (i.e. hypothesis spaceH 2 H ) that is appropriate for the entire environment.A simple example is the problem of handwritten character recognition. A preprocessing stage thatidentifies and removes any (small) rotations, dilations andtranslations of an image of a characterwill be advantageous for recognizing all characters. If theset of all individual character recognitionproblems is viewed as an environment of learning problems (that is, the set of all problems of theform “distinguish ‘A’ from all other characters”, “distinguish ‘B’ from all other characters”, andso on), this preprocessor represents a bias that is appropriate for all problems in the environment.It is likely that there are many other currently unknown biases that are also appropriate for thisenvironment. We would like to be able to learn these automatically.

There are many other examples of learning problems that can be viewed as belonging to envi-ronments of related problems. For example, each individualface recognition problem belongs to an(essentially infinite) set of related learning problems (all the other individual face recognition prob-lems); the set of all individual spoken word recognition problems forms another large environment,as does the set of all fingerprint recognition problems, printed Chinese and Japanese character recog-nition problems, stock price prediction problems and so on.Even medical diagnostic and prognosticproblems, where a multitude of diseases are predicted from the same pathology tests, constitute anenvironment of related learning problems.

In many cases these “environments” are not normally modeledas such; instead they are treatedas single, multiple category learning problems. For example, recognizing a group of faces wouldnormally be viewed as a single learning problem with multiple class labels (one for each face inthe group), not as multiple individual learning problems. However, if a reliable classifier for eachindividual face in the group can be constructed then they caneasily be combined to produce aclassifier for the whole group. Furthermore, by viewing the faces as an environment of relatedlearning problems, the results presented here show that bias can be learnt that will be good forlearning novel faces, a claim that cannot be made for the traditional approach.

This point goes to the heart of our model: we are not not concerned with adjusting a learner’sbias so it performs better on somefixed set of learning problems. Such a process is in fact justordinary learning but with a richer hypothesis space in which some components labelled “bias” arealso able to be varied. Instead, we suppose the learner is faced with a (potentially infinite) stream oftasks, and that by adjusting its bias on some subset of the tasks it improves its learning performanceon future, as yet unseen tasks.

Bias that is appropriate for all problems in an environment must be learnt by sampling frommany tasks. If only a single task is learnt then the bias extracted is likely to be specific to thattask. In the rest of this paper, a general theory of bias learning is developed based upon the idea oflearning multiple related tasks. Loosely speaking (formalresults are stated in Section 2), there aretwo main conclusions of the theory presented here:� Learning multiple related tasks reduces the sampling burden required for good generalization,

at least on a number-of-examples-required-per-task basis.

150

A M ODEL OF INDUCTIVE BIAS LEARNING� Bias that is learnt on sufficiently many training tasks is likely to be good for learning noveltasks drawn from the same environment.

The second point shows that a form ofmeta-generalizationis possible in bias learning. Or-dinarily, we say a learner generalizes well if, after seeingsufficiently many training examples, itproduces a hypothesis that with high probability will perform well on future examples of the sametask. However, a bias learner generalizes well if, after seeing sufficiently many trainingtasksit pro-duces ahypothesis spacethat with high probability contains good solutions to noveltasks. Anotherterm that has been used for this process isLearning to Learn(Thrun & Pratt, 1997).

Our main theorems are stated in an agnostic setting (that is,H does not necessarily contain ahypothesis space with solutions to all the problems in the environment), but we also give improvedbounds in the realizable case. The sample complexity boundsappearing in these results are statedin terms of combinatorial parameters related to the complexity of the set of all hypothesis spacesHavailable to the bias learner. For Boolean learning problems (pattern classification) these parametersare the bias learning analogue of theVapnik-Chervonenkis dimension(Vapnik, 1982; Blumer et al.,1989).

As an application of the general theory, the problem of learning an appropriate set of neural-network features for an environment of related tasks is formulated as a bias learning problem. Inthe case of continuous neural-network features we are able to prove upper bounds on the numberof training tasks and number of examples of each training task required to ensure a set of featuresthat works well for the training tasks will, with high probability, work well on novel tasks drawnfrom the same environment. The upper bound on the number of tasks scales asO(b) whereb isa measure of the complexity of the possible feature sets available to the learner, while the upperbound on the number of examples of each task scales asO(a + b=n) whereO(a) is the numberof examples required to learn a task if the “true” set of features (that is, the correct bias) is alreadyknown, andn is the number of tasks. Thus, in this case we see that as the number of related taskslearnt increases, the number of examples required of each task for good generalization decays tothe minimum possible. For Boolean neural-network feature maps we are able to show a matchinglower bound on the number of examples required per task of thesame form.

1.1 Related Work

There is a large body of previous algorithmic and experimental work in the machine learning andstatistics literature addressing the problems of inductive bias learning and improving generalizationthrough multiple task learning. Some of these approaches can be seen as special cases of, or at leastclosely aligned with, the model described here, while others are more orthogonal. Without beingcompletely exhaustive, in this section we present an overview of the main contributions. See Thrunand Pratt (1997, chapter 1) for a more comprehensive treatment.� Hierarchical Bayes. The earliest approaches to bias learning come from Hierarchical Bayesian

methods in statistics (Berger, 1985; Good, 1980; Gelman, Carlin, Stern, & Rubim, 1995).In contrast to the Bayesian methodology, the present paper takes an essentially empiricalprocess approach to modeling the problem of bias learning. However, a model using a mixtureof hierarchical Bayesian and information-theoretic ideaswas presented in Baxter (1997a),with similar conclusions to those found here. An empirical study showing the utility of thehierarchical Bayes approach in a domain containing a large number of related tasks was givenin Heskes (1998).

151

BAXTER� Early machine learning work. In Rendell, Seshu, and Tcheng (1987) “VBMS” orVariable BiasManagement Systemwas introduced as a mechanism for selecting amongst different learningalgorithms when tackling a new learning problem. “STABB” orShift To a Better Bias(Ut-goff, 1986) was another early scheme for adjusting bias, butunlike VBMS, STABB was notprimarily focussed on searching for bias applicable to large problem domains. Our use of an“environment of related tasks” in this paper may also be interpreted as an “environment ofanalogous tasks” in the sense that conclusions about one task can be arrived at by analogywith (sufficiently many of) the other tasks. For an early discussion of analogy in this con-text, see Russell (1989, S4.3), in particular the observation that for analogous problems thesampling burdenper taskcan be reduced.� Metric-based approaches.The metric used in nearest-neighbour classification, and invectorquantization to determine the nearest code-book vector, represents a form of inductive bias.Using the model of the present paper, and under some extra assumptions on the tasks inthe environment (specifically, that their marginal input-space distributions are identical andthey only differ in the conditional probabilities they assign to class labels), it can be shownthat there is anoptimal metric or distance measure to use for vector quantization and one-nearest-neighbour classification (Baxter, 1995a, 1997b; Baxter & Bartlett, 1998). This metriccan be learnt by sampling from a subset of tasks from the environment, and then used as adistance measure when learning novel tasks drawn from the same environment. Bounds onthe number of tasks and examples of each task required to ensure good performance on noveltasks were given in Baxter and Bartlett (1998), along with anexperiment in which a metricwas successfully trained on examples of a subset of 400 Japanese characters and then used asa fixed distance measure when learning 2600 as yet unseen characters.

A similar approach is described in Thrun and Mitchell (1995), Thrun (1996), in which aneural network’s output was trained to match labels on a novel task, while simultaneouslybeing forced to match its gradient toderivativeinformation generated from a distance metrictrained on previous, related tasks. Performance on the novel tasks improved substantiallywith the use of the derivative information.

Note that there are many other adaptive metric techniques used in machine learning, but theseall focus exclusively on adjusting the metric for a fixed set of problems rather than learning ametric suitable for learning novel, related tasks (bias learning).� Feature learning or learning internal representations. As with adaptive metric techniques,there are many approaches to feature learning that focus on adapting features for a fixed taskrather than learning features to be used in novel tasks. One of the few cases where featureshave been learnt on a subset of tasks with the explicit aim of using them on novel tasks wasIntrator and Edelman (1996) in which a low-dimensional representation was learnt for a setof multiple related image-recognition tasks and then used to successfully learn novel tasks ofthe same kind. The experiments reported in Baxter (1995a, chapter 4) and Baxter (1995b),Baxter and Bartlett (1998) are also of this nature.� Bias learning in Inductive Logic Programming (ILP). Predicate invention refers to the pro-cess in ILP whereby new predicates thought to be useful for the classification task at handare added to the learner’s domain knowledge. By using the newpredicates as background do-main knowledge when learning novel tasks, predicate invention may be viewed as a form of

152

A M ODEL OF INDUCTIVE BIAS LEARNING

inductive bias learning. Preliminary results with this approach on a chess domain are reportedin Khan, Muggleton, and Parson (1998).� Improving performance on a fixed reference task.“Multi-task learning” (Caruana, 1997)trains extra neural network outputs to match related tasks in order to improve generalizationperformance on a fixed reference task. Although this approach does not explicitly identify theextra bias generated by the related tasks in a way that can be used to learn novel tasks, it isan example of exploiting the bias provided by a set of relatedtasks to improve generalizationperformance. Other similar approaches include Suddarth and Kergosien (1990), Suddarth andHolden (1991), Abu-Mostafa (1993).� Bias as computational complexity. In this paper we consider inductive bias from a sample-complexity perspective: how does the learnt bias decrease the number of examples required ofnovel tasks for good generalization? A natural alternativeline of enquiry is how the running-time or computational complexity of a learning algorithm may be improved by training onrelated tasks. Some early algorithms for neural networks inthis vein are contained in Sharkeyand Sharkey (1993), Pratt (1992).� Reinforcement Learning. Many control tasks can appropriately be viewed as elements of setsof related tasks, such as learning to navigate to different goal states, or learning a set ofcomplex motor control tasks. A number of papers in the reinforcement learning literaturehave proposed algorithms for both sharing the information in related tasks to improve averagegeneralization performance across those tasks Singh (1992), Ring (1995), or learning biasfrom a set of tasks to improve performance on future tasks Sutton (1992), Thrun and Schwartz(1995).

1.2 Overview of the Paper

In Section 2 the bias learning model is formally defined, and the main sample complexity resultsare given showing the utility of learning multiple related tasks and the feasibility of bias learning.These results show that the sample complexity is controlledby the size of certain covering numbersassociated with the set of all hypothesis spaces available to the bias learner, in much the same wayas the sample complexity in learning Boolean functions is controlled by theVapnik-Chervonenkisdimension (Vapnik, 1982; Blumer et al., 1989). The results of Section 2 are upper bounds onthe sample complexity required for good generalization when learning multiple tasks and learninginductive bias.

The general results of Section 2 are specialized to the case of feature learning with neural net-works in Section 3, where an algorithm for training featuresby gradient descent is also presented.For this special case we are able to show matching lower bounds for the sample complexity ofmultiple task learning. In Section 4 we present some concluding remarks and directions for futureresearch. Many of the proofs are quite lengthy and have been moved to the appendices so as not tointerrupt the flow of the main text.

The following tables contain a glossary of the mathematicalsymbols used in the paper.

153

BAXTER

Symbol Description First ReferencedX Input Space 155Y Output Space 155P Distribution onX � Y (learning task) 155l Loss function 155H Hypothesis Space 155h Hypothesis 155erP (h) Error of hypothesish on distributionP 156z Training set 156A Learning Algorithm 156erz(h) Empirical error ofh on training setz 156P Set of all learning tasksP 157Q Distribution over learning tasks 157H Family of hypothesis spaces 157erQ(H) Loss of hypothesis spaceH on environmentQ 158z (n;m)-sample 158erz(H) Empirical loss ofH onz 158A Bias learning algorithm 159hl Function induced byh andl 159Hl Set ofhl 159(h1; : : : ; hn)l Average ofh1;l; : : : ; hn;l 159hl Same as(h1; : : : ; hn)l 159Hnl Set of(h1; : : : ; hn)l 159H nl Set ofHnl 159H� Function on probability distributions 160H � Set ofH� 160dP Pseudo-metric onHnl 160dQ Pseudo-metric onH � 160N ("; H �; dQ) Covering number ofH � 160C("; H �) Capacity ofH � 160N ("; H nl ; dP) Covering number ofH nl 160C("; H nl ) Capacity ofH nl 160h Sequence ofn hypotheses(h1; : : : ; hn) 163P Sequence ofn distributions(P1; : : : ; Pn) 163erP(h) Average loss ofh onP 164erz(h) Average loss ofh onz 164F Set of feature maps 166G Output class composed with feature mapsf 166G Æ f Hypothesis space associated withf 166Gl Loss function class associated withG 166N (";Gl; dP ) Covering number ofGl 166C (";Gl) Capacity ofGl 166d[P;Gl℄(f; f 0) Pseudo-metric on feature mapsf; f 0 166N (";F ; d[P;Gl℄) Covering number ofF 166

154


Symbol Description First ReferencedN (";F ; d[P;Gl℄) Covering number ofF 166CGl(";F) Capacity ofF 166Hw Neural network hypothesis space 167Hjx H restricted to vectorx 172�H(m) Growth function ofH 172VCdim(H) Vapnik-Chervonenkis dimension ofH 172Hjx H restricted to matrixx 173H jx H restricted to matrixx 173�H (n;m) Growth function ofH 173dH (n) Dimension function ofH 173d(H ) Upper dimension function ofH 173d(H ) Lower dimension function ofH 173optP(H n) Optimal performance ofH n onP 175d� Metric onR+ 179h1 � � � � � hn Average ofh1, : : : , hn 179H1 � � � � �Hn Set ofh1 � � � � � hn 180�(2m;n) Permutations on integer pairs 182z� Permutedz 182dz(h;h0) Empirical l1 metric on functionsh 182erP(H) Optimal average error ofH onP 185

2. The Bias Learning Model

In this section the bias learning model is formally introduced. To motivate the definitions, we firstdescribe the main features of ordinary (single-task) supervised learning models.

2.1 Single-Task Learning

Computational learning theory models of supervised learning usually include the following ingre-dients:� An input spaceX and anoutput spaceY ,� aprobability distributionP onX � Y ,� a loss functionl : Y � Y ! R, and� ahypothesis spaceH which is a set ofhypothesesor functionsh : X ! Y .

As an example, if the problem is to learn to recognize images of Mary’s face using a neural network,thenX would be the set of all images (typically represented as a subset ofRd where each componentis a pixel intensity),Y would be the setf0; 1g, and the distributionP would be peaked over imagesof different faces and the correct class labels. The learner’s hypothesis spaceH would be a class ofneural networks mapping the input spaceRd to f0; 1g. The loss in this case would be discrete loss:l(y; y0) := � 1 if y 6= y00 if y = y0 (1)

155

BAXTER

Using the loss function allows us to present a unified treatment of both pattern recognition (Y =f0; 1g, l as above), and real-valued function learning (e.g.regression) in whichY = R and usuallyl(y; y0) = (y � y0)2.The goal of the learner is to select a hypothesish 2 H with minimumexpected loss:erP (h) := ZX�Y l(h(x); y) dP (x; y): (2)

Of course, the learner does not knowP and so it cannot search throughH for an h minimizingerP (h). In practice, the learner samples repeatedly fromX � Y according to the distributionP togenerate atraining set z := f(x1; y1); : : : ; (xm; ym)g: (3)

Based on the information contained inz the learner produces a hypothesish 2 H. Hence, in generala learner is simply a mapA from the set of all training samples to the hypothesis spaceH:A : [m>0(X � Y )m !H(stochastic learner’s can be treated by assuming a distribution-valuedA.)

Many algorithms seek to minimize theempirical loss ofh onz, where this is defined by:erz(h) := 1m mXi=1 l(h(xi); yi): (4)

Of course, there are more intelligent things to do with the data than simply minimizing empiricalerror—for example one can add regularisation terms to avoidover-fitting.

However the learner chooses its hypothesish, if we have auniformbound (over allh 2 H) onthe probability of large deviation betweenerz(h) anderP (h), then we can bound the learner’s gen-eralization errorerP (h) as a function of its empirical loss on the training seterz(h). Whether sucha bound holds depends upon the “richness” ofH. The conditions ensuring convergence betweenerz(h) anderP (h) are by now well understood; for Boolean function learning (Y = f0; 1g, discreteloss), convergence is controlled by theVC-dimension1 of H:

Theorem 1. Let P be any probability distribution onX � f0; 1g and supposez =f(x1; y1); : : : ; (xm; ym)g is generated by samplingm times fromX � f0; 1g according toP . Letd := VCdim(H). Then with probability at least1 � Æ (over the choice of the training setz), allh 2 H will satisfy erP (h) � erz(h) + �32m �d log 2emd + log 4Æ��1=2 (5)

Proofs of this result may be found in Vapnik (1982), Blumer etal. (1989), and will not bereproduced here.

1. The VC dimension of a class of Boolean functionsH is the largest integerd such that there exists a subsetS :=fx1; : : : ; xdg � X such that the restriction ofH to S contains all2d Boolean functions onS.

156


Theorem 1 only provides conditions under which the deviation betweenerP (h) and erz(h) islikely to be small, it does not guarantee that the true errorerP (h) will actually be small. This isgoverned by the choice ofH. If H contains a solution with small error and the learner minimizeserror on the training set, then with high probabilityerP (h) will be small. However, a bad choice ofH will mean there is no hope of achieving small error. Thus, thebiasof the learner in this model2

is represented by the choice of hypothesis spaceH.

2.2 The Bias Learning Model

The main extra assumption of the bias learning model introduced here is that the learner is embed-ded in anenvironmentof related tasks, and can sample from the environment to generate multipletraining sets belonging to multiple different tasks. In theabove model of ordinary (single-task)learning, a learning task is represented by a distributionP on X � Y . So in the bias learningmodel, an environment of learning problems is represented by a pair(P; Q) whereP is the set ofall probability distributions onX�Y (i.e.,P is the set of all possible learning problems), andQ is adistribution onP. Q controls which learning problems the learner is likely to see3. For example, ifthe learner is in a face recognition environment,Q will be highly peaked over face-recognition-typeproblems, whereas if the learner is in a character recognition environmentQ will be peaked overcharacter-recognition-type problems (here, as in the introduction, we view these environments assets of individual classification problems, rather than single, multiple class classification problems).

Recall from the last paragraph of the previous section that the learner’s bias is represented by itschoice of hypothesis spaceH. So to enable the learner to learn the bias, we supply it with afamilyor set of hypothesis spacesH := fHg.

Putting all this together, formally alearning to learnor bias learningproblem consists of:� an input spaceX and anoutput spaceY (both of which are separable metric spaces),� a loss functionl : Y � Y ! R,� anenvironment(P; Q) whereP is the set of all probability distributions onX � Y andQ isa distribution onP,� ahypothesis space familyH = fHg where eachH 2 H is a set of functionsh : X ! Y .

From now on we will assume the loss functionl has range[0; 1℄, or equivalently, with rescaling,we assume thatl is bounded.

2. The bias is also governed by how the learner uses the hypothesis space. For example, under some circumstances thelearner may choose not to use the full power ofH (a neural network example is early-stopping). For simplicity inthis paper we abstract away from such features of the algorithmA and assume that it uses the entire hypothesis spaceH.

3. Q’s domain is a�-algebra of subsets ofP. A suitable one for our purposes is the Borel�-algebraB(P) generatedby the topology of weak convergence onP. If we assume thatX andY are separable metric spaces, thenP is alsoa separable metric space in the Prohorov metric (which metrizes the topology of weak convergence) (Parthasarathy,1967), so there is no problem with the existence of measures on B(P). See Appendix D for further discussion,particularly the proof of part 5 in Lemma 32.

157

BAXTER

We define the goal of a bias learner to be to find a hypothesis spaceH 2 H minimizing thefollowing loss: erQ(H) := ZP infh2H erP (h) dQ(P ) (6)= ZP infh2HZX�Y l(h(x); y) dP (x; y) dQ(P ):The only wayerQ(H) can be small is if, with highQ-probability,H contains a good solutionh toany problemP drawn at random according toQ. In this senseerQ(H) measures how appropriatethe bias embodied byH is for the environment(P; Q).

In general the learner will not knowQ, so it will not be able to find anH minimizing erQ(H)directly. However, the learner can sample from the environment in the following way:� Samplen times fromP according toQ to yield:P1; : : : ; Pn.� Samplem times fromX � Y according to eachPi to yield:zi = f(xi1; yi1) : : : ; (xim; yim)g.� The resultingn training sets—henceforth called an(n;m)-sampleif they are generated by the

above process—are supplied to the learner. In the sequel, an(n;m)-sample will be denotedby z and written as a matrix: (x11; y11) � � � (x1m; y1m) = z1z := ...

.. ....

...(xn1; yn1) � � � (xnm; ynm) = zn (7)

An (n;m)-sample is simplyn training setsz1; : : : ; zn sampled fromn different learning tasksP1; : : : ; Pn, where each task is selected according to the environmentalprobability distributionQ.The size of each training set is kept the same primarily to facilitate the analysis.

Based on the information contained inz, the learner must choose a hypothesis spaceH 2 H .One way to do this would be for the learner to find anH minimizing theempirical lossonz, wherethis is defined by: erz(H) := 1n nXi=1 infh2H erzi(h) (8)

Note that erz(H) is simply the average of the best possible empirical error achievable on eachtraining setzi, using a function fromH. It is a biased estimate oferQ(H). An unbiased esti-mate oferQ(H) would require choosing anH with minimal average error over then distributionsP1; : : : ; Pn, where this is defined by1nPni=1 infh2H erPi(h).

As with ordinary learning, it is likely there are more intelligent things to do with the training dataz than minimizing (8). Denoting the set of all(n;m)-samples by(X � Y )(n;m), a general “biaslearner” is a mapA that takes(n;m)-samples as input and produces hypothesis spacesH 2 H asoutput: A : [n>0m>0 (X � Y )(n;m) ! H : (9)

158


(as stated,A is a deterministic bias learner, however it is trivial to extend our results to stochasticlearners).

Note that in this paper we are concerned only with the sample complexity properties of a biaslearnerA; we do not discuss issues of the computability ofA.

SinceA is searching for entire hypothesis spacesH within a family of such hypothesis spacesH , there is an extra representational question in our model ofbias learning that is not present inordinary learning, and that is how the familyH is represented and searched byA. We defer thisdiscussion until Section 2.5, after the main sample complexity results for this model of bias learninghave been introduced. For the specific case of learning a set of features suitable for an environmentof related learning problems, see Section 3.

Regardless of how the learner chooses its hypothesis spaceH, if we have a uniform bound (overall H 2 H ) on the probability of large deviation betweenerz(H) anderQ(H), and we can computean upper bound onerz(H), then we can bound the bias learner’s “generalization error” erQ(H).With this view, the question of generalization within our bias learning model becomes: how manytasks (n) and how many examples of each task (m) are required to ensure thaterz(H) anderQ(H)are close with high probability, uniformly over allH 2 H ? Or, informally, how many tasks and howmany examples of each task are required to ensure that a hypothesis space with good solutions toall the training tasks will contain good solutions to novel tasks drawn from the same environment?

It turns out that this kind of uniform convergence for bias learning is controlled by the “size”of certain function classes derived from the hypothesis space family H , in much the same way asthe VC-dimension of a hypothesis spaceH controls uniform convergence in the case of Booleanfunction learning (Theorem 1). These “size” measures and other auxiliary definitions needed tostate the main theorem are introduced in the following subsection.

2.3 Covering Numbers

Definition 1. For any hypothesish : X ! Y , definehl : X � Y ! [0; 1℄ byhl(x; y) := l(h(x); y) (10)

For any hypothesis spaceH in the hypothesis space familyH , defineHl := fhl : h 2 Hg: (11)

For any sequence ofn hypotheses(h1; : : : ; hn), define(h1; : : : ; hn)l : (X � Y )n ! [0; 1℄ by(h1; : : : ; hn)l(x1; y1; : : : ; xn; yn) := 1n nXi=1 l(hi(xi); yi): (12)

We will also usehl to denote(h1; : : : ; hn)l. For anyH in the hypothesis space familyH , defineHnl := f(h1; : : : ; hn)l : h1; : : : ; hn 2 Hg: (13)

Define H nl := [H2H Hnl : (14)

159

BAXTER

In the first part of the definition above, hypothesesh : X ! Y are turned into functionshlmappingX � Y ! [0; 1℄ by composition with the loss function.Hl is then just the collection of allsuch functions where the original hypotheses come fromH. Hl is often called aloss-function class.In our case we are interested in the average loss acrossn tasks, where each of then hypothesesis chosen from a fixed hypothesis spaceH. This motivates the definition ofhl andHnl . Finally,H nl is the collection of all(h1; : : : ; hn)l, with the restriction that allh1; : : : ; hn belong to a singlehypothesis spaceH 2 H .

Definition 2. For eachH 2 H , defineH� : P ! [0; 1℄ byH�(P ) := infh2H erP (h): (15)

For the hypothesis space familyH , defineH � := fH� : H 2 H g: (16)

It is the “size” of H nl and H � that controls how large the(n;m)-samplez must be to ensureerz(H) anderQ(H) are close uniformly over allH 2 H . Their size will be defined in terms ofcertain covering numbers, and for this we need to define how tomeasure the distance betweenelements ofH nl and also between elements ofH �.Definition 3. LetP = (P1; : : : ; Pn) be any sequence ofn probability distributions onX � Y . Foranyhl;h0l 2 H nl , definedP(hl;h0l) := Z(X�Y )n jhl(x1; y1; : : : ; xn; yn)� h0l(x1; y1; : : : ; xn; yn)jdP1(x1; y1) : : : dPn(xn; yn) (17)

Similarly, for any distributionQ onP and anyH�1;H�2 2 H �, definedQ(H�1;H�2) := ZP jH�1(P )�H�2(P )j dQ(P ) (18)

It is easily verified thatdP anddQ are pseudo-metrics4 on H nl andH � respectively.

Definition 4. An "-cover of (H �; dQ) is a set fH�1; : : : ;H�Ng such that for allH� 2 H �,dQ(H�;H�i ) � " for somei = 1 : : : N . Note that we do not require theH�i to be contained inH �, just that they be measurable functions onP. LetN ("; H �; dQ) denote the size of the smallestsuch cover. Define thecapacityof H � byC("; H �) := supQ N ("; H �; dQ) (19)

where the supremum is over all probability measures onP. N ("; H nl ; dP) is defined in a similarway, usingdP in place ofdQ. Define thecapacityof H nl by:C("; H nl ) := supP N ("; H nl ; dP) (20)

where now the supremum is over all sequences ofn probability measures onX � Y .

4. A pseudo-metricd is a metric without the condition thatd(x; y) = 0) x = y.

160


2.4 Uniform Convergence for Bias Learners

Now we have enough machinery to state the main theorem. In thetheorem the hypothesis spacefamily is required to bepermissible. Permissibility is discussed in detail in Appendix D, but notethat it is a weak measure-theoretic condition satisfied by almost all “real-world” hypothesis spacefamilies. All logarithms are to basee.Theorem 2. SupposeX and Y are separable metric spaces and letQ be any probability distri-bution onP, the set of all distributions onX � Y . Supposez is an (n;m)-sample generated bysamplingn times fromP according toQ to giveP1; : : : ; Pn, and then samplingm times from eachPi to generatezi = f(xi1; yi1); : : : ; (xim; yim)g, i = 1; : : : ; n. Let H = fHg be any permissiblehypothesis space family. If the number of tasksn satisfiesn � max(256"2 log 8C � "32 ; H ��Æ ; 64"2) ; (21)

and the number of examplesm of each task satisfiesm � max�256n"2 log 8C( "32 ; H nl )Æ ; 64"2� ; (22)

then with probability at least1� Æ (over the(n;m)-samplez), all H 2 H will satisfyerQ(H) � erz(H) + " (23)

Proof. See Appendix A.

There are several important points to note about Theorem 2:

1. Provided the capacitiesC ("; H �) andC("; H nl ) are finite, the theorem shows thatany biaslearner that selects hypothesis spaces fromH can bound its generalisation errorerQ(H) interms oferz(H) for sufficiently large(n;m)-samplesz. Most bias learner’s will not find theexact value oferz(H) because it involves finding the smallest error of any hypothesish 2 Hon each of then training sets inz. But any upper bound onerz(H) (found, for exampleby gradient descent on some error function) will still give an upper bound onerQ(H). SeeSection 3.3.1 for a brief discussion on how this can be achieved in a feature learning setting.

2. In order to learn bias (in the sense thaterQ(H) and erz(H) are close uniformly over allH 2 H ), both the number of tasksn and the number of examples of each taskm mustbe sufficiently large. This is intuitively reasonable because the bias learner must see bothsufficiently many tasks to be confident of the nature of the environment, and sufficientlymany examples of each task to be confident of the nature of eachtask.

3. Once the learner has found anH 2 H with a small value oferz(H), it can then useH tolearn novel tasksP drawn according toQ. One then has the following theorem bounding thesample complexity required for good generalisation when learning withH (the proof is verysimilar to the proof of the bound onm in Theorem 2).

161

BAXTER

Theorem 3. Let z = f(x1; y1); : : : ; (xm; ym)g be a training set generated by sampling fromX � Y according to some distributionP . LetH be a permissible hypothesis space. For all"; Æ with 0 < "; Æ < 1, if the number of training examplesm satisfiesm � max(64"2 log 4C � "16 ;Hl�Æ ; 16"2) (24)

then with probability at least1� Æ, all h 2 H will satisfyerP (h) � erz(h) + ":The capacityC (";H) appearing in equation (24) is defined in an analogous fashionto thecapacities in Definition 4 (we just use the pseudo-metricdP (hl; h0l) := RX�Y jhl(x; y) �h0l(x; y)j dP (x; y)). The important thing to note about Theorem 3 is that the number of ex-amples required for good generalisation when learning novel tasks is proportional to the log-arithm of the capacity of the learnt hypothesis spaceH. In contrast, if the learner does notdo any bias learning, it will have no reason to select one hypothesis spaceH 2 H over anyother and consequently it would have to view as a candidate solution any hypothesis in anyof the hypothesis spacesH 2 H . Thus, its sample complexity will be proportional to thecapacity of[H2H fHlg = H 1l , which in general will be considerably larger than the capacityof any individualH 2 H . So by learningH the learner haslearnt to learnin the environment(P; Q) in the sense that it needs far smaller training sets to learn novel tasks.

4. Having learnt a hypothesis spaceH with a small value oferz(H), Theorem 2 tells us thatwith probability at least1� Æ, the expected value ofinfh2H erP (h) on a novel taskP will beless thanerz(H) + ". Of course, this does not rule out really bad performance on some tasksP . However, the probability of generating such “bad” tasks can be bounded. In particular,note thaterQ(H) is just the expected value of the functionH� overP, and so by Markov’sinequality, for > 0,Pr�P : infh2H erP (h) � � = Pr fP : H�(P ) � g� EQH� = erQ(H) � erz(H) + " (with probability1� Æ).

5. Keeping the accuracy and confidence parameters"; Æ fixed, note that the number of examplesrequired of each task for good generalisation obeysm = O� 1n log C ("; H nl )� : (25)

So providedlog C ("; H nl ) increases sublinearly withn, the upper bound on the number ofexamples required of each task willdecreaseas the number of tasks increases. This showsthat for suitably constructed hypothesis space families itis possible toshare informationbetween tasks. This is discussed further after Theorem 4 below.

162


2.5 Choosing the Hypothesis Space FamilyH .

Theorem 2 only provides conditions under whicherz(H) anderQ(H) are close, it does not guaran-tee thaterQ(H) is actually small. This is governed by the choice ofH . If H contains a hypothesisspaceH with a small value oferQ(H) and the learner is able to find anH 2 H minimizing error onthe(n;m) samplez (i.e., minimizingerz(H)), then, for sufficiently largen andm, Theorem 2 en-sures that with high probabilityerQ(H) will be small. However, a bad choice ofH will mean thereis no hope of finding anH with small error. In this sense the choice ofH represents thehyper-biasof the learner.

Note that from a sample complexity point of view, theoptimalhypothesis space family to chooseis one containing a single, minimal hypothesis spaceH that contains good solutions to all of theproblems in the environment (or at least a set of problems with highQ-probability), and no more.For then there is no bias learning to do (because there is no choice to be made between hypothesisspaces), the output of the bias learning algorithm is guaranteed to be a good hypothesis space forthe environment, and since the hypothesis space is minimal,learning any problem within the en-vironment usingH will require the smallest possible number of examples. However, this scenariois analagous to the trivial scenario in ordinary learning inwhich the learning algorithm contains asingle, optimal hypothesis for the problem being learnt. Inthat case there is no learning to be done,just as there is no bias learning to be done if the correct hypothesis space is already known.

At the other extreme, ifH contains a single hypothesis spaceH consisting of all possible func-tions fromX ! Y then bias learning is impossible because the bias learner cannot produce arestricted hypothesis space as output, and hence cannot produce a hypothesis space with improvedsample complexity requirements on as yet unseen tasks.

Focussing on these two extremes highlights the minimal requirements onH for successful biaslearning to occur: the hypothesis spacesH 2 H must be strictly smaller than the space of allfunctionsX ! Y , but not so small or so “skewed” that none of them contain goodsolutions to alarge majority of the problems in the environment.

It may seem that we have simply replaced the problem of selecting the right bias (i.e., selectingthe right hypothesis spaceH) with the equally difficult problem of selecting the right hyper-bias (i.e.,the right hypothesis space familyH ). However, in many cases selecting the right hyper-bias is fareasier than selecting the right bias. For example, in Section 3 we will see how the feature selectionproblem may be viewed as a bias selection problem. Selectingthe right features can be extremelydifficult if one knows little about the environment, with intelligent trial-and-error typically the bestone can do. However, in a bias learning scenario, one only hasto specify that a set of features shouldexist, find a loosely parameterised set of features (for example neural networks), and then learn thefeatures by sampling from multiple related tasks.

2.6 Learning Multiple Tasks

It may be that the learner is not interested in learning to learn, but just wants to learn a fixed setof n tasks from the environment(P; Q). As in the previous section, we assume the learner startsout with a hypothesis space familyH , and also that it receives an(n;m)-samplez generated fromthen distributionsP1; : : : ; Pn. This time, however, the learner is simply looking forn hypotheses(h1; : : : ; hn), all contained in the same hypothesis spaceH, such that the average generalizationerror of then hypotheses is minimal. Denoting(h1; : : : ; hn) by h and writingP = (P1; : : : ; Pn),

163

BAXTER

this error is given by: erP(h) := 1n nXi=1 erPi(hi) (26)= 1n nXi=1 ZX�Y l(hi(x); y) dPi(x; y);and the empirical loss ofh onz iserz(h) := 1n nXi=1 erzi(hi) (27)= 1n nXi=1 1m mXj=1 l(hi(xij); yij):As before, regardless of how the learner chooses(h1; : : : ; hn), if we can prove a uniform bound onthe probability of large deviation betweenerz(h) anderP(h) then any(h1; : : : ; hn) that performwell on the training setsz will with high probability perform well on future examples of the sametasks.

Theorem 4. LetP = (P1; : : : ; Pn) ben probability distributions onX�Y and letz be an(n;m)-sample generated by samplingm times fromX � Y according to eachPi. Let H = fHg be anypermissible hypothesis space family. If the number of examplesm of each task satisfiesm � max� 64n"2 log 4C( "16 ; H nl )Æ ; 16"2� (28)

then with probability at least1� Æ (over the choice ofz), anyh 2 H n will satisfyerP(h) � erz(h) + " (29)

(recall Definition 4 for the meaning ofC("; H nl )).Proof. Omitted (follow the proof of the bound onm in Theorem 2).

The bound onm in Theorem 4 is virtually identical to the bound onm in Theorem 2, and noteagain that it dependsinverselyon the number of tasksn (assuming that the first part of the “max”expression is the dominate one). Whether this helps dependson the rate of growth ofC( "16 ; H nl ) asa function ofn. The following Lemma shows that this growth is always small enough to ensure thatwe never do worse by learning multiple tasks (at least in terms of the upper bound on the number ofexamples required per task).

Lemma 5. For any hypothesis space familyH ,C �"; H 1l � � C ("; H nl ) � C �"; H 1l �n : (30)

164


Proof. Let K denote the set of all functions(h1; : : : ; hn)l where eachhi can be a member of anyhypothesis spaceH 2 H (recall Definition 1). ThenH nl � K and soC ("; H nl ) � C (";K). ByLemma 29 in Appendix B,C (";K) � C �"; H 1l �n and so the right hand inequality follows.

For the first inequality, letP be any probability measure onX � Y and letP be the mea-sure on(X � Y )n obtained by usingP on the first copy ofX � Y in the product, and ignoringall other elements of the product. LetN be an"-cover for (H nl ; dP). Pick anyhl 2 H 1l andlet (g1; : : : ; gn)l 2 N be such thatdP ((h; h; : : : ; h)l; (g1; : : : ; gn)l) � ". But by construction,dP ((h; h; : : : ; h)l; (g1; : : : ; gn)l) = dP (h; (g1)l), which establishes the first inequality.

By Lemma 5 log C �"; H 1l � � log C ("; H nl ) � n log C �"; H 1l � : (31)

So keeping the accuracy parameters" andÆ fixed, and plugging (31) into (28), we see that the upperbound on the number of examples required of each task neverincreaseswith the number of tasks,and at best decreases asO(1=n). Although only an upper bound, this provides a strong hint thatlearning multiple related tasks should be advantageous on a“number of examples required per task”basis. In Section 3 it will be shown that for feature learningall types of behavior are possible, fromno advantage at all toO(1=n) decrease.

2.7 Dependence on"In Theorems 2, 3 and 4 the bounds on sample complexity all scale as1="2. This behavior can beimproved to1=" if the empirical loss is always guaranteed to be zero (i.e., we are in the realizablecase). The same behavior results if we are interested in relative deviation between empirical andtrue loss, rather than absolute deviation. Formal theoremsalong these lines are stated in AppendixA.3.

3. Feature Learning

The use of restricted feature sets is nearly ubiquitous as a method of encoding bias in many areas ofmachine learning and statistics, including classification, regression and density estimation.

In this section we show how the problem of choosing a set of features for an environment ofrelated tasks can be recast as a bias learning problem. Explicit bounds onC(H �; ") andC(H nl ; ")are calculated for general feature classes in Section 3.2. These bounds are applied to the problem oflearning a neural network feature set in Section 3.3.

3.1 The Feature Learning Model

Consider the following quote from Vapnik (1996):

The classical approach to estimating multidimensional functional dependencies isbased on the following belief:

Real-life problems are such that there exists a small numberof “strong features,” simplefunctions of which (say linear combinations) approximate well the unknown function.Therefore, it is necessary to carefully choose a low-dimensional feature space and thento use regular statistical techniques to construct an approximation.

165

BAXTER

In general a set of “strong features” may be viewed as a function f : X ! V mapping the inputspaceX into some (typically lower) dimensional spaceV . Let F = ffg be a set of such featuremaps (eachf may be viewed as a set of features(f1; : : : ; fk) if V = Rk ). It is thef that must be“carefully chosen” in the above quote. In general, the “simple functions of the features” may berepresented as a class of functionsG mappingV to Y . If for eachf 2 F we define the hypothesisspaceG Æ f := fg Æ f : g 2 Gg, then we have the hypothesis space familyHH := fG Æ f : f 2 Fg: (32)

Now the problem of “carefully choosing” the right featuresf is equivalent to the bias learningproblem “find the right hypothesis spaceH 2 H ”. Hence, provided the learner is embedded withinan environment of related tasks, and the capacitiesC(H �; ") andC(H nl ; ") are finite, Theorem 2 tellsus that the feature setf can belearnt rather than carefully chosen. This represents an importantsimplification, as choosing a set of features is often the most difficult part of any machine learningproblem.

In Section 3.2 we give a theorem boundingC(H �; ") andC(H nl ; ") for general feature classes.The theorem is specialized to neural network classes in Section 3.3.

Note that we have forced the function classG to be the same for all feature mapsf , althoughthis is not necessary. Indeed variants of the results to follow can be obtained ifG is allowed to varywith f .

3.2 Capacity Bounds for General Feature Classes

Notationally it is easier to view the feature mapsf as mapping fromX � Y to V � Y by (x; y) 7!(f(x); y), and also to absorb the loss functionl into the definition ofG by viewing eachg 2 G as amap fromV �Y into [0; 1℄ via (v; y) 7! l(g(v); y). Previously this latter function would have beendenotedgl but in what follows we will drop the subscriptl where this does not cause confusion. Theclass to whichgl belongs will still be denoted byGl.

With the above definitions letGl Æ F := fg Æ f : g 2 Gl; f 2 Fg. Define the capacity ofGl inthe usual way, C (";Gl) := supP N (";Gl; dP )where the supremum is over all probability measures onV � Y , anddP (g; g0) := RV�Y jg(v; y) �g0(v; y)j dP (v; y). To define the capacity ofF we first define a pseudo-metricd[P;Gl℄ on F by“pulling back” theL1 metric onR throughGl as follows:d[P;Gl℄(f; f 0) := ZX�Y supg2Gl jg Æ f(x; y)� g Æ f 0(x; y)j dP (x; y): (33)

It is easily verified thatd[P;Gl℄ is a pseudo-metric. Note that ford[P;Gl℄ to be well defined the supre-mum overGl in the integrand must be measurable. This is guaranteed if the hypothesis space familyH = fGl Æ f : f 2 Fg is permissible (Lemma 32, part 4). Now defineN (";F ; d[P;Gl℄) to be thesmallest"-cover of the pseudo-metric space(F ; d[P;Gl℄) and the"-capacity ofF (with respect toGl)as CGl(";F) := supP N (";F ; d[P;Gl℄)where the supremum is over all probability measures onX�Y . Now we can state the main theoremof this section.

166


Theorem 6. Let H be a hypothesis space family as in equation(32). Then for all"; "1; "2 > 0 with" = "1 + "2, C ("; H nl ) � C ("1;Gl)n CGl ("2;F) (34)C ("; H �) � CGl (";F) (35)

Proof. See Appendix B.

3.3 Learning Neural Network Features

In general, a set of features may be viewed as a map from the (typically high-dimensional) inputspaceRd to a much smaller dimensional spaceRk (k � d). In this section we consider approximat-ing such a feature map by a one-hidden-layer neural network with d input nodes andk output nodes(Figure 1). We denote the set of all such feature maps byf�w = (�w;1; : : : ; �w;k) : w 2 Dg whereD is a bounded subset ofRW (W is the number of weights (parameters) in the first two layers).This set is theF of the previous section.

Each feature�w;i : Rd ! [0; 1℄, i = 1; : : : ; k is defined by�w;i(x) := �0� lXj=1 vijhj(x) + vil+11A (36)

wherehj(x) is the output of thejth node in the first hidden layer,(vi1; : : : ; vil+1) are the outputnode parameters for theith feature and� is a “sigmoid” squashing function� : R ! [0; 1℄. Eachfirst layer hidden nodehi : Rd ! R, i = 1; : : : ; l, computeshi(x) := �0� dXj=1 uijxj + uid+11A (37)

where(ui1; : : : ; uid+1) are the hidden node’s parameters. We assume� is Lipschitz.5 The weightvector for the entire feature map is thusw = (u11; : : : ; u1d+1; : : : ; ul1; : : : ; uld+1; v11; : : : ; v1l+1; : : : ; vk1; : : : ; vkl+1)and the total number of feature parametersW = l(d+ 1) + k(l + 1).

For argument’s sake, assume the “simple functions” of the features (the classG of the previoussection) are squashed affine maps using the same sigmoid function � above (in keeping with the“neural network” flavor of the features). Thus, each settingof the feature weightsw generates ahypothesis space:Hw := (� kXi=1 �i�w;i + �k+1! : (�1; : : : ; �k+1) 2 D0) ; (38)

whereD0 is a bounded subset ofRk+1 . The set of all such hypothesis spaces,H := fHw : w 2 Dg (39)

5. � is Lipschitz if there exists a constantK such thatj�(x)� �(x0)j � Kjx� x0j for all x; x0 2 R.

167

BAXTER

k

n

l

d

FeatureMap

Input

Multiple Output Classes

Figure 1: Neural network for feature learning. The feature map is implemented by the first twohidden layers. Then output nodes correspond to then different tasks in the(n;m)-samplez. Each node in the network computes a squashed linear function of the nodes inthe previous layer.

is a hypothesis space family. The restrictions on the outputlayer weights(�1; : : : ; �k+1) and featureweightsw, and the restriction to a Lipschitz squashing function are needed to obtain finite upperbounds on the covering numbers in Theorem 2.

Finding a good set of features for the environment(P; Q) is equivalent to finding a good hy-pothesis spaceHw 2 H , which in turn means finding a good set of feature map parametersw.

As in Theorem 2, the correct set of features may be learnt by finding a hypothesis space withsmall error on a sufficiently large(n;m)-samplez. Specializing to squared loss, in the presentframework the empirical loss ofHw onz (equation (8)) is given byerz(Hw) = 1n nXi=1 inf(�0;�1;:::;�k)2D0 1m mXj=1 "� kXl=1 �l�w;l(xij) + �0!� yij#2 (40)

Since our sigmoid function� only has range[0; 1℄, we also restrict the outputsY to this range.

3.3.1 ALGORITHMS FORFINDING A GOOD SET OF FEATURES

Provided the squashing function� is differentiable, gradient descent (with a small variation onbackpropagation to compute the derivatives) can be used to find feature weightsw minimizing (40)(or at least a local minimum of (40)). The only extra difficulty over and above ordinary gradientdescent is the appearance of “inf” in the definition oferz(Hw). The solution is to perform gradientdescent over both the output parameters(�0; : : : ; �k) for each node and the feature weightsw. Formore details see Baxter (1995b) and Baxter (1995a, chapter 4), where empirical results supportingthe theoretical results presented here are also given.

168


3.3.2 SAMPLE COMPLEXITY BOUNDS FORNEURAL-NETWORK FEATURE LEARNING

The size ofz ensuring that the resulting features will be good for learning novel tasks from the sameenvironment is given by Theorem 2. All we have to do is computethe logarithm of the coveringnumbersC("; H nl ) andC("; H �).Theorem 7. Let H = �Hw : w 2 RW be a hypothesis space family where eachHw is of the formHw := (� kXi=1 �i�w;i(�) + �0! : (�1; : : : ; �k) 2 Rk) ;where�w = (�w;1; : : : ; �w;k) is a neural network withW weights mapping fromRd to Rk . If thefeature weightsw and the output weights�0; �1; : : : ; �k are bounded, the squashing function� isLipschitz,l is squared loss, and the output spaceY = [0; 1℄ (any bounded subset ofR will do), thenthere exist constants�; �0 (independent of";W andk) such that for all" > 0,log C("; H nl ) � 2 ((k + 1)n+W ) log �" (41)log C("; H �) � 2W log �0" (42)

(recall that we have specialized to squared loss here).

Proof. See Appendix B.

Noting that our neural network hypothesis space familyH is permissible, plugging (41) and (42)into Theorem 2 gives the following theorem.

Theorem 8. Let H = fHwg be a hypothesis space family where each hypothesis spaceHw is aset of squashed linear maps composed with a neural network feature map, as above. Suppose thenumber of features isk, and the total number of feature weights is W. Assume all feature weights andoutput weights are bounded, and the squashing function� is Lipschitz. Letz be an(n;m)-samplegenerated from the environment(P; Q). Ifn � O� 1"2 �W log 1" + log 1Æ�� ; (43)

and m � O� 1"2 ��k + 1 + Wn � log 1" + 1n log 1Æ �� (44)

then with probability at least1� Æ anyHw 2 H will satisfyerQ(Hw) � erz(Hw) + ": (45)

169

BAXTER

3.3.3 DISCUSSION

1. Keeping the accuracy and confidence parameters" andÆ fixed, the upper bound on the numberof examples required of each task behaves likeO(k+W=n). If the learner is simply learningn fixed tasks (rather than learning to learn), then the same upper bound also applies (recallTheorem 4).

2. Note that if we do away with the feature map altogether thenW = 0 and the upper bound onm becomesO(k), independent ofn (apart from the less importantÆ term). So in terms of theupper bound, learningn tasks becomes just as hard as learning one task. At the other extreme,if we fix the output weights then effectivelyk = 0 and the number of examples required ofeach task decreases asO(W=n). Thus a range of behavior in the number of examples requiredof each task is possible: from no improvement at all to anO(1=n) decrease as the number oftasksn increases (recall the discussion at the end of Section 2.6).

3. Once the feature map is learnt (which can be achieved usingthe techniques outlined in Baxter,1995b; Baxter & Bartlett, 1998; Baxter, 1995a, chapter 4), only the output weights have to beestimated to learn a novel task. Again keeping the accuracy parameters fixed, this requires nomore thatO(k) examples. Thus, as the number of tasks learnt increases, theupper bound onthe number of examples required of each task decays to the minimum possible,O(k).

4. If the “small number of strong features” assumption is correct, thenk will be small. However,typically we will have very little idea of what the features are, so to be confident that the neuralnetwork is capable of implementing a good feature set it willneed to be very large, implyingW � k. O(k +W=n) decreases most rapidly with increasingn whenW � k, so at least interms of the upper bound on the number of examples required per task, learning small featuresets is an ideal application for bias learning. However, theupper bound on the number oftasks does not fare so well as it scales asO(W ).

3.3.4 COMPARISON WITH TRADITIONAL MULTIPLE-CLASS CLASSIFICATION

A special case of this multi-task framework is one in which the marginal distribution on the inputspacePijX is the same for each taski = 1; : : : ; n, and all that varies between tasks is the conditionaldistribution over the output spaceY . An example would be a multi-class problem such as facerecognition, in whichY = f1; : : : ; ng wheren is the number of faces to be recognized and themarginal distribution onX is simply the “natural” distribution over images of those faces. In thatcase, if for every examplexij we have—in addition to the sampleyij from theith task’s conditionaldistribution onY—samples from the remainingn � 1 conditional distributions onY , then we canview then training sets containingm examples each as one large training set for the multi-classproblem withmn examples altogether. The bound onm in Theorem 8 states thatmn should beO(nk +W ), or proportional to the total number of parameters in the network, a result we wouldexpect from6 (Haussler, 1992).

So when specialized to the traditional multiple-class, single task framework, Theorem 8 is con-sistent with the bounds already known. However, as we have already argued, problems such as facerecognition are not really single-task, multiple-class problems. They are more appropriately viewed

6. If each example can be classified with a “large margin” thennaive parameter counting can be improved upon (Bartlett,1998).

170


as a (potentially infinite) collection of distinct binary classification problems. In that case, the goalof bias learning is not to find a singlen-output network that can classify some subset ofn faceswell. It is to learn a set of features that can reliably be usedas a fixed preprocessing for distinguish-ing any single face from other faces. This is the new thing provided by Theorem 8: it tells us thatprovided we have trained ourn-output neural network on sufficiently many examples ofsufficientlymany tasks, we can be confident that the common feature map learnt for thosen tasks will be goodfor learninganynew, as yet unseen task, provided the new task is drawn from the same distributionthat generated the training tasks. In addition, learning the new task only requires estimating thekoutput node parameters for that task, a vastly easier problem than estimating the parameters of theentire network, from both a sample and computational complexity perspective. Also, since we havehigh confidence that the learnt features will be good for learning novel tasks drawn from the sameenvironment, those features are themselves a candidate forfurther study to learn more about thenature of the environment. The same claim could not be made ifthe features had been learnt on toosmall a set of tasks to guarantee generalization to novel tasks, for then it is likely that the featureswould implement idiosyncrasies specific to those tasks, rather than “invariances” that apply acrossall tasks.

When viewed from a bias (or feature) learning perspective, rather than a traditionaln-classclassification perspective, the boundm on the number of examples required of each task takes ona somewhat different meaning. It tells us that providedn is large (i.e., we are collecting examplesof a large number tasks), then we really only need to collect afew more examples than we wouldotherwise have to collect if the feature map was already known (k+W=n examples vs.k examples).So it tells us that the burden imposed by feature learning canbe made negligibly small, at least whenviewed from the perspective of the sampling burden requiredof each task.

3.4 Learning Multiple Tasks with Boolean Feature Maps

Ignoring the accuracy and confidence parameters" and Æ, Theorem 8 shows that the number ofexamples required of each task when learningn tasks with a common neural-network feature mapis bounded above byO(k + W=n), wherek is the number of features andW is the number ofadjustable parameters in the feature map. SinceO(k) examples are required to learn a single taskonce the true features are known, this shows that the upper bound on the number of examplesrequired of each task decays (in order) to the minimum possible as the number of tasksn increases.This suggests that learning multiple tasks is advantageous, but to be truly convincing we need toprove a lower bound of the same form. Proving lower bounds in areal-valued setting (Y = R)is complicated by the fact that a single example can convey aninfinite amount of information, soone typically has to make extra assumptions, such as that thetargetsy 2 Y are corrupted by anoise process. Rather than concern ourselves with such complications, in this section we restrictour attention to Boolean hypothesis space families (meaning each hypothesish 2 H 1 maps toY = f�1g and we measure error by discrete lossl(h(x); y) = 1 if h(x) 6= y andl(h(x); y) = 0otherwise).

We show that the sample complexity for learningn tasks with a Boolean hypothesis space familyH is controlled by a “VC dimension” type parameterdH (n) (that is, we give nearly matching upperand lower bounds involvingdH (n)). We then derive bounds ondH (n) for the hypothesis spacefamily considered in the previous section with the Lipschitz sigmoid function� replaced by a hardthreshold (linear threshold networks).

171

BAXTER

As well as the bound on the number of examples required per task for good generalization acrossthose tasks, Theorem 8 also shows that features performing well onO(W ) taskswill generalize wellto novel tasks, whereW is the number of parameters in the feature map. Given that formany featurelearning problemsW is likely to be quite large (recall Note 4 in Section 3.3.3), it would be usefulto know thatO(W ) tasks are in factnecessarywithout further restrictions on the environmentaldistributionsQ generating the tasks. Unfortunately, we have not yet been able to show such a lowerbound.

There is some empirical evidence suggesting that in practice the upper bound on the number oftasks may be very weak. For example, in Baxter and Bartlett (1998) we reported experiments inwhich a set of neural network features learnt on a subset of only 400 Japanese characters turned outto be good enough for classifying some 2600 unseen characters, even though the features containedseveral hundred thousand parameters. Similar results may be found in Intrator and Edelman (1996)and in the experiments reported in Thrun (1996) and Thrun andPratt (1997, chapter 8). Whilethis gap between experiment and theory may be just another example of the looseness inherent ingeneral bounds, it may also be that the analysis can be tightened. In particular, the bound on thenumber of tasks is insensitive to the size of the class of output functions (the classG in Section 3.1),which may be where the looseness has arisen.

3.4.1 UPPER ANDLOWER BOUNDS FORLEARNING n TASKS WITH BOOLEAN HYPOTHESIS

SPACE FAMILIES

First we recall some concepts from the theory of Boolean function learning. LetH be a class ofBoolean functions onX andx = (x1; : : : ; xm) 2 Xm. Hjx is the set of all binary vectors obtainableby applying functions inH to x:Hjx := f(h(x1); : : : ; h(xm)) : h 2 Hg:Clearly jHjxj � 2m. If jHjxj = 2m we sayH shattersx. Thegrowth functionof H is defined by�H(m) := maxx2Xm ��Hjx�� :TheVapnik-Chervonenkis dimensionVCdim(H) is the size of the largest set shattered byH:VCdim(H) := maxfm : �H(m) = 2mg:An important result in the theory of learning Boolean functions is Sauer’s Lemma (Sauer, 1972), ofwhich we will also make use.

Lemma 9 (Sauer’s Lemma). For a Boolean function classH withVCdim(H) = d,�H(m) � dXi=0 �mi � � �emd �d ;for all positive integersm.

We now generalize these concepts to learningn tasks with a Boolean hypothesis space family.

172


Definition 5. Let H be a Boolean hypothesis space family. Denote then �m matrices over theinput spaceX byX(n;m). For eachx 2 X(n;m) andH 2 H , defineHjx to be the set of (binary)matrices, Hjx := 8><>:264 h1(x11) � � � h1(x1m)

..... .

...hn(xn1) � � � hn(xnm) 375 : h1; : : : ; hn 2 H9>=>; :Define H jx := [H2H Hjx:Now for eachn > 0;m > 0, define�H (n;m) by�H (n;m) := maxx2X(n;m) ��H jx�� :Note that�H (n;m) � 2nm. If

��H jx�� = 2nm we sayH shatters the matrixx. For eachn > 0 letdH (n) := maxfm : �H (n;m) = 2nmg:Define d(H ) : = VCdim(H 1) andd(H ) : = maxH2H VCdim(H):Lemma 10. d(H ) � d(H )dH (n) � max��d(H )n � ; d(H )� � 12 ��d(H )n �+ d(H )�Proof. The first inequality is trivial from the definitions. To get the second term in the maximumin the second inequality, choose anH 2 H with VCdim(H) = d(H ) and construct a matrixx 2 X(n;m) whose rows are of lengthd(H ) and are shattered byH. Then clearlyH shattersx. Forthe first term in the maximum take a sequencex = (x1; : : : ; xd(H )) shattered byH 1 (the hypothesisspace consisting of the union over all hypothesis spaces from H ), and distribute its elements equallyamong the rows ofx (throw away any leftovers). The set of matrices8><>:264 h(x11) � � � h(x1m)

.... . .

...h(xn1) � � � h(xnm) 375 : h 2 H 19>=>; :wherem = bd(H )=n is a subset ofH jx and has size2nm.

Lemma 11. �H (n;m) � � emdH (n)�ndH(n)173

BAXTER

Proof. Observe that for eachn, �H (n;m) = �H(nm) whereH is the collection of all Booleanfunctions on sequencesx1; : : : ; xnm obtained by first choosingn functionsh1; : : : ; hn from someH 2 H , and then applyingh1 to the firstm examples,h2 to the secondm examples and so on. Bythe definition ofdH (n), VCdim(H) = ndH (n), hence the result follows from Lemma 9 applied toH.

If one follows the proof of Theorem 4 (in particular the proofof Theorem 18 in AppendixA) then it is clear that for all� > 0, C(H nl ; ") may be replaced by�H (n; 2m) in the Booleancase. Making this replacement in Theorem 18, and using the choices of�; � from the discussionfollowing Theorem 26, we obtain the following bound on the probability of large deviation betweenempirical and true performance in this Boolean setting.

Theorem 12. LetP = (P1; : : : ; Pn) ben probability distributions onX � f�1g and letz be an(n;m)-sample generated by samplingm times fromX�f�1g according to eachPi. LetH = fHgbe any permissible Boolean hypothesis space family. For all0 < � � 1,Pr fz : 9h 2 H n : erP(h) � erz(h) + "g � 4�H (n; 2m) exp(��2nm=64): (46)

Corollary 13. Under the conditions of Theorem 12, if the number of examplesm of each tasksatisfies m � 88"2 �2dH (n) log 22" + 1n log 4Æ� (47)

then with probability at least1� Æ (over the choice ofz), anyh 2 H n will satisfyerP(h) � erz(h) + " (48)

Proof. Applying Theorem 12, we require4�H (n; 2m) exp(��2nm=64) � Æ;which is satisfied if m � 64�2 �dH (n) log 2emdH (n) + 1n log 4Æ � ; (49)

where we have used Lemma 11. Now, for alla � 1, ifm = �1 + 1e� a log�1 + 1e� a;thenm � a logm. So settinga = 64dH (n)="2, (49) is satisfied ifm � 88"2 �2dH (n) log 22" + 1n log 4Æ � :

174


Corollary 13 shows that any algorithm learningn tasks using the hypothesis space familyHrequires no more than m = O� 1"2 �dH (n) log 1" + 1n log 1Æ �� (50)

examples of each task to ensure that with high probability the average true error of anyn hypothesesit selects fromH n is within " of their average empirical error on the samplez. We now give atheorem showing that if the learning algorithm is required to producen hypotheses whose averagetrue error is within" of thebest possible error(achievable usingH n) for an arbitrary sequence ofdistributionsP1; : : : ; Pn, then within alog 1" factor the number of examples in equation (50) is alsonecessary.

For any sequenceP = (P1; : : : ; Pn) of n probability distributions onX � f�1g, defineoptP(H n) by optP(H n) := infh2H n erP(h):Theorem 14. Let H be a Boolean hypothesis space family such thatH 1 contains at least twofunctions. For eachn = 1; 2; : : : ; letAn be any learning algorithm taking as input(n;m)-samplesz 2 (X � f�1g)(n;m) and producing as outputn hypothesesh = (h1; : : : ; hn) 2 H n. For all0 < " < 1=64 and0 < Æ < 1=64, ifm < 1"2 �dH (n)616 + (1� "2) 1n log� 18Æ(1 � 2Æ)��then there exist distributionsP = (P1; : : : ; Pn) such that with probability at leastÆ (over therandom choice ofz), erP(An(z)) > optP(H n) + "Proof. See Appendix C

3.4.2 LINEAR THRESHOLD NETWORKS

Theorems 13 and 14 show that within constants and alog(1=") factor, the sample complexity oflearningn tasks using the Boolean hypothesis space familyH is controlled by the complexity pa-rameterdH (n). In this section we derive bounds ondH (n) for hypothesis space families constructedas thresholded linear combinations of Boolean feature maps. Specifically, we assumeH is of theform given by (39), (38), (37) and (36), where now the squashing function� is replaced with a hardthreshold: �(x) := (1 if x � 0;�1 otherwise;and we don’t restrict the range of the feature and output layer weights. Note that in this case theproof of Theorem 8 does not carry through because the constants�; �0 in Theorem 7 depend on theLipschitz bound on�.

Theorem 15. LetH be a hypothesis space family of the form given in(39), (38), (37)and(36), witha hard threshold sigmoid function�. Recall that the parametersd, l andk are the input dimension,number of hidden nodes in the feature map and number of features (output nodes in the feature map)

175

BAXTER

respectively. LetW := l(d + 1) + k(l + 1) (the number of adjustable parameters in the featuremap). Then, dH (n) � 2�Wn + k + 1� log2 (2e(k + l + 1)) :Proof. Recall that for eachw 2 RW , �w : Rd ! Rk denotes the feature map with parametersw.For eachx 2 X(n;m), let�wjx denote the matrix264 �w(x11) � � � �w(x1m)

.... . .

...�w(xn1) � � � �w(xnm) 375 :Note thatH jx is the set of all binaryn �m matrices obtainable by composing thresholded linearfunctions with the elements of�wjx, with the restriction that the same function must be appliedtoeach element in a row (but the functions may differ between rows). With a slight abuse of notation,define ��(n;m) := maxx2X(n;m) ��wjx : w 2 RW �� :Fix x 2 X(n;m). By Sauer’s Lemma, each node in the first hidden layer of the feature map computesat most(emn=(d+ 1))d+1 functions on thenm input vectors inx. Thus, there can be at most(emn=(d+ 1))l(d+1) distinct functions from the input to the output of the first hidden layer onthenm points inx. Fixing the first hidden layer parameters, each node in the second layer of thefeature map computes at most(emn=(l + 1))l+1 functions on the image ofx produced at the outputof the first hidden layer. Thus the second hidden layer computes no more than(emn=(l + 1))k(l+1)functions on the output of the first hidden layer on thenm points inx. So, in total,��(n;m) � � emnd+ 1�l(d+1)� emnl + 1�k(l+1) :Now, for each possible matrix�wjx, the number of functions computable on each row of�wjx by a

thresholded linear combination of the output of the featuremap is at most(em=(k + 1))k+1. Hence,the number of binary sign assignments obtainable by applying linear threshold functions to all therows is at most(em=(k + 1))n(k+1). Thus,�H (n;m) � � emnd+ 1�l(d+1) � emnl + 1�k(l+1)� emnn(k + 1)�n(k+1) :f(x) := x log x is a convex function, hence for alla; b; > 0,f �ka+ lb+ k + l + 1 � � 1k + l + 1 (kf(a) + lf(b) + f( ))) � k + l + 1ka+ lb+ �ka+lb+ � �1a�ka�1b�lb�1 � :Substitutinga = l + 1, b = d+ 1 and = n(k + 1) shows that�H (n;m) � �emn(k + l + 1)W + n(k + 1) �W+n(k+1) : (51)

176


Hence, if m > �Wn + k + 1� log2�emn(k + l + 1)W + n(k + 1) � (52)

then�H (n;m) < 2nm and so by definitiondH (n) � m. For alla > 1, observe thatx > a log2 xif x = 2a log2 2a. Settingx = emn(k + l + 1)=(W + n(k + 1)) anda = e(k + l + 1) shows that(52) is satisfied ifm = 2(W=n+ k + 1) log2(2e(k + l + 1)).Theorem 16. Let H be as in Theorem 15 with the following extra restrictions:d � 3, l � k andk � d. Then dH (n) � 12 ��W2n�+ k + 1�Proof. We boundd(H ) andd(H ) and then apply Lemma 10. In the present settingH 1 contains allthree-layer linear-threshold networks withd input nodes,l hidden nodes in the first hidden layer,khidden nodes in the second hidden layer and one output node. From Theorem 13 in Bartlett (1993),we have VCdim(H 1) � dl + l(k � 1)2 + 1;which under the restrictions stated above is greater thanW=2. Henced(H ) �W=2.

As k � d andl � k we can choose a feature weight assignment so that the featuremap is theidentity onk components of the input vector and insensitive to the setting of the reminaingd � kcomponents. Hence we can generatek + 1 points inX whose image under the feature map isshattered by the linear threshold output node, and sod(H ) = k + 1.

Combining Theorem 15 with Corrolary 13 shows thatm � O� 1"2 ��Wn + k + 1� log 1" + 1n log 1Æ��examples of each task suffice when learningn tasks using a linear threshold hypothesis space family,while combining Theorem 16 with Theorem 14 shows that ifm � � 1"2 ��Wn + k + 1�+ 1n log 1Æ ��then any learning algorithm will fail on some set ofn tasks.

4. Conclusion

The problem of inductive bias is one that has broad significance in machine learning. In this paperwe have introduced a formal model of inductive bias learningthat applies when the learner is ableto sample from multiple related tasks. We proved that provided certain covering numbers computedfrom the set of all hypothesis spaces available to the bias learner are finite, any hypothesis spacethat contains good solutions to sufficiently many training tasks is likely to contain good solutions tonovel tasks drawn from the same environment.

In the specific case of learning a set of features, we showed that the number of examplesmrequired of each task in ann-task training set obeysm = O(k +W=n), wherek is the number of

177

BAXTER

features andW is a measure of the complexity of the feature class. We showedthat this bound isessentially tight for Boolean feature maps constructed from linear threshold networks. In addition,we proved that the number of tasks required to ensure good performance from the features on noveltasks is no more thanO(W ). We also showed how a good set of features may be found by gradientdescent.

The model of this paper represents a first step towards a formal model of hierarchical approachesto learning. By modelling a learner’s uncertainty concerning its environment in probabilistic terms,we have shown how learning can occur simultaneously at both the base level—learn the tasks athand—and at the meta-level—learn bias that can be transferred to novel tasks. From a technicalperspective, it is the assumption that tasks are distributed probabilstically that allows the perfor-mance guarantees to be proved. From a practical perspective, there are many problem domains thatcan be viewed as probabilistically distributed sets of related tasks. For example, speech recognitionmay be decomposed along many different axes: words, speakers, accents, etc. Face recognitionrepresents a potentially infinite domain of related tasks. Medical diagnosis and prognosis problemsusing the same pathology tests are yet another example. All of these domains should benefit frombeing tackled with a bias learning approach.

Natural avenues for further enquiry include:� Alternative constructions for H . Although widely applicable, the specific example on featurelearning via gradient descent represents just one possibleway of generating and searchingthe hypothesis space familyH . It would be interesting to investigate alternative methods,including decision tree approaches, approaches from Inductive Logic Programming (Khanet al., 1998), and whether more general learning techniquessuch as boosting can be appliedin a bias learning setting.� Algorithms for automatically determining the hypothesis space family H . In our model thestructure ofH is fixed apriori and represents thehyper-biasof the bias learner. It wouldbe interesting to see to what extent this structure can also be learnt.� Algorithms for automatically determining task relatedness. In ordinary learning there is usu-ally little doubt whether an individualexamplebelongs to the same learning task or not.The analogous question in bias learning is whether an individual learning task belongs to agiven set of related tasks, which in contrast to ordinary learning, does not always have sucha clear-cut answer. For most of the examples we have discussed here, such as speech andface recognition, the task-relatedness is not in question,but in other cases such as medicalproblems it is not so clear. Grouping too large a subset of tasks together as related tasks couldclearly have a detrimental impact on bias-learning or multi-task learning, and there is empri-cal evidence to support this (Caruana, 1997). Thus, algorithms for automatically determiningtask-relatedness are a potentially useful avenue for further research. In this context, see Silverand Mercer (1996), Thrun and O’Sullivan (1996). Note that the question of task relatednessis clearly only meaningfulrelativeto a particular hypothesis space familyH (for example, allpossible collections of tasks are related ifH contains every possible hypothesis space).� Extended hierarchies. For an extension of our two-level approach to arbitrarily deep hierarchies,see Langford (1999). An interesting further question is to what extent the hierarchy canbe inferred from data. This is somewhat related to the question of automatic induction ofstructure in graphical models.

178


Acknowledgements

This work was supported at various times by an Australian Postgraduate Award, a Shell Aus-tralia Postgraduate Fellowship, U.K Engineering and Physical Sciences Research Council grantsK70366 and K70373, and an Australian Postdoctoral Fellowship. Along the way, many peoplehave contributed helpful comments and suggestions for improvement including Martin Anthony,Peter Bartlett, Rich Caruana, John Langford, Stuart Russell, John Shawe-Taylor, Sebastian Thrunand several anonymous referees.

Appendix A. Uniform Convergence Results

Theorem 2 provides a bound (uniform over allH 2 H ) on the probability of large deviation betweenerQ(H) anderz(H). To obtain a more general result, we follow Haussler (1992) and introduce thefollowing parameterized class of metrics onR+ :d� [x; y℄ := jx� yjx+ y + � ;where� > 0. Our main theorem will be a uniform bound on the probability of large values ofd� [erQ(H); erz(H)℄, rather thanj erQ(H)� erz(H)j. Theorem 2 will then follow as a corollary, aswill better bounds for the realizable caseerz(H) = 0 (Appendix A.3).

Lemma 17. The following three properties ofd� are easily established:

1. For all r; s � 0, 0 � d� [r; s℄ � 12. For all 0 � r � s � t, d� [r; s℄ � d� [r; t℄ andd� [s; t℄ � d� [r; t℄.3. For 0 � r; s � 1, jr�sj�+2 � d� [r; s℄ � jr�sj�For ease of exposition we have up until now been dealing explicitly with hypothesis spacesH

containing functionsh : X ! Y , and then constructing loss functionshl mappingX � Y ! [0; 1℄byhl(x; y) := l(h(x); y) for some loss functionl : Y�Y ! [0; 1℄. However, in general we can viewhl just as a function from an abstract setZ (X � Y ) to [0; 1℄ and ignore its particular constructionin terms of the loss functionl. So for the remainder of this section, unless otherwise stated, allhypothesis spacesH will be sets of functions mappingZ to [0; 1℄. It will also be considerably moreconvenient to transpose our notation for(n;m)-samples, writing then training sets as columnsinstead of rows: z = z11 : : : z1n

.... . .

...zm1 : : : zmnwhere eachzij 2 Z. Recalling the definition of(X � Y )(n;m) (Equation 9 and prior discussion),with this transpositionz lives in(X � Y )(m;n). The following definition now generalizes quantitieslike erz(H), erP(H) and so on to this new setting.

Definition 6. Let H1; : : : ;Hn be n sets of functions mappingZ into [0; 1℄. For any h1 2H1; : : : ; hn 2 Hn, let h1 � � � � � hn or simplyh denote the maph(~z) = 1=n nXi=1 hi(zi)179

BAXTER

for all ~z = (z1; : : : ; zn) 2 Zn. Let H1 � � � � � Hn denote the set of all such functions. Givenh 2 H1 � � � � � Hn andm elements of(X � Y )n, (~z1; : : : ; ~zm) (or equivalently an elementz of(X � Y )(m;n) by writing the~zi as rows), defineerz(h) := 1m mXi=1 h(~zi)(recall equation (8)). Similarly, for any product probability measureP = P1 � � � � � Pn on(X � Y )n, define erP(h) := ZZn h(~z) dP(~z)(recall equation (26)). Foranyh;h0 : (X � Y )n ! [0; 1℄ (not necessarily of the formh1�� hn),define dP(h;h0) := ZZn jh(~z)� h0(~z)j dP(~z)(recall equation (17)). For any class of functionsH mapping(X � Y )n to [0; 1℄, defineC (";H) := supP N (";H; dP)where the supremum is over all product probability measureson (X � Y )n andN (";H; dP) is thesize of the smallest"-cover ofH underdP (recall Definition 4).

The following theorem is the main result from which the rest of the uniform convergence resultsin this paper are derived.

Theorem 18. LetH � H1� � � � �Hn be a permissible class of functions mapping(X � Y )n into[0; 1℄. Let z 2 (X � Y )(m;n) be generated bym � 2=(�2�) independent trials from(X � Y )naccording to some product probability measureP = P1 � � � � � Pn. For all � > 0, 0 < � < 1,Pr�z 2 (X � Y )(m;n) : supH d� [erz(h); erP(h)℄ > �� 4C(��=8;H) exp(��2�nm=8): (53)

The following immediate corollary will also be of use later.

Corollary 19. Under the same conditions as Theorem 18, ifm � max( 8�2�n log 4C ��8 ;H�Æ ; 2�2�) ; (54)

then Pr�z 2 (X � Y )(m;n) : supH d� [erz(h); erP(h)℄ > �� Æ (55)

A.1 Proof of Theorem 18

The proof is via a double symmetrization argument of the kindgiven in chapter 2 of Pollard (1984).I have also borrowed some ideas from the proof of Theorem 3 in Haussler (1992).

180


A.1.1 FIRST SYMMETRIZATION

An extra piece of notation: for allz 2 (X � Y )(2m;n), let z(1) be the top half ofz andz(2) be thebottom half, viz: z(1) = z11 : : : z1n

.... . .

...zm1 : : : zmn z(2) = zm+1;1 : : : zm+1;n...

. .....z2m;1 : : : z2m;n

The following lemma is the first “symmetrization trick.” We relate the probability of large deviationbetween an empirical estimate of the loss and the true loss tothe probability of large deviationbetween two independent empirical estimates of the loss.

Lemma 20. Let H be a permissible set of functions from(X � Y )n into [0; 1℄ and letP be aprobability measure on(X � Y )n. For all � > 0; 0 < � < 1 andm � 2�2� ,Pr�z 2 (X � Y )(m;n) : supH d� [erz(h); erP (h)℄ > �� 2Pr�z 2 Z(2m;n) : supH d� �erz(1)(h); erz(2)(h)� > �2� : (56)

Proof. Note first that permissibility ofH guarantees the measurability of suprema overH(Lemma 32 part 5). By the triangle inequality ford� , if d� �erz(1)(h); erP (h)� > � andd� �erz(2)(h); erP (h)� < �=2, thend� �erz(1)(h); erz(2)(h)� > �=2. Thus,Prnz 2 (X � Y )(2m;n) : 9h 2 H : d� �erz(1)(h); erz(2)(h)� > �2o� Prnz 2 (X � Y )(2m;n) : 9h 2 H : d� �erz(1)(h); erP (h)� > � andd� �erz(2)(h); erP (h)� < �=2o: (57)

By Chebyshev’s inequality, for any fixedh,Prnz 2 (X � Y )(m;n) : d� [erz(h); erP (h)℄ < �2o� Pr�z 2 (X � Y )(m;n) : jerz(h)� erP (h)j� < �2�� 1� erP (h)(1 � erP (h))m��2=4� 12asm � 2=(�2�) anderP (h) � 1. Substituting this last expression into the right hand sideof (57)gives the result.

181

BAXTER

A.1.2 SECOND SYMMETRIZATION

The second symmetrization trick bounds the probability of large deviation between two empiricalestimates of the loss (i.e. the right hand side of (56)) by computing the probability of large devi-ation when elements are randomly permuted between the first and second sample. The followingdefinition introduces the appropriate permutation group for this purpose.

Definition 7. For all integersm;n � 1, let �(2m;n) denote the set of all permutations� of thesequence of pairs of integersf(1; 1); : : : ; (1; n); : : : ; (2m; 1); : : : ; (2m;n)g such that for alli, 1 �i � m, either�(i; j) = (m + i; j) and�(m + i; j) = (i; j) or �(i; j) = (i; j) and�(m + i; j) =(m+ i; j).

For anyz 2 (X � Y )(2m;n) and any� 2 �(2m;n), letz� := z�(1;1) : : : z�(1;n)...

. . ....z�(2m;1) : : : z�(2m;n):

Lemma 21. LetH = H1 � � � � � Hn be a permissible set of functions mapping(X � Y )n into[0; 1℄ (as in the statement of Theorem 18). Fixz 2 (X � Y )(2m;n) and letH := ff1; : : : ; fMg bean��=8-cover for(H; dz), wheredz(h;h0) := 12mP2mi=1 jh(~zi)�h0(~zi)j where the~zi are the rowsof z. Then,Pr�� 2 �(2m;n) : supH d� �erz�(1)(h); erz�(2)(h)� > �2�� MXi=1 Prn� 2 �(2m;n) : d� �erz�(1)(f i); erz�(2)(f i)� > �4o ; (58)

where each� 2 �(2m;n) is chosen uniformly at random.

Proof. Fix � 2 �(2m;n) and leth 2 H be such thatd� �erz�(1)(h); erz�(2)(h)� > �=2 (if there is

no suchh for any� we are already done). Choosef 2 H such thatdz(h; f) � ��=8. Without lossof generality we can assumef is of the formf = f1 � � � � � fn. Now,2� dz(h; f) = P2mi=1 ��Pnj=1 hj(zij)� fj(zij)��mn= P2mi=1 ��Pnj=1 hj(z�(i;j))� fj(z�(i;j))��mn� ��Pmi=1Pnj=1 �hj(z�(i;j))� fj(z�(i;j))��mn+Pmi=1Pnj=1 �hj(z�(i;j)) + fj(z�(i;j))�+ ��P2mi=m+1Pnj=1 �hj(z�(i;j))� fj(z�(i;j))��mn+P2mi=m+1Pnj=1 �hj(z�(i;j)) + fj(z�(i;j))�= d� �erz�(1)(h); erz�(1)(f)�+ d� �erz�(2)(h); erz�(2)(f)� :

182


Hence, by the triangle inequality ford� ,2� dz(h; f)+d� �erz�(1)(f); erz�(2)(f)� � d� �erz�(1)(h); erz�(1)(f)�+ d� �erz�(2)(h); erz�(2)(f)�+ d� �erz�(1)(f); erz�(2)(f)�� d� �erz�(1)(h); erz�(2)(h)� : (59)

But 2�dz(h; f) � �=4 by construction andd� �erz�(1)(h); erz�(2)(h)� > �=2 by assumption, so(59) impliesd� �erz�(1)(f); erz�(2)(f)� > �=4. Thus,n� 2 �(2m;n) : 9h 2 H : d� �erz�(1)(h); erz�(2)(h)� > �2o� n� 2 �(2m;n) : 9f 2 H : d� �erz�(1)(f); erz�(2)(f)�) > �4o ;which gives (58).

Now we bound the probability of each term in the right hand side of (58).

Lemma 22. Let f : (X � Y )n ! [0; 1℄ be any function that can be written in the formf =f1 � � � � � fn. For anyz 2 (X � Y )(2m;n),Prn� 2 �(2m;n) : d� �erz�(1)(f); erz�(2)(f)� > �4o � 2 exp��2�mn8 � ; (60)

where each� 2 �(2m;n) is chosen uniformly at random.

Proof. For any� 2 �(2m;n),d� �erz�(1)(f); erz�(2)(f)� = ��Pmi=1Pnj=1 �fj(z�(i;j))� fj(z�(m+i;j))��mn+P2mi=1Pnj=1 fj(zij) : (61)

To simplify the notation denotefj(zij) by �ij . For each pairij, 1 � i � m, 1 � j � n, letYij be an independent random variable such thatYij = �ij � �m+i;j with probability 1=2 andYij = �m+i;j � �ij with probability1=2. From (61),Prn� 2 �(2m;n) : d� �erz�(1)(f); erz�(2)(f)� > �4o= Pr8<:� 2 �(2m;n) : �� mXi=1 nXj=1 �fj(z�(i;j))� fj(z�(m+i;j))�� > �4 0��mn+ 2mXi=1 nXj=1 �ij1A9=;= Pr8<:�� mXi=1 nXj=1 Yij�� 4 0��mn+ 2mXi=1 nXj=1 �ij1A9=;For zero-mean independent random variablesY1; : : : ; Yk with bounded rangesai � Yi � bi, Ho-effding’s inequality (Devroye, Gyorfi, & Lugosi, 1996) isPr(�� kXi=1 Yi�� ) � 2 exp � 2�2Pki=1(bi � ai)2! :

183

BAXTER

Noting that the range of eachYij is [�j�ij � �i+m;j)j; j�ij � �i+m;j)j℄, we havePr8<:�� mXi=1 nXj=1 Yij�� > �4 0��mn+ 2mXi=1 nXj=1 �ij1A9=; � 2 exp0B�� 2 h�mn+P2mi=1Pnj=1 �iji232Pmi=1Pnj=1(�ij � �i+m;j)21CALet =P2mi=1Pnj=1 �ij . As 0 � �ij � 1,

Pmi=1Pnj=1(�ij � �m+ij)2 � . Hence,2 exp0B�� 2 h�mn+P2mi=1Pnj=1 �iji232Pmi=1Pnj=1(�ij � �i+m;j)21CA � 2 exp��2(�mn+ )232 � :(�mn+ )2= is minimized by setting = �mn giving a value of4�mn. HencePrn� 2 �(2m;n) : d� �erz�(1)(f); erz�(2)(f)� > �4o � 2 exp��2�mn8 � ;as required.

A.1.3 PUTTING IT TOGETHER

For fixedz 2 (X � Y )(2m;n), Lemmas 21 and 22 give:Pr�� 2 �(2m;n) : supH d� �erz�(1)(h); erz�(2)(h)� > �2�� 2N (��=8;H; dz)) exp��2�mn8 � :Note thatdz is simplydP whereP = (P1; : : : ; Pn) and eachPi is the empirical distribution thatputs point mass1=m on eachzji; j = 1; : : : ; 2m (recall Definition 3). Hence,Pr�� 2 �(2m;n); z 2 (X � Y )(2m;n) : supH d� �erz�(1)(h); erz�(2)(h)� > �2�� 2C (��=8;H) exp��2�mn8 � :Now, for a random choice ofz, eachzij in z is independently (but not identically) distributed and�only ever swapszij andzi+m;j (so that� swaps azij drawn according toPj with another componentdrawn according to the same distribution). Thus we can integrate out with respect to the choice of� and writePr�z 2 (X � Y )(2m;n) : supH d� �erz(1)(h); erz(2)(h)� > �2�� 2C (��=8;H) exp��2�mn8 � :Applying Lemma 20 to this expression gives Theorem 18.

184


A.2 Proof of Theorem 2

Another piece of notation is required for the proof. For any hypothesis spaceH and any probabilitymeasuresP = (P1; : : : ; Pn) onZ, leterP(H) := 1n nXi=1 infh2H erPi(h):Note that we have usederP(H) rather thanerP(H) to indicate thaterP(H) is another empiricalestimate oferQ(H).

With the (n;m)-sampling process, in addition to the samplez there is also generated a se-quence of probability measures,P = (P1; : : : ; Pn) although these are not supplied to the learner.This notion is used in the following Lemma, wherePrf(z;P) 2 (X � Y )(n;m) � Pn : Ag means“the probability of generating a sequence of measuresP from the environment(P; Q) and then an(n;m)-samplez according toP such that A holds”.

Lemma 23. IfPr�(z;P) 2 (X � Y )(n;m) �Pn : supH d� [erz(H); erP(H)℄ > �2� � Æ2 ; (62)

and Pr�P 2 Pn : supH d� [erP(H); erQ(H)℄ > �2� � Æ2 ; (63)

then Pr�z 2 (X � Y )(n;m) : supH d� [erz(H); erQ(H)℄ > �� Æ:Proof. Follows directly from the triangle inequality ford� .

We treat the two inequalities in Lemma 23 separately.

A.2.1 INEQUALITY (62)

In the following Lemma we replace the supremum overH 2 H in inequality (62) with a supremumoverh 2 H n.

Lemma 24.Pr�(z;P) 2 (X � Y )(n;m) �Pn : supH d� [erz(H); erP(H)℄ > �� Pr((z;P) 2 (X � Y )(n;m) �Pn : supH nl d� [erz(h); erP(h)℄ > �) (64)

185

BAXTER

Proof. Suppose that(z;P) are such thatsupH d� [erz(H); erP(H)℄ > �. Let H satisfy this in-equality. Suppose first thaterz(H) � erP(H). By the definition oferz(H), for all " > 0 thereexistsh 2 Hn := H � � � � � H such thaterz(h) < erz(H) + ". Hence by property (3) of thed�metric, for all" > 0, there existsh 2 Hn such thatd� [erz(h); erz(H)℄ < ". Pick an arbitraryhsatisfying this inequality. By definition,erP(H) � erP(h), and soerz(H) � erP(H) � erP(h).As d� [erz(H); erP(H)℄ > � (by assumption), by the compatibility ofd� with the ordering on thereals,d� [erz(H); erP(h)℄ > � = �+ Æ, say. By the triangle inequality ford� ,d� [erz(h); erP(h)℄ + d� [erz(h); erz(H)℄ � d� [erz(H); erP(h)℄ = �+ Æ:Thusd� [erz(h); erP(h)℄ > � + Æ � " and for any" > 0 anh satisfying this inequality can befound. Choosing" = Æ shows that there existsh 2 Hn such thatd� [erz(h); erP(h)℄ > �.

If instead,erP(H) < erz(H), then an identical argument can be run with the role ofz andPinterchanged. Thus in both cases,supH d� [erz(H); erP(H)℄ > �) 9h 2 H nl : d� [erz(h); erP(h)℄ > �;which completes the proof of the Lemma.

By the nature of the(n;m) sampling process,Pr((z;P) 2 (X � Y )(n;m) �Pn supH nl : d� [erz(h); erP(h)℄ > �)= ZP2PnPr(z 2 (X � Y )(n;m) : supH nl d� [erz(h); erP(h)℄ > �) dQn(P): (65)

Now H nl � K�� K whereK := fhl : h 2 H : H 2 H g andH nl is permissible by the assumedpermissibility ofH (Lemma 32, Appendix D). HenceH nl satisfies the conditions of Corollary 19and so combining Lemma 24, Equation (65) and substituting�=2 for � andÆ=2 for Æ in Corollary19 gives the following Lemma on the sample size required to ensure (62) holds.

Lemma 25. If m � max� 32�2�n log 8C(��=16; H nl )Æ ; 8�2��then Pr�(z;P) 2 (X � Y )(n;m) �Pn : supH d� [erz(H); erP(H)℄ > �2� � Æ2 :A.2.2 INEQUALITY (63)

Note thaterP(H) = 1nPni=1H�(Pi) anderQ(H) = EP�QH�(P ), i.e the expectation ofH�(P )whereP is distributed according toQ. So to bound the left-hand-side of (63) we can apply Corollary19 withn = 1, m replaced byn,H replaced byH �, � andÆ replaced by�=2 andÆ=2 respectively,P replaced byQ andZ replaced byP. Note thatH � is permissible wheneverH is (Lemma 32).Thus, if n � max� 32�2� log 8C(��=16; H �)Æ ; 8�2�� (66)

186


then inequality (63) is satisfied.Now, putting together Lemma 23, Lemma 25 and Equation 66, we have proved the following

more general version of Theorem 2.

Theorem 26. Let H be a permissible hypothesis space family and letz be an(n;m)-sample gen-erated from the environment(P; Q). For all 0 < �; Æ < 1 and� > 0, ifn � max� 32�2� log 8C(��=16; H �)Æ ; 8�2��

and m � max� 32�2�n log 8C(��=16; H nl )Æ ; 8�2�� ;then Pr�z 2 (X � Y )(n;m) : supH d� [erz(H); erQ(H)℄ > �� Æ

To get Theorem 2, observe thaterQ(H) > erz(H) + " ) d� [erz(H); erQ(H)℄ > "=(2 + �).Setting� = "=(2 + �) and maximizing�2� gives� = 2. Substituting� = "=4 and� = 2 intoTheorem 26 gives Theorem 2.

A.3 The Realizable Case

In Theorem 2 the sample complexity for bothm andn scales as1="2. This can be improved to1=" if instead of requiringerQ(H) � erz(H) + ", we require only thaterQ(H) � �erz(H) + "for some� > 1. To see this, observe thaterQ(H) > erz(H)(1 + �)=(1 � �) + ��=(1 � �) )d� [erz(H); erQ(H)℄ > �, so setting��=(1 � �) = " in Theorem 26 and treating� as a constantgives:

Corollary 27. Under the same conditions as Theorem 26, for all" > 0 and0 < �; Æ < 1, ifn � max� 32�(1� �)" log 8C((1 � �)"=16; H �)Æ ; 8�(1 � �)"�and m � max� 32�(1� �)"n log 8C((1 � �)"=16; H nl )Æ ; 8�(1� �)"� ;

then Pr�z 2 (X � Y )(n;m) : supH erQ(H) � 1 + �1� � erz(H) + "� � Æ:These bounds are particularly useful if we know thaterz(H) = 0, for then we can set� = 1=2

(which maximizes�(1� �)).Appendix B. Proof of Theorem 6

Recalling Definition 6, forH of the form given in (32),H nl can be writtenH nl = fg1 Æ f � � � � � gn Æ f : g1; : : : ; gn 2 Gl andf 2 Fg :To write H nl as a composition of two function classes note that if for eachf : X ! V we define�f : (X � Y )n ! (V � Y )n by�f(x1; y1; : : : ; xn; yn) := (f(x1); y1; : : : ; f(xn); yn)

187

BAXTER

theng1 Æ f�� gn Æ f = g1 � � � � � gn Æ �f . Thus, settingGnl := Gl�� Gl andF := f �f : f 2Fg, H nl = Gnl Æ F : (67)

The following two Lemmas will enable us to boundC ("; H nl ).Lemma 28. LetH : X � Y ! [0; 1℄ be of the formH = Gl Æ F whereX � Y F�! V � Y Gl�![0; 1℄. For all "1; "2 > 0, C("1 + "2;H) � CGl("1;F) C("2;Gl):Proof. Fix a measureP on X � Y and letF be a minimum size"1-cover for (F ; d[P;Gl℄). Bydefinition jF j � CGl("1;F). For eachf 2 F let Pf be the measure onV � Y defined byPf (S) =P (f�1(S)) for any setS in the�-algebra onV � Y (f is measurable sof�1(S) is measurable).Let Gf be a minimum size"2-cover for (Gl; dPf ). By definition again,jGf j � C("2;Gl). LetN := fg Æ f : f 2 F andg 2 Gfg. Note thatjN j � CGl("1;F)C("2;Gl) so the Lemma will beproved ifN can be shown to be an"1 + "2-cover for(H; dP ). So, given anyg Æ f 2 H choosef 0 2 F such thatd[P;Gl℄(f; f 0) � "1 andg0 2 Gf 0 such thatdPf 0 (g; g0) � "2. Now,dP (g Æ f; g0 Æ f 0) � dP (g Æ f; g Æ f 0) + dP (g Æ f 0; g0 Æ f 0)� d[P;Gl℄(f; f 0) + dPf 0 (g; g0)� "1 + "2:where the first line follows from the triangle inequality fordP and the second line follows fromthe facts: dP (g Æ f 0; g0 Æ f 0) = dPf (g; g0) and dP (g Æ f; g Æ f 0) � d[P;Gl℄(f; f 0). ThusN is an"1 + "2-cover for(H; dP ) and so the result follows.

Recalling the definition ofH1 � � � � � Hn (Definition 6), we have the following Lemma.

Lemma 29. C(";H1 � � � � � Hn) � nYi=1 C(";Hi)Proof. Fix a product probability measureP = P1 � � � � � Pn on (X � Y )n. Let N1; : : : ; Nn be"-covers of(H1; dP1) : : : ; (Hn; dPn). and letN = N1 � � � � � Nn. Givenh = h1 � � � � � hn 2H1 � � � � �Hn, chooseg1 � � � � � gn 2 N such thatdPi(hi; gi) � " for eachi = 1; : : : ; n. Now,dP(h1 � � � � � hn; g1 � � � � � gn) = 1n ZZn �� nXi=1 hi(zi)� nXi=1 gi(zi)�� dP(z1; : : : ; zn)� 1n nXi=1 dPi(hi; gi)� ":ThusN is an"-cover forH1 � � � � � Hn and asjN j =Qni=1 jNij the result follows.

188


B.1 BoundingC ("; H nl )From Lemma 28, C �"1 + "2;Gln Æ F� � C ("1;Gln) CGln �"2;F� (68)

and from Lemma 29, C ("1;Gln) � C ("1;Gl)n : (69)

Using similar techniques to those used to prove Lemmas 28 and29, CGln(";F ) can be shown tosatisfy CGln("2;F) � CGl ("2;F) : (70)

Equations (67), (68), (69) and (70) together imply inequality (34).

B.2 BoundingC ("; H �)We wish to prove thatC ("; H �) � CGl (";F) when H is a hypothesis space family of the formH = fGl Æ f : f 2 Fg. Note that eachH� 2 H � corresponds to someGl Æ f , and thatH�(P ) = infg2Gl erP (g Æ f):Any probability measureQ onP induces a probability measureQX�Y onX � Y , defined byQX�Y (S) = ZP P (S) dQ(P )for anyS in the�-algebra onX � Y . Note also that ifh; h0 are bounded, positive functions on anarbitrary setA, then �� infa2A h(a)� infa2A h0(a)�� supa2A ��h(a)� h0(a)�� : (71)

Let Q be any probability measure on the spaceP of probability measures onX � Y . LetH�1;H�2be two elements ofH � with corresponding hypothesis spacesGl Æ f1;Gl Æ f2. Then,dQ(H�1;H�2) = ZP �� infg2Gl erP (g Æ f1)� infg2Gl erP (g Æ f2)�� dQ(P )� ZP supg2Gl jerP (g Æ f1)� erP (g Æ f2)j dQ(P ) (by (71) above)� ZP ZX�Y supg2Gl jg Æ f1(x; y)� g Æ f2(x; y)j dP (x; y) dQ(P )= d[QX�Y ;Gl℄(f1; f2):The measurability ofsupGl g Æ f is guaranteed by the permissibility ofH (Lemma 32 part 4, Ap-pendix D). FromdQ(H�1;H�2) � d[QX�Y ;Gl℄(f1; f2) we have,N ("; H �; dQ) � N �";F ; d[QX�Y ;Gl℄� ; (72)

which gives inequality (35).

189

BAXTER

B.3 Proof of Theorem 7

In order to prove the bounds in Theorem 7 we have to apply Theorem 6 to the neural networkhypothesis space family of equation (39). In this case the structure isRd F�! Rk G�! [0; 1℄whereG = f(x1; : : : ; xk) 7! � �Pki=1 �ixi + �0� : (�0; �1; : : : ; �k) 2 Ug for some bounded

subsetU of Rk+1 and some Lipschitz squashing function�. The feature classF : Rd ! Rkis the set of all one hidden layer neural networks withd inputs, l hidden nodes,k outputs,� asthe squashing function and weightsw 2 T whereT is a bounded subset ofRW . The Lipschitzrestriction on� and the bounded restrictions on the weights ensure thatF andG are Lipschitzclasses. Hence there existsb < 1 such that for allf 2 F andx; x0 2 Rd , kf(x) � f(x0)k <bkx� x0k and for allg 2 G andx; x0 2 Rk , jg(x)� g(x0)j < bkx� x0k wherek � k is theL1 normin each case. The loss function is squared loss.

Now, gl(x; y) = l(g(x); y) = (g(x) � y)2, hence for allg; g0 2 G and all probability measuresP onRk � [0; 1℄ (recall that we assumed the output spaceY was[0; 1℄),dP (gl; g0l) = ZRk�[0;1℄ ��(g(v) � y)2 � (g0(v)� y)2�� dP (v; y)� 2ZRk ��g(v) � g0(v)�� dPRk(v); (73)

wherePRk is the marginal distribution onRk derived fromP . Similarly, for all f; f 0 2 F andprobability measuresP onRd � [0; 1℄,d[P;Gl℄(f; f 0) � 2bZRd kf(x)� f 0(x)k dPRd(x): (74)

Define C �";G; L1� := supP N �";G; L1(P )� ;where the supremum is over all probability measures on (the Borel subsets of)Rk , andN �";G; L1(P )� is the size of the smallest"-cover ofG under theL1(P ) metric. Similarly set,C �";F ; L1� := supP N �";F ; L1(P )� ;where now the supremum is over all probability measures onRd . Equations (73) and (74) implyC(";Gl) � C �"2 ;G; L1� (75)CGl (";F) � C � "2b ;F ; L1� (76)

Applying Theorem 11 from Haussler (1992), we findC �"2 ;Gl; L1� � �2eb" �2k+2C � "2b ;F ; L1� � �2eb2" �2W :Substituting these two expressions into (75) and (76) and applying Theorem 6 yields Theorem7.

190


Appendix C. Proof of Theorem 14

This proof follows a similar argument to the one presented inAnthony and Bartlett (1999) forordinary Boolean function learning.

First we need a technical Lemma.

Lemma 30. Let � be a random variable uniformly distributed onf1=2 + �=2; 1=2 � �=2g, with0 < � < 1. Let �1; : : : ; �m be i.i.d.f1;�1g-valued random variables withPr(�i = 1) = � for alli. For any functionf mappingf1;�1gn ! f1=2 + �=2; 1=2 � �=2g,Pr f�1; : : : ; �m : f(�1; : : : ; �m) 6= �g > 14 "1�q1� e� m�21��2 # :Proof. LetN(�) denote the number of occurences of+1 in the random sequence� = (�1; : : : ; �m).The functionf can be viewed as a decision rule, i.e. based on the observations �, f tries to guesswhether the probability of+1 is 1=2 + �=2 or 1=2 � �=2. The optimal decision rule is the Bayesestimator:f(�1; : : : ; �m) = 1=2+�=2 if N(�) � m=2, andf(�1; : : : ; �m) = 1=2��=2 otherwise.Hence, Pr (f(�) 6= �) � 12 Pr�N(�) � m2 �� = 12 � �2�+ 12 Pr�N(�) < m2 �� = 12 + �2�> 12 Pr�N(�) � m2 �� = 12 � �2�which is half the probability that a binomial(m; 1=2 � �=2) random variable is at leastm=2. BySlud’s inequality (Slud, 1977),Pr (f(�) 6= �) > 12 Pr Z �s m�21� �2!whereZ is normal(0; 1). Tate’s inequality (Tate, 1953) states that for allx � 0,Pr (Z � x) � 12 h1�p1� e�x2i :Combining the last two inequalities completes the proof.

Let x 2 X(n;m) be shattered byH , with m = dH (n). For each rowi in x let Pi be the set ofall 2d distributionsP onX � f�1g such thatP (x; 1) = P (x; 0) = 0 if x is not contained in theith row of x, and for eachj = 1; : : : ; dH (n), P (xij ; 1) = (1 � �)=(2dH (n)) andP (xij ;�1) =(1� �)=(2dH (n)). LetP := P1 � � � � � Pn.

Note that forP = (P1; : : : ; Pn) 2 P, the optimal erroroptP(H n) is achieved by any sequenceh� = (h�1; : : : ; h�n) such thath�i (xij) = 1 if and only if Pi(xij ; 1) = (1 + �)=(2dH (n)), andH nalways contains such a sequence becauseH shattersx. The optimal error is thenoptP(H n) = erP(h�) = 1n nXi=1 Pifh�i (x) 6= yg = 1n nXi=1 dH(n)Xj=1 1� �2dH (n) = 1� �2 ;

191

BAXTER

and for anyh = (h1; : : : ; hn) 2 H n,erP(h) = optP(H n) + �ndH (n) jf(i; j) : hi(xij) 6= h�i (xij)gj : (77)

For any(n;m)-samplez, let each elementmij in the arraym(z) := m11 � � � m1dH(n)...

.. ....mn1 � � � mndH(n)

equal the number of occurrences ofxij in z.Now, if we selectP = (P1; : : : ; Pn) uniformly at random fromP, and generate an(n;m)-

samplez usingP, then forh = An(z) (the output of the learning algorithm) we have:E (jf(i; j) : hi(xij) 6= h�i (xij)gj) =Xm P (m)E (jf(i; j) : hi(xij) 6= h�i (xij)gj jm)=Xm P (m) nXi=1 dH(n)Xj=1 P (h(xij) 6= h�(xij)jmij)whereP (m) is the probability of generating a configurationm of thexij under the(n;m)-samplingprocess and the sum is over all possible configurations. FromLemma 30,P (h(xij) 6= h�(xij)jmij) > 14 241�r1� e�mij�21��2 35 ;henceE � 1ndH (n) jf(i; j) : hi(xij) 6= h�i (xij)gj� > 1ndH (n)Xm P (m) nXi=1 dH(n)Xj=1 14 241�r1� e�mij�21��2 35� 14 "1�q1� e� m�2dH(n)(1��2)# (78)

by Jensen’s inequality. Since for any[0; 1℄-valued random variableZ, Pr(Z > x) � EZ � x, (78)implies: Pr� 1ndH (n) jf(i; j) : hi(xij) 6= h�i (xij)gj > �� > (1� )�where � := 14 "1�q1� e� m�2dH(n)(1��2)# (79)

and 2 [0; 1℄. Plugging this into (77) shows thatPr f(P; z) : erP(An(z)) > optP(H n) + ��g > (1� )�:192


Since the inequality holds over the random choice ofP, it must also hold for some specific choiceof P. Hence for any learning algorithmAn there is some sequence of distributionsP such thatPr fz : erP(An(z)) > optP(H n) + ��g > (1� )�:Setting (1� )� � Æ; and �� "; (80)

ensures Pr fz : erP(An(z)) > optP(H n) + "g > Æ: (81)

Assuming equality in (80), we get� = Æ1� ; � = "Æ 1� :Solving (79) form, and substituting the above expressions for� and� shows that (81) is satisfiedprovided m � dH (n)"�Æ"�2� 1� �2 � 1# log (1� )28Æ (1� � 2Æ) (82)

Setting = 1 � aÆ for somea > 4 (a > 4 since� < 1=4 and� = Æ=(1 � )), and assuming"; Æ � 1=(ka) for somek > 2, (82) becomesm � dH (n)a2 �1� 2k� log a28(a� 2) : (83)

Subject to the constrainta > 4, the right hand side of (83) is approximately maximized ata =8:7966, at which point the value exceedsdH (n)(1 � 2=k)=(220"2). Thus, for allk � 1, if "; Æ �1=9k and m � dH (n) �1� 2k�220"2 ; (84)

then Pr fz : erP(An(z)) > optP(H n) + "g > Æ:To obtain theÆ-dependence in Theorem 14 observe that by assumptionH 1 contains at least two

functionsh1; h2, hence there exists anx 2 X such thath1(x) 6= h2(x). LetP� be two distributionsconcentrated on(x; 1) and (x;�1) such thatP� (x; h1(x)) = (1 � ")=2 andP� (x; h2(x)) =(1� ")=2. LetP+ := P+ � � � � � P+ andP� := P� � � � � � P� be the product distributions on(X � f�1g)n generated byP�, andh1 := (h1; : : : ; h1);h2 := (h2; : : : ; h2). Note thath1 andh2are both inH n. If P is one ofP� and the learning algorithmAn chooses the wrong hypothesish,then erP(h)� optP(H n) = ":

193

BAXTER

Now, if we chooseP uniformly at random fromfP+;P�g and generate an(n;m)-samplez ac-cording toP, Lemma 30 shows thatPrf(P; z) : erP(An(z)) � optP(H n) + "g > 14 "1�q1� enm"21�"2 # ;which is at leastÆ if m < 1� "2"2 1n log 18Æ(1 � 2Æ) (85)

provided0 < Æ < 1=4. Combining the two constraints onm: (84) (withk = 7) and (85), and usingmaxfx1; x2g � 12 (x1 + x2) finishes the proof.

Appendix D. Measurability

In order for Theorems 2 and 18 to hold in full generality we hadto impose a constraint called“permissibility” on the hypothesis space familyH . Permissibility was introduced by Pollard (1984)for ordinary hypothesis classesH. His definition is very similar to Dudley’s “image admissibleSuslin” (Dudley, 1984). We will be extending this definitionto cover hypothesis space families.

Throughout this section we assume all functionsh map from (the complete separable metricspace)Z into [0; 1℄. LetB(T ) denote the Borel�-algebra of any topological spaceT . As in Section2.2, we viewP, the set of all probability measures onZ, as a topological space by equipping itwith the topology of weak convergence.B(P) is then the�-algebra generated by this topology. Thefollowing two definitions are taken (with minor modifications) from Pollard (1984).

Definition 8. A setH of [0; 1℄-valued functions onZ is indexedby the setT if there exists a functionf : Z � T ! [0; 1℄ such that H = ff( � ; t) : t 2 Tg :Definition 9. The setH is permissibleif it can be indexed by a setT such that

1. T is an analytic subset of a Polish7 spaceT , and

2. the functionf : Z � T ! [0; 1℄ indexingH by T is measurable with respect to the product�-algebraB(Z) B(T ).An analytic subsetT of a Polish spaceT is simply the continuous image of a Borel subsetX

of another Polish spaceX. The analytic subsets of a Polish space include the Borel sets. Theyare important because projections of analytic sets are analytic, and can be measured in a completemeasure space whereas projections of Borel sets are not necessarily Borel, and hence cannot bemeasured with a Borel measure. For more details see Dudley (1989), section 13.2.

Lemma 31. H1 � � � � �Hn : (X � Y )n ! [0; 1℄ is permissible ifH1; : : : ;Hn are all permissible.

Proof. Omitted.

We now define permissibility of hypothesis space families.

7. A topological space is calledPolish if it is metrizable such that it is a complete separable metric space.

194


Definition 10. A hypothesis space familyH = fHg is permissibleif there exist setsS andT thatare analytic subsets of Polish spacesS andT respectively, and a functionf : Z � T � S ! [0; 1℄,measurable with respect toS B(T ) B(S), such thatH = �ff( � ; t; s) : t 2 Tg : s 2 S:

Let (X;�; �) be a measure space andT be an analytic subset of a Polish space. LetA(X)denote the analytic subsets ofX. The following three facts about analytic sets are taken fromPollard (1984), appendix C.

(a) If (X;�; �) is complete thenA(X) � �.

(b) A(X � T ) contains the product�-algebra� B(T ).(c) For any setY in A(X � T ), the projection�XY of Y ontoX is inA(X).

Recall Definition 2 for the definition ofH �. In the following Lemma we assume that(Z;B(Z))has been completed with respect to any probability measureP , and also that(P;B(P)) is completewith respect to the environmental measureQ.

Lemma 32. For any permissible hypothesis space familyH ,

1. H nl is permissible.

2. fh 2 H : H 2 H g is permissible.

3. H is permissible for allH 2 H .

4. supH and infH are measurable for allH 2 H .

5. H� is measurable for allH 2 H .

6. H � is permissible.

Proof. As we have absorbed the loss function into the hypothesesh, H nl is simply the set of alln-fold productsH � � � � � H such thatH 2 H . Thus (1) follows from Lemma 31. (2) and (3)are immediate from the definitions. AsH is permissible for allH 2 H , (4) can be proved by anidentical argument to that used in the “Measurable Suprema”section of Pollard (1984), appendixC.

For (5), note that for any Borel-measurableh : Z ! [0; 1℄, the functionh : P ! [0; 1℄ definedby h(P ) := RZ h(z) dP (z) is Borel measurable Kechris (1995, chapter 17). Now, permissibility ofH automatically implies permissibility ofH := fh : h 2 Hg, andH� = infH soH� is measurableby (4).

Now let H be indexed byf : Z � T � S ! [0; 1℄ in the appropriate way. To prove (6),defineg : P � T � S ! [0; 1℄ by g(P; t; s) := RZ f(z; t; s) dP (z). By Fubini’s theoremg is aB(P) B(T ) B(S)-measurable function. LetG : P � S ! [0; 1℄ be defined byG(P; s) :=inft2T g(P; t; s). G indexesH � in the appropriate way forH � to be permissible, provided it canbe shown thatG is B(P) B(S)-measurable. This is where analyticity becomes important.Letg� := f(P; t; s) : g(P; t; s) > �g. By property (b) of analytic sets,A (P � T � S) containsg�.The setG� := f(P; s) : G(P; s) > �g is the projection ofg� ontoP � S, which by property (c) isalso analytic. As(P;B(P); Q) is assumed complete,G� is measurable, by property (a). ThusG isa measurable function and the permissibility ofH � follows.

195

BAXTER

References

Abu-Mostafa, Y. (1993). A method for learning from hints. InHanson, S. J., Cowan, J. D., & Giles,C. L. (Eds.),Advances in Neural Information Processing Systems 5, pp. 73–80 San Mateo,CA. Morgan Kaufmann.

Anthony, M., & Bartlett, P. L. (1999).Neural Network Learning: Theoretical Foundations. Cam-bridge University Press, Cambridge, UK.

Bartlett, P. L. (1993). Lower bounds on the VC-dimension of multi-layer threshold networks. InProccedings of the Sixth ACM Conference on Computational Learning Theory, pp. 44–150New York. ACM Press. Summary appeared in Neural Computation, 5, no. 3.

Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: thesize of the weights is more important than the size of the network. IEEE Transactions onInformation Theory, 44(2), 525–536.

Baxter, J. (1995a). Learning Internal Representations. Ph.D. thesis, Department of Mathe-matics and Statistics, The Flinders University of South Australia. Copy available fromhttp://wwwsyseng.anu.edu.au/�jon/papers/thesis.ps.gz.

Baxter, J. (1995b). Learning internal representations. InProceedings of the Eighth InternationalConference on Computational Learning Theory, pp. 311–320. ACM Press. Copy availablefrom http://wwwsyseng.anu.edu.au/�jon/papers/colt95.ps.gz.

Baxter, J. (1997a). A Bayesian/information theoretic model of learning to learn via multiple tasksampling.Machine Learning, 28, 7–40.

Baxter, J. (1997b). The canonical distortion measure for vector quantization and function approx-imation. InProceedings of the Fourteenth International Conference onMachine Learning,pp. 39–47. Morgan Kaufmann.

Baxter, J., & Bartlett, P. L. (1998). The canonical distortion measure in feature space and 1-NNclassification. InAdvances in Neural Information Processing Systems 10, pp. 245–251. MITPress.

Berger, J. O. (1985).Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, NewYork.

Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the vapnik-chervonenkis dimension.Journal of the ACM, 36, 929–965.

Caruana, R. (1997). Multitask learning.Machine Learning, 28, 41–70.

Devroye, L., Gyorfi, L., & Lugosi, G. (1996).A Probabilistic Theory of Pattern Recognition.Springer, New York.

Dudley, R. M. (1984).A Course on Empirical Processes, Vol. 1097 ofLecture Notes in Mathemat-ics, pp. 2–142. Springer-Verlag.

Dudley, R. M. (1989).Real Analysis and Probability. Wadsworth & Brooks/Cole, California.

196


Gelman, A., Carlin, J. B., Stern, H. S., & Rubim, D. B. (Eds.).(1995). Bayesian Data Analysis.Chapman and Hall.

Good, I. J. (1980). Some history of the hierarchical Bayesian methodology. In Bernardo, J. M.,Groot, M. H. D., Lindley, D. V., & Smith, A. F. M. (Eds.),Bayesian Statistics II. UniversityPress, Valencia.

Haussler, D. (1992). Decision theoretic generalizations of the pac model for neural net and otherlearning applications.Information and Computation, 100, 78–150.

Heskes, T. (1998). Solving a huge number of similar tasks: a combination of multi-task learning anda hierarchical Bayesian approach. In Shavlik, J. (Ed.),Proceedings of the 15th InternationalConference on Machine Learning (ICML ’98), pp. 233–241. Morgan Kaufmann.

Intrator, N., & Edelman, S. (1996). How to make a low-dimensional representation suitable fordiverse tasks.Connection Science, 8.

Kechris, A. S. (1995).Classical Descriptive Set Theory. Springer-Verlag, New York.

Khan, K., Muggleton, S., & Parson, R. (1998). Repeat learning using predicate invention. In Page,C. D. (Ed.),Proceedings of the 8th International Workshop on InductiveLogic Programming(ILP-98), LNAI 1446, pp. 65–174. Springer-Verlag.

Langford, J. C. (1999). Staged learning. Tech. rep., CMU, School of Computer Science.http://www.cs.cmu.edu/�jcl/research/ltol/stagedlatest.ps.

Mitchell, T. M. (1991). The need for biases in learning generalisations. In Dietterich, T. G., &Shavlik, J. (Eds.),Readings in Machine Learning. Morgan Kaufmann.

Parthasarathy, K. R. (1967).Probabiliity Measures on Metric Spaces. Academic Press, London.

Pollard, D. (1984).Convergence of Stochastic Processes. Springer-Verlag, New York.

Pratt, L. Y. (1992). Discriminability-based transfer between neural networks. In Hanson, S. J.,Cowan, J. D., & Giles, C. L. (Eds.),Advances in Neural Information Processing Systems 5,pp. 204–211. Morgan Kaufmann.

Rendell, L., Seshu, R., & Tcheng, D. (1987). Layered conceptlearning and dynamically-variablebias management. InProceedings of the Tenth International Joint Conference onArtificialIntelligence (IJCAI ’87), pp. 308–314. IJCAI , Inc.

Ring, M. B. (1995).Continual Learning in Reinforcement Environments. R. Oldenbourg Verlag.

Russell, S. (1989).The Use of Knowledge in Analogy and Induction. Morgan Kaufmann.

Sauer, N. (1972). On the density of families of sets.Journal of Combinatorial Theory A, 13,145–168.

Sharkey, N. E., & Sharkey, A. J. C. (1993). Adaptive generalisation and the transfer of knowledge.Artificial Intelligence Review, 7, 313–328.

197

BAXTER

Silver, D. L., & Mercer, R. E. (1996). The parallel transfer of task knowledge using dynamiclearning rates based on a measure of relatedness.Connection Science, 8, 277–294.

Singh, S. (1992). Transfer of learning by composing solutions of elemental sequential tasks.Ma-chine Learning, 8, 323–339.

Slud, E. (1977). Distribution inequalities for the binomial law. Annals of Probability, 4, 404–412.

Suddarth, S. C., & Holden, A. D. C. (1991). Symolic-neural systems and the use of hints in devel-oping complex systems.International Journal of Man-Machine Studies, 35, 291–311.

Suddarth, S. C., & Kergosien, Y. L. (1990). Rule-injection hints as a means of improving net-work performance and learning time. InProceedings of the EURASIP Workshop on NeuralNetworksPortugal. EURASIP.

Sutton, R. (1992). Adapting bias by gradient descent: An incremental version of delta-bar-delta. InProceedings of the Tenth National Conference on Artificial Intelligence, pp. 171–176. MITPress.

Tate, R. F. (1953). On a double inequality of the normal distribution. Annals of MathematicalStatistics, 24, 132–134.

Thrun, S. (1996). Is learning the n-th thing any easier than learning the first?. InAdvances in NeuralInformation Processing Systems 8, pp. 640–646. MIT Press.

Thrun, S., & Mitchell, T. M. (1995). Learning one more thing.In Proceedings of the InternationalJoint Conference on Artificial Intelligence, pp. 1217–1223. Morgan Kaufmann.

Thrun, S., & O’Sullivan, J. (1996). Discovering structure in multiple learning tasks: The TC al-gorithm. In Saitta, L. (Ed.),Proceedings of the 13th International Conference on MachineLearning (ICML ’96), pp. 489–497. Morgen Kaufmann.

Thrun, S., & Pratt, L. (Eds.). (1997).Learning to Learn. Kluwer Academic.

Thrun, S., & Schwartz, A. (1995). Finding structure in reinforcement learning. In Tesauro, G.,Touretzky, D., & Leen, T. (Eds.),Advances in Neural Information Processing Systems, Vol. 7,pp. 385–392. MIT Press.

Utgoff, P. E. (1986). Shift of bias for inductive concept learning. InMachine Learning: An ArtificialIntelligence Approach, pp. 107–147. Morgan Kaufmann.

Valiant, L. G. (1984). A theory of the learnable.Comm. ACM, 27, 1134–1142.

Vapnik, V. N. (1982).Estimation of Dependences Based on Empirical Data. Springer-Verlag, NewYork.

Vapnik, V. N. (1996).The Nature of Statistical Learning Theory. Springer Verlag, New York.

198

Documents

A Model of Inductive Bias Learning · of examples required to learn a task if the “true” set of featu res (that is, the correct bias) is already known, and n is the number of