Download pdf - Machine Learning: Some theoretical and practical problems · Outline Framework Theoretical Results Consequences Practical Implications Machine Learning: Some theoretical and practical

Outline Framework Theoretical Results Consequences Practical Implications

Machine Learning: Some theoretical and practicalproblems

Olivier Bousquet

Journees MAS, Lille, 2006

Olivier Bousquet Machine Learning: Some theoretical and practical problems


1 Framework

2 Theoretical Results

3 Consequences

4 Practical Implications



Outline

1 Framework


3 Consequences




The Setting

Prediction problemsafter observing example pairs (X ,Y ), build a function g : X → Ythat predicts well: g(X ) ≈ Y

Typical setting is statistical (data assumed to be sampled i.i.d.)

Other setting: on-line adversarial (no assumption on the datageneration mechanism)

Goal: find the best algorithm

Theoretical answer: fundamental limits of learningPractical answer: guidelines for algorithm design



The Setting








The Setting








The Setting








The Setting








The Setting








Definitions

We consider the classification setting: Y = {0, 1} with data sampled i.i.d.

A rule (or learning algorithm) is a mapping gn : (X ×Y)n ×X → Y.

Sample: Sn = {(X1,Y1), . . . , (Xn,Yn)}

Misclassification error: L(g) = P (g(X ) 6= Y ) (conditional on thesample)

Bayes error: best possible error L∗ = infg L(g) over all measurablefunctions

Sequence of classification rules {gn}: defined for any sample size(algorithms are usually defined in this way, possibly with a samplesize-dependent parameter)

Consistency: limn→∞ EL(gn) = L∗



Definitions



Sample: Sn = {(X1,Y1), . . . , (Xn,Yn)}







Definitions



Sample: Sn = {(X1,Y1), . . . , (Xn,Yn)}







Definitions



Sample: Sn = {(X1,Y1), . . . , (Xn,Yn)}







Definitions



Sample: Sn = {(X1,Y1), . . . , (Xn,Yn)}







Definitions



Sample: Sn = {(X1,Y1), . . . , (Xn,Yn)}







Outline

1 Framework


3 Consequences




Consistency

How to build a consistent sequence of rules?

Countable Xvery easy, just wait! eventually every point with non-zero probabilityis observed an unbounded number of times (i.e. take majority voteover observed x , and random prediction on unobserved ones)

Uncountable Xobserved sample has measure zero (for non-atomic measures) sothis trick does not work

Instead, take local majority with two conditions: more and morelocal, but also more and more points averaged



Consistency







Consistency







Consistency of Histograms

Histogram in Rd : cubic cells of size hn, prediction is constant over eachcell (majority vote)

hn → 0, nhdn →∞ enough for universal consistency

Idea of the proof

Continuous functions with bounded support are dense in Lp(ν)Such functions are uniformly continuous and can thus beapproximated by histograms (average of the function on a cell)provided cell size goes to 0Since cells will contain more and more points (secondcondition), the cell value will eventually converge to theaverage over the cell



Consistency of Histograms

Histogram in Rd : cubic cells of size hn, prediction is constant over eachcell (majority vote)

hn → 0, nhdn →∞ enough for universal consistency

Idea of the proof

Continuous functions with bounded support are dense in Lp(ν)Such functions are uniformly continuous and can thus beapproximated by histograms (average of the function on a cell)provided cell size goes to 0Since cells will contain more and more points (secondcondition), the cell value will eventually converge to theaverage over the cell



No Free Lunch

We can ”learn” anything

Is the problem solved?

The question becomes: among the consistent algorithms, which oneis the best?

We consider here the special case of classification

Similar phenomena occur for regression or density estimation

Unfortunately, there is no free lunch



No Free Lunch









No Free Lunch









No Free Lunch









No Free Lunch









No Free Lunch









No Free Lunch 1

Out-of-sample error: L′(gn) = P (gn(X ) 6= Y |X /∈ Sn)

Consider a uniform probability distribution µ over problems, i.e. forall x , EµP(Y = 1|X = x) = EµP(Y = 0|X = x)

All classifiers have the same average error

Theorem (Wolpert96)

For any classification rule gn,

EµEL′(gn) =1

2



No Free Lunch 1

Out-of-sample error: L′(gn) = P (gn(X ) 6= Y |X /∈ Sn)

Consider a uniform probability distribution µ over problems, i.e. forall x , EµP(Y = 1|X = x) = EµP(Y = 0|X = x)

All classifiers have the same average error

Theorem (Wolpert96)

For any classification rule gn,

EµEL′(gn) =1

2



No Free Lunch 2

A consequence of NFL1 is that there are always cases where analgorithm can be beaten.

A stronger version of NFL1: No Super Classifier

Theorem (DGL96)

For every sequence of classification rules {gn} there is a universallyconsistent sequence {g ′n} such that for some distribution

L(gn) > L(g ′n)

for all n.



No Free Lunch 2

A consequence of NFL1 is that there are always cases where analgorithm can be beaten.

A stronger version of NFL1: No Super Classifier

Theorem (DGL96)

For every sequence of classification rules {gn} there is a universallyconsistent sequence {g ′n} such that for some distribution

L(gn) > L(g ′n)

for all n.



No Free Lunch 3

A variation of NFL1

Arbitrarily bad error for fixed sample sizes

Theorem (Devroye82)

Fix an ε > 0. For any integer n and classification rule gn, there exists adistribution of (X ,Y ) with Bayes risk L∗ = 0 such that

EL(gn) ≥ 1/2− ε



No Free Lunch 3

A variation of NFL1

Arbitrarily bad error for fixed sample sizes

Theorem (Devroye82)

Fix an ε > 0. For any integer n and classification rule gn, there exists adistribution of (X ,Y ) with Bayes risk L∗ = 0 such that

EL(gn) ≥ 1/2− ε



No Free Lunch 4

NFL3 possibly considers a different distribution for each n

What happens for a fixed distribution when n increases?

Slow rate phenomenon

Theorem (Devroye82)

Let {an} be a sequence of positive numbers converging to zero with1/16 ≥ a1 ≥ a2 ≥ . . .. For every sequence of classification rules, thereexists a distribution of (X ,Y ) with Bayes risk L∗ = 0 such that

EL(gn) ≥ an

for all n.



No Free Lunch 4




Theorem (Devroye82)


EL(gn) ≥ an

for all n.



No Free Lunch 4




Theorem (Devroye82)


EL(gn) ≥ an

for all n.



Proofs

The idea is to create a ”bad” distribution

It turns out that random ones are bad enough: just create a problemwith no structure (prediction at x unrelated to prediction at x ′)

All proofs work on finite (for fixed n) or countable (for varying n)spaces (no need to introduce uncountable X )

The trick is to make sure that there are enough point that have notbeen observed yet (on those, the error will be 1/2)



Proofs







Proofs







Proofs







A closer look at consistency

Consider the trivially consistent rule for a countable space (majorityvote)

Its error decreases with increasing sample size

∀n, EL(gn) ≥ EL(gn+1)

Is is true in general for universally consistent rules?

















Smart rules

Consistency for uncountable spaces is not so trivial

Smart rules

Definition

A sequence {gn} of classification rules is smart if for any distribution andany integer n,

EL(gn) ≥ EL(gn+1)

For uncountable spaces, some of the known universally consistentrules can be shown to be non-smart

Conjecture: on Rd any smart rule is not universally consistent

Interpretation: consistency on uncountable spaces requires to adaptthe degree of smoothness to the sample size, this means that therewill be a point for which smoothness degree will be too large



Smart rules


Smart rules

Definition


EL(gn) ≥ EL(gn+1)






Smart rules


Smart rules

Definition


EL(gn) ≥ EL(gn+1)






Smart rules


Smart rules

Definition


EL(gn) ≥ EL(gn+1)






Anti-learning

Average error is 1/2 so there are problems for which the error is muchworse than random guessing!

One can indeed construct distributions for which some standardalgorithms have EL(gn) arbitrarily close to 1 even with L∗ = 0!

Of course this occurs for a fixed sample size

Can one always do that (for any rule) ?

The problem should have a structure, but one which is opposite tothe ones preferred by the algorithm



Anti-learning








Anti-learning








Anti-learning








Bayes Error Estimation

Assume we just want to estimate L∗.

Of course, we could use any universally consistent algorithm andestimate its error. But we get slow rates!

Is there a better way?

Theorem (DGL96)

For every n, for any estimate Ln of the Bayes error L∗ and for everyε > 0, there exists a distribution (X ,Y ), such that

E(|Ln − L∗|

)≥ 1/4− ε

Estimating this single number does not seem easier than estimatingthe whole set {x : P(Y = 1|x) > 1/2}







Theorem (DGL96)


E(|Ln − L∗|

)≥ 1/4− ε








Theorem (DGL96)


E(|Ln − L∗|

)≥ 1/4− ε




Outline

1 Framework


3 Consequences




What can we hope to prove?

Our framework is too general! Nothing interesting can be saidabout learning algorithms

Can we prove something interesting under slightly more restrictiveassumptions?

Are the distributions used to prove the NFLs pathological? (NFL 4holds even within classes of ”reasonable” distributions!)

If we can define which problems actually occur in real life, we canhope to derive appropriate algorithms (optimal on this class ofproblems)
























The Bayesian Way

Assume something about how the data is generated

Consider an algorithm specifically tuned to this property

Prove that under this assumption the algorithm does well

Most results are going in this direction (sometimes in a subtle way)

Bayesian algorithms

Most minimax results are of this form

inf{gn}

supP∈P

(L(gn)− inf

gL(g)

)Seems reasonable and useful for understanding but does not provideguarantees



The Bayesian Way





Bayesian algorithms


inf{gn}

supP∈P

(L(gn)− inf

gL(g)




The Bayesian Way





Bayesian algorithms


inf{gn}

supP∈P

(L(gn)− inf

gL(g)




The Bayesian Way





Bayesian algorithms


inf{gn}

supP∈P

(L(gn)− inf

gL(g)




The Bayesian Way





Bayesian algorithms


inf{gn}

supP∈P

(L(gn)− inf

gL(g)




The Bayesian Way





Bayesian algorithms


inf{gn}

supP∈P

(L(gn)− inf

gL(g)




The Bayesian Way





Bayesian algorithms


inf{gn}

supP∈P

(L(gn)− inf

gL(g)




The Worst Case Way

Assume nothing about the data (distribution-free)

Restrict your objectives

Derive an algorithm that reaches this objective no matter how thedata is

inf{gn}

supP

(L(gn)− inf

g∈GL(g)

)Gives guarantees

In between: adaptation



The Worst Case Way




inf{gn}

supP

(L(gn)− inf

g∈GL(g)

)Gives guarantees




The Worst Case Way




inf{gn}

supP

(L(gn)− inf

g∈GL(g)

)Gives guarantees




The Worst Case Way




inf{gn}

supP

(L(gn)− inf

g∈GL(g)

)Gives guarantees




The Worst Case Way




inf{gn}

supP

(L(gn)− inf

g∈GL(g)

)Gives guarantees




Outline

1 Framework


3 Consequences




Does this help practically?

We can probably come up with algorithms that work well on mostreal-world problems

If we have a characterization of these problems, we can even provesomething about such algorithms

However, there is no guarantee that a new problem will satisfy thischaracterization

So there cannot be a formal proof that an algorithm is good or bad
























If theory cannot help, what can we do?

Essentially a matter of finding an algorithm that implements theright notion of smoothness for the problem at hand

More an art than a science!



If theory cannot help, what can we do?

Essentially a matter of finding an algorithm that implements theright notion of smoothness for the problem at hand

More an art than a science!



Priors

Algorithm design is composed of two steps

Choosing a preferenceThis first step is based on knowledge of the problem, this is whereguidance (but no theory) is needed.

Exploiting it for inferenceThe second step can possibly be formalized (optimality with respectto assumptions). The main issue is computational cost.



Priors

Algorithm design is composed of two steps

Choosing a preferenceThis first step is based on knowledge of the problem, this is whereguidance (but no theory) is needed.

Exploiting it for inferenceThe second step can possibly be formalized (optimality with respectto assumptions). The main issue is computational cost.



Why can algorithms fail in practice?

1 Data representation (unappropriate features, errors, ...)

2 Data scarcity (not enough data samples)

3 Data overload (too many variables, too much noise)

4 Lack of understanding of the result (impossible validation) / lack ofvalidation data

Examples

Forgot to remove the output variable (or a version of it): algorithmpicks it up

An irrelevant variable happens to be discriminative (e.g. date ofsample collection)

Error in a measurement (misalignment in the database)








Examples











Examples











Examples











Examples











Examples











Examples











Examples






So, what would be helpful?

Flexible ways to incorporate knowledge/expertise

Provide tools that allow to formulate prior knowledge in anatural wayLook for other types of prior assumptions that occur in variousproblems (e.g. manifold structure, clusteredness, analogy...)

Ability to understand what is found by the algorithm (need alanguage to interact with experts)

Investigate how to improve understandability (simpler models,separate models and language for interaction...)Improve interaction (understand user’s intent)

Computationally efficient algorithms

Scalability, anytimeIncorporate time complexity in the theoretical analysis (tradecomplexity for accuracy)











































































References

L. Devroye: Necessary and Sufficient Conditions for the AlmostEverywhere Convergence of Nearest Neighbors Regression FunctionEstimates. Zeitschrift fur Wahrscheinlichkeitstheorie und verwandteGebiete, 61: 467-481 (1982)

D. Wolpert: The lack of a prior distinctions between learningalgorithms, Neural Computation 8 (1996)

L. Devroye, L. Gyorfi and G. Lugosi: A Probabilistic Theory ofPattern Recognition, Springer (1996)