Outline Framework Theoretical Results Consequences Practical Implications
Machine Learning: Some theoretical and practicalproblems
Olivier Bousquet
Journees MAS, Lille, 2006
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
1 Framework
2 Theoretical Results
3 Consequences
4 Practical Implications
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Outline
1 Framework
2 Theoretical Results
3 Consequences
4 Practical Implications
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Setting
Prediction problemsafter observing example pairs (X ,Y ), build a function g : X → Ythat predicts well: g(X ) ≈ Y
Typical setting is statistical (data assumed to be sampled i.i.d.)
Other setting: on-line adversarial (no assumption on the datageneration mechanism)
Goal: find the best algorithm
Theoretical answer: fundamental limits of learningPractical answer: guidelines for algorithm design
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Setting
Prediction problemsafter observing example pairs (X ,Y ), build a function g : X → Ythat predicts well: g(X ) ≈ Y
Typical setting is statistical (data assumed to be sampled i.i.d.)
Other setting: on-line adversarial (no assumption on the datageneration mechanism)
Goal: find the best algorithm
Theoretical answer: fundamental limits of learningPractical answer: guidelines for algorithm design
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Setting
Prediction problemsafter observing example pairs (X ,Y ), build a function g : X → Ythat predicts well: g(X ) ≈ Y
Typical setting is statistical (data assumed to be sampled i.i.d.)
Other setting: on-line adversarial (no assumption on the datageneration mechanism)
Goal: find the best algorithm
Theoretical answer: fundamental limits of learningPractical answer: guidelines for algorithm design
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Setting
Prediction problemsafter observing example pairs (X ,Y ), build a function g : X → Ythat predicts well: g(X ) ≈ Y
Typical setting is statistical (data assumed to be sampled i.i.d.)
Other setting: on-line adversarial (no assumption on the datageneration mechanism)
Goal: find the best algorithm
Theoretical answer: fundamental limits of learningPractical answer: guidelines for algorithm design
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Setting
Prediction problemsafter observing example pairs (X ,Y ), build a function g : X → Ythat predicts well: g(X ) ≈ Y
Typical setting is statistical (data assumed to be sampled i.i.d.)
Other setting: on-line adversarial (no assumption on the datageneration mechanism)
Goal: find the best algorithm
Theoretical answer: fundamental limits of learningPractical answer: guidelines for algorithm design
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Setting
Prediction problemsafter observing example pairs (X ,Y ), build a function g : X → Ythat predicts well: g(X ) ≈ Y
Typical setting is statistical (data assumed to be sampled i.i.d.)
Other setting: on-line adversarial (no assumption on the datageneration mechanism)
Goal: find the best algorithm
Theoretical answer: fundamental limits of learningPractical answer: guidelines for algorithm design
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Definitions
We consider the classification setting: Y = {0, 1} with data sampled i.i.d.
A rule (or learning algorithm) is a mapping gn : (X ×Y)n ×X → Y.
Sample: Sn = {(X1,Y1), . . . , (Xn,Yn)}
Misclassification error: L(g) = P (g(X ) 6= Y ) (conditional on thesample)
Bayes error: best possible error L∗ = infg L(g) over all measurablefunctions
Sequence of classification rules {gn}: defined for any sample size(algorithms are usually defined in this way, possibly with a samplesize-dependent parameter)
Consistency: limn→∞ EL(gn) = L∗
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Definitions
We consider the classification setting: Y = {0, 1} with data sampled i.i.d.
A rule (or learning algorithm) is a mapping gn : (X ×Y)n ×X → Y.
Sample: Sn = {(X1,Y1), . . . , (Xn,Yn)}
Misclassification error: L(g) = P (g(X ) 6= Y ) (conditional on thesample)
Bayes error: best possible error L∗ = infg L(g) over all measurablefunctions
Sequence of classification rules {gn}: defined for any sample size(algorithms are usually defined in this way, possibly with a samplesize-dependent parameter)
Consistency: limn→∞ EL(gn) = L∗
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Definitions
We consider the classification setting: Y = {0, 1} with data sampled i.i.d.
A rule (or learning algorithm) is a mapping gn : (X ×Y)n ×X → Y.
Sample: Sn = {(X1,Y1), . . . , (Xn,Yn)}
Misclassification error: L(g) = P (g(X ) 6= Y ) (conditional on thesample)
Bayes error: best possible error L∗ = infg L(g) over all measurablefunctions
Sequence of classification rules {gn}: defined for any sample size(algorithms are usually defined in this way, possibly with a samplesize-dependent parameter)
Consistency: limn→∞ EL(gn) = L∗
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Definitions
We consider the classification setting: Y = {0, 1} with data sampled i.i.d.
A rule (or learning algorithm) is a mapping gn : (X ×Y)n ×X → Y.
Sample: Sn = {(X1,Y1), . . . , (Xn,Yn)}
Misclassification error: L(g) = P (g(X ) 6= Y ) (conditional on thesample)
Bayes error: best possible error L∗ = infg L(g) over all measurablefunctions
Sequence of classification rules {gn}: defined for any sample size(algorithms are usually defined in this way, possibly with a samplesize-dependent parameter)
Consistency: limn→∞ EL(gn) = L∗
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Definitions
We consider the classification setting: Y = {0, 1} with data sampled i.i.d.
A rule (or learning algorithm) is a mapping gn : (X ×Y)n ×X → Y.
Sample: Sn = {(X1,Y1), . . . , (Xn,Yn)}
Misclassification error: L(g) = P (g(X ) 6= Y ) (conditional on thesample)
Bayes error: best possible error L∗ = infg L(g) over all measurablefunctions
Sequence of classification rules {gn}: defined for any sample size(algorithms are usually defined in this way, possibly with a samplesize-dependent parameter)
Consistency: limn→∞ EL(gn) = L∗
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Definitions
We consider the classification setting: Y = {0, 1} with data sampled i.i.d.
A rule (or learning algorithm) is a mapping gn : (X ×Y)n ×X → Y.
Sample: Sn = {(X1,Y1), . . . , (Xn,Yn)}
Misclassification error: L(g) = P (g(X ) 6= Y ) (conditional on thesample)
Bayes error: best possible error L∗ = infg L(g) over all measurablefunctions
Sequence of classification rules {gn}: defined for any sample size(algorithms are usually defined in this way, possibly with a samplesize-dependent parameter)
Consistency: limn→∞ EL(gn) = L∗
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Outline
1 Framework
2 Theoretical Results
3 Consequences
4 Practical Implications
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Consistency
How to build a consistent sequence of rules?
Countable Xvery easy, just wait! eventually every point with non-zero probabilityis observed an unbounded number of times (i.e. take majority voteover observed x , and random prediction on unobserved ones)
Uncountable Xobserved sample has measure zero (for non-atomic measures) sothis trick does not work
Instead, take local majority with two conditions: more and morelocal, but also more and more points averaged
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Consistency
How to build a consistent sequence of rules?
Countable Xvery easy, just wait! eventually every point with non-zero probabilityis observed an unbounded number of times (i.e. take majority voteover observed x , and random prediction on unobserved ones)
Uncountable Xobserved sample has measure zero (for non-atomic measures) sothis trick does not work
Instead, take local majority with two conditions: more and morelocal, but also more and more points averaged
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Consistency
How to build a consistent sequence of rules?
Countable Xvery easy, just wait! eventually every point with non-zero probabilityis observed an unbounded number of times (i.e. take majority voteover observed x , and random prediction on unobserved ones)
Uncountable Xobserved sample has measure zero (for non-atomic measures) sothis trick does not work
Instead, take local majority with two conditions: more and morelocal, but also more and more points averaged
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Consistency of Histograms
Histogram in Rd : cubic cells of size hn, prediction is constant over eachcell (majority vote)
hn → 0, nhdn →∞ enough for universal consistency
Idea of the proof
Continuous functions with bounded support are dense in Lp(ν)Such functions are uniformly continuous and can thus beapproximated by histograms (average of the function on a cell)provided cell size goes to 0Since cells will contain more and more points (secondcondition), the cell value will eventually converge to theaverage over the cell
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Consistency of Histograms
Histogram in Rd : cubic cells of size hn, prediction is constant over eachcell (majority vote)
hn → 0, nhdn →∞ enough for universal consistency
Idea of the proof
Continuous functions with bounded support are dense in Lp(ν)Such functions are uniformly continuous and can thus beapproximated by histograms (average of the function on a cell)provided cell size goes to 0Since cells will contain more and more points (secondcondition), the cell value will eventually converge to theaverage over the cell
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch
We can ”learn” anything
Is the problem solved?
The question becomes: among the consistent algorithms, which oneis the best?
We consider here the special case of classification
Similar phenomena occur for regression or density estimation
Unfortunately, there is no free lunch
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch
We can ”learn” anything
Is the problem solved?
The question becomes: among the consistent algorithms, which oneis the best?
We consider here the special case of classification
Similar phenomena occur for regression or density estimation
Unfortunately, there is no free lunch
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch
We can ”learn” anything
Is the problem solved?
The question becomes: among the consistent algorithms, which oneis the best?
We consider here the special case of classification
Similar phenomena occur for regression or density estimation
Unfortunately, there is no free lunch
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch
We can ”learn” anything
Is the problem solved?
The question becomes: among the consistent algorithms, which oneis the best?
We consider here the special case of classification
Similar phenomena occur for regression or density estimation
Unfortunately, there is no free lunch
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch
We can ”learn” anything
Is the problem solved?
The question becomes: among the consistent algorithms, which oneis the best?
We consider here the special case of classification
Similar phenomena occur for regression or density estimation
Unfortunately, there is no free lunch
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch
We can ”learn” anything
Is the problem solved?
The question becomes: among the consistent algorithms, which oneis the best?
We consider here the special case of classification
Similar phenomena occur for regression or density estimation
Unfortunately, there is no free lunch
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 1
Out-of-sample error: L′(gn) = P (gn(X ) 6= Y |X /∈ Sn)
Consider a uniform probability distribution µ over problems, i.e. forall x , EµP(Y = 1|X = x) = EµP(Y = 0|X = x)
All classifiers have the same average error
Theorem (Wolpert96)
For any classification rule gn,
EµEL′(gn) =1
2
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 1
Out-of-sample error: L′(gn) = P (gn(X ) 6= Y |X /∈ Sn)
Consider a uniform probability distribution µ over problems, i.e. forall x , EµP(Y = 1|X = x) = EµP(Y = 0|X = x)
All classifiers have the same average error
Theorem (Wolpert96)
For any classification rule gn,
EµEL′(gn) =1
2
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 2
A consequence of NFL1 is that there are always cases where analgorithm can be beaten.
A stronger version of NFL1: No Super Classifier
Theorem (DGL96)
For every sequence of classification rules {gn} there is a universallyconsistent sequence {g ′n} such that for some distribution
L(gn) > L(g ′n)
for all n.
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 2
A consequence of NFL1 is that there are always cases where analgorithm can be beaten.
A stronger version of NFL1: No Super Classifier
Theorem (DGL96)
For every sequence of classification rules {gn} there is a universallyconsistent sequence {g ′n} such that for some distribution
L(gn) > L(g ′n)
for all n.
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 3
A variation of NFL1
Arbitrarily bad error for fixed sample sizes
Theorem (Devroye82)
Fix an ε > 0. For any integer n and classification rule gn, there exists adistribution of (X ,Y ) with Bayes risk L∗ = 0 such that
EL(gn) ≥ 1/2− ε
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 3
A variation of NFL1
Arbitrarily bad error for fixed sample sizes
Theorem (Devroye82)
Fix an ε > 0. For any integer n and classification rule gn, there exists adistribution of (X ,Y ) with Bayes risk L∗ = 0 such that
EL(gn) ≥ 1/2− ε
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 4
NFL3 possibly considers a different distribution for each n
What happens for a fixed distribution when n increases?
Slow rate phenomenon
Theorem (Devroye82)
Let {an} be a sequence of positive numbers converging to zero with1/16 ≥ a1 ≥ a2 ≥ . . .. For every sequence of classification rules, thereexists a distribution of (X ,Y ) with Bayes risk L∗ = 0 such that
EL(gn) ≥ an
for all n.
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 4
NFL3 possibly considers a different distribution for each n
What happens for a fixed distribution when n increases?
Slow rate phenomenon
Theorem (Devroye82)
Let {an} be a sequence of positive numbers converging to zero with1/16 ≥ a1 ≥ a2 ≥ . . .. For every sequence of classification rules, thereexists a distribution of (X ,Y ) with Bayes risk L∗ = 0 such that
EL(gn) ≥ an
for all n.
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 4
NFL3 possibly considers a different distribution for each n
What happens for a fixed distribution when n increases?
Slow rate phenomenon
Theorem (Devroye82)
Let {an} be a sequence of positive numbers converging to zero with1/16 ≥ a1 ≥ a2 ≥ . . .. For every sequence of classification rules, thereexists a distribution of (X ,Y ) with Bayes risk L∗ = 0 such that
EL(gn) ≥ an
for all n.
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Proofs
The idea is to create a ”bad” distribution
It turns out that random ones are bad enough: just create a problemwith no structure (prediction at x unrelated to prediction at x ′)
All proofs work on finite (for fixed n) or countable (for varying n)spaces (no need to introduce uncountable X )
The trick is to make sure that there are enough point that have notbeen observed yet (on those, the error will be 1/2)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Proofs
The idea is to create a ”bad” distribution
It turns out that random ones are bad enough: just create a problemwith no structure (prediction at x unrelated to prediction at x ′)
All proofs work on finite (for fixed n) or countable (for varying n)spaces (no need to introduce uncountable X )
The trick is to make sure that there are enough point that have notbeen observed yet (on those, the error will be 1/2)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Proofs
The idea is to create a ”bad” distribution
It turns out that random ones are bad enough: just create a problemwith no structure (prediction at x unrelated to prediction at x ′)
All proofs work on finite (for fixed n) or countable (for varying n)spaces (no need to introduce uncountable X )
The trick is to make sure that there are enough point that have notbeen observed yet (on those, the error will be 1/2)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Proofs
The idea is to create a ”bad” distribution
It turns out that random ones are bad enough: just create a problemwith no structure (prediction at x unrelated to prediction at x ′)
All proofs work on finite (for fixed n) or countable (for varying n)spaces (no need to introduce uncountable X )
The trick is to make sure that there are enough point that have notbeen observed yet (on those, the error will be 1/2)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
A closer look at consistency
Consider the trivially consistent rule for a countable space (majorityvote)
Its error decreases with increasing sample size
∀n, EL(gn) ≥ EL(gn+1)
Is is true in general for universally consistent rules?
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
A closer look at consistency
Consider the trivially consistent rule for a countable space (majorityvote)
Its error decreases with increasing sample size
∀n, EL(gn) ≥ EL(gn+1)
Is is true in general for universally consistent rules?
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
A closer look at consistency
Consider the trivially consistent rule for a countable space (majorityvote)
Its error decreases with increasing sample size
∀n, EL(gn) ≥ EL(gn+1)
Is is true in general for universally consistent rules?
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Smart rules
Consistency for uncountable spaces is not so trivial
Smart rules
Definition
A sequence {gn} of classification rules is smart if for any distribution andany integer n,
EL(gn) ≥ EL(gn+1)
For uncountable spaces, some of the known universally consistentrules can be shown to be non-smart
Conjecture: on Rd any smart rule is not universally consistent
Interpretation: consistency on uncountable spaces requires to adaptthe degree of smoothness to the sample size, this means that therewill be a point for which smoothness degree will be too large
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Smart rules
Consistency for uncountable spaces is not so trivial
Smart rules
Definition
A sequence {gn} of classification rules is smart if for any distribution andany integer n,
EL(gn) ≥ EL(gn+1)
For uncountable spaces, some of the known universally consistentrules can be shown to be non-smart
Conjecture: on Rd any smart rule is not universally consistent
Interpretation: consistency on uncountable spaces requires to adaptthe degree of smoothness to the sample size, this means that therewill be a point for which smoothness degree will be too large
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Smart rules
Consistency for uncountable spaces is not so trivial
Smart rules
Definition
A sequence {gn} of classification rules is smart if for any distribution andany integer n,
EL(gn) ≥ EL(gn+1)
For uncountable spaces, some of the known universally consistentrules can be shown to be non-smart
Conjecture: on Rd any smart rule is not universally consistent
Interpretation: consistency on uncountable spaces requires to adaptthe degree of smoothness to the sample size, this means that therewill be a point for which smoothness degree will be too large
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Smart rules
Consistency for uncountable spaces is not so trivial
Smart rules
Definition
A sequence {gn} of classification rules is smart if for any distribution andany integer n,
EL(gn) ≥ EL(gn+1)
For uncountable spaces, some of the known universally consistentrules can be shown to be non-smart
Conjecture: on Rd any smart rule is not universally consistent
Interpretation: consistency on uncountable spaces requires to adaptthe degree of smoothness to the sample size, this means that therewill be a point for which smoothness degree will be too large
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Anti-learning
Average error is 1/2 so there are problems for which the error is muchworse than random guessing!
One can indeed construct distributions for which some standardalgorithms have EL(gn) arbitrarily close to 1 even with L∗ = 0!
Of course this occurs for a fixed sample size
Can one always do that (for any rule) ?
The problem should have a structure, but one which is opposite tothe ones preferred by the algorithm
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Anti-learning
Average error is 1/2 so there are problems for which the error is muchworse than random guessing!
One can indeed construct distributions for which some standardalgorithms have EL(gn) arbitrarily close to 1 even with L∗ = 0!
Of course this occurs for a fixed sample size
Can one always do that (for any rule) ?
The problem should have a structure, but one which is opposite tothe ones preferred by the algorithm
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Anti-learning
Average error is 1/2 so there are problems for which the error is muchworse than random guessing!
One can indeed construct distributions for which some standardalgorithms have EL(gn) arbitrarily close to 1 even with L∗ = 0!
Of course this occurs for a fixed sample size
Can one always do that (for any rule) ?
The problem should have a structure, but one which is opposite tothe ones preferred by the algorithm
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Anti-learning
Average error is 1/2 so there are problems for which the error is muchworse than random guessing!
One can indeed construct distributions for which some standardalgorithms have EL(gn) arbitrarily close to 1 even with L∗ = 0!
Of course this occurs for a fixed sample size
Can one always do that (for any rule) ?
The problem should have a structure, but one which is opposite tothe ones preferred by the algorithm
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Bayes Error Estimation
Assume we just want to estimate L∗.
Of course, we could use any universally consistent algorithm andestimate its error. But we get slow rates!
Is there a better way?
Theorem (DGL96)
For every n, for any estimate Ln of the Bayes error L∗ and for everyε > 0, there exists a distribution (X ,Y ), such that
E(|Ln − L∗|
)≥ 1/4− ε
Estimating this single number does not seem easier than estimatingthe whole set {x : P(Y = 1|x) > 1/2}
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Bayes Error Estimation
Assume we just want to estimate L∗.
Of course, we could use any universally consistent algorithm andestimate its error. But we get slow rates!
Is there a better way?
Theorem (DGL96)
For every n, for any estimate Ln of the Bayes error L∗ and for everyε > 0, there exists a distribution (X ,Y ), such that
E(|Ln − L∗|
)≥ 1/4− ε
Estimating this single number does not seem easier than estimatingthe whole set {x : P(Y = 1|x) > 1/2}
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Bayes Error Estimation
Assume we just want to estimate L∗.
Of course, we could use any universally consistent algorithm andestimate its error. But we get slow rates!
Is there a better way?
Theorem (DGL96)
For every n, for any estimate Ln of the Bayes error L∗ and for everyε > 0, there exists a distribution (X ,Y ), such that
E(|Ln − L∗|
)≥ 1/4− ε
Estimating this single number does not seem easier than estimatingthe whole set {x : P(Y = 1|x) > 1/2}
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Outline
1 Framework
2 Theoretical Results
3 Consequences
4 Practical Implications
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
What can we hope to prove?
Our framework is too general! Nothing interesting can be saidabout learning algorithms
Can we prove something interesting under slightly more restrictiveassumptions?
Are the distributions used to prove the NFLs pathological? (NFL 4holds even within classes of ”reasonable” distributions!)
If we can define which problems actually occur in real life, we canhope to derive appropriate algorithms (optimal on this class ofproblems)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
What can we hope to prove?
Our framework is too general! Nothing interesting can be saidabout learning algorithms
Can we prove something interesting under slightly more restrictiveassumptions?
Are the distributions used to prove the NFLs pathological? (NFL 4holds even within classes of ”reasonable” distributions!)
If we can define which problems actually occur in real life, we canhope to derive appropriate algorithms (optimal on this class ofproblems)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
What can we hope to prove?
Our framework is too general! Nothing interesting can be saidabout learning algorithms
Can we prove something interesting under slightly more restrictiveassumptions?
Are the distributions used to prove the NFLs pathological? (NFL 4holds even within classes of ”reasonable” distributions!)
If we can define which problems actually occur in real life, we canhope to derive appropriate algorithms (optimal on this class ofproblems)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
What can we hope to prove?
Our framework is too general! Nothing interesting can be saidabout learning algorithms
Can we prove something interesting under slightly more restrictiveassumptions?
Are the distributions used to prove the NFLs pathological? (NFL 4holds even within classes of ”reasonable” distributions!)
If we can define which problems actually occur in real life, we canhope to derive appropriate algorithms (optimal on this class ofproblems)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Bayesian Way
Assume something about how the data is generated
Consider an algorithm specifically tuned to this property
Prove that under this assumption the algorithm does well
Most results are going in this direction (sometimes in a subtle way)
Bayesian algorithms
Most minimax results are of this form
inf{gn}
supP∈P
(L(gn)− inf
gL(g)
)Seems reasonable and useful for understanding but does not provideguarantees
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Bayesian Way
Assume something about how the data is generated
Consider an algorithm specifically tuned to this property
Prove that under this assumption the algorithm does well
Most results are going in this direction (sometimes in a subtle way)
Bayesian algorithms
Most minimax results are of this form
inf{gn}
supP∈P
(L(gn)− inf
gL(g)
)Seems reasonable and useful for understanding but does not provideguarantees
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Bayesian Way
Assume something about how the data is generated
Consider an algorithm specifically tuned to this property
Prove that under this assumption the algorithm does well
Most results are going in this direction (sometimes in a subtle way)
Bayesian algorithms
Most minimax results are of this form
inf{gn}
supP∈P
(L(gn)− inf
gL(g)
)Seems reasonable and useful for understanding but does not provideguarantees
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Bayesian Way
Assume something about how the data is generated
Consider an algorithm specifically tuned to this property
Prove that under this assumption the algorithm does well
Most results are going in this direction (sometimes in a subtle way)
Bayesian algorithms
Most minimax results are of this form
inf{gn}
supP∈P
(L(gn)− inf
gL(g)
)Seems reasonable and useful for understanding but does not provideguarantees
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Bayesian Way
Assume something about how the data is generated
Consider an algorithm specifically tuned to this property
Prove that under this assumption the algorithm does well
Most results are going in this direction (sometimes in a subtle way)
Bayesian algorithms
Most minimax results are of this form
inf{gn}
supP∈P
(L(gn)− inf
gL(g)
)Seems reasonable and useful for understanding but does not provideguarantees
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Bayesian Way
Assume something about how the data is generated
Consider an algorithm specifically tuned to this property
Prove that under this assumption the algorithm does well
Most results are going in this direction (sometimes in a subtle way)
Bayesian algorithms
Most minimax results are of this form
inf{gn}
supP∈P
(L(gn)− inf
gL(g)
)Seems reasonable and useful for understanding but does not provideguarantees
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Bayesian Way
Assume something about how the data is generated
Consider an algorithm specifically tuned to this property
Prove that under this assumption the algorithm does well
Most results are going in this direction (sometimes in a subtle way)
Bayesian algorithms
Most minimax results are of this form
inf{gn}
supP∈P
(L(gn)− inf
gL(g)
)Seems reasonable and useful for understanding but does not provideguarantees
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Worst Case Way
Assume nothing about the data (distribution-free)
Restrict your objectives
Derive an algorithm that reaches this objective no matter how thedata is
inf{gn}
supP
(L(gn)− inf
g∈GL(g)
)Gives guarantees
In between: adaptation
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Worst Case Way
Assume nothing about the data (distribution-free)
Restrict your objectives
Derive an algorithm that reaches this objective no matter how thedata is
inf{gn}
supP
(L(gn)− inf
g∈GL(g)
)Gives guarantees
In between: adaptation
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Worst Case Way
Assume nothing about the data (distribution-free)
Restrict your objectives
Derive an algorithm that reaches this objective no matter how thedata is
inf{gn}
supP
(L(gn)− inf
g∈GL(g)
)Gives guarantees
In between: adaptation
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Worst Case Way
Assume nothing about the data (distribution-free)
Restrict your objectives
Derive an algorithm that reaches this objective no matter how thedata is
inf{gn}
supP
(L(gn)− inf
g∈GL(g)
)Gives guarantees
In between: adaptation
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
The Worst Case Way
Assume nothing about the data (distribution-free)
Restrict your objectives
Derive an algorithm that reaches this objective no matter how thedata is
inf{gn}
supP
(L(gn)− inf
g∈GL(g)
)Gives guarantees
In between: adaptation
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Outline
1 Framework
2 Theoretical Results
3 Consequences
4 Practical Implications
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Does this help practically?
We can probably come up with algorithms that work well on mostreal-world problems
If we have a characterization of these problems, we can even provesomething about such algorithms
However, there is no guarantee that a new problem will satisfy thischaracterization
So there cannot be a formal proof that an algorithm is good or bad
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Does this help practically?
We can probably come up with algorithms that work well on mostreal-world problems
If we have a characterization of these problems, we can even provesomething about such algorithms
However, there is no guarantee that a new problem will satisfy thischaracterization
So there cannot be a formal proof that an algorithm is good or bad
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Does this help practically?
We can probably come up with algorithms that work well on mostreal-world problems
If we have a characterization of these problems, we can even provesomething about such algorithms
However, there is no guarantee that a new problem will satisfy thischaracterization
So there cannot be a formal proof that an algorithm is good or bad
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Does this help practically?
We can probably come up with algorithms that work well on mostreal-world problems
If we have a characterization of these problems, we can even provesomething about such algorithms
However, there is no guarantee that a new problem will satisfy thischaracterization
So there cannot be a formal proof that an algorithm is good or bad
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
If theory cannot help, what can we do?
Essentially a matter of finding an algorithm that implements theright notion of smoothness for the problem at hand
More an art than a science!
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
If theory cannot help, what can we do?
Essentially a matter of finding an algorithm that implements theright notion of smoothness for the problem at hand
More an art than a science!
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Priors
Algorithm design is composed of two steps
Choosing a preferenceThis first step is based on knowledge of the problem, this is whereguidance (but no theory) is needed.
Exploiting it for inferenceThe second step can possibly be formalized (optimality with respectto assumptions). The main issue is computational cost.
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Priors
Algorithm design is composed of two steps
Choosing a preferenceThis first step is based on knowledge of the problem, this is whereguidance (but no theory) is needed.
Exploiting it for inferenceThe second step can possibly be formalized (optimality with respectto assumptions). The main issue is computational cost.
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack ofvalidation data
Examples
Forgot to remove the output variable (or a version of it): algorithmpicks it up
An irrelevant variable happens to be discriminative (e.g. date ofsample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack ofvalidation data
Examples
Forgot to remove the output variable (or a version of it): algorithmpicks it up
An irrelevant variable happens to be discriminative (e.g. date ofsample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack ofvalidation data
Examples
Forgot to remove the output variable (or a version of it): algorithmpicks it up
An irrelevant variable happens to be discriminative (e.g. date ofsample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack ofvalidation data
Examples
Forgot to remove the output variable (or a version of it): algorithmpicks it up
An irrelevant variable happens to be discriminative (e.g. date ofsample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack ofvalidation data
Examples
Forgot to remove the output variable (or a version of it): algorithmpicks it up
An irrelevant variable happens to be discriminative (e.g. date ofsample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack ofvalidation data
Examples
Forgot to remove the output variable (or a version of it): algorithmpicks it up
An irrelevant variable happens to be discriminative (e.g. date ofsample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack ofvalidation data
Examples
Forgot to remove the output variable (or a version of it): algorithmpicks it up
An irrelevant variable happens to be discriminative (e.g. date ofsample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack ofvalidation data
Examples
Forgot to remove the output variable (or a version of it): algorithmpicks it up
An irrelevant variable happens to be discriminative (e.g. date ofsample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in anatural wayLook for other types of prior assumptions that occur in variousproblems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need alanguage to interact with experts)
Investigate how to improve understandability (simpler models,separate models and language for interaction...)Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytimeIncorporate time complexity in the theoretical analysis (tradecomplexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in anatural wayLook for other types of prior assumptions that occur in variousproblems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need alanguage to interact with experts)
Investigate how to improve understandability (simpler models,separate models and language for interaction...)Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytimeIncorporate time complexity in the theoretical analysis (tradecomplexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in anatural wayLook for other types of prior assumptions that occur in variousproblems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need alanguage to interact with experts)
Investigate how to improve understandability (simpler models,separate models and language for interaction...)Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytimeIncorporate time complexity in the theoretical analysis (tradecomplexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in anatural wayLook for other types of prior assumptions that occur in variousproblems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need alanguage to interact with experts)
Investigate how to improve understandability (simpler models,separate models and language for interaction...)Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytimeIncorporate time complexity in the theoretical analysis (tradecomplexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in anatural wayLook for other types of prior assumptions that occur in variousproblems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need alanguage to interact with experts)
Investigate how to improve understandability (simpler models,separate models and language for interaction...)Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytimeIncorporate time complexity in the theoretical analysis (tradecomplexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in anatural wayLook for other types of prior assumptions that occur in variousproblems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need alanguage to interact with experts)
Investigate how to improve understandability (simpler models,separate models and language for interaction...)Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytimeIncorporate time complexity in the theoretical analysis (tradecomplexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in anatural wayLook for other types of prior assumptions that occur in variousproblems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need alanguage to interact with experts)
Investigate how to improve understandability (simpler models,separate models and language for interaction...)Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytimeIncorporate time complexity in the theoretical analysis (tradecomplexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in anatural wayLook for other types of prior assumptions that occur in variousproblems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need alanguage to interact with experts)
Investigate how to improve understandability (simpler models,separate models and language for interaction...)Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytimeIncorporate time complexity in the theoretical analysis (tradecomplexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in anatural wayLook for other types of prior assumptions that occur in variousproblems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need alanguage to interact with experts)
Investigate how to improve understandability (simpler models,separate models and language for interaction...)Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytimeIncorporate time complexity in the theoretical analysis (tradecomplexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
Outline Framework Theoretical Results Consequences Practical Implications
References
L. Devroye: Necessary and Sufficient Conditions for the AlmostEverywhere Convergence of Nearest Neighbors Regression FunctionEstimates. Zeitschrift fur Wahrscheinlichkeitstheorie und verwandteGebiete, 61: 467-481 (1982)
D. Wolpert: The lack of a prior distinctions between learningalgorithms, Neural Computation 8 (1996)
L. Devroye, L. Gyorfi and G. Lugosi: A Probabilistic Theory ofPattern Recognition, Springer (1996)
Olivier Bousquet Machine Learning: Some theoretical and practical problems