31
Towards a Unified Theory of Learning and Information Towards a Unified Theory of Learning and Information Ibrahim Alabdulmohsin November 2018

New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning andInformation

Ibrahim Alabdulmohsin

November 2018

Page 2: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Key Takeaway

Q: When can we extrapolate from the past into the future?A: Traditional approaches include statistical tools (e.g. VC theory,covering numbers, and multiple hypothesis testing) and algorithmictools (e.g. stability).

However, there is a recent surge of interest in using InformationTheory to answer this question.

Advantages: robust guarantees, adaptive learning, generalsetting, simplified analysis, new tools, unifies results ...etc.

Page 3: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Background

Background

Page 4: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Background

A Bit of History

Ibn Al-Haitham (Alhazen) wrote his celebrated book Optics inaround 1015 AD.

1 Introduced the scientific method, and emphasized the role ofempirical evidence.

2 Established, among others: light travel in straight lines, eyes are

not the source of light, refraction and reflection, ... etc.

Page 5: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Background

Ibn Al-Haitham is less known for a key significant contribution hemade in his Optics.

He argued that “perception was inferential”.

Page 6: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Background

Ibn Al-Haitham is less known for a key significant contribution hemade in his Optics.

He argued that “perception was inferential”.

What we observe in “nature” may not be a phenomenon of nature.It could be a phenomenon of the human mind.

Page 7: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Background

The Moon Illusion

Page 8: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Background

How Does it Work?

A possible explanation (Angular Size-Contrast Theory):

1 We can perceive the distance to the horizon (triangulation).

2 Objects at that distance are expected to be barely visible.

3 The moon is not barely visible.

4 Therefore, the moon must be really big. (Interpreation)

Page 9: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Background

How does it work?

observation abstraction perception

z

-

-

-

h1(z)

h2(z)

h3(z)

-

-

-

s

(something)

Page 10: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Background

Today, Ibn Al-Haitham’s model is taken for granted. Many otheroptical illusions have been discovered and popularized.

Page 11: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Background

Why?

Q: Why does the brain create many layers of abstractions?

A: To extract meaningful patterns that will continue to holdin the future. This is called generalization.

It does not help to memorize the past because experience will neverbe repeated exactly.

Page 12: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Background

Generalization is a key goal in any machine learning algorithm.

1 computer vision

2 intrusion detection

3 spam filtering

4 demand forecasting

5 · · ·

Conclusions based on experience may not always generalize:

E.g. Fear of “Friday the 13th”

Page 13: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Background

What are the necessary and su�cient conditions for generalization?

Answering this questions will have applications for:

Quantifying Uncertainty

Performing Model Selection

Developing Better Machine Learning Algorithms

Gaining Qualitative Insight

· · ·

Page 14: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Learning and Information

Learning and Information

Page 15: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Learning and Information

Mathematical Formalism

S 2 Zm -

Learning

✓⌘◆⇣?

- h 2 H

?� � � � � � � � � � � � � � � � � � � � � � � � � � ��

z1

z2

z3

· · ·

-

-

-

-

-

-

-

-

l(z1;h)

l(z2;h)

l(z3;h)

· · ·

Generalization = Future � Past

Page 16: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Learning and Information

Example 1

Consider a classification task

We want to minimize the misclassification error rate.

Early systems used expert advice to create rules manually. Didnot work well! In fact, experts disagreed.

Optimization based on data solved the problem successfully.

Page 17: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Learning and Information

Example 2

Consider a clustering task (e.g. location assignment, marketsegmentation, news aggregation, topic discovery, communitydetection)

We want to minimize the intra-cluster distances and maximizethe inter-cluster distances.

Page 18: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Learning and Information

Example 3

Consider a sparse coding task:

(image credit: Andrew Ng)

Denoising and Compression

We want to minimize the description length and minimize theinformation loss.

Page 19: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Learning and Information

Information-Theoretic Approach

A learning algorithm as a channel from Zm to H. From this, we

define a “learning capacity” C (L) for machine learning algorithms.

Channel capacity is not restricted to “communication”channels per se. More generally, it is about random variables.

Given two random variables x and y, the conditionaldistribution p(y | x) defines a channel.

Page 20: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Learning and Information

It can be shown that C (L) captures all of the contributions to therisk of over-fitting.

Finite Domain:

If |Z| < 1, then C (L) .q

|Z|�12⇡m .

Finite Hypothesis Space:

If |H| < 1, then C (L)

qlog |H|

2m .

Di↵erential Privacy:If L is (✏, �) di↵erentially private, then C (L) (e✏ � 1+ �)/2.

Stochastic Convex Optimization: C (L)

pd/(2m).

Page 21: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Learning and Information

Qualitative Insight

Stability: A learning algorithm has a small learning capacityif and only if it is algorithmically stable.

Information: A learning algorithm has a small learningcapacity if and only if its hypothesis does not reveal a lot ofinformation about the training sample.

Composition: Learning “more” about the training sample,e.g. (h1,h2), increases the risk for overfitting, compared to h1.

Universality: A problem is learnable if and only if it islearnable via an algorithm with a vanishing learning capacity.

Page 22: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Learning and Information

Data Processing Inequality: If s ! h1 ! h2, then the riskof overfitting in h2 is smaller.

E.g. decision tree pruning and sparsification.Induces a partial order on learning algorithms.

L2 ✓ L1 , s ! h1 ! h2

��

L1 L2 L3

1

Page 23: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Learning and Information

Equivalence Result

In Learning Theory: The VC dimension.

Agnostic PAC learnability ⌘ finite VC dimension.

5

Figure 1. Three points in R2, shattered by oriented lines.

2.3. The VC Dimension and the Number of Parameters

The VC dimension thus gives concreteness to the notion of the capacity of a given setof functions. Intuitively, one might be led to expect that learning machines with manyparameters would have high VC dimension, while learning machines with few parameterswould have low VC dimension. There is a striking counterexample to this, due to E. Levinand J.S. Denker (Vapnik, 1995): A learning machine with just one parameter, but withinfinite VC dimension (a family of classifiers is said to have infinite VC dimension if it canshatter l points, no matter how large l). Define the step function �(x), x 2 R : {�(x) =1 �x > 0; �(x) = �1 �x 0}. Consider the one-parameter family of functions, defined by

f(x, �) � �(sin(�x)), x, � 2 R. (4)

You choose some number l, and present me with the task of finding l points that can beshattered. I choose them to be:

xi = 10�i, i = 1, · · · , l. (5)

You specify any labels you like:

y1, y2, · · · , yl, yi 2 {�1, 1}. (6)

Then f(�) gives this labeling if I choose � to be

� = �(1 +l�

i=1

(1 � yi)10i

2). (7)

Thus the VC dimension of this machine is infinite.

Interestingly, even though we can shatter an arbitrarily large number of points, we canalso find just four points that cannot be shattered. They simply have to be equally spaced,and assigned labels as shown in Figure 2. This can be seen as follows: Write the phase atx1 as �1 = 2n� + �. Then the choice of label y1 = 1 requires 0 < � < �. The phase at x2,mod 2�, is 2�; then y2 = 1 � 0 < � < �/2. Similarly, point x3 forces � > �/3. Then atx4, �/3 < � < �/2 implies that f(x4, �) = �1, contrary to the assigned label. These fourpoints are the analogy, for the set of functions in Eq. (4), of the set of three points lyingalong a line, for oriented hyperplanes in Rn. Neither set can be shattered by the chosenfamily of functions.

In Information Theory: The Shannon channel capacity.Highest achievable information rate with arbitrarily small error probability.

Note: An equivalence relation exists between the two quantities ifthe channel capacity is measured in the total variation distance.

Page 24: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Applications

Applications

Page 25: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Applications

Large Scale Optimization

Accuracy = Empirical Risk + Generalization Risk

Stochastic convex optimization can be parallelized trivially(e.g. SVM, logistic regression, least squares, ...etc).

Algorithm Test Error Training TimeERM (m = 50, 000) 14.8 ± 0.3% 2.93sERM (m = 1, 000) 15.3 ± 0.5% 0.03sParallelized (K = 50,m = 1, 000) 14.8 ± 0.1% 0.03s

Table: `2 regularized logistic regression on the MiniBooNE ParticleIdentification problem.

Page 26: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Applications

Model Selection

Uniform Information Criterion: Model selection according to

UIC = Ez⇠S [l(h, z)] +q

d2m , where d is the number of learned

parameters.

0 2 4 6 8 1010-4

10-3

10-2

10-1

AICBICUIC

0 5 10 15 20d

10-4

10-3

10-2

10-1

0 10 20 3010-4

10-3

10-2

10-1

Figure: Least-squares polynomial regression where X = [�1,+1],y = f (x) + ✏, f is a polynomial of degree d

? (marked by dashed line),and ✏ is white Gaussian noise.

Page 27: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Future Research Directions

Future Research Directions

Page 28: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Future Research Directions

Adaptive Learning

A typical machine learning project involves multiple rounds ofanalysis (e.g. feature selection, normalization, cross validation,trying multiple algorithms, ...etc). The chain rule provides a simplerecipe for analyzing the generalization risk.

Given the analysis conducted (e.g. in the form of a code or aflowchart), can we design a procedure for estimating thegeneralization risk?

Page 29: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Future Research Directions

Information Criterion

Information criteria (such as BIC), are sometimes used in theunsupervised learning setting for model selection (e.g. in k-meansclustering).

Given that uniform generalization is developed in the general

setting of learning, should the learning capacity serve as amodel selection criterion in the unsupervised setting? Why orwhy not?

Page 30: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Future Research Directions

References

Learning Capacity

Alabdulmohsin, “Algorithmic Stability and UniformGeneralization.” NIPS, 2015.

Alabdulmohsin, “An information-theoretic route fromgeneralization in expectation to generalization in probability.”AISTATS, 2017.

Alabdulmohsin, “Information-Theoretic Guarantees for ERMwith Applications to Model Selection and Large-ScaleOptimization.” ICML, 2018.

The Shannon Mutual Information and Adaptive Learning

D Russo and J Zou. “Controlling bias in adaptive dataanalysis using information theory.” AISTATS, 2017.

Page 31: New Towards a Unified Theory of Learning and Informationsites.nationalacademies.org/cs/groups/pgasite/documents/... · 2020. 4. 14. · Towards a Unified Theory of Learning and Information

Towards a Unified Theory of Learning and Information

Future Research Directions

Generalization via Di↵erential Privacy

C Dwork, V Feldman, M Hardt, T Pitassi, O Reingold, and ARoth. “Preserving statistical validity in adaptive dataanalysis.” STOC 2015.

Generalization based on leave-one-out mutual information

M Raginsky, A Rakhlin, M Tsao, Y Wu, and A Xu.“Information-theoretic analysis of stability and bias of learningalgorithms.” ITW, 2016.

A Xu, M Raginsky. “Information-theoretic analysis ofgeneralization capability of learning algorithms.” NIPS 2017.