Scaling Learning Algorithms Towards AI
Authors: Yoshua Bengio, Yann LeCunPresenter: Marilyn Vazquez
George Mason University
February 10, 2017
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 1 / 25
Outline
1 Curse of Dimensionality
2 Shallow Learning
3 Deep Learning
4 Results
5 Conclusion
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 2 / 25
Curse of Dimensionality
Curse of Dimensionality
The curse of dimensionality can be viewed either as the limitation on dataanalysis due to the large amount of data or parameters needed to analyzethe data.
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 3 / 25
Curse of Dimensionality
Curse of Dimensionality: Example 1
Kernel density estimation: At point x ∈ Rd , the kernel density q̂estimates the real density q with high probability, i.e.
E(q̂(xi )) = E
σ−d
N
N∑j=1
Kσ(xi , xj)
= E
1
N
N∑j=1
e−||xi−xj ||
2σ2
(2πσ2)2d
→ q(xi ) +O(σ2,N−12σ−
d2
√q(x))
where the bias error, σ2, is dominant with large data and variance error,
N−12σ−
d2 , blows up if we take σ → 0
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 4 / 25
Curse of Dimensionality
Error
To find an optimal bandwidth σ, we can balance errors:
σ2 = c1N− 1
2σ−d2 =⇒ σ
4+d2 = c1N
− 12 =⇒ σ = c1N
− 14+d
=⇒ error = c2N− 2
4+d
So that if for d = 1 we need n1 points to achieve the fixed error e1, then
increasing the dimension to d , we need n4+d5
1 data points i.e. exponentialin dimension!
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 5 / 25
Curse of Dimensionality
Curse of Dimensionality: Example 2
Smooth function representation: A gaussian kernel machine is arepresentation
f (x) = b +n∑
i=1
wiK (xi , x)
where xi are the base points, wi are weights found through regression, andK (xi , x) is a Gaussian kernel.
Theorem
Let f : R→ R computed by a Gaussian Kernel machine with k base points(k non-zero wi ’s). Then f has at most 2k zeros.
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 6 / 25
Curse of Dimensionality
Curse of Dimensionality: Example 2
Smooth Function Representation
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 7 / 25
Curse of Dimensionality
Curse of Dimensionality: Example 2
Smooth function representation: A gaussian kernel machine is arepresentation
f (x) = b +n∑
i=1
wiK (xi , x)
where xi are the base points, wi are weights found through regression, andK (xi , x) is a Gaussian kernel.
Corollary
In Rd , if the learning problem requires f to change sign at least 2k timesalong some straight line, then the kernel machine must have at least kbase points (k non-zero wi ’s).
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 8 / 25
Curse of Dimensionality
Curse of Dimensionality: Example 3
Local Derivative : For a Gaussian Kernel classifier, the normal of thetangent of the decision surface at x is constrained to approximately lie inthe span of the vectors (x − xi ), where ||x − xi || is small compared to σand xi are in the training set.
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 9 / 25
Curse of Dimensionality
Local Derivative
Brief explanation:
For f (x) = b +n∑
i=1
wiK (x , xi ) = b +n∑
i=1
wie− ||x−xi ||
2
σ2
We get∂f (x)
∂x= −
n∑i=1
2(x − xi )wi
σ2e−||x−xi ||
2
σ2
Note that the dominant terms are those for which xi is a near neighbor ofx , so that we approximately get
∂f (x)
∂x≈ −
m∑i=1
w ′i e− ||x−xi ||
2
σ2
where w ′i = 2(x−xi )wi
σ2 and m ≤ n.
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 10 / 25
Curse of Dimensionality
Local Derivative Example
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 11 / 25
Shallow and Deep Learning
Getting Around the Curse of Dimensionality
There is no universal solution to the curse of dimensionality; however, forparticular purposes you can make assumption that may help.
“We hypothesize that many tasks in the AI set may be built aroundcommon representations, which can be understood as a set of interrelatedconcepts”
Translation: Use the mathematical idea of composition of functions tobuild a complicated function from simple parts (common representation),such as Gaussian kernels or any basis functions.
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 12 / 25
Shallow and Deep Learning
Shallow Learning
f (x) = b +N∑i=1
wiφi (x)
where wi results from the training, the basis could be something fixed suchas K (xi , x) or also a result of the training.
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 13 / 25
Shallow and Deep Learning
Shallow Learning
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 14 / 25
Shallow and Deep Learning
Deep Learning as Neural Network
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 15 / 25
Shallow and Deep Learning
Deep Learning as Function Composition
Let fjk represent the jth feature in the k layer
fj1(x) = bj ,1 +∑i
wij1K (xij1, x)
fj ,2(x) = bj ,2 +∑i
wij2K (xij2, fj1(x))
...
They conjecture that by allowing these compositions, we will need fewerparameters to fit compared to a shallow representation
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 16 / 25
Shallow and Deep Learning
Deep Learning Steps
Step 1 Initialization via unsupervised learning with a feedback thathelps reconstruct input from output
Step 2 Refine via gradient-descent supervised learning
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 17 / 25
Shallow and Deep Learning
Deep Learning
cijxy = tanh
bij +∑k
Pi−1∑p=0
Qi−1∑q=0
wijkpqc(i−1),k,(x+p),(y+q)
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 18 / 25
Shallow and Deep Learning
Deep Learning
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 19 / 25
Results
Results
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 20 / 25
Results
Sample Data
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 21 / 25
Results
Results
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 22 / 25
Conclusion
Summary
The curse of dimensionality can limit the amounts of data that can beanalyzed
We can not completely get rid of the curse of dimensionality, but wecan go around it if we make some assumptions
Shallow learning assumes that we can represent functions with thesmooth functions such as gaussian kernels
Deep learning assumes that complicated functions can be build bycomposing simple functions such as Gaussian kernels
Deep learning is composed of several two-layer sequences, the featuredetection layer and the feature pooling layer, in which each layer anon-linear supervision step is performed.
The authors show successful results in image classification
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 23 / 25
Conclusion
References
Yoshua Bengio and Yann LeCun, Scaling Learning Algorithms towardsAI, Large-Scale Kernel Machine, 2007.
Leslie Lamport, Deep Learning and Convolutional Neural Networks,RSIP Vision Blogshttp://www.rsipvision.com/exploring-deep-learning/.
Jianxin Wu, Introduction to Convolutional Neural Networks, NationalKey Lab for Novel Software Technology, Nanjing University, China2016.
Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 24 / 25