Download pdf - Scaling Learning Algorithms Towards AI...Yoshua Bengio and Yann LeCun, Scaling Learning Algorithms towards AI, Large-Scale Kernel Machine, 2007. Leslie Lamport, Deep Learning and Convolutional

Scaling Learning Algorithms Towards AI

Authors: Yoshua Bengio, Yann LeCunPresenter: Marilyn Vazquez

George Mason University

February 10, 2017

Bengion and LeCun (GMU) NLDA Seminar February 10, 2017 1 / 25

Outline

1 Curse of Dimensionality

2 Shallow Learning

3 Deep Learning

4 Results

5 Conclusion


Curse of Dimensionality


The curse of dimensionality can be viewed either as the limitation on dataanalysis due to the large amount of data or parameters needed to analyzethe data.



Curse of Dimensionality: Example 1

Kernel density estimation: At point x ∈ Rd , the kernel density q̂estimates the real density q with high probability, i.e.

E(q̂(xi )) = E

σ−d

N

N∑j=1

Kσ(xi , xj)

= E

1

N

N∑j=1

e−||xi−xj ||

2σ2

(2πσ2)2d

→ q(xi ) +O(σ2,N−12σ−

d2

√q(x))

where the bias error, σ2, is dominant with large data and variance error,

N−12σ−

d2 , blows up if we take σ → 0



Error

To find an optimal bandwidth σ, we can balance errors:

σ2 = c1N− 1

2σ−d2 =⇒ σ

4+d2 = c1N

− 12 =⇒ σ = c1N

− 14+d

=⇒ error = c2N− 2

4+d

So that if for d = 1 we need n1 points to achieve the fixed error e1, then

increasing the dimension to d , we need n4+d5

1 data points i.e. exponentialin dimension!




Smooth function representation: A gaussian kernel machine is arepresentation

f (x) = b +n∑

i=1

wiK (xi , x)

where xi are the base points, wi are weights found through regression, andK (xi , x) is a Gaussian kernel.

Theorem

Let f : R→ R computed by a Gaussian Kernel machine with k base points(k non-zero wi ’s). Then f has at most 2k zeros.




Smooth Function Representation




Smooth function representation: A gaussian kernel machine is arepresentation

f (x) = b +n∑

i=1

wiK (xi , x)

where xi are the base points, wi are weights found through regression, andK (xi , x) is a Gaussian kernel.

Corollary

In Rd , if the learning problem requires f to change sign at least 2k timesalong some straight line, then the kernel machine must have at least kbase points (k non-zero wi ’s).




Local Derivative : For a Gaussian Kernel classifier, the normal of thetangent of the decision surface at x is constrained to approximately lie inthe span of the vectors (x − xi ), where ||x − xi || is small compared to σand xi are in the training set.



Local Derivative

Brief explanation:

For f (x) = b +n∑

i=1

wiK (x , xi ) = b +n∑

i=1

wie− ||x−xi ||

2

σ2

We get∂f (x)

∂x= −

n∑i=1

2(x − xi )wi

σ2e−||x−xi ||

2

σ2

Note that the dominant terms are those for which xi is a near neighbor ofx , so that we approximately get

∂f (x)

∂x≈ −

m∑i=1

w ′i e− ||x−xi ||

2

σ2

where w ′i = 2(x−xi )wi

σ2 and m ≤ n.



Local Derivative Example


Shallow and Deep Learning

Getting Around the Curse of Dimensionality

There is no universal solution to the curse of dimensionality; however, forparticular purposes you can make assumption that may help.

“We hypothesize that many tasks in the AI set may be built aroundcommon representations, which can be understood as a set of interrelatedconcepts”

Translation: Use the mathematical idea of composition of functions tobuild a complicated function from simple parts (common representation),such as Gaussian kernels or any basis functions.



Shallow Learning

f (x) = b +N∑i=1

wiφi (x)

where wi results from the training, the basis could be something fixed suchas K (xi , x) or also a result of the training.



Shallow Learning



Deep Learning as Neural Network



Deep Learning as Function Composition

Let fjk represent the jth feature in the k layer

fj1(x) = bj ,1 +∑i

wij1K (xij1, x)

fj ,2(x) = bj ,2 +∑i

wij2K (xij2, fj1(x))

...

They conjecture that by allowing these compositions, we will need fewerparameters to fit compared to a shallow representation



Deep Learning Steps

Step 1 Initialization via unsupervised learning with a feedback thathelps reconstruct input from output

Step 2 Refine via gradient-descent supervised learning



Deep Learning

cijxy = tanh

bij +∑k

Pi−1∑p=0

Qi−1∑q=0

wijkpqc(i−1),k,(x+p),(y+q)



Deep Learning


Results

Results


Results

Sample Data


Results

Results


Conclusion

Summary

The curse of dimensionality can limit the amounts of data that can beanalyzed

We can not completely get rid of the curse of dimensionality, but wecan go around it if we make some assumptions

Shallow learning assumes that we can represent functions with thesmooth functions such as gaussian kernels

Deep learning assumes that complicated functions can be build bycomposing simple functions such as Gaussian kernels

Deep learning is composed of several two-layer sequences, the featuredetection layer and the feature pooling layer, in which each layer anon-linear supervision step is performed.

The authors show successful results in image classification


Conclusion

References

Yoshua Bengio and Yann LeCun, Scaling Learning Algorithms towardsAI, Large-Scale Kernel Machine, 2007.

Leslie Lamport, Deep Learning and Convolutional Neural Networks,RSIP Vision Blogshttp://www.rsipvision.com/exploring-deep-learning/.

Jianxin Wu, Introduction to Convolutional Neural Networks, NationalKey Lab for Novel Software Technology, Nanjing University, China2016.