Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
GANNs: A New Theory of Representation forNonlinear Bounded Operators
William Guss
Machine Learning at Berkeley
April 22, 2016
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Introduction: What’s up with continuous data?
All of the data we deal with is discrete thanks to Turing.
But, most of it models a continuous process.
ExamplesAudio: We take > 100k samples of something we coulddescribe with f : R→ R! Trick Question: Which is easierto use? (a) v ∈ R100000 or (b) f .Images: We take 100k × 100k samples of something wecould describe with f : R2 → R.
Why do we use discrete data? No computer known canreally store f . End of stoy.
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Introduction: Abusing continuity
f can’t be that bad. Can it?
If f is smooth it’s easy to draw:
−10 −5 0 5 10
0
20
40
60
80
100
I can even name f most of the time: f : x 7→ x2 or evensuper precisely g : x 7→
∑∞i anx
n.
Moral: Smooth functions are mostly very managable.
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Introduction: Abusing continuity
So why do we do this:
To classify this:
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
ResultsThe Core Idea: Let neuralnetworks abuse continuity
and smoothness.
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Artificial Neural Networks
Definition
We say N : Rn → Rm is a feed-forward neural network if for aninput vector xxx ,
N : σ(l+1)j = g
∑i∈Z (l)
w(l)ij σ
(l)i + β
(l)
σ
(0)i = xi ,
(1)
where 1 ≤ l ≤ L− 1. Furthermore we say {N} is the set of allneural networks.
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Operator Neural Networks
Let’s get rid of R100000 and use f .
Definition
We call O : Lp(X )→ L1(Y ) an operator neural network if,
O : σ(l+1)(j) = g(∫
R(l)σ(l)(i)w (l)(i , j) di
)σ(0)(j) = f (j).
(2)
Furthermore let {O} denote the set of all functional neuralnetworks.
Well that was easy. In fact {O} ⊃ {N}These definitions looks really similar? Is there some moregeneral category or structure containing them.
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Generalized Artifical Neural Networks
Definition
If A,B are (possibly distinct) Banach spaces over a field F, wesay G : A→ B is a generalized neural network if and only if
G : σ(l+1) = g(Tl
[σ(l)]
+ β(l))
σ(0) = ξ(3)
for some input ξ ∈ A, and a linear form Tl .
Claim: ”Neural networks” are powerful because they can movebumps anywhere!How? Tl is a linear form. It can move σ
(l) anywhere, and g isa bump of some sort.
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Moving bumps around
The sigmoid function
g =1
1 + e−x(4)
is a bump, that we can move around with weights!
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Tl as the layer type.
Definition
We suggest several classes of Tl as follows
Tl is said to be o operational if and only if =
Tl = o : Lp(R(l))→ L1(R(l+1))
σ 7→∫R(l)
σ(i)w (l)(i , j) di .(5)
Tl is said to be n discrete if and only if
Tl = n : Rn → Rm
~σ 7→m∑j
~ej
n∑i
σiw(l)ij
(6)
where ~ej denotes the jth basis vector in Rm.
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Tl as the layer type.
Definition
Tl is said to be n1 transitional if and only if
Tl = n1 : Rn → Lq(R(l+1))
~σ 7→n∑i
σiw(l)i (j).
(7)
Tl is said to be n2 transitional if and only if
Tl = n2 : Lp(R(l))→ Rm
σ(i) 7→m∑j
~ej
∫R(l)
σ(i)w(l)j (i) di
(8)
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Neural networks as diagrams!
This generalization is nice from a creative standpoint.I can come up with new sorts of ”classifiers” on the fly.Examples:
A three layer neural network is just
N3 : R10000g◦n−−→ R30 g◦n−−→ R3. (9)
A three layer operator network is simply
O3 : Lp(R)g◦o−−→ L1(R) g◦o−−→ C (R). (10)
We can even classify functions!
C : Lp(R) g◦o−−→ L1(R) g◦o−−→ . . . g◦o−−→ L1(R) g◦n2−−−→ Rn. (11)
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Did abusing continuity help?
For every layer o has weights
w (l)(i , j) =
Z(l)Y∑b
Z(l)X∑a
k(l)a,bi
ajb. (12)
Theorem
Let C be a GANN with only one n2 transitional layer with O(1)weight polynomial. If a continuous function, say f (t) issampled uniformly from t = 0, to t = N, such that xn = f (n),and if G has an input function which is piecewise linear withO(N2) weights.
ξ = (xn+1 − xn) (z − n) + xn (13)
for n ≤ z < n + 1, then there exist some discrete neuralnetwork N such that G(ξ) = N (xxx).
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Did abusing continuity help?
WHAT!?!? How did C reduce the number of weights fromO(N2) to O(1)?
The infinite dimensional versions of N , in particular O andC are invariant to input quality. Takes the idea behindConvnets to an extreme!
This is easy to see.
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Representation Theory
How good are Continuous Classifier Networks, {C} asalgorithms?
Theorem
Let X be a compact Hausdorf space. For every � > 0 and everycontinuous bounded functional on Lq(X ), say f , there exists atwo layer continuous classifier
C : Lq(x) g◦n2−−−→ Rm n−→ Rn (14)
such that‖f − C‖ < �. (15)
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Representation Theory
How good are Operator Networks and GANNs as algorithms?They should be able to approximate the important operators,eg. Fourier Transform, Laplace Transform, Derivation, etc.
Theorem
Given a operator neural network O then some layer l ∈ O, thelet K : C (R(l))→ C (R(l)) be a bounded linear operator. If wedenote the operation of layer l on layer l − 1 asσ(l+1) = g
(Σl+1σ
(l)), then for every � > 0, there exists a
weight polynomial w (l)(i , j) such that the supremum norm overR(l) ∥∥∥Kσ(l) − Σl+1σ(l)∥∥∥
∞< � (16)
Proof.
See paper. Nice!
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
We want to show the following better theorem.
Theorem
Given a operator neural network O then some layer l ∈ O, thelet K : C (R(l))→ C (R(l)) be a bounded continuous operator.If we denote the operation of layer l on layer l − 1 asσ(l+1) = g
(Σl+1σ
(l)), then for every � > 0, there exists a
weight polynomial w (l)(i , j) such that the supremum norm overR(l) ∥∥∥Kσ(l) − Σl+1σ(l)∥∥∥
∞< � (17)
But how? Dirac Spikes!
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
ξ
Lp(R)
f
f (j)
L1(R)
R
j
K ,O
K , ?O
Kj , Cj
Kj , Cj
Proof.
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
ξ
Lp(R)
f
f (j)
L1(R)
R
j
K ,O
K , ?O
Kj , Cj
Kj , Cj
Proof.
Fix � > 0. Given K : ξ 7→ f , let Kj : ξ 7→ f (j) be afunctional on Lq.
We can find a Cj : Lq(R)g◦n2−−−→ Rm(j) n−→ R1 so that for all
ξ,|Cj(ξ)− Kj(ξ)| = |Cj(ξ)− f (j)| < �/2. (18)
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
ξ
Lp(R)
f
f (j)
L1(R)
R
j
K ,O
K , ?O
Kj , Cj
Kj , Cj
Proof.
Fix � > 0. Given K : ξ 7→ f , let Kj : ξ 7→ f (j) be afunctional on Lq.
We can find a Cj : Lq(R)g◦n2−−−→ Rm(j) n−→ R1 so that for all
ξ,|Cj(ξ)− Kj(ξ)| = |Cj(ξ)− f (j)| < �/2. (18)
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
ξ
Lp(R)
f
f (j)
L1(R)
R
j
K ,O
K , ?O
Kj , Cj
Kj , Cj
Proof.
We know that
Cj(ξ) =m(j)∑k=1
ajkg
(∫Rξ(i)wkj(i) dµ(i)
)(18)
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
w1jw2jw3jw4j
1
Proof.
We wish to turn Cj into a two layer O. Let,
w (0)(i , `) =
{wkj(i), if ` = j + k , k∈ 1, . . . ,m(j)0 otherwise
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
w1jw2jw3jw4j
1
Proof.
Then
Cj(ξ) =m∑
k=1
ajkg ◦ o[ξ](k + j) (18)
How do we turn this finite sum into an integral? Diractime!
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
w1jw2jw3jw4j
1
Proof.
Then
Cj(ξ) =m∑
k=1
ajkg ◦ o[ξ](k + j) (18)
How do we turn this finite sum into an integral? Diractime!
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
Proof.
We define a dirac spike as follows for every n:
δnkj(`) = cn exp(−bn2|`− (j + k)|2) (18)
where c , b are set so that∫R δnkj = 1
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
Proof.
Now let the second weight function be:
w(1)n (`, j) =
m∑k=1
ajkδnkj(`) (18)
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
Proof.
Putting everything together, for every n letOn : Lp(R)→ L1([0, 1])
On : ξ 7→∫Rw (1)(`, j)o[ξ](`) dµ(`). (18)
Clearly On →∑m
k=1 ajkg ◦ o[ξ](k + j)
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
Proof.
Therefore for every � > 0 there exists an N such that forall n > N, for all ξ, and for all j ,
|On[ξ](j)− Cj [ξ]| ≤ ‖On[·](j)− Cj [·]‖ < �/2. (18)
Recall that for every j , ‖Kj − Cj‖ < �/2.
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
Proof.
Therefore for every � > 0 there exists an N such that forall n > N, for all ξ, and for all j ,
|On[ξ](j)− Cj [ξ]| ≤ ‖On[·](j)− Cj [·]‖ < �/2. (18)
Recall that for every j , ‖Kj − Cj‖ < �/2.
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
Proof.
By the triangle inequality we have that for all j
‖Kj −On(k)‖ = ‖Kj −On(j) + Cj − Cj‖≤ ‖Kj − Cj‖+ ‖On(j)− Ck‖ < �.
(18)
Therefore ‖K −O‖ < �
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
Proof.
By the triangle inequality we have that for all j
‖Kj −On(k)‖ = ‖Kj −On(j) + Cj − Cj‖≤ ‖Kj − Cj‖+ ‖On(j)− Ck‖ < �.
(18)
Therefore ‖K −O‖ < �
IntroductionThe Core IdeaResults