MACHINE LEARNING A CHALLENGE FOR MATHEMATICS

Steffen Grünewälder
Lancaster University
Statistical data analysis (statistics). Software that uses data to adapt (computer science). Processing of information/signals (engineering).
Statistics applied to technological problems. Terminology is often biologically inspired.
WHAT IS MACHINE LEARNING?
MACHINE LEARNING
MACHINE LEARNING
MACHINE LEARNING
SOME ML HISTORY
Labels y1; : : : ; yn 2 f1;C1g
Goal: find a function f W Rd ! f1;C1g that predicts ’well’ the labels y of future inputs x.
CLASSIFICATION
CLASSIFICATION
CLASSIFICATION
-3 -2 -1 0 1 2 3 -2
-1.5
-1
-0.5
0
0.5
1
1.5
2
PERCEPTRON
f .x/ D sgn.hw; xi C b/:
The perceptron algorithm finds a hyperplane (w; b) that separates the data (if it is separable).1
1Rosenblatt, 1957
1Rosenblatt, 1957
1Rosenblatt, 1957
-3 -2 -1 0 1 2 3 -2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Corresponding function class F is dense in C.Œ0; 1d /.
Proof techniques: Stone-Weierstraß and Wiener-Tauberian theorems.
1Cybenko 89, Hornik 91
RISK FUNCTIONALS AND OPTIMISATION
How to select a candidate in F ? Typically one defines a loss-function per pair .x; y/
l.x; y; f / D .f .x/ y/2:
Risk-function
P is unknown and one uses instead the empirical measure
Pn D n 1
Rn.f / D
1 nX
l.x; y; f / D .f .x/ y/2:
Risk-function
Pn D n 1
Rn.f / D
1 nX
l.x; y; f / D .f .x/ y/2:
Risk-function
Pn D n 1
Rn.f / D
1 nX
l.x; y; f / D .f .x/ y/2:
Risk-function
Pn D n 1
Rn.f / D
1 nX
l.x; y; f / D .f .x/ y/2:
Risk-function
Pn D n 1
Rn.f / D
1 nX
APPROXIMATION VS. ESTIMATION
y D sin.x/C ; N .0; 1=4/:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -1.5
-1
-0.5
0
0.5
1
1.5
Small F
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -1.5
-1
-0.5
0
0.5
1
1.5
Medium F
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -1.5
-1
-0.5
0
0.5
1
1.5
Large F
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -1.5
-1
-0.5
0
0.5
1
1.5
RISK BOUNDS
How does the approximation and estimation error behave in dependence of F ?
Typically one has a measure of complexity of F and tries to link complexity to the two error types.
For neural-networks one has
O.1=N/CO.Nd=n/ logn
(N number of units/measures complexity; d dimension of X ; n number of samples)
Balancing the two types of errors (N D .n=.d log.n///1=2):
O.n1=2.d log.n//1=2/:
1Barron, 1991
RISK BOUNDS
O.1=N/CO.Nd=n/ logn
O.n1=2.d log.n//1=2/:
1Barron, 1991
RISK BOUNDS
O.1=N/CO.Nd=n/ logn
O.n1=2.d log.n//1=2/:
1Barron, 1991
RISK BOUNDS
O.1=N/CO.Nd=n/ logn
O.n1=2.d log.n//1=2/:
1Barron, 1991
(LINEAR) SUPPORT VECTOR MACHINE
Alternative approach to linear classification. Also based on hyperplanes. An SVM finds the hyperplane
that maximises the margin between two classes.
1Vapnik-Cervonenkis, 1963
1Vapnik-Cervonenkis, 1963
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
VC-CLASSES
V-C focused then on consistency of learning machines. X some set, A X and 2X . Consider
A WD fC \ A W C 2 g
is said to shatter A if .A/ WD jAj D 2 jAj.
A way to measure the ’complexity’ of is to consider
m .n/ WD maxf .F / W F X; jF j D ng:
The V. / index is the smallest n (possibly infinity) where m .n/ < 2n.
is a VC-class if the V. / index is finite.
VC-CLASSES
A WD fC \ A W C 2 g
VC-CLASSES
A WD fC \ A W C 2 g
VC-CLASSES
A WD fC \ A W C 2 g
VC-CLASSES
A WD fC \ A W C 2 g
VC-CLASSES
A WD fC \ A W C 2 g
EXAMPLE: HYPERPLANES IN Rd
Consider D fAll hyperplanes in Rd g. In R one can use F D f0; 1g and shatters F . But there exists no F; jF j D 3 that shatters (V. / D 3). In R2 we have V. / D 4.
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-2 -1 0 1 2 -2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Consider D fAll hyperplanes in Rd g. In R one can use F D f0; 1g and shatters F . But there exists no F; jF j D 3 that shatters (V. / D 3). In R2 we have V. / D 4. In Rd we have V. / D d C 2 (Radon’s Theorem).
GLIVENKO-CANTELLI THEOREMS
Fn.x/ D Pn. 1; x/:
Almost surely Fn converges (uniformly) to the cdf F
kFn F k1 D sup x2R jFn.x/ F.x/j ! 0 (a.s.):
Extension is called a GC-class if
kPn P k D sup A2
jPn.A/ P.A/j ! 0 (a.s.).
Fn.x/ D Pn. 1; x/:
kPn P k D sup A2
jPn.A/ P.A/j ! 0 (a.s.).
Fn.x/ D Pn. 1; x/:
kPn P k D sup A2
jPn.A/ P.A/j ! 0 (a.s.).
Fn.x/ D Pn. 1; x/:
kPn P k D sup A2
jPn.A/ P.A/j ! 0 (a.s.).
One can also consider convergence in mean, since
kPn P k ! 0 .a:s:/ iff EkPn P k ! 0:
.X;†/ a measure space and
P D fall probability measures on †g:
is called uGC if
is a VC-class iff is uGC!
1Vapnik-Cervonenkis, 1968.
is called uGC if
is called uGC if
is called uGC if
UCLTs (1982)
EMPIRICAL PROCESS
GC: LLN that holds uniformly over a (not too large) set
sup A2
jPn.A/ P.A/j ! 0:
There exists a similar extension for the CLT. Consider the normalised difference (the empirical process)
n WD n 1=2.Pn P / indexed by a function space D
n.f / D n 1=2 Z
f dPn
n.f / d ! N .0; 2/ with 2
D
EMPIRICAL PROCESS
sup A2
jPn.A/ P.A/j ! 0:
n.f / D n 1=2 Z
f dPn
n.f / d ! N .0; 2/ with 2
D
EMPIRICAL PROCESS
sup A2
jPn.A/ P.A/j ! 0:
n.f / D n 1=2 Z
f dPn
n.f / d ! N .0; 2/ with 2
D
EMPIRICAL PROCESS
sup A2
jPn.A/ P.A/j ! 0:
n.f / D n 1=2 Z
f dPn
n.f / d ! N .0; 2/ with 2
D
EMPIRICAL PROCESS
sup A2
jPn.A/ P.A/j ! 0:
n.f / D n 1=2 Z
f dPn
n.f / d ! N .0; 2/ with 2
D
UNIFORM CENTRAL LIMIT THEOREM
If D is suitable restricted in complexity then the CLT holds uniformly over D .
Instead of N .0; 2/ the limiting distribution is a Gaussian process GP on D .
It has zero mean and covariance (f; g 2 D)
cov.GP .f /; GP .g// D
Z fg dP
Z f dP
Z g dP:
n ÝGP :
Z fg dP
Z f dP
Z g dP:
n ÝGP :
Z fg dP
Z f dP
Z g dP:
n ÝGP :
Z fg dP
Z f dP
Z g dP:
n ÝGP :
Z fg dP
Z f dP
Z g dP:
n ÝGP :
SOME ML HISTORY
Linear SVM (1963)
RKHS : a Hilbert space H with continuous point evaluation,
Lxf D f .x/ and Lx 2 H 0 Š H :
There exists a map X 7! H (denoted k.x; /) such that
hk.x; /; f i D f .x/:
Can be applied to a variety of statistical problems.2
2Parzen 1960, Wahba & Parzen until the 90s
REPRODUCING KERNEL HILBERT SPACES
hk.x; /; f i D f .x/:
REPRODUCING KERNEL HILBERT SPACES
hk.x; /; f i D f .x/:
RKHS AND THE SVM
’Non-linear’ SVM.3
First idea: one can use an arbitrary transformation W X ! H to make the data ’richer’
xi 2 R and .xi / D .xi ; x 2 i ; x
3 i ; : : : ; x
k.x; y/ positive semi-definite then there exists an RKHS H
and a function W X ! H with
k.x; y/ D h.x/; .y/i:
The SVM can be formulated entirely in terms of k without the need to know H or .
3Boser, Guyon, Vapnik, 1992
RKHS AND THE SVM
3 i ; : : : ; x
k.x; y/ D h.x/; .y/i:
RKHS AND THE SVM
3 i ; : : : ; x
k.x; y/ D h.x/; .y/i:
RKHS AND THE SVM
3 i ; : : : ; x
k.x; y/ D h.x/; .y/i:
RKHS AND THE SVM
3 i ; : : : ; x
k.x; y/ D h.x/; .y/i:
RKHS AND THE SVM
3 i ; : : : ; x
k.x; y/ D exp.kx yk2/:
RKHS AND THE SVM
3 i ; : : : ; x
k.x; y/ D exp.kx yk2/:
-10
-5
0
5
10
15
-10
-5
0
5
10
15
Linear SVM (1963)
Linear SVM (1963)
Mean Embeddings
RKHS - DONSKER
The unit ball of an RKHS with a uniformly continuous kernel function k.x; / W X ! H is a Donsker class.
This implies for every > 0 there exists a constant b > 0 such that
Prf sup kf k1
jEnf Ef j > bn 1=2 g < ; for all n 1:
Can be used for 2 sample tests,
sup kf k1
RKHS - DONSKER
Prf sup kf k1
sup kf k1
RKHS - DONSKER
Prf sup kf k1
sup kf k1
MEAN EMBEDDINGS
If a Banach space B L1.X; P / and E W B ! R is bounded then
9m 2 B0 with Ef D m.f /:
For HSs this implies
sup kf k1
jhf;mP i hf;mQi
In an RKHS kmP mQk can be computed in n2.
MEAN EMBEDDINGS
sup kf k1
jhf;mP i hf;mQi
MEAN EMBEDDINGS
sup kf k1
jhf;mP i hf;mQi
APPROXIMATIONS
One might be interested in a ’compact’ approximation of m.
If we have continuous point evaluators Lx 2 B0 then
m.f / D Ef D
Z Lxf dP D
Z Lx dP:R
Lx dP the Bochner integral.
m D R Lx dP lies then in the closed convex hull of the Lx
m 2 cch Lx :
m.f / D Ef D
Z Lxf dP D
Z Lx dP:R
m 2 cch Lx :
m.f / D Ef D
Z Lxf dP D
Z Lx dP:R
m 2 cch Lx :
A SIMPLE APPROXIMATION ALGORITHM
Intuitive to approximate m with convex combinations of the extremes of cch Lx.
A simple algorithm for an RKHS .Lxf D hk.x; /; f i/:
1 xt 2 argmaxx2X hk.x; /; wt i,
2 wtC1 D wt .k.xt ; / m/.
kwtk b for all t then
1
points ’without’ loss of accuracy.
1
1
1
1
1
PROOF SKETCH
-2
-1
0
1
2
3
wt
m
PROOF SKETCH
A density on X with p.x/ > c for some constant c > 0 and all x 2 X implies m B.m; / cch Lx (finite dimensional only!).
-3 -2 -1 0 1 2 3 -3
-2
-1
0
1
2
3
wt
m
SUMMARY
ML is broad field with many different areas of applications. Engineering/money plays a role nowadays but new ideas
can still have massive impact. ML has always been heavily influenced by mathematics.
Estimation/ Prob. Theory
SUMMARY
SUMMARY

Documents

MACHINE LEARNING A CHALLENGE FOR MATHEMATICS