118
MACHINE L EARNING: AC HALLENGE FOR MATHEMATICS Steffen Grünewälder [email protected] Department of Mathematics & Statistics Lancaster University

MACHINE LEARNING A CHALLENGE FOR MATHEMATICS

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Steffen Grünewälder
Lancaster University
Statistical data analysis (statistics). Software that uses data to adapt (computer science). Processing of information/signals (engineering).
Statistics applied to technological problems. Terminology is often biologically inspired.
WHAT IS MACHINE LEARNING?
Statistical data analysis (statistics). Software that uses data to adapt (computer science). Processing of information/signals (engineering).
Statistics applied to technological problems. Terminology is often biologically inspired.
WHAT IS MACHINE LEARNING?
Statistical data analysis (statistics). Software that uses data to adapt (computer science). Processing of information/signals (engineering).
Statistics applied to technological problems. Terminology is often biologically inspired.
WHAT IS MACHINE LEARNING?
Statistical data analysis (statistics). Software that uses data to adapt (computer science). Processing of information/signals (engineering).
Statistics applied to technological problems. Terminology is often biologically inspired.
MACHINE LEARNING
MACHINE LEARNING
MACHINE LEARNING
SOME ML HISTORY
Labels y1; : : : ; yn 2 f1;C1g
Goal: find a function f W Rd ! f1;C1g that predicts ’well’ the labels y of future inputs x.
CLASSIFICATION
Labels y1; : : : ; yn 2 f1;C1g
Goal: find a function f W Rd ! f1;C1g that predicts ’well’ the labels y of future inputs x.
CLASSIFICATION
Labels y1; : : : ; yn 2 f1;C1g
Goal: find a function f W Rd ! f1;C1g that predicts ’well’ the labels y of future inputs x.
CLASSIFICATION
Labels y1; : : : ; yn 2 f1;C1g
Goal: find a function f W Rd ! f1;C1g that predicts ’well’ the labels y of future inputs x.
-3 -2 -1 0 1 2 3 -2
-1.5
-1
-0.5
0
0.5
1
1.5
2
PERCEPTRON
f .x/ D sgn.hw; xi C b/:
The perceptron algorithm finds a hyperplane (w; b) that separates the data (if it is separable).1
1Rosenblatt, 1957
f .x/ D sgn.hw; xi C b/:
The perceptron algorithm finds a hyperplane (w; b) that separates the data (if it is separable).1
1Rosenblatt, 1957
f .x/ D sgn.hw; xi C b/:
The perceptron algorithm finds a hyperplane (w; b) that separates the data (if it is separable).1
1Rosenblatt, 1957
f .x/ D sgn.hw; xi C b/:
The perceptron algorithm finds a hyperplane (w; b) that separates the data (if it is separable).1
-3 -2 -1 0 1 2 3 -2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Corresponding function class F is dense in C.Œ0; 1d /.
Proof techniques: Stone-Weierstraß and Wiener-Tauberian theorems.
1Cybenko 89, Hornik 91
Corresponding function class F is dense in C.Œ0; 1d /.
Proof techniques: Stone-Weierstraß and Wiener-Tauberian theorems.
1Cybenko 89, Hornik 91
Corresponding function class F is dense in C.Œ0; 1d /.
Proof techniques: Stone-Weierstraß and Wiener-Tauberian theorems.
1Cybenko 89, Hornik 91
RISK FUNCTIONALS AND OPTIMISATION
How to select a candidate in F ? Typically one defines a loss-function per pair .x; y/
l.x; y; f / D .f .x/ y/2:
Risk-function
P is unknown and one uses instead the empirical measure
Pn D n 1
Rn.f / D
1 nX
RISK FUNCTIONALS AND OPTIMISATION
How to select a candidate in F ? Typically one defines a loss-function per pair .x; y/
l.x; y; f / D .f .x/ y/2:
Risk-function
P is unknown and one uses instead the empirical measure
Pn D n 1
Rn.f / D
1 nX
RISK FUNCTIONALS AND OPTIMISATION
How to select a candidate in F ? Typically one defines a loss-function per pair .x; y/
l.x; y; f / D .f .x/ y/2:
Risk-function
P is unknown and one uses instead the empirical measure
Pn D n 1
Rn.f / D
1 nX
RISK FUNCTIONALS AND OPTIMISATION
How to select a candidate in F ? Typically one defines a loss-function per pair .x; y/
l.x; y; f / D .f .x/ y/2:
Risk-function
P is unknown and one uses instead the empirical measure
Pn D n 1
Rn.f / D
1 nX
RISK FUNCTIONALS AND OPTIMISATION
How to select a candidate in F ? Typically one defines a loss-function per pair .x; y/
l.x; y; f / D .f .x/ y/2:
Risk-function
P is unknown and one uses instead the empirical measure
Pn D n 1
Rn.f / D
1 nX
APPROXIMATION VS. ESTIMATION
y D sin.x/C ; N .0; 1=4/:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -1.5
-1
-0.5
0
0.5
1
1.5
Small F
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -1.5
-1
-0.5
0
0.5
1
1.5
Medium F
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -1.5
-1
-0.5
0
0.5
1
1.5
Large F
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -1.5
-1
-0.5
0
0.5
1
1.5
RISK BOUNDS
How does the approximation and estimation error behave in dependence of F ?
Typically one has a measure of complexity of F and tries to link complexity to the two error types.
For neural-networks one has
O.1=N/CO.Nd=n/ logn
(N number of units/measures complexity; d dimension of X ; n number of samples)
Balancing the two types of errors (N D .n=.d log.n///1=2):
O.n1=2.d log.n//1=2/:
1Barron, 1991
RISK BOUNDS
How does the approximation and estimation error behave in dependence of F ?
Typically one has a measure of complexity of F and tries to link complexity to the two error types.
For neural-networks one has
O.1=N/CO.Nd=n/ logn
(N number of units/measures complexity; d dimension of X ; n number of samples)
Balancing the two types of errors (N D .n=.d log.n///1=2):
O.n1=2.d log.n//1=2/:
1Barron, 1991
RISK BOUNDS
How does the approximation and estimation error behave in dependence of F ?
Typically one has a measure of complexity of F and tries to link complexity to the two error types.
For neural-networks one has
O.1=N/CO.Nd=n/ logn
(N number of units/measures complexity; d dimension of X ; n number of samples)
Balancing the two types of errors (N D .n=.d log.n///1=2):
O.n1=2.d log.n//1=2/:
1Barron, 1991
RISK BOUNDS
How does the approximation and estimation error behave in dependence of F ?
Typically one has a measure of complexity of F and tries to link complexity to the two error types.
For neural-networks one has
O.1=N/CO.Nd=n/ logn
(N number of units/measures complexity; d dimension of X ; n number of samples)
Balancing the two types of errors (N D .n=.d log.n///1=2):
O.n1=2.d log.n//1=2/:
1Barron, 1991
(LINEAR) SUPPORT VECTOR MACHINE
Alternative approach to linear classification. Also based on hyperplanes. An SVM finds the hyperplane
that maximises the margin between two classes.
1Vapnik-Cervonenkis, 1963
(LINEAR) SUPPORT VECTOR MACHINE
Alternative approach to linear classification. Also based on hyperplanes. An SVM finds the hyperplane
that maximises the margin between two classes.
1Vapnik-Cervonenkis, 1963
(LINEAR) SUPPORT VECTOR MACHINE
Alternative approach to linear classification. Also based on hyperplanes. An SVM finds the hyperplane
that maximises the margin between two classes.
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
VC-CLASSES
V-C focused then on consistency of learning machines. X some set, A X and 2X . Consider
A WD fC \ A W C 2 g
is said to shatter A if .A/ WD jAj D 2 jAj.
A way to measure the ’complexity’ of is to consider
m .n/ WD maxf .F / W F X; jF j D ng:
The V. / index is the smallest n (possibly infinity) where m .n/ < 2n.
is a VC-class if the V. / index is finite.
VC-CLASSES
V-C focused then on consistency of learning machines. X some set, A X and 2X . Consider
A WD fC \ A W C 2 g
is said to shatter A if .A/ WD jAj D 2 jAj.
A way to measure the ’complexity’ of is to consider
m .n/ WD maxf .F / W F X; jF j D ng:
The V. / index is the smallest n (possibly infinity) where m .n/ < 2n.
is a VC-class if the V. / index is finite.
VC-CLASSES
V-C focused then on consistency of learning machines. X some set, A X and 2X . Consider
A WD fC \ A W C 2 g
is said to shatter A if .A/ WD jAj D 2 jAj.
A way to measure the ’complexity’ of is to consider
m .n/ WD maxf .F / W F X; jF j D ng:
The V. / index is the smallest n (possibly infinity) where m .n/ < 2n.
is a VC-class if the V. / index is finite.
VC-CLASSES
V-C focused then on consistency of learning machines. X some set, A X and 2X . Consider
A WD fC \ A W C 2 g
is said to shatter A if .A/ WD jAj D 2 jAj.
A way to measure the ’complexity’ of is to consider
m .n/ WD maxf .F / W F X; jF j D ng:
The V. / index is the smallest n (possibly infinity) where m .n/ < 2n.
is a VC-class if the V. / index is finite.
VC-CLASSES
V-C focused then on consistency of learning machines. X some set, A X and 2X . Consider
A WD fC \ A W C 2 g
is said to shatter A if .A/ WD jAj D 2 jAj.
A way to measure the ’complexity’ of is to consider
m .n/ WD maxf .F / W F X; jF j D ng:
The V. / index is the smallest n (possibly infinity) where m .n/ < 2n.
is a VC-class if the V. / index is finite.
VC-CLASSES
V-C focused then on consistency of learning machines. X some set, A X and 2X . Consider
A WD fC \ A W C 2 g
is said to shatter A if .A/ WD jAj D 2 jAj.
A way to measure the ’complexity’ of is to consider
m .n/ WD maxf .F / W F X; jF j D ng:
The V. / index is the smallest n (possibly infinity) where m .n/ < 2n.
is a VC-class if the V. / index is finite.
EXAMPLE: HYPERPLANES IN Rd
Consider D fAll hyperplanes in Rd g. In R one can use F D f0; 1g and shatters F . But there exists no F; jF j D 3 that shatters (V. / D 3). In R2 we have V. / D 4.
EXAMPLE: HYPERPLANES IN Rd
Consider D fAll hyperplanes in Rd g. In R one can use F D f0; 1g and shatters F . But there exists no F; jF j D 3 that shatters (V. / D 3). In R2 we have V. / D 4.
EXAMPLE: HYPERPLANES IN Rd
Consider D fAll hyperplanes in Rd g. In R one can use F D f0; 1g and shatters F . But there exists no F; jF j D 3 that shatters (V. / D 3). In R2 we have V. / D 4.
EXAMPLE: HYPERPLANES IN Rd
Consider D fAll hyperplanes in Rd g. In R one can use F D f0; 1g and shatters F . But there exists no F; jF j D 3 that shatters (V. / D 3). In R2 we have V. / D 4.
EXAMPLE: HYPERPLANES IN Rd
Consider D fAll hyperplanes in Rd g. In R one can use F D f0; 1g and shatters F . But there exists no F; jF j D 3 that shatters (V. / D 3). In R2 we have V. / D 4.
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
EXAMPLE: HYPERPLANES IN Rd
Consider D fAll hyperplanes in Rd g. In R one can use F D f0; 1g and shatters F . But there exists no F; jF j D 3 that shatters (V. / D 3). In R2 we have V. / D 4.
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
EXAMPLE: HYPERPLANES IN Rd
Consider D fAll hyperplanes in Rd g. In R one can use F D f0; 1g and shatters F . But there exists no F; jF j D 3 that shatters (V. / D 3). In R2 we have V. / D 4.
-2 -1 0 1 2 -2
-1.5
-1
-0.5
0
0.5
1
1.5
2
EXAMPLE: HYPERPLANES IN Rd
Consider D fAll hyperplanes in Rd g. In R one can use F D f0; 1g and shatters F . But there exists no F; jF j D 3 that shatters (V. / D 3). In R2 we have V. / D 4. In Rd we have V. / D d C 2 (Radon’s Theorem).
GLIVENKO-CANTELLI THEOREMS
Fn.x/ D Pn. 1; x/:
Almost surely Fn converges (uniformly) to the cdf F
kFn F k1 D sup x2R jFn.x/ F.x/j ! 0 (a.s.):
Extension is called a GC-class if
kPn P k D sup A2
jPn.A/ P.A/j ! 0 (a.s.).
Fn.x/ D Pn. 1; x/:
Almost surely Fn converges (uniformly) to the cdf F
kFn F k1 D sup x2R jFn.x/ F.x/j ! 0 (a.s.):
Extension is called a GC-class if
kPn P k D sup A2
jPn.A/ P.A/j ! 0 (a.s.).
Fn.x/ D Pn. 1; x/:
Almost surely Fn converges (uniformly) to the cdf F
kFn F k1 D sup x2R jFn.x/ F.x/j ! 0 (a.s.):
Extension is called a GC-class if
kPn P k D sup A2
jPn.A/ P.A/j ! 0 (a.s.).
Fn.x/ D Pn. 1; x/:
Almost surely Fn converges (uniformly) to the cdf F
kFn F k1 D sup x2R jFn.x/ F.x/j ! 0 (a.s.):
Extension is called a GC-class if
kPn P k D sup A2
jPn.A/ P.A/j ! 0 (a.s.).
One can also consider convergence in mean, since
kPn P k ! 0 .a:s:/ iff EkPn P k ! 0:
.X;†/ a measure space and
P D fall probability measures on †g:
is called uGC if
is a VC-class iff is uGC!
1Vapnik-Cervonenkis, 1968.
One can also consider convergence in mean, since
kPn P k ! 0 .a:s:/ iff EkPn P k ! 0:
.X;†/ a measure space and
P D fall probability measures on †g:
is called uGC if
is a VC-class iff is uGC!
1Vapnik-Cervonenkis, 1968.
One can also consider convergence in mean, since
kPn P k ! 0 .a:s:/ iff EkPn P k ! 0:
.X;†/ a measure space and
P D fall probability measures on †g:
is called uGC if
is a VC-class iff is uGC!
1Vapnik-Cervonenkis, 1968.
One can also consider convergence in mean, since
kPn P k ! 0 .a:s:/ iff EkPn P k ! 0:
.X;†/ a measure space and
P D fall probability measures on †g:
is called uGC if
is a VC-class iff is uGC!
1Vapnik-Cervonenkis, 1968.
UCLTs (1982)
EMPIRICAL PROCESS
GC: LLN that holds uniformly over a (not too large) set
sup A2
jPn.A/ P.A/j ! 0:
There exists a similar extension for the CLT. Consider the normalised difference (the empirical process)
n WD n 1=2.Pn P / indexed by a function space D
n.f / D n 1=2 Z
f dPn
n.f / d ! N .0; 2/ with 2
D
EMPIRICAL PROCESS
GC: LLN that holds uniformly over a (not too large) set
sup A2
jPn.A/ P.A/j ! 0:
There exists a similar extension for the CLT. Consider the normalised difference (the empirical process)
n WD n 1=2.Pn P / indexed by a function space D
n.f / D n 1=2 Z
f dPn
n.f / d ! N .0; 2/ with 2
D
EMPIRICAL PROCESS
GC: LLN that holds uniformly over a (not too large) set
sup A2
jPn.A/ P.A/j ! 0:
There exists a similar extension for the CLT. Consider the normalised difference (the empirical process)
n WD n 1=2.Pn P / indexed by a function space D
n.f / D n 1=2 Z
f dPn
n.f / d ! N .0; 2/ with 2
D
EMPIRICAL PROCESS
GC: LLN that holds uniformly over a (not too large) set
sup A2
jPn.A/ P.A/j ! 0:
There exists a similar extension for the CLT. Consider the normalised difference (the empirical process)
n WD n 1=2.Pn P / indexed by a function space D
n.f / D n 1=2 Z
f dPn
n.f / d ! N .0; 2/ with 2
D
EMPIRICAL PROCESS
GC: LLN that holds uniformly over a (not too large) set
sup A2
jPn.A/ P.A/j ! 0:
There exists a similar extension for the CLT. Consider the normalised difference (the empirical process)
n WD n 1=2.Pn P / indexed by a function space D
n.f / D n 1=2 Z
f dPn
n.f / d ! N .0; 2/ with 2
D
UNIFORM CENTRAL LIMIT THEOREM
If D is suitable restricted in complexity then the CLT holds uniformly over D .
Instead of N .0; 2/ the limiting distribution is a Gaussian process GP on D .
It has zero mean and covariance (f; g 2 D)
cov.GP .f /; GP .g// D
Z fg dP
Z f dP
Z g dP:
n ÝGP :
UNIFORM CENTRAL LIMIT THEOREM
If D is suitable restricted in complexity then the CLT holds uniformly over D .
Instead of N .0; 2/ the limiting distribution is a Gaussian process GP on D .
It has zero mean and covariance (f; g 2 D)
cov.GP .f /; GP .g// D
Z fg dP
Z f dP
Z g dP:
n ÝGP :
UNIFORM CENTRAL LIMIT THEOREM
If D is suitable restricted in complexity then the CLT holds uniformly over D .
Instead of N .0; 2/ the limiting distribution is a Gaussian process GP on D .
It has zero mean and covariance (f; g 2 D)
cov.GP .f /; GP .g// D
Z fg dP
Z f dP
Z g dP:
n ÝGP :
UNIFORM CENTRAL LIMIT THEOREM
If D is suitable restricted in complexity then the CLT holds uniformly over D .
Instead of N .0; 2/ the limiting distribution is a Gaussian process GP on D .
It has zero mean and covariance (f; g 2 D)
cov.GP .f /; GP .g// D
Z fg dP
Z f dP
Z g dP:
n ÝGP :
UNIFORM CENTRAL LIMIT THEOREM
If D is suitable restricted in complexity then the CLT holds uniformly over D .
Instead of N .0; 2/ the limiting distribution is a Gaussian process GP on D .
It has zero mean and covariance (f; g 2 D)
cov.GP .f /; GP .g// D
Z fg dP
Z f dP
Z g dP:
n ÝGP :
SOME ML HISTORY
Linear SVM (1963)
RKHS : a Hilbert space H with continuous point evaluation,
Lxf D f .x/ and Lx 2 H 0 Š H :
There exists a map X 7! H (denoted k.x; /) such that
hk.x; /; f i D f .x/:
Can be applied to a variety of statistical problems.2
2Parzen 1960, Wahba & Parzen until the 90s
REPRODUCING KERNEL HILBERT SPACES
RKHS : a Hilbert space H with continuous point evaluation,
Lxf D f .x/ and Lx 2 H 0 Š H :
There exists a map X 7! H (denoted k.x; /) such that
hk.x; /; f i D f .x/:
Can be applied to a variety of statistical problems.2
2Parzen 1960, Wahba & Parzen until the 90s
REPRODUCING KERNEL HILBERT SPACES
RKHS : a Hilbert space H with continuous point evaluation,
Lxf D f .x/ and Lx 2 H 0 Š H :
There exists a map X 7! H (denoted k.x; /) such that
hk.x; /; f i D f .x/:
Can be applied to a variety of statistical problems.2
2Parzen 1960, Wahba & Parzen until the 90s
RKHS AND THE SVM
’Non-linear’ SVM.3
First idea: one can use an arbitrary transformation W X ! H to make the data ’richer’
xi 2 R and .xi / D .xi ; x 2 i ; x
3 i ; : : : ; x
k.x; y/ positive semi-definite then there exists an RKHS H
and a function W X ! H with
k.x; y/ D h.x/; .y/i:
The SVM can be formulated entirely in terms of k without the need to know H or .
3Boser, Guyon, Vapnik, 1992
RKHS AND THE SVM
’Non-linear’ SVM.3
First idea: one can use an arbitrary transformation W X ! H to make the data ’richer’
xi 2 R and .xi / D .xi ; x 2 i ; x
3 i ; : : : ; x
k.x; y/ positive semi-definite then there exists an RKHS H
and a function W X ! H with
k.x; y/ D h.x/; .y/i:
The SVM can be formulated entirely in terms of k without the need to know H or .
3Boser, Guyon, Vapnik, 1992
RKHS AND THE SVM
’Non-linear’ SVM.3
First idea: one can use an arbitrary transformation W X ! H to make the data ’richer’
xi 2 R and .xi / D .xi ; x 2 i ; x
3 i ; : : : ; x
k.x; y/ positive semi-definite then there exists an RKHS H
and a function W X ! H with
k.x; y/ D h.x/; .y/i:
The SVM can be formulated entirely in terms of k without the need to know H or .
3Boser, Guyon, Vapnik, 1992
RKHS AND THE SVM
’Non-linear’ SVM.3
First idea: one can use an arbitrary transformation W X ! H to make the data ’richer’
xi 2 R and .xi / D .xi ; x 2 i ; x
3 i ; : : : ; x
k.x; y/ positive semi-definite then there exists an RKHS H
and a function W X ! H with
k.x; y/ D h.x/; .y/i:
The SVM can be formulated entirely in terms of k without the need to know H or .
3Boser, Guyon, Vapnik, 1992
RKHS AND THE SVM
’Non-linear’ SVM.3
First idea: one can use an arbitrary transformation W X ! H to make the data ’richer’
xi 2 R and .xi / D .xi ; x 2 i ; x
3 i ; : : : ; x
k.x; y/ positive semi-definite then there exists an RKHS H
and a function W X ! H with
k.x; y/ D h.x/; .y/i:
The SVM can be formulated entirely in terms of k without the need to know H or .
3Boser, Guyon, Vapnik, 1992
RKHS AND THE SVM
’Non-linear’ SVM.3
First idea: one can use an arbitrary transformation W X ! H to make the data ’richer’
xi 2 R and .xi / D .xi ; x 2 i ; x
3 i ; : : : ; x
k.x; y/ positive semi-definite then there exists an RKHS H
and a function W X ! H with
k.x; y/ D exp.kx yk2/:
The SVM can be formulated entirely in terms of k without the need to know H or .
3Boser, Guyon, Vapnik, 1992
RKHS AND THE SVM
’Non-linear’ SVM.3
First idea: one can use an arbitrary transformation W X ! H to make the data ’richer’
xi 2 R and .xi / D .xi ; x 2 i ; x
3 i ; : : : ; x
k.x; y/ positive semi-definite then there exists an RKHS H
and a function W X ! H with
k.x; y/ D exp.kx yk2/:
The SVM can be formulated entirely in terms of k without the need to know H or .
3Boser, Guyon, Vapnik, 1992
-10
-5
0
5
10
15
-10
-5
0
5
10
15
Linear SVM (1963)
Linear SVM (1963)
Mean Embeddings
RKHS - DONSKER
The unit ball of an RKHS with a uniformly continuous kernel function k.x; / W X ! H is a Donsker class.
This implies for every > 0 there exists a constant b > 0 such that
Prf sup kf k1
jEnf Ef j > bn 1=2 g < ; for all n 1:
Can be used for 2 sample tests,
sup kf k1
RKHS - DONSKER
The unit ball of an RKHS with a uniformly continuous kernel function k.x; / W X ! H is a Donsker class.
This implies for every > 0 there exists a constant b > 0 such that
Prf sup kf k1
jEnf Ef j > bn 1=2 g < ; for all n 1:
Can be used for 2 sample tests,
sup kf k1
RKHS - DONSKER
The unit ball of an RKHS with a uniformly continuous kernel function k.x; / W X ! H is a Donsker class.
This implies for every > 0 there exists a constant b > 0 such that
Prf sup kf k1
jEnf Ef j > bn 1=2 g < ; for all n 1:
Can be used for 2 sample tests,
sup kf k1
MEAN EMBEDDINGS
If a Banach space B L1.X; P / and E W B ! R is bounded then
9m 2 B0 with Ef D m.f /:
For HSs this implies
sup kf k1
jhf;mP i hf;mQi
In an RKHS kmP mQk can be computed in n2.
MEAN EMBEDDINGS
If a Banach space B L1.X; P / and E W B ! R is bounded then
9m 2 B0 with Ef D m.f /:
For HSs this implies
sup kf k1
jhf;mP i hf;mQi
In an RKHS kmP mQk can be computed in n2.
MEAN EMBEDDINGS
If a Banach space B L1.X; P / and E W B ! R is bounded then
9m 2 B0 with Ef D m.f /:
For HSs this implies
sup kf k1
jhf;mP i hf;mQi
In an RKHS kmP mQk can be computed in n2.
APPROXIMATIONS
One might be interested in a ’compact’ approximation of m.
If we have continuous point evaluators Lx 2 B0 then
m.f / D Ef D
Z Lxf dP D
Z Lx dP:R
Lx dP the Bochner integral.
m D R Lx dP lies then in the closed convex hull of the Lx
m 2 cch Lx :
One might be interested in a ’compact’ approximation of m.
If we have continuous point evaluators Lx 2 B0 then
m.f / D Ef D
Z Lxf dP D
Z Lx dP:R
Lx dP the Bochner integral.
m D R Lx dP lies then in the closed convex hull of the Lx
m 2 cch Lx :
One might be interested in a ’compact’ approximation of m.
If we have continuous point evaluators Lx 2 B0 then
m.f / D Ef D
Z Lxf dP D
Z Lx dP:R
Lx dP the Bochner integral.
m D R Lx dP lies then in the closed convex hull of the Lx
m 2 cch Lx :
A SIMPLE APPROXIMATION ALGORITHM
Intuitive to approximate m with convex combinations of the extremes of cch Lx.
A simple algorithm for an RKHS .Lxf D hk.x; /; f i/:
1 xt 2 argmaxx2X hk.x; /; wt i,
2 wtC1 D wt .k.xt ; / m/.
kwtk b for all t then
1
points ’without’ loss of accuracy.
A SIMPLE APPROXIMATION ALGORITHM
Intuitive to approximate m with convex combinations of the extremes of cch Lx.
A simple algorithm for an RKHS .Lxf D hk.x; /; f i/:
1 xt 2 argmaxx2X hk.x; /; wt i,
2 wtC1 D wt .k.xt ; / m/.
kwtk b for all t then
1
points ’without’ loss of accuracy.
A SIMPLE APPROXIMATION ALGORITHM
Intuitive to approximate m with convex combinations of the extremes of cch Lx.
A simple algorithm for an RKHS .Lxf D hk.x; /; f i/:
1 xt 2 argmaxx2X hk.x; /; wt i,
2 wtC1 D wt .k.xt ; / m/.
kwtk b for all t then
1
points ’without’ loss of accuracy.
A SIMPLE APPROXIMATION ALGORITHM
Intuitive to approximate m with convex combinations of the extremes of cch Lx.
A simple algorithm for an RKHS .Lxf D hk.x; /; f i/:
1 xt 2 argmaxx2X hk.x; /; wt i,
2 wtC1 D wt .k.xt ; / m/.
kwtk b for all t then
1
points ’without’ loss of accuracy.
A SIMPLE APPROXIMATION ALGORITHM
Intuitive to approximate m with convex combinations of the extremes of cch Lx.
A simple algorithm for an RKHS .Lxf D hk.x; /; f i/:
1 xt 2 argmaxx2X hk.x; /; wt i,
2 wtC1 D wt .k.xt ; / m/.
kwtk b for all t then
1
points ’without’ loss of accuracy.
A SIMPLE APPROXIMATION ALGORITHM
Intuitive to approximate m with convex combinations of the extremes of cch Lx.
A simple algorithm for an RKHS .Lxf D hk.x; /; f i/:
1 xt 2 argmaxx2X hk.x; /; wt i,
2 wtC1 D wt .k.xt ; / m/.
kwtk b for all t then
1
points ’without’ loss of accuracy.
PROOF SKETCH
-2
-1
0
1
2
3
wt
m
PROOF SKETCH
A density on X with p.x/ > c for some constant c > 0 and all x 2 X implies m B.m; / cch Lx (finite dimensional only!).
-3 -2 -1 0 1 2 3 -3
-2
-1
0
1
2
3
wt
m
SUMMARY
ML is broad field with many different areas of applications. Engineering/money plays a role nowadays but new ideas
can still have massive impact. ML has always been heavily influenced by mathematics.
Estimation/ Prob. Theory
SUMMARY
ML is broad field with many different areas of applications. Engineering/money plays a role nowadays but new ideas
can still have massive impact. ML has always been heavily influenced by mathematics.
Estimation/ Prob. Theory
SUMMARY
ML is broad field with many different areas of applications. Engineering/money plays a role nowadays but new ideas
can still have massive impact. ML has always been heavily influenced by mathematics.
Estimation/ Prob. Theory