26
PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain 1,2 joint work with Ga¨ el Letarte 3 , Benjamin Guedj 1,2,4 , Fran¸cois Laviolette 3 1 Inria Lille - Nord Europe (´ equipe-projet MODAL) 2 Laboratoire Paul Painlev´ e (´ equipe Probabilit´ es et Statistique) 3 Universit´ e Laval (´ equipe GRAAL) 4 University College London (CSML group) 51es Journ´ ees de Statistique de la SFdS (Nancy) June 6, 2019 Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 1 / 19

PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

PAC-Bayesian Learning and Neural NetworksThe Binary Activated Case

Pascal Germain1,2

joint work with Gael Letarte3, Benjamin Guedj1,2,4, Francois Laviolette3

1 Inria Lille - Nord Europe (equipe-projet MODAL)2 Laboratoire Paul Painleve (equipe Probabilites et Statistique)

3 Universite Laval (equipe GRAAL) 4 University College London (CSML group)

51es Journees de Statistique de la SFdS (Nancy)

June 6, 2019

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 1 / 19

Page 2: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Plan

1 PAC-Bayesian Learning

2 Standard Neural Networks

3 Binary Activated Neural NetworksOne Layer (Linear predictor)Two Layers (shallow)More Layers (deep)

4 Empirical results

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 2 / 19

Page 3: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Plan

1 PAC-Bayesian Learning

2 Standard Neural Networks

3 Binary Activated Neural NetworksOne Layer (Linear predictor)Two Layers (shallow)More Layers (deep)

4 Empirical results

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 2 / 19

Page 4: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Setting

Assumption : Learning samples are generated iid by a data-distribution D.

S = { (x1, y1), (x2, y2), . . . , (xn, yn) } ∼ Dn

Objective

Given a loss function ` : Y × Y → R, find a predictor f ∈ F that minimizes thegeneralization loss on D :

LD(f ) := E(x ,y)∼D

`(f (x), y)

Challenge

The learning algorithm has only access to the empirical loss on S

LS(f ) :=1

n

n∑i=1

`(f (xi ), yi )

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 3 / 19

Page 5: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

PAC-Bayesian Theory

Pioneered byShawe-Taylor et Williamson (1997), McAllester (1999) et Catoni (2003),

the PAC-Bayesian theory give PAC generalization guarantees to “Bayesian like” algorithms.

PAC guarantees (Probably Approximately Correct)

With probability at least “ 1−δ ”, the loss of predictor f is less than “ ε ”

PrS∼Dn

(((LD(f ) ≤≤≤ ε(LS(f ), n, δ, . . .)

)))≥ 1−δ

Bayesian flavor

Given :

A prior distribution P on F .

A posterior distribution Q on F .

PrS∼Dn

(((E

f∼QLD(f ) ≤≤≤ ε( E

f∼QLS(f ), n, δ,P, . . .)

)))≥ 1−δ

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 4 / 19

Page 6: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

A Classical PAC-Bayesian Theorem

PAC-Bayesian theorem (adapted from McAllester 1999 ; McAllester 2003)

For any distribution P on F , for any δ∈(0, 1], we have,

PrS∼Dn

(((∀Q on F : E

f∼QLD(f ) ≤≤≤ E

f∼QLS(f )︸ ︷︷ ︸

empirical loss

+

√1

2n

[KL(Q‖P) + ln 2

√nδ

]︸ ︷︷ ︸

complexity term

)))≥ 1−δ ,

where KL(Q‖P) = Ef∼Q

ln Q(f )P(f ) is the Kullback-Leibler divergence.

Training bound

Gives generalization guarantees not based on testing sample.

Valid for all posterior Q on FInspiration for conceiving new learning algorithms.

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 5 / 19

Page 7: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Plan

1 PAC-Bayesian Learning

2 Standard Neural Networks

3 Binary Activated Neural NetworksOne Layer (Linear predictor)Two Layers (shallow)More Layers (deep)

4 Empirical results

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 5 / 19

Page 8: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Standard Neural Networks (Multilayer perceptrons, or MLP)

Classification setting :

x ∈ Rd

y ∈ {−1, 1}Architecture :

L fully connected layers

dk denotes the number of neurons of the kth layer

σ : R→ R is the activation function

Parameters :

Wk ∈ Rdk×dk−1 denotes the weight matrices.

θ= vec({Wk}Lk=1

)∈RD

x1 · · · xd

σ σ σ

σ σ σ

σ

Prediction

fθ(x) = σ(wLσ

(WL−1σ

(. . . σ

(W1x

)))).

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 6 / 19

Page 9: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

First PAC-Bayesian bounds for Stochastic Neural Networks

“(Not) Bounding the True Error” (Langford et Caruana 2001)

Shallow networks (L = 2)

Sigmoid activation functions 5.0 2.5 0.0 2.5 5.00.0

0.2

0.4

0.6

0.8

1.0

“Computing Nonvacuous Generalization Bounds for Deep(Stochastic) Neural Networks (Dziugaite et Roy 2017)

Deep networks (L > 2)

ReLU activation functions5.0 2.5 0.0 2.5 5.0

0

2

4

6

¸

Idea : Bound the expected loss of the network under a Gaussian perturbation of the weights

Empirical loss : Eθ′∼N (θ,Σ)

LS(fθ′) −→ estimated by sampling

Complexity term : KL(N (θ,Σ)‖N (θp,Σp)) −→ closed formPascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 7 / 19

Page 10: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Plan

1 PAC-Bayesian Learning

2 Standard Neural Networks

3 Binary Activated Neural NetworksOne Layer (Linear predictor)Two Layers (shallow)More Layers (deep)

4 Empirical results

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 7 / 19

Page 11: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Binary Activated Neural Networks

Classification setting :

x ∈ Rd

y ∈ {−1, 1}Architecture :

L fully connected layers

dk denotes the number of neurons of the kth layer

sgn(a) = 1 if a > 0 and sgn(a) = −1 otherwise

Parameters :

Wk ∈ Rdk×dk−1 denotes the weight matrices.

θ= vec({Wk}Lk=1

)∈RD

x1 · · · xd

sgn sgn sgn

sgn sgn sgn

sgn

Prediction

fθ(x) = sgn(wLsgn

(WL−1sgn

(. . . sgn

(W1x

)))),

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 8 / 19

Page 12: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Plan

1 PAC-Bayesian Learning

2 Standard Neural Networks

3 Binary Activated Neural NetworksOne Layer (Linear predictor)Two Layers (shallow)More Layers (deep)

4 Empirical results

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 8 / 19

Page 13: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

One Layer

“PAC-Bayesian Learning of Linear Classifiers” (Germain et al. 2009)

fw(x) := sgn(w · x), with w ∈ Rd .

x1 x2 · · · xd

sgn

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 9 / 19

Page 14: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

One Layer

“PAC-Bayesian Learning of Linear Classifiers” (Germain et al. 2009)

fw(x) := sgn(w · x), with w ∈ Rd .

PAC-Bayes analysis :Space of all linear classifiers Fd := {fv|v ∈ Rd}Gaussian posterior Qw := N (w, Id) over Fd

Gaussian prior Pw0:= N (w0, Id) over Fd

Predictor

Fw(x) := Ev∼Qw

fv(x) = erf(

w·x√d‖x‖

)2 0 2

1.0

0.5

0.0

0.5

1.0 erf(x)tanh(x)sgn(x)

Bound minimization — under the linear loss `(y , y ′) := 12(1− yy ′)

C n LS(Fw) + KL(Qw‖Pw0) = C 12

∑ni=1 erf

(−yi w·xi√

d‖xi‖

)+ 1

2‖w −w0‖2 .

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 10 / 19

Page 15: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Plan

1 PAC-Bayesian Learning

2 Standard Neural Networks

3 Binary Activated Neural NetworksOne Layer (Linear predictor)Two Layers (shallow)More Layers (deep)

4 Empirical results

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 10 / 19

Page 16: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Two Layers

Posterior Qθ = N (θ, ID), over the family of all networks FD = {fθ | θ ∈ RD}, where

fθ(x) = sgn(w2 · sgn(W1x)

).

Fθ(x) = Eθ∼Qθ

fθ(x)

=

∫Rd1×d0

Q1(V1)

∫Rd1

Q2(v2)sgn(v2 · sgn(V1x))dv2dV1

=

∫Rd1×d0

Q1(V1) erf(

w2·sgn(V1x)√2‖sgn(V1x)‖

)dV1

=∑

s∈{−1,1}d1

erf(

w2·s√2d1

)∫Rd1×d0

1[s = sgn(V1x)]Q1(V1) dV1

=∑

s∈{−1,1}d1

erf

(w2 · s√

2d1

)︸ ︷︷ ︸

Fw2 (s)

d1∏i=1

[1

2+

si2erf

(wi

1 · x√2 ‖x‖

)]︸ ︷︷ ︸

Pr(s|x,W1)

. x1 · · · xd

sgn sgn sgn

sgn

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 11 / 19

Page 17: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

PAC-Bayesian Ingredients

Empirical loss

LS(Fθ) = Eθ′∼Qθ

LS(fθ′) =1

n

n∑i=1

[1

2− 1

2yiFθ(xi )

].

Complexity term

KL(Qθ‖Pθ0) =1

2‖θ − θ0‖2 .

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 12 / 19

Page 18: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Derivatives

Chain rule.

∂LS(Fθ)

∂θ=

1

n

n∑i=1

∂`(Fθ(xi ), yi )

∂θ=

1

n

n∑i=1

∂Fθ(xi )

∂θ`′(Fθ(xi ), yi ) ,

Hidden layer partial derivatives.

∂wk1

Fθ(x) =x

232 ‖x‖

erf ′(

wk1 · x√2 ‖x‖

) ∑s∈{−1,1}d1

sk erf

(w2 · s√

2d1

)[Pr(s|x,W1)

Pr(sk |x,wk1 )

],

with erf ′(x) := 2√πe−x

2.

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 13 / 19

Page 19: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Stochastic Approximation

Fθ(x) =∑

s∈{−1,1}d1

Fw2(s) Pr(s|x,W1)

Monte Carlo sampling

We generate T random binary vectors {st}Tt=1 according to Pr(s|x,W1)

Prediction.

Fθ(x) ≈ 1

T

T∑t=1

Fw2(st) .

Derivatives.

∂wk1

Fθ(x) ≈ x

232 ‖x‖

erf ′(

wk1 · x√2 ‖x‖

)1

T

T∑t=1

stkPr(stk |x,wk

1 )Fw2(st) .

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 14 / 19

Page 20: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Plan

1 PAC-Bayesian Learning

2 Standard Neural Networks

3 Binary Activated Neural NetworksOne Layer (Linear predictor)Two Layers (shallow)More Layers (deep)

4 Empirical results

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 14 / 19

Page 21: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

More Layers

x1 x2

y

x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2

y

F(j)1 (x) = erf

(wj

1·x√2‖x‖

), F

(j)k+1(x) =

∑s∈{−1,1}dk

erf

(wjk+1·s√2dk

) dk∏i=1

(1

2+

1

2si × F

(i)k (x)

)Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 15 / 19

Page 22: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Plan

1 PAC-Bayesian Learning

2 Standard Neural Networks

3 Binary Activated Neural NetworksOne Layer (Linear predictor)Two Layers (shallow)More Layers (deep)

4 Empirical results

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 15 / 19

Page 23: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Model name Cost function Train split Valid split Model selection Prior

MLP–tanh linear loss, L2 regularized 80% 20% valid linear loss -PBGNet` linear loss, L2 regularized 80% 20% valid linear loss random initPBGNet PAC-Bayes bound 100 % - PAC-Bayes bound random init

PBGNetpre

– pretrain linear loss (20 epochs) 50% - - random init– final PAC-Bayes bound 50% - PAC-Bayes bound pretrain

DatasetMLP–tanh PBGNet` PBGNet PBGNetpre

ES ET ES ET ES ET Bound ES ET Bound

ads 0.021 0.037 0.018 0.032 0.024 0.038 0.283 0.034 0.033 0.058adult 0.128 0.149 0.136 0.148 0.158 0.154 0.227 0.153 0.151 0.165mnist17 0.003 0.004 0.008 0.005 0.007 0.009 0.067 0.003 0.005 0.009mnist49 0.002 0.013 0.003 0.018 0.034 0.039 0.153 0.018 0.021 0.030mnist56 0.002 0.009 0.002 0.009 0.022 0.026 0.103 0.008 0.008 0.017mnistLH 0.004 0.017 0.005 0.019 0.071 0.073 0.186 0.026 0.026 0.033

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 16 / 19

Page 24: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Perspectives

Transfer learning

Multiclass

CNN

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 17 / 19

Page 25: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

Merci !

https://arxiv.org/abs/1905.10259

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 18 / 19

Page 26: PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary Activated Case Pascal Germain1;2 joint work with Ga el Letarte3, Benjamin Guedj1;2;4,

References

Shawe-Taylor, John et Robert C. Williamson (1997). “A PAC Analysis of a BayesianEstimator”. In : COLT.

McAllester, David (1999). “Some PAC-Bayesian Theorems”. In : Machine Learning 37.3.

Catoni, Olivier (2003). “A PAC-Bayesian approach to adaptive classification”. In : preprint 840.

McAllester, David (2003). “PAC-Bayesian Stochastic Model selection”. In : MachineLearning 51.1.

Langford, John et Rich Caruana (2001). “(Not) Bounding the True Error”. In : NIPS. MITPress, p. 809-816.

Dziugaite, Gintare Karolina et Daniel M. Roy (2017). “Computing NonvacuousGeneralization Bounds for Deep (Stochastic) Neural Networks with Many More Parametersthan Training Data”. In : UAI. AUAI Press.

Germain, Pascal et al. (2009). “PAC-Bayesian learning of linear classifiers”. In : ICML. ACM,p. 353-360.

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 19 / 19