1 A Statistical Mechanical Analysis of Online Learning: Seiji MIYOSHI Kobe City College of Technology [email protected]

1

A Statistical Mechanical Analysis of Online Learning:

Seiji MIYOSHIKobe City College of Technology

[email protected]

2

Background (1)

• Batch Learning– Examples are used repeatedly– Correct answers for all examples– Long time– Large memory

• Online Learning– Examples used once are discarded– Cannot give correct answers for all examples– Large memory isn't necessary– Time variant teacher

3


Can Student be more Clever than Teacher ?


[email protected]

Jan. 2006

4

BMoving Teacher

JStudent

True Teacher

A

Jan. 2006

5



[email protected]

Many Teachers or Few Teachers ?

6

B B

B B

k'k

K 1A

J

True teacher

Student

Ensemble teachers

7

P U R P O S EP U R P O S ETo analyze generalization performance of a model composed of a student, a true teacher and K teachers (ensemble teachers) who exist around the true teacher

To discuss the relationship between the number, the diversity of ensemble teachers and the generalization error

B B

B B

k'k

K 1A

J

8

M O D E L (1/4)M O D E L (1/4)True teacher

Student

• J learns B1,B2, ・・・ in turn.

• J can not learn A directly.

• A, B1,B2, ・・・ ,J are linear perceptrons with noises.

Ensemble teachers

B B

B B

k'k

K 1A

J

9

Simple Perceptron

J1

x1 xN

JN

Output

Inputs

Connection weights

)sgn(Output1

N

iiixJ

+1

-1

10

J1

x1 xN

JN

Output

Inputs

Connection weights

)sgn(Output1

N

iiixJ

Simple Perceptron

N

iiixJ

1

Output

Linear Perceptron

11

B

BB

B 1

kk'

KA

J

M O D E L (2/4)M O D E L (2/4)

Linear Perceptrons with Noises

12

M O D E L (3/4)M O D E L (3/4)• Inputs: 　 • Initial value of student:

• True teacher: 　• Ensemble teachers:

• N→∞ (Thermodynamic limit)

• Order parameters– Length of student– Direction cosines

13

B B

B B

R

q

R

RJ

B k

k'k

K 1

BkJ

kk'

A

J

True teacher

Student

Ensemble teachers

14

fkm

Student learns K ensemble teachers in turn.

M O D E L (4/4)M O D E L (4/4)

Gradient method

Squared errors

15

GENERALIZATION ERRORGENERALIZATION ERROR• A goal of statistical learning theory is to obtain generalization error theoretically.

• Generalization error = mean of errors over the distribution of new input

16

Simultaneous differential equations in deterministic forms, Simultaneous differential equations in deterministic forms, which describe dynamical behaviors of order parameterswhich describe dynamical behaviors of order parameters

17

Analytical solutions of order parametersAnalytical solutions of order parameters

18

GENERALIZATION ERRORGENERALIZATION ERROR• A goal of statistical learning theory is to obtain generalization error theoretically.

• Generalization error = mean of errors over the distribution of new input

19

Dynamical behaviors of generalization error, Dynamical behaviors of generalization error, RRJJ and and ll

（ η=0.3, K=3, RB=0.7, σA2=0.0, σB

2=0.1, σJ2=0.2 ）

Ord

er P

ara

me

ters

t=m/N

q=1.00

l

R

q=0.80q=0.60q=0.49

0 5

0.2

0.4

0.6

0.0

1.0

0.8

10 15 20

Student

Ensembleteachers

Ge

ner

ali

zati

on

Err

or

t=m/N

q=1.00q=0.80q=0.60q=0.49

0 50.2

0.4

0.6

1.2

1.0

0.8

10 15 20

J

20

Analytical solutions of order parametersAnalytical solutions of order parameters

21

Steady state analysisSteady state analysis （（ tt → → ∞ ∞ ））

・ If η ＜０　 or 　 η＞２

・ If ０＜ η ＜２

Generalization error and length of student diverge.

If η ＜１ , the more teachers exist or the richer the diversity of teachers is, the cleverer the student can become.　

If η ＞１ , the fewer teachers exist or the poorer the diversity of teachers is, the cleverer the student can become.

22

Steady value of generalization error, Steady value of generalization error, RRJJ and and ll

（ K=3, RB=0.7, σA2=0.0, σB

2=0.1, σJ2=0.2 ）

0 0.5 1 1.5 2

q=1.00q=0.80q=0.60q=0.49

η

0.2

0.4R

0.6

0.0

0.8

Ge

ner

ali

zati

on

Err

or

00.1

1

10

0.5 1 1.5 2

q=1.00q=0.80q=0.60q=0.49

ηJ

23

Steady value of generalization error, Steady value of generalization error, RRJJ and and ll

（ q=0.49, RB=0.7, σA2=0.0, σB

2=0.1, σJ2=0.2 ）

0 0.5 1 1.5 2

η

K=1K=3K=10K=30

0.2

0.4R

0.6

0.0

0.8

1.0

Ge

ner

ali

zati

on

Err

or

0.1

1

10

0 0.5 1 1.5 2

η

K=1K=3K=10K=30

J

24

CONCLUSIONSCONCLUSIONSWe have analyzed the generalization performance of a student in a model composed of linear perceptrons: a true teacher, K teachers, and the student.

Calculating the generalization error of the student analytically using statistical mechanics in the framework of on-line learning, we have proven that when the learning rate satisfies η<1, the larger the number K is and the more diversity the teachers have, the smaller the generalization error is. On the other hand, when η>1, the properties are completely reversed.

If the diversity of the K teachers is rich enough, the direction cosine between the true teacher and the student becomes unity in the limit of η→0 and K→∞.

Documents

1 A Statistical Mechanical Analysis of Online Learning: Seiji MIYOSHI Kobe City College of Technology [email protected]