Upload
joella-hines
View
215
Download
0
Embed Size (px)
Citation preview
1
A Statistical Mechanical Analysis of Online Learning:
Seiji MIYOSHIKobe City College of Technology
2
Background (1)
• Batch Learning– Examples are used repeatedly– Correct answers for all examples– Long time– Large memory
• Online Learning– Examples used once are discarded– Cannot give correct answers for all examples– Large memory isn't necessary– Time variant teacher
3
A Statistical Mechanical Analysis of Online Learning:
Can Student be more Clever than Teacher ?
Seiji MIYOSHIKobe City College of Technology
Jan. 2006
4
BMoving Teacher
JStudent
True Teacher
A
Jan. 2006
5
A Statistical Mechanical Analysis of Online Learning:
Seiji MIYOSHIKobe City College of Technology
Many Teachers or Few Teachers ?
6
B B
B B
k'k
K 1A
J
True teacher
Student
Ensemble teachers
7
P U R P O S EP U R P O S ETo analyze generalization performance of a model composed of a student, a true teacher and K teachers (ensemble teachers) who exist around the true teacher
To discuss the relationship between the number, the diversity of ensemble teachers and the generalization error
B B
B B
k'k
K 1A
J
8
M O D E L (1/4)M O D E L (1/4)True teacher
Student
• J learns B1,B2, ・・・ in turn.
• J can not learn A directly.
• A, B1,B2, ・・・ ,J are linear perceptrons with noises.
Ensemble teachers
B B
B B
k'k
K 1A
J
9
Simple Perceptron
J1
x1 xN
JN
Output
Inputs
Connection weights
)sgn(Output1
N
iiixJ
+1
-1
10
J1
x1 xN
JN
Output
Inputs
Connection weights
)sgn(Output1
N
iiixJ
Simple Perceptron
N
iiixJ
1
Output
Linear Perceptron
11
B
BB
B 1
kk'
KA
J
M O D E L (2/4)M O D E L (2/4)
Linear Perceptrons with Noises
12
M O D E L (3/4)M O D E L (3/4)• Inputs: • Initial value of student:
• True teacher: • Ensemble teachers:
• N→∞ (Thermodynamic limit)
• Order parameters– Length of student– Direction cosines
13
B B
B B
R
q
R
RJ
B k
k'k
K 1
BkJ
kk'
A
J
True teacher
Student
Ensemble teachers
14
fkm
Student learns K ensemble teachers in turn.
M O D E L (4/4)M O D E L (4/4)
Gradient method
Squared errors
15
GENERALIZATION ERRORGENERALIZATION ERROR• A goal of statistical learning theory is to obtain generalization error theoretically.
• Generalization error = mean of errors over the distribution of new input
16
Simultaneous differential equations in deterministic forms, Simultaneous differential equations in deterministic forms, which describe dynamical behaviors of order parameterswhich describe dynamical behaviors of order parameters
17
Analytical solutions of order parametersAnalytical solutions of order parameters
18
GENERALIZATION ERRORGENERALIZATION ERROR• A goal of statistical learning theory is to obtain generalization error theoretically.
• Generalization error = mean of errors over the distribution of new input
19
Dynamical behaviors of generalization error, Dynamical behaviors of generalization error, RRJJ and and ll
( η=0.3, K=3, RB=0.7, σA2=0.0, σB
2=0.1, σJ2=0.2 )
Ord
er P
ara
me
ters
t=m/N
q=1.00
l
R
q=0.80q=0.60q=0.49
0 5
0.2
0.4
0.6
0.0
1.0
0.8
10 15 20
Student
Ensembleteachers
Ge
ner
ali
zati
on
Err
or
t=m/N
q=1.00q=0.80q=0.60q=0.49
0 50.2
0.4
0.6
1.2
1.0
0.8
10 15 20
J
20
Analytical solutions of order parametersAnalytical solutions of order parameters
21
Steady state analysisSteady state analysis ( ( tt → → ∞ ∞ ))
・ If η <0 or η>2
・ If 0< η <2
Generalization error and length of student diverge.
If η <1 , the more teachers exist or the richer the diversity of teachers is, the cleverer the student can become.
If η >1 , the fewer teachers exist or the poorer the diversity of teachers is, the cleverer the student can become.
22
Steady value of generalization error, Steady value of generalization error, RRJJ and and ll
( K=3, RB=0.7, σA2=0.0, σB
2=0.1, σJ2=0.2 )
0 0.5 1 1.5 2
q=1.00q=0.80q=0.60q=0.49
η
0.2
0.4R
0.6
0.0
0.8
Ge
ner
ali
zati
on
Err
or
00.1
1
10
0.5 1 1.5 2
q=1.00q=0.80q=0.60q=0.49
ηJ
23
Steady value of generalization error, Steady value of generalization error, RRJJ and and ll
( q=0.49, RB=0.7, σA2=0.0, σB
2=0.1, σJ2=0.2 )
0 0.5 1 1.5 2
η
K=1K=3K=10K=30
0.2
0.4R
0.6
0.0
0.8
1.0
Ge
ner
ali
zati
on
Err
or
0.1
1
10
0 0.5 1 1.5 2
η
K=1K=3K=10K=30
J
24
CONCLUSIONSCONCLUSIONSWe have analyzed the generalization performance of a student in a model composed of linear perceptrons: a true teacher, K teachers, and the student.
Calculating the generalization error of the student analytically using statistical mechanics in the framework of on-line learning, we have proven that when the learning rate satisfies η<1, the larger the number K is and the more diversity the teachers have, the smaller the generalization error is. On the other hand, when η>1, the properties are completely reversed.
If the diversity of the K teachers is rich enough, the direction cosine between the true teacher and the student becomes unity in the limit of η→0 and K→∞.