Upload
badru
View
35
Download
0
Embed Size (px)
DESCRIPTION
Learning from Infinite Training Examples. 3.18.2009, 3.19.2009 Prepared for NKU and NUTN seminars Presenter: Chun-Nan Hsu ( 許鈞南 ) Institute of Information Science Academia Sinica Taipei, Taiwan. The Ever Growing Web (Zhuang, -400). - PowerPoint PPT Presentation
Citation preview
Learning from Infinite Training Examples
3.18.2009, 3.19.2009
Prepared for NKU and NUTN seminars
Presenter: Chun-Nan Hsu ( 許鈞南 )Institute of Information ScienceAcademia SinicaTaipei, Taiwan
2
The Ever Growing Web(Zhuang, -400)
Human life is finite, but knowledge is
infinite. Following the infinite with the finite is doomed to fail.
人之生也有涯,而知也無涯。以有涯隨無涯,殆矣。莊子,西元前四百年
03/18/2009
3
Analogously…
Computing power is finite, but the Web is infinite. Mining infinite Web with finite
computing power…
is doomed to fail?
03/18/2009
Other “holy grails” in Artificial Intelligence
Learning to understand natural languages
Learning to recognize millions of objects in computer vision
Speech recognition in noisy environment, such as in a car
403/18/2009
5
On-Line Learning vs. Off-Line Learning
Nothing to do with human learning by browsing the web
Definition: Given a set of new training data, online learner can update its model without
reading old data while improving its performance.
By contrast, off-line learner must combine old and new data and start the learning all over again, otherwise the performance willsuffer.
03/18/2009
6
Off-Line Learning
Nearly all popular ML algorithms are off-line today
They scan the training examples many passes iteratively until an objective function is minimized
For example: SMO algorithm for SVM L-BFGS algorithm for CRF EM algorithm for HMM and GMM Etc.
03/18/2009
7
Why on-line learning?
03/18/2009
8
Single-pass on-line learning
The key for on-line learning to win is to achieve satisfying performance right after scanning the new training
examples for a single pass only
03/18/2009
9
Previous work on on-line learning
Perceptron Rosenblatt 1957
Stochastic Gradient Descent Widrow & Hoff 1960
Bregment Divergence Azoury & Warmuth 2001
MIRA (Large Margin) Crammer & Singer 2003
LaRank Borde & Bottou 2005, 2007
EG Collins & Peter Bartlet et al. 2008
03/18/2009
03/18/2009 10
Stochastic Gradient Descent (SGD)
Learning is to minimize a loss function given training examples
0);(
GL
DL
03/18/2009 11
Optimal Step Size(Benveniste et al. 1990, Murata et al. 1998)
Solving gradient = 0 by Newton’s method
Step size is asymptotically optimal if it approaches to
);( )(1)()1( DLH ttt
1
1)(
t
Ht
03/18/2009 12
Single-Pass Result (Bottou & LeCun 2004)
Optimum for n+1 examples is a Newton step away from the optimum for n examples
21
*11
**1
1);(
1
1
noBLH
n nnnnn
*n
*1n
);( nDL );( 1nDL
03/18/2009 13
2nd Order SGD
2nd order SGD (2SGD): Adjusting the step size to approach to Hessian
Good News: from previous work, given sufficiently large training examples, 2SGD achieves empirical optimum in a single pass!
Bad News: it is prohibitively expensive to compute H-1
e.g. 10K features, H will be a 10K by 10K matrix = 100M floating point array
How about 1M features?
03/18/2009 14
Approximating Jacobian(Aitken 1925, Schafer 1997)
Learning algorithms can be considered as fixed-point iteration mapping =M()
Taylor expansion gives
Eigenvalues of J can be approximated by
)()1(
)1()2(
ti
ti
ti
ti
i
)()( *)(*)()1( ttt JM
03/18/2009 15
Approximating Hessian
Consider SGD mapping as a fixed-point iteration, too.
since J=M’=I-H, we have eig(J)=eig(M’)=eig(I-H), therefore, (since H is symmetric) eig(J)=1- eig(H) eig(H-1)= / 1-eig(J)= / 1-γ.
);()( )()( tt BLM
03/18/2009 16
Estimating Eigenvalue Periodically
Since the mapping of SGD is stochastic, estimating the eigenvalues at each iteration may yield inaccurate estimations.
To make the mapping more stationary, we use Mb=M(M(…M(θ)…))
From the law of large number, b consecutive mappings, Mb, will be less “stochastic”
From Equation (4), we can estimate eig(Jb) by
)()(
)()2(
ti
bti
bti
bti
i
b
03/18/2009 17
The PSA algorithm (Huang, Chang & Hsu 2007)
03/18/2009 18
Experimental Results
Conditional Random Fields (CRF) (Lafferty et al. 2001)
Sequence labeling problems – gene mention tagging
Conditional Random Fields
1903/18/2009
In effect, CRF encodes a probabilistic rule-based system with rules of the form:
If fj1(X,Y) & fj2(X,Y) & … & fjn(X,Y) are non-zero,
then the labels of the sequence are Y
with score P(Y|X)
If we have d features and considers w context, then an order-1 CRF encodes this many rules:
20
2||||2 yx wd 03/18/2009
03/18/2009 21
Tasks and Setups
CoNLL 2000 base NP Tag Noun phrases 8936 training 2012 test 3 tags, 1015662 features
CoNLL 2000 chunking Tag 11 POS types 8936 training 2012 test 23 tags, 7448606
features Performance measure: F-
score:
BioNLP/NLPBA 2004 Tag 5 types of bio-
entities (e.g., gene, protein, cell lines, etc.)
18546 training 3856 test
5977675 features BioCreative 2
Tag gene names 15000 training 5000
test 10242972 features
pr
rpF
FPTP
TPp
FNTP
TPr
2,,
Feature types for BioCreative 2
2203/18/2009O(22M ) rules are encoded in our CRF model!!!
03/18/2009 23
Convergence PerformanceCoNLL 2000 base NP
03/18/2009 24
Convergence PerformanceCoNLL chunking
03/18/2009 25
Convergence PerformanceBioNLP/NLPBA 2004
03/18/2009 26
Convergence PerformanceBioCreative 2
03/18/2009 27
Execution TimeCoNLL 2000 base NP
First Pass 23.74 sec
03/18/2009 28
Execution TimeCoNLL chunking
First Pass 196.44 sec
03/18/2009 29
Execution TimeBioNLP/NLPBA 2004
First Pass 287.48 sec
03/18/2009 30
Execution TimeBioCreative 2
First Pass 394.04 sec
Experimental results for linear SVM and convolutional neural net
Data sets
3103/18/2009
Linear SVM
Convolutional Neural Net (5 layers)
3203/18/2009
** Layer trick -- Step sizes in the lower layers should be larger than in the higher layer
03/18/2009 33
Mini-conclusion: Single-Pass
By approximating Jacobian, we can approximate Hessian, too
By approximating Hessian, we can achieve near-optimal single-pass performance in practice
With a single-pass on-line learner, virtually infinitely many training examples can be used
PSA is a member in the family of “discretized Newton Methods”
Other well-known members include Secant method (aka. Quickprop) Steffensen’s method (aka. Triple Jump)
General form of these methods
where A is a matrix designed to
approximate the hessian matrix without actually computing the derivative
)(],[ )(1)()1( ttt ghA
Analysis of PSA
03/18/2009 34
PSA
PSA is not secant nor is it steffensen’s method
PSA iterates a 2b-step “parallel chord” method (i.e., fixed rate SGD) followed by an approximated Newton step Off-line 2-step parallel chord method is
known to have an order 4 convergence
03/18/2009 35
Convergence analysis of PSA
03/18/2009 36
03/18/2009 37
Are we there yet?
With single-pass on-line learning, we can learn from infinite training examples now, at least in theory
A cheaper, quicker method to annotate labels for training examples
Plus a lot of computers…
03/18/2009 38
The human life is finite, but the knowledge is infinite. Learning from infinite examples by
applying PSA to 2nd Order SGD
is a good idea!
Thank you for your attention!http://aiia.iis.sinica.edu.twhttp://chunnan.iis.sinica.edu.tw/~chunnan
This research is supported mostly by NRPGM’s advanced bioinformatics core facility grant 2005-2011.