Learning from Infinite Training Examples

Learning from Infinite Training Examples

3.18.2009, 3.19.2009

Prepared for NKU and NUTN seminars

Presenter: Chun-Nan Hsu ( 許鈞南 )Institute of Information ScienceAcademia SinicaTaipei, Taiwan

2

The Ever Growing Web(Zhuang, -400)

Human life is finite, but knowledge is

infinite. Following the infinite with the finite is doomed to fail.

人之生也有涯，而知也無涯。以有涯隨無涯，殆矣。莊子，西元前四百年

03/18/2009

3

Analogously…

Computing power is finite, but the Web is infinite. Mining infinite Web with finite

computing power…

is doomed to fail?

03/18/2009

Other “holy grails” in Artificial Intelligence

Learning to understand natural languages

Learning to recognize millions of objects in computer vision

Speech recognition in noisy environment, such as in a car

403/18/2009

5

On-Line Learning vs. Off-Line Learning

Nothing to do with human learning by browsing the web

Definition: Given a set of new training data, online learner can update its model without

reading old data while improving its performance.

By contrast, off-line learner must combine old and new data and start the learning all over again, otherwise the performance willsuffer.

03/18/2009

6

Off-Line Learning

Nearly all popular ML algorithms are off-line today

They scan the training examples many passes iteratively until an objective function is minimized

For example: SMO algorithm for SVM L-BFGS algorithm for CRF EM algorithm for HMM and GMM Etc.

03/18/2009

7

Why on-line learning?

03/18/2009

8

Single-pass on-line learning

The key for on-line learning to win is to achieve satisfying performance right after scanning the new training

examples for a single pass only

03/18/2009

9

Previous work on on-line learning

Perceptron Rosenblatt 1957

Stochastic Gradient Descent Widrow & Hoff 1960

Bregment Divergence Azoury & Warmuth 2001

MIRA (Large Margin) Crammer & Singer 2003

LaRank Borde & Bottou 2005, 2007

EG Collins & Peter Bartlet et al. 2008

03/18/2009

03/18/2009 10

Stochastic Gradient Descent (SGD)

Learning is to minimize a loss function given training examples

0);(

GL

DL

03/18/2009 11

Optimal Step Size(Benveniste et al. 1990, Murata et al. 1998)

Solving gradient = 0 by Newton’s method

Step size is asymptotically optimal if it approaches to

);( )(1)()1( DLH ttt

1

1)(

t

Ht

03/18/2009 12

Single-Pass Result (Bottou & LeCun 2004)

Optimum for n+1 examples is a Newton step away from the optimum for n examples

21

*11

**1

1);(

1

1

noBLH

n nnnnn

*n

*1n

);( nDL );( 1nDL

03/18/2009 13

2nd Order SGD

2nd order SGD (2SGD): Adjusting the step size to approach to Hessian

Good News: from previous work, given sufficiently large training examples, 2SGD achieves empirical optimum in a single pass!

Bad News: it is prohibitively expensive to compute H-1

e.g. 10K features, H will be a 10K by 10K matrix = 100M floating point array

How about 1M features?

03/18/2009 14

Approximating Jacobian(Aitken 1925, Schafer 1997)

Learning algorithms can be considered as fixed-point iteration mapping =M()

Taylor expansion gives

Eigenvalues of J can be approximated by

)()1(

)1()2(

ti

ti

ti

ti

i

)()( *)(*)()1( ttt JM

03/18/2009 15

Approximating Hessian

Consider SGD mapping as a fixed-point iteration, too.

since J=M’=I-H, we have eig(J)=eig(M’)=eig(I-H), therefore, (since H is symmetric) eig(J)=1- eig(H) eig(H-1)= / 1-eig(J)= / 1-γ.

);()( )()( tt BLM

03/18/2009 16

Estimating Eigenvalue Periodically

Since the mapping of SGD is stochastic, estimating the eigenvalues at each iteration may yield inaccurate estimations.

To make the mapping more stationary, we use Mb=M(M(…M(θ)…))

From the law of large number, b consecutive mappings, Mb, will be less “stochastic”

From Equation (4), we can estimate eig(Jb) by

)()(

)()2(

ti

bti

bti

bti

i

b

03/18/2009 17

The PSA algorithm (Huang, Chang & Hsu 2007)

03/18/2009 18

Experimental Results

Conditional Random Fields (CRF) (Lafferty et al. 2001)

Sequence labeling problems – gene mention tagging

Conditional Random Fields

1903/18/2009

In effect, CRF encodes a probabilistic rule-based system with rules of the form:

If fj1(X,Y) & fj2(X,Y) & … & fjn(X,Y) are non-zero,

then the labels of the sequence are Y

with score P(Y|X)

If we have d features and considers w context, then an order-1 CRF encodes this many rules:

20

2||||2 yx wd 03/18/2009

03/18/2009 21

Tasks and Setups

CoNLL 2000 base NP Tag Noun phrases 8936 training 2012 test 3 tags, 1015662 features

CoNLL 2000 chunking Tag 11 POS types 8936 training 2012 test 23 tags, 7448606

features Performance measure: F-

score:

BioNLP/NLPBA 2004 Tag 5 types of bio-

entities (e.g., gene, protein, cell lines, etc.)

18546 training 3856 test

5977675 features BioCreative 2

Tag gene names 15000 training 5000

test 10242972 features

pr

rpF

FPTP

TPp

FNTP

TPr

2,,

Feature types for BioCreative 2

2203/18/2009O(22M ) rules are encoded in our CRF model!!!

03/18/2009 23

Convergence PerformanceCoNLL 2000 base NP

03/18/2009 24

Convergence PerformanceCoNLL chunking

03/18/2009 25

Convergence PerformanceBioNLP/NLPBA 2004

03/18/2009 26

Convergence PerformanceBioCreative 2

03/18/2009 27

Execution TimeCoNLL 2000 base NP

First Pass 23.74 sec

03/18/2009 28

Execution TimeCoNLL chunking


03/18/2009 29

Execution TimeBioNLP/NLPBA 2004


03/18/2009 30

Execution TimeBioCreative 2


Experimental results for linear SVM and convolutional neural net

Data sets

3103/18/2009

Linear SVM

Convolutional Neural Net (5 layers)

3203/18/2009

** Layer trick -- Step sizes in the lower layers should be larger than in the higher layer

03/18/2009 33

Mini-conclusion: Single-Pass

By approximating Jacobian, we can approximate Hessian, too

By approximating Hessian, we can achieve near-optimal single-pass performance in practice

With a single-pass on-line learner, virtually infinitely many training examples can be used

PSA is a member in the family of “discretized Newton Methods”

Other well-known members include Secant method (aka. Quickprop) Steffensen’s method (aka. Triple Jump)

General form of these methods

where A is a matrix designed to

approximate the hessian matrix without actually computing the derivative

)(],[ )(1)()1( ttt ghA

Analysis of PSA

03/18/2009 34

PSA

PSA is not secant nor is it steffensen’s method

PSA iterates a 2b-step “parallel chord” method (i.e., fixed rate SGD) followed by an approximated Newton step Off-line 2-step parallel chord method is

known to have an order 4 convergence

03/18/2009 35

Convergence analysis of PSA

03/18/2009 36

03/18/2009 37

Are we there yet?

With single-pass on-line learning, we can learn from infinite training examples now, at least in theory

A cheaper, quicker method to annotate labels for training examples

Plus a lot of computers…

03/18/2009 38

The human life is finite, but the knowledge is infinite. Learning from infinite examples by

applying PSA to 2nd Order SGD

is a good idea!

Thank you for your attention!http://aiia.iis.sinica.edu.twhttp://chunnan.iis.sinica.edu.tw/~chunnan

This research is supported mostly by NRPGM’s advanced bioinformatics core facility grant 2005-2011.

Documents

Learning from Infinite Training Examples