39
Learning from Infinite Training Examples 3.18.2009, 3.19.2009 Prepared for NKU and NUTN seminars Presenter: Chun-Nan Hsu ( 許許許 ) Institute of Information Science Academia Sinica Taipei, Taiwan

Learning from Infinite Training Examples

  • Upload
    badru

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

Learning from Infinite Training Examples. 3.18.2009, 3.19.2009 Prepared for NKU and NUTN seminars Presenter: Chun-Nan Hsu ( 許鈞南 ) Institute of Information Science Academia Sinica Taipei, Taiwan. The Ever Growing Web (Zhuang, -400). - PowerPoint PPT Presentation

Citation preview

Page 1: Learning from  Infinite Training Examples

Learning from Infinite Training Examples

3.18.2009, 3.19.2009

Prepared for NKU and NUTN seminars

Presenter: Chun-Nan Hsu ( 許鈞南 )Institute of Information ScienceAcademia SinicaTaipei, Taiwan

Page 2: Learning from  Infinite Training Examples

2

The Ever Growing Web(Zhuang, -400)

Human life is finite, but knowledge is

infinite. Following the infinite with the finite is doomed to fail.

人之生也有涯,而知也無涯。以有涯隨無涯,殆矣。莊子,西元前四百年

03/18/2009

Page 3: Learning from  Infinite Training Examples

3

Analogously…

Computing power is finite, but the Web is infinite. Mining infinite Web with finite

computing power…

is doomed to fail?

03/18/2009

Page 4: Learning from  Infinite Training Examples

Other “holy grails” in Artificial Intelligence

Learning to understand natural languages

Learning to recognize millions of objects in computer vision

Speech recognition in noisy environment, such as in a car

403/18/2009

Page 5: Learning from  Infinite Training Examples

5

On-Line Learning vs. Off-Line Learning

Nothing to do with human learning by browsing the web

Definition: Given a set of new training data, online learner can update its model without

reading old data while improving its performance.

By contrast, off-line learner must combine old and new data and start the learning all over again, otherwise the performance willsuffer.

03/18/2009

Page 6: Learning from  Infinite Training Examples

6

Off-Line Learning

Nearly all popular ML algorithms are off-line today

They scan the training examples many passes iteratively until an objective function is minimized

For example: SMO algorithm for SVM L-BFGS algorithm for CRF EM algorithm for HMM and GMM Etc.

03/18/2009

Page 7: Learning from  Infinite Training Examples

7

Why on-line learning?

03/18/2009

Page 8: Learning from  Infinite Training Examples

8

Single-pass on-line learning

The key for on-line learning to win is to achieve satisfying performance right after scanning the new training

examples for a single pass only

03/18/2009

Page 9: Learning from  Infinite Training Examples

9

Previous work on on-line learning

Perceptron Rosenblatt 1957

Stochastic Gradient Descent Widrow & Hoff 1960

Bregment Divergence Azoury & Warmuth 2001

MIRA (Large Margin) Crammer & Singer 2003

LaRank Borde & Bottou 2005, 2007

EG Collins & Peter Bartlet et al. 2008

03/18/2009

Page 10: Learning from  Infinite Training Examples

03/18/2009 10

Stochastic Gradient Descent (SGD)

Learning is to minimize a loss function given training examples

0);(

GL

DL

Page 11: Learning from  Infinite Training Examples

03/18/2009 11

Optimal Step Size(Benveniste et al. 1990, Murata et al. 1998)

Solving gradient = 0 by Newton’s method

Step size is asymptotically optimal if it approaches to

);( )(1)()1( DLH ttt

1

1)(

t

Ht

Page 12: Learning from  Infinite Training Examples

03/18/2009 12

Single-Pass Result (Bottou & LeCun 2004)

Optimum for n+1 examples is a Newton step away from the optimum for n examples

21

*11

**1

1);(

1

1

noBLH

n nnnnn

*n

*1n

);( nDL );( 1nDL

Page 13: Learning from  Infinite Training Examples

03/18/2009 13

2nd Order SGD

2nd order SGD (2SGD): Adjusting the step size to approach to Hessian

Good News: from previous work, given sufficiently large training examples, 2SGD achieves empirical optimum in a single pass!

Bad News: it is prohibitively expensive to compute H-1

e.g. 10K features, H will be a 10K by 10K matrix = 100M floating point array

How about 1M features?

Page 14: Learning from  Infinite Training Examples

03/18/2009 14

Approximating Jacobian(Aitken 1925, Schafer 1997)

Learning algorithms can be considered as fixed-point iteration mapping =M()

Taylor expansion gives

Eigenvalues of J can be approximated by

)()1(

)1()2(

ti

ti

ti

ti

i

)()( *)(*)()1( ttt JM

Page 15: Learning from  Infinite Training Examples

03/18/2009 15

Approximating Hessian

Consider SGD mapping as a fixed-point iteration, too.

since J=M’=I-H, we have eig(J)=eig(M’)=eig(I-H), therefore, (since H is symmetric) eig(J)=1- eig(H) eig(H-1)= / 1-eig(J)= / 1-γ.

);()( )()( tt BLM

Page 16: Learning from  Infinite Training Examples

03/18/2009 16

Estimating Eigenvalue Periodically

Since the mapping of SGD is stochastic, estimating the eigenvalues at each iteration may yield inaccurate estimations.

To make the mapping more stationary, we use Mb=M(M(…M(θ)…))

From the law of large number, b consecutive mappings, Mb, will be less “stochastic”

From Equation (4), we can estimate eig(Jb) by

)()(

)()2(

ti

bti

bti

bti

i

b

Page 17: Learning from  Infinite Training Examples

03/18/2009 17

The PSA algorithm (Huang, Chang & Hsu 2007)

Page 18: Learning from  Infinite Training Examples

03/18/2009 18

Experimental Results

Conditional Random Fields (CRF) (Lafferty et al. 2001)

Sequence labeling problems – gene mention tagging

Page 19: Learning from  Infinite Training Examples

Conditional Random Fields

1903/18/2009

Page 20: Learning from  Infinite Training Examples

In effect, CRF encodes a probabilistic rule-based system with rules of the form:

If fj1(X,Y) & fj2(X,Y) & … & fjn(X,Y) are non-zero,

then the labels of the sequence are Y

with score P(Y|X)

If we have d features and considers w context, then an order-1 CRF encodes this many rules:

20

2||||2 yx wd 03/18/2009

Page 21: Learning from  Infinite Training Examples

03/18/2009 21

Tasks and Setups

CoNLL 2000 base NP Tag Noun phrases 8936 training 2012 test 3 tags, 1015662 features

CoNLL 2000 chunking Tag 11 POS types 8936 training 2012 test 23 tags, 7448606

features Performance measure: F-

score:

BioNLP/NLPBA 2004 Tag 5 types of bio-

entities (e.g., gene, protein, cell lines, etc.)

18546 training 3856 test

5977675 features BioCreative 2

Tag gene names 15000 training 5000

test 10242972 features

pr

rpF

FPTP

TPp

FNTP

TPr

2,,

Page 22: Learning from  Infinite Training Examples

Feature types for BioCreative 2

2203/18/2009O(22M ) rules are encoded in our CRF model!!!

Page 23: Learning from  Infinite Training Examples

03/18/2009 23

Convergence PerformanceCoNLL 2000 base NP

Page 24: Learning from  Infinite Training Examples

03/18/2009 24

Convergence PerformanceCoNLL chunking

Page 25: Learning from  Infinite Training Examples

03/18/2009 25

Convergence PerformanceBioNLP/NLPBA 2004

Page 26: Learning from  Infinite Training Examples

03/18/2009 26

Convergence PerformanceBioCreative 2

Page 27: Learning from  Infinite Training Examples

03/18/2009 27

Execution TimeCoNLL 2000 base NP

First Pass 23.74 sec

Page 28: Learning from  Infinite Training Examples

03/18/2009 28

Execution TimeCoNLL chunking

First Pass 196.44 sec

Page 29: Learning from  Infinite Training Examples

03/18/2009 29

Execution TimeBioNLP/NLPBA 2004

First Pass 287.48 sec

Page 30: Learning from  Infinite Training Examples

03/18/2009 30

Execution TimeBioCreative 2

First Pass 394.04 sec

Page 31: Learning from  Infinite Training Examples

Experimental results for linear SVM and convolutional neural net

Data sets

3103/18/2009

Page 32: Learning from  Infinite Training Examples

Linear SVM

Convolutional Neural Net (5 layers)

3203/18/2009

** Layer trick -- Step sizes in the lower layers should be larger than in the higher layer

Page 33: Learning from  Infinite Training Examples

03/18/2009 33

Mini-conclusion: Single-Pass

By approximating Jacobian, we can approximate Hessian, too

By approximating Hessian, we can achieve near-optimal single-pass performance in practice

With a single-pass on-line learner, virtually infinitely many training examples can be used

Page 34: Learning from  Infinite Training Examples

PSA is a member in the family of “discretized Newton Methods”

Other well-known members include Secant method (aka. Quickprop) Steffensen’s method (aka. Triple Jump)

General form of these methods

where A is a matrix designed to

approximate the hessian matrix without actually computing the derivative

)(],[ )(1)()1( ttt ghA

Analysis of PSA

03/18/2009 34

Page 35: Learning from  Infinite Training Examples

PSA

PSA is not secant nor is it steffensen’s method

PSA iterates a 2b-step “parallel chord” method (i.e., fixed rate SGD) followed by an approximated Newton step Off-line 2-step parallel chord method is

known to have an order 4 convergence

03/18/2009 35

Page 36: Learning from  Infinite Training Examples

Convergence analysis of PSA

03/18/2009 36

Page 37: Learning from  Infinite Training Examples

03/18/2009 37

Are we there yet?

With single-pass on-line learning, we can learn from infinite training examples now, at least in theory

A cheaper, quicker method to annotate labels for training examples

Plus a lot of computers…

Page 38: Learning from  Infinite Training Examples

03/18/2009 38

The human life is finite, but the knowledge is infinite. Learning from infinite examples by

applying PSA to 2nd Order SGD

is a good idea!

Page 39: Learning from  Infinite Training Examples

Thank you for your attention!http://aiia.iis.sinica.edu.twhttp://chunnan.iis.sinica.edu.tw/~chunnan

This research is supported mostly by NRPGM’s advanced bioinformatics core facility grant 2005-2011.