Learning from Infinite Training Examples

3.18.2009, 3.19.2009

Prepared for NKU and NUTN seminars

Presenter: Chun-Nan Hsu ( 許鈞南 )Institute of Information ScienceAcademia SinicaTaipei, Taiwan

The Ever Growing Web(Zhuang, -400)

Human life is finite, but knowledge is

infinite. Following the infinite with the finite is doomed to fail.

人之生也有涯，而知也無涯。以有涯隨無涯，殆矣。莊子，西元前四百年

03/18/2009

Analogously…

Computing power is finite, but the Web is infinite. Mining infinite Web with finite

computing power…

is doomed to fail?

03/18/2009

Other “holy grails” in Artificial Intelligence

Learning to understand natural languages

Learning to recognize millions of objects in computer vision

Speech recognition in noisy environment, such as in a car

403/18/2009

On-Line Learning vs. Off-Line Learning

Nothing to do with human learning by browsing the web

Definition: Given a set of new training data, online learner can update its model without

reading old data while improving its performance.

By contrast, off-line learner must combine old and new data and start the learning all over again, otherwise the performance willsuffer.

03/18/2009

Off-Line Learning

Nearly all popular ML algorithms are off-line today

They scan the training examples many passes iteratively until an objective function is minimized

For example: SMO algorithm for SVM L-BFGS algorithm for CRF EM algorithm for HMM and GMM Etc.

03/18/2009

Why on-line learning?

03/18/2009

Single-pass on-line learning

The key for on-line learning to win is to achieve satisfying performance right after scanning the new training

examples for a single pass only

03/18/2009

Previous work on on-line learning

Perceptron Rosenblatt 1957

Stochastic Gradient Descent Widrow & Hoff 1960

Bregment Divergence Azoury & Warmuth 2001

MIRA (Large Margin) Crammer & Singer 2003

LaRank Borde & Bottou 2005, 2007

EG Collins & Peter Bartlet et al. 2008

03/18/2009

03/18/2009 10

Stochastic Gradient Descent (SGD)

Learning is to minimize a loss function given training examples

03/18/2009 11

Optimal Step Size(Benveniste et al. 1990, Murata et al. 1998)

Solving gradient = 0 by Newton’s method

Step size is asymptotically optimal if it approaches to

);( )(1)()1( DLH ttt

03/18/2009 12

Single-Pass Result (Bottou & LeCun 2004)

Optimum for n+1 examples is a Newton step away from the optimum for n examples

n nnnnn

);( nDL );( 1nDL

03/18/2009 13

2nd Order SGD

2nd order SGD (2SGD): Adjusting the step size to approach to Hessian

Good News: from previous work, given sufficiently large training examples, 2SGD achieves empirical optimum in a single pass!

Bad News: it is prohibitively expensive to compute H-1

e.g. 10K features, H will be a 10K by 10K matrix = 100M floating point array

How about 1M features?

03/18/2009 14

Approximating Jacobian(Aitken 1925, Schafer 1997)

Learning algorithms can be considered as fixed-point iteration mapping =M()

Taylor expansion gives

Eigenvalues of J can be approximated by

)1()2(

)()( *)(*)()1( ttt JM

03/18/2009 15

Approximating Hessian

Consider SGD mapping as a fixed-point iteration, too.

since J=M’=I-H, we have eig(J)=eig(M’)=eig(I-H), therefore, (since H is symmetric) eig(J)=1- eig(H) eig(H-1)= / 1-eig(J)= / 1-γ.

);()( )()( tt BLM

03/18/2009 16

Estimating Eigenvalue Periodically

Since the mapping of SGD is stochastic, estimating the eigenvalues at each iteration may yield inaccurate estimations.

To make the mapping more stationary, we use Mb=M(M(…M(θ)…))

From the law of large number, b consecutive mappings, Mb, will be less “stochastic”

From Equation (4), we can estimate eig(Jb) by

03/18/2009 17

The PSA algorithm (Huang, Chang & Hsu 2007)

03/18/2009 18

Experimental Results

Conditional Random Fields (CRF) (Lafferty et al. 2001)

Sequence labeling problems – gene mention tagging

Conditional Random Fields

1903/18/2009

In effect, CRF encodes a probabilistic rule-based system with rules of the form:

If fj1(X,Y) & fj2(X,Y) & … & fjn(X,Y) are non-zero,

then the labels of the sequence are Y

with score P(Y|X)

If we have d features and considers w context, then an order-1 CRF encodes this many rules:

2||||2 yx wd 03/18/2009

03/18/2009 21

Tasks and Setups

CoNLL 2000 base NP Tag Noun phrases 8936 training 2012 test 3 tags, 1015662 features

CoNLL 2000 chunking Tag 11 POS types 8936 training 2012 test 23 tags, 7448606

features Performance measure: F-

score:

BioNLP/NLPBA 2004 Tag 5 types of bio-

entities (e.g., gene, protein, cell lines, etc.)

18546 training 3856 test

5977675 features BioCreative 2

Tag gene names 15000 training 5000

test 10242972 features

Feature types for BioCreative 2

2203/18/2009O(22M ) rules are encoded in our CRF model!!!

03/18/2009 23

Convergence PerformanceCoNLL 2000 base NP

03/18/2009 24

Convergence PerformanceCoNLL chunking

03/18/2009 25

Convergence PerformanceBioNLP/NLPBA 2004

03/18/2009 26

Convergence PerformanceBioCreative 2

03/18/2009 27

Execution TimeCoNLL 2000 base NP

First Pass 23.74 sec

03/18/2009 28

Execution TimeCoNLL chunking

03/18/2009 29

Execution TimeBioNLP/NLPBA 2004

03/18/2009 30

Execution TimeBioCreative 2

Experimental results for linear SVM and convolutional neural net

Data sets

3103/18/2009

Linear SVM

Convolutional Neural Net (5 layers)

3203/18/2009

** Layer trick -- Step sizes in the lower layers should be larger than in the higher layer

03/18/2009 33

Mini-conclusion: Single-Pass

By approximating Jacobian, we can approximate Hessian, too

By approximating Hessian, we can achieve near-optimal single-pass performance in practice

With a single-pass on-line learner, virtually infinitely many training examples can be used

PSA is a member in the family of “discretized Newton Methods”

Other well-known members include Secant method (aka. Quickprop) Steffensen’s method (aka. Triple Jump)

General form of these methods

where A is a matrix designed to

approximate the hessian matrix without actually computing the derivative

)(],[ )(1)()1( ttt ghA

Analysis of PSA

03/18/2009 34

PSA is not secant nor is it steffensen’s method

PSA iterates a 2b-step “parallel chord” method (i.e., fixed rate SGD) followed by an approximated Newton step Off-line 2-step parallel chord method is

known to have an order 4 convergence

03/18/2009 35

Convergence analysis of PSA

03/18/2009 36

03/18/2009 37

Are we there yet?

With single-pass on-line learning, we can learn from infinite training examples now, at least in theory

A cheaper, quicker method to annotate labels for training examples

Plus a lot of computers…

03/18/2009 38

The human life is finite, but the knowledge is infinite. Learning from infinite examples by

applying PSA to 2nd Order SGD

is a good idea!

Thank you for your attention!http://aiia.iis.sinica.edu.twhttp://chunnan.iis.sinica.edu.tw/~chunnan

This research is supported mostly by NRPGM’s advanced bioinformatics core facility grant 2005-2011.

Learning from Infinite Training Examples

Documents

These training props are provided as examples of props

Infinite Campus Training Training Session 1: Getting started Attendance, Seating Charts, Rosters

USING WORKED EXAMPLES FOR TRAINING NUTRITION …

49320275 0 Adams Basic Training and Examples b5

Training Examples (2013.01) - WIPO

Infinite Campus Attendance Training Module this is not revolutionary; it’s evolutionary…

Stability Exercises Etrans-jbtb.pln.co.id/wp-content/uploads/perpustakaan/Buku training... · Infinite Source ~ V 87.32 LV 24. kV 24.00 1.00 28.34 HV 500. kV 472.13 0.94 20.12 Infinite

pdf of non trad examples - AECB CarbonLite Training

“New to Infinite Campus” Training · “New to Infinite Campus” Training –Day 1 Deb Fredrickson –Watertown Deb.Fredrickson@k12.sd.us Martin Sieverding –Menno Martin.Sieverding@k12.sd.us

Training Principles2 Application Examples

modular vocational education and training examples of good

Computing with Finite and Infinite Networks · weights training examples. Since a GP can be considered as an infinite NN, the results show that even in the Bayesian approach, it is

Cool Training Ideas Examples & Tips Israeli Training Community Meeting May 2007 Amir Elion

Www.hecogear.com Distributor Training Session – Application Examples Heco Application Examples

These training props are provided as examples of …€¢ These training props are provided as examples of props used by local fire departments to deliver hands-on training. PART VI

Entrepreneurship Training: 12 Good Practice Examples

Company Profile Infinite Ideas, Infinite Technologies

Infinite Campus: Parent Portal Training Jesse Rice Elementary September 18, 2014

EXAMPLES OF EURCAW TRAINING MATERIALS

Training - Infinite Possibilites