Natural Language Processinglegacydirs.umiacs.umd.edu/~hal/tmp/mlss.pdf · Not just string processing or keyword matching End systems that we want to build Simple: Spelling correction,

1/225 NLP Hal Daumé III, [email protected]

NaturalLanguage

Processing(for smart people who know nothing

about natural language processing)

Hal Daume III | UMD | [email protected] | @haldaume3Machine Learning Summer School 2012 @ UC Santa Cruz

Title cred: Persi Diaconis


What is NLP?

� Fundamental goal: deep understanding of text� Not just string processing or keyword matching

� End systems that we want to build� Simple: Spelling correction, text categorization, etc.

� Complex: Speech recognition, machine translation, information extraction, dialog interfaces, question answering

� Unknown: human-level comprehension (might be more than just NLP)

Slide cred: Dan Klein


Topology of the Field

Human Language Technologies

Automatic Speech

Recognition

Information

Retrieval

Computational

Linguistics

Natural Language

Processing

ICASSP

??? ACL

SIGIR


Topology of the Field

Human Language Technologies

Automatic Speech

Recognition

Information

Retrieval

Computational

Linguistics

Natural Language

Processing

NLP

Machine Translation

Summarization

Question Answering

Information Extraction

Parsing

“Understanding”

Generation


What can I teach you in 6 hours...

� a bag is just language of not words

� NLP through the eyes of machine translation

� Machine learning for complex structured prediction problems (and then some)

� How to learn language by interacting with the world

(no math, no machine learning)

(a little math, a little machine learning)

(lotsa math, lotsa machine learning)

(some math, some machine learning)


Speech

� Automatic speech recognition (ASR)� Audio in, text out� SOTA: 0.3% error for digits, 10% dictation, 50%+ TV

� Text to speech (TTS)� Text in, audio out� SOTA: completely intelligible (sometimes unnatural)

�Speech Lab�



Question answering

� More than search� Can be really easy:

�What is the capital of Wyoming?�

� �Can also be hard � Can be open ended:

�What are the main issues in the global warming debate?�

� SOTA: Can do factoids, can't do much else



Information Extraction� Unstructured text to database entries

� SOTA: 40% for hard things, 80% accuracy for multi-sentence templates, 90%+ for single easy fields

New York Times Co. named Russell T. Lewis, 45, president and general

manager of its flagship New York Times newspaper, responsible for all

business-side activities. He was executive vice president and deputy

general manager. He succeeds Lance R. Primis, who in September was

named president and chief operating officer of the parent.

startpresident and CEONew York Times Co.Lance R. Primis

endexecutive vice

presidentNew York Times

newspaperRussell T. Lewis

startpresident and

general managerNew York Times

newspaperRussell T. Lewis

StatePostCompanyPerson



Summarization� Condensing documents

� Single or multiple

� Extractive or abstractive

� Indicative or informative

� Etc...

� Very context (and user) dependent

� Involves analysis and generation

� SOTA: who knows...



Machine translation


Language comprehension

But man is not destined to vanish. He can be killed, but he cannot be destroyed, because his soul is deathless and his spirit is irrepressible. Therefore, though the situation seems dark in the context of the confrontation between the superpowers, the silver lining is provided by amazing phenomenon that the very nations which have spent incalculable resources and energy for the production of deadly weapons are desperately trying to find out how they might never be used. They threaten each other, intimidate each other and go to the brink, but before the total hour arrives they withdraw from the brink.

The main point from the author's view is:

1. Man's soul and spirit can not be destroyed by superpowers.

2. Man's destiny is not fully clear or visible. 3. Man's soul and spirit are immortal. 4. Man's safety is assured by the delicate balance

of power in terms of nuclear weapons. 5. Human society will survive despite the

serious threat of total annihilation.


Why is language hard?

� Ambiguity abounds (some headlines)� Iraqi Head Seeks Arms� Teacher Strikes Idle Kids� Kids Make Nutritious Snacks� Stolen Painting Found by Tree� Local HS Dropouts Cut in Half� Enraged Cow Injures Farmer with Ax� Hospitals are Sued by 7 Foot Doctors� Ban on Nude Dancing on Governor's Desk

� Why are these funny?



Why is language hard?

� Ambiguity abounds (some headlines)� Iraqi Head Seeks Arms� Teacher Strikes Idle Kids� Kids Make Nutritious Snacks� Stolen Painting Found by Tree� Local HS Dropouts Cut in Half� Enraged Cow Injures Farmer with Ax� Hospitals are Sued by 7 Foot Doctors� Ban on Nude Dancing on Governor's Desk

� Why are these funny?



Why (else) is language hard?

� Data sparsity� NYT: 500M+ words of newswire text

� Brown Corpus: 1M words of tagged �balanced� text

� Penn Treebank: 1M words of parsed WSJ text

� Canadian Hansard: 10M+ words of parallel French/English text

� The Web: *B+ words of who knows what...

Brain teaser:You've already seen N tokens of text.This contained a bunch of word types.What is the probability that the N+1st word will be new?



Why (else) is language hard?

00.10.20.30.40.50.60.70.80.9

1

0 200000 400000 600000 800000 1000000

Fraction Seen

Number of Words

Unigrams

Bigrams


Structural ambiguity

Time flies like an arrow.

I saw the Grand Canyon flying to New York.

I saw the man on the hill with the telescope.

I ate spaghetti with....

.... meatballs

.... a fork

Slide cred: Ellen Riloff


Structural ambiguity

Hurricane Emily howled toward Mexico's Caribbean coast on Sundaypacking 135 mph winds and torrential rain and causing panic inCancun, where frightened tourists squeezed into musty shelters.

SOTA: 90% accuracy for mid-1980s WSJ text



Syntax doesn't nail down semantics

� Every morning someone'salarm clock wakes me up

� Every person on this islandspeaks two languages

� Please come see me next Friday



Discourse: coreference

President John F. Kennedy was assassinated.

The President was shot yesterday.

Relatives said that John was a good father.

JFK was the youngest president in history.

His family will bury him tomorrow.

Friends of the Massachusetts native will hold

a candlelight service in Mr. Kennedy's

home town.

SOTA: 70-80% accuracy for newswire text



Discourse structure

� I only like traveling to Europe.

So I submitted a paper to ACL.

� I only like traveling to Europe.

Nevertheless I submitted a paper to ACL.


Pragmatics

Rules of conversation:

- Can you tell me what time it is?

- Could you pass the salt?

Speech acts:

- I bet you $50 that the Jazz will win.

- Will you marry me?



Pragmatics

Scene 1: Penn Station, NY

(Hal) Long Beach?

(Passerby) Downstairs, LIRR Station.

S2: Ticket Counter, LIRR (Hal) Long Beach?

(Clerk) $4.50.

Scene 3: Info. Booth, LIRR Station

(Hal) Long Beach?

(Clerk) 4:19, Track 17.

Scene 4: On train, vicinity of Forest Hills

(Hal) Long Beach?

(Conductor) Change at Jamaica.

Scene 5: Next train, near Lynbrook

(Hal) Long Beach?

(Conductor) Right after Island Park.

Slide cred: Kathy McKeown


Translate Centauri Arcturan�

1a. ok-voon ororok sprok .1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok 7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .12b. wat nnat forat arrat vat gat .

Your assignment, translate this Centauri sentence to Arcturan:farok crrrok hihok yorok clok kantok ok-yurp

Slide cred: Kevin Knight


A Bit of History1940s Computations begins, AI hot, Turing test

Machine translation = Code-breaking?

1950s Cold war continues

1960s Chomsky and statistics, ALPAC report

1970s Dry spell

1980s Statistics makes significant advances in speech

1990s Web arrives

Statistical revolution in machine translation, parsing, IE, etc

Serious �corpus� work, increasing focus on evaluation

2000s Focus on optimizing loss functions, reranking

Even more focus on evaluation, automatic metrics

Huge process in machine translation

Gigantic corpora become available, scaling

New challenges


Data-driven NLP

� I could hand-code that:� rope is a way of committing suicide� meatballs are themes and forks are means� canyons don't fly� some bugs fly and others are planted

� Or I could try to learn/infer it from data:� (committing suicide with Google)� (flying canyons with Google)


What is Nearby NLP?� Computational Linguistics

� Using computational methods to learn more about how language works

� We end up doing this and using it

� Cognitive Science� Figuring out how the human brain works� Includes the bits that do language� Humans: the only working NLP prototype!

� Speech?� Mapping audio signals to text� Traditionally separate from NLP, converging?� Two components: acoustic models and language

models� Language models in the domain of stat NLP



English is an outlier...

uyuyorum I am sleeping uyuorsun you are sleepinguyuyor he/she/it is sleeping uyuyoruz we are sleepinguyuyorsunuz you are sleeping uyuyorlar they are sleepinguyuduk we slept uyumaliyiz we must sleepuyuman your sleeping uyumadan without sleeping

uyudukca as long as (somebody) sleepsuyurken while (somebody) is sleepinguyuyunca when (somebody) sleepsuyutmak to cause somebody to sleepuyuturmak to cause somebody to cause another to sleepuyutturtturmak to cause somebody to cause another to cause

yet another to sleep


Classical MT (1970s and 1980s)

SourceText

SourceLanguageAnalysis

SourceLexicon

TargetText

TargetLanguageGeneration

TargetLexicon

KnowledgeBase

Transfer/Interlingua

Representation

TransferRules


How Much Analysis?

Source Words Target Words

SourceMorphology

Source Syntax

Source Semantics

Interlingua

TargetMorphology

Target Syntax

Target Semantics

Anal

ysis

Gen

eration

Direct

Classical

Not practicalfor open domain

Becoming Popular


Syntactic Transfer

� Now, just get a bunch of linguists to sit down and write rules and grammars

The student will see the man

D N AX V D N

NP NP

SUB A V OB

S

Der Student wird den Mann seben

D N AX D N V

NP NP

SUB A OB V

S


While We're Busy Writing Grammars

Claude Shannon

InformationSource

NoisyChannel

CorruptedOutput

p(s) p(o | s) p(s | o) � p(s) * p(o | s)

�Imagined�Words

Speech Process Acoustic Signal

Need: p(word sequence) and p(signal | word sequence)


Acoustic Modeling: p(a | w)

Signal:

Transcription: the man ate

Key notion: acoustic-word alignments

Chicken-and-eggproblem that wecan solve usingExpectationMaximization


Language Modeling: p(w)

� Sentence = sequence of symbols from alphabet

p �w1,w2, ... , w I � = �i=1

I

p �wi �w1, ... ,w i�1�

�i=1

I

p �wi �wi�k , ... , wi�1�

The beloved n-gramlanguage model

In practice, probabilitiesare estimated from a largecorpus, but are �smoothed�intelligently to avoid zeroprobability n-grams.

Language modeling isoften the art of goodsmoothing.

See [Goodman 1998]


Speech Rec = Machine Translation?� Peter F. Brown� Stephen A. Della Pietra� Vincent J. Della Pietra� Robert Mercer� The Mathematics of Statistical Machine Translation:

Parameter Estimation� Computational Linguistics 19 (2), June 1993

� Maybe most important paper in NLP in last 20 years

�Brown 93�


Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp





























































































































































































???





























process ofelimination



























cognate?




























cognate?


Zero fertility!


Unsupervised EM Training

� la maison �.. la maison bleue �... la fleur �

� the house �.. the blue house ...� the flower �

All P(french-word | english-word) equally likely



� la maison ..� la maison bleue ...� la fleur �

� the house ..� the blue house ...� the flower �

�la� and �the� observed to co-occur frequently,so P(la | the) is increased.



� la maison ..� la maison bleue �... la fleur �

� the house ..� the blue house ...� the flower �

�maison� co-occurs with both �the� and �house�, butP(maison | house) can be raised without limit, to 1.0,

while P(maison | the) is limited because of �la�

(pigeonhole principle)



� la maison �.. la maison bleue ...� la fleur �

� the house ..� the blue house �... the flower �

settling down after another iteration



� la maison �.. la maison bleue �... la fleur �

� the house �.. the blue house ...� the flower �

Inherent hidden structure revealed by EM training!

� �A Statistical MT Tutorial Workbook� (Knight, 1999). Promises free beer.

� �The Mathematics of Statistical Machine Translation� (Brown et al, 1993)

� Software: GIZA++


Mary did not slap the green witch

Mary not slap slap slap the green witch n(3|slap)

Maria no daba una botefada a la bruja verde

d(j|i)

Mary not slap slap slap NULL the green witch

P NULL

Maria no daba una botefada a la verde bruja

t(la|the)

Use the EM algorithm for training the parameters

The IBM model: �Brown93�


Ready-to-use Data

1994 1996 1998 2000 2002 20040

20

40

60

80

100

120

140

160

180

French-English

Chinese-English

Arabic-English

Mill

ions o

f W

ord

s (

Englis

h s

ide)


Decoding for Machine Translation

EnglishSource

TranslationModel

FrenchOutput

p(e) p(f | e) p(e | f) � p(e) * p(f | e)

Decoding: e = argmaxe

p�e� p� f � e�

Problem in NP-hard; use search:

Beam Search

Greedy SearchInteger Programming

A* Search


insistent Wednesday may recurred her trips to Libya tomorrow for flying

Cairo 6-4 ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment .

And said the official " the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air , a situation her receiving replying are so a trip will pull to Libya a morning Wednesday " .

Egyptair Has Tomorrow to Resume Its Flights to Libya

Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.

" The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the first take off a trip to Libya on Wednesday morning ".

20022002 20032003

Progress in machine translation


Phrase-Based Translation

Aus

tral

ia

Nor

th

dipl

omat

ic

rela

tionsha

ve

that

coun

trie

s

one

of

��

Australia is withNorthKorea

isdiplomaticrelations

one of the few

countries

Australia isdiplomaticrelations

withNorthKorea

isone of the few

countries

[Koehn, Och and Marcu, NAACL03]


Training Phrase-Based MT Systems

Mar

ia no

dab

a

una

bote

fada a la

bru

ja

verd

e

witchgreen

theslapnotdid

Mary



Decoding Phrase-Based MTMaria no daba una botefada a la bruja verde

Mary did not slap the green witch

� Each step induces a cost attributed to:� Language model probability: p(slap | did not)� T-table probability: p(the | a la) and p(a la | the)� Distortion probability: p(skip 1) [for a la --> verde]� Length penalty� ...



Hierarchical Phrase-Based MT

��

fewcountries

NorthKorea

diplomaticrelationshave with �isAustralia

Australia isNorthKorea

diplomaticrelations

fewcountries

� � �

[Chiang, ACL05]


Hierarchical Phrase-Based MT

fewcountries

NorthKorea

diplomaticrelationshave with �isAustralia

fewcountries

NorthKorea

diplomaticrelationshave withthatisAustralia

one ofthe

isAustraliafew

countriesNorthKorea

diplomaticrelationshave withthat �

[Chiang, ACL05]


Syntax for MT

I ate

lunch

P

R

O

VB

D

N

N

N

P

N

P

V

P

S

tabeta

ate

VB

D

hirugohan

lunch

N

N

N

P

watashi

I

P

R

O

N

P

V

P

x yy wo x

S

x yx wa y

I

P

R

O

N

P

ate

lunch

VB

D

N

N

N

P

V

P

wa

ate

lunch

VB

D

N

N

N

P

V

P

watashi wa

atelunc

h

VB

DN

N

N

P

watashi wa wo

watashi wa hirugohan wo tabeta

Kevin Knight, Daniel Marcu, Ignacio Thayer, Jonathan Graehl, Jon May, Steve DeNeefe


Syntax for MT� Decoding:

� Tree-to-tree/string automata

� CKY parsing algorithm

� Currently only unigram LM and trigram reranking

� No syntax-based LM

� Rule learning:� Parsed English

corpus� Aligned data

(GIZA++)� Extract rules and

assign probabilities

Time

BLE

U

Phrase-based MT

Syntax-based MT


Traditional NLP

� The process of building computational models for understanding natural language

� INPUT: natural language text (or speech)

� OUTPUT: representation of the meaning of the text

� Sometimes called natural language understanding (NLU)

Big Question: What does �understand� mean?

Title63 Hal Daumé III ([email protected])

Examples of structured problems


Examples of demonstrations


Examples of demonstrations


07/19/12

NLP as transduction

Task Input Output

MachineTranslation

Ces deux principes se tiennent à

la croisée de la philosophie, de

la politique, de l’économie, de la

sociologie et du droit.

Both principles lie at the

crossroads of philosophy,

politics, economics,

sociology, and law.

Document Summarization

SyntacticAnalysis

The man ate a big sandwich.

...many more...

Argentina was still obsessed with theFalkland Islands even in 1994, 12 years after its defeat in the 74-day warwith Britain. The country's overriding foreign policy aim continued to be winning sovereignty over the islands.

The Falkland islands war, in 1982, was fought between Britain and Argentina.

The man ate a big sandwich


Structured prediction 101

Learn a function mapping inputs to complex outputs:

I can can a can

Pro Md Vb Dt Nn

Pro Md Vb Dt Vb

Pro Md Vb Dt Md

Sequence Labeling

Pro Md Nn Dt Nn

Pro Md Nn Dt Vb

Pro Md Nn Dt Md

Pro Md Md Dt Nn

Pro Md Md Dt Vb

Parsing

Barack

Obama

McCain

John

he

the President

Input Space Decoding Output Space

Mary did not slap the green witch .

Mary no dio una bofetada a la bruja verda .

Coreference ResolutionMachine Translation

f : X � Y


Structured prediction 101

Learn a function mapping inputs to complex outputs:

I can can a can

Input Space Decoding Output Space

f : X � Y


07/19/12

Why is structure important?

� Correlations among outputs� Determiners often precede nouns

� Sentences usually have verbs

� Global coherence� It just doesn't make sense to have three determiners next to

each other

� My objective (aka �loss function�) forces it� Translations should have good sequences of words

� Summaries should be coherent


Outline: Part II

� Learning to Search� Incremental parsing

� Learning to queue

� Refresher on Markov Decision Processes

� Inverse Reinforcement Learning� Determining rewards given policies

� Maximum margin planning

� Learning by Demonstration� Searn

� Dagger

� Discussion


Refresher on Binary Classification


What does it mean to learn?

� Informally:� to predict the future based on the past

� Slightly-less-informally:� to take labeled examples and construct a function that will

label them as a human would

� Formally:� Given:

� A fixed unknown distribution D over X*Y

� A loss function over Y*Y

� A finite sample of (x,y) pairs drawn i.i.d. from D

� Construct a function f that has low expected loss with respect to D


Feature extractors

� A feature extractor � maps examples to vectors

� Feature vectors in NLP are frequently sparse

Dear Sir.

First, I must solicit

your confidence in

this transaction,

this is by virture of

its nature as being

utterly confidencial

and top secret. �

W=dear : 1

W=sir : 1

W=this : 2

...

W=wish : 0

...

MISSPELLED : 2

NAMELESS : 1

ALL_CAPS : 0

NUM_URLS : 0

...

f : X � Y�


Linear models for binary classification

f1

f2

� Decision boundaryis the set of �uncertain�points

� Linear decisionboundaries arecharacterized byweight vectors

BIAS : -3

free : 4

money : 2

the : 0

...

BIAS : 1

free : 1

money : 1

the : 0

...

�free

money�

x �(x) w i wi �i(x)


The perceptron

� Inputs = feature values� Params = weights� Sum is the response

� If the response is:� Positive, output +1� Negative, output -1

� When training,update on errors:

S

�1

�2

�3

w1

w2

w3

>0?

75

�Error� when:w=w� y�� x �y w��x ��0


Why does that update work?

� When , update:wnew=w

old� y��x �y wold��x ��0

wold

wnew

y wnew��x �= y �wold� y��x � �� x �

= y wold��x �� yy��x ��x �

<0 >0


Support vector machines

� Explicitly optimizethe margin

� Enforce thatall training pointsare correctlyclassified

f1

f2

maxw

margin s.t. all points arecorrectly classified

maxw

margin s.t.

minw

s.t.

yn w��xn��1 , �n

�w�2 yn w��xn��1 , �n


Support vector machines with slack

� Explicitly optimizethe margin

� Allow some �noisy�points to bemisclassified f1

f2

minw,�

s.t.

1

2�w�2�C

n�n

yn w��xn�� n �1 , �n

�n�0 , � n


Batch versus stochastic optimization

� Batch = read in all the data, then process it

� Stochastic = (roughly) process a bit at a time

minw,�

s.t.

� For n=1..N:

� If

�

1

2�w�2�C

n�n

yn w��xn��n�1

, � n

�n�0 , � n

w=w� yn�� xn�

yn w��xn��0


Stochastically optimized SVMs

� For n=1..N:

� If

�

�

SOMEMATH

� For n=1..N:

� If

�

SVM Objective

w=w� yn�� xn�

yn w��xn��0yn w��xn��1

Implementation Note:

Weight shrinkage is SLOW.Implement it lazily, at thecost of double storage.

w=�1� 1

CN �ww=w� yn�� xn�


From Perceptronto Structured Perceptron


Perceptron with multiple classes

� Store separate weightvector for each classw1, w2, ..., wK

w1��(x)biggest

w2��(x)biggest

w3��(x)biggest

� For n=1..N:

� Predict:

� If :

y=argmaxk wk��xn�

w y=w y��xn�

y� yn

wyn

=wyn

��xn�

Why does thisdo the right thing??!


Perceptron with multiple classes v2

� Originally:

w1 w2 w3

w

� For n=1..N:

� Predict:

� If :

� For n=1..N:

� Predict:

� If :


w y=w y��xn�

y� yn

wyn

=wyn

��xn�

y=argmaxk w��xn , k �

y� yn

w=w��xn, y �

�� xn, y

n�


Perceptron with multiple classes v2

� Originally:

w1 w2 w3

w

� For n=1..N:

� Predict:

� If :

� For n=1..N:

� Predict:

� If :

spam_BIAS : 1

spam_free : 1

spam_money : 1

spam_the : 0

...

�free

money�

x �(x,1) �(x,2)

ham_BIAS : 1

ham_free : 1

ham_money : 1

ham_the : 0

...


w y=w y��xn�

y� yn

wyn

=wyn

��xn�


y� yn

w=w��xn, y �

�� xn, y

n�


Features for structured prediction

I can can a can

Pro Md Vb Dt Nn

��x , y �=

I_Pro : 1

can_Md : 1

can_Vb : 1

a_Dt : 1

can_Nn : 1

...

<s>-Pro : 1Pro-Md : 1Md-Vb : 1Vb-Dt : 1Dt-Nn : 1Nn-</s> : 1

has_verb : 1has_nn_lft : 0has_n_lft : 1has_nn_rgt : 1has_n_rgt : 1...

� Allowed to encode anything you want

� Output features, Markov features, other features


Structured perceptron

� For n=1..N:

� Predict:

� If :

� For n=1..N:

� Predict:

� If :

Enumerationover 1..K

Enumerationover all outputs


y� yn

w=w��xn, y �

�� xn, y

n�


y� yn

w=w��xn, y �

�� xn, y

n�

[Co

llins, E

MN

LP

02

]


Argmax for sequences

� If we only have output and Markov features, we can use Viterbi algorithm:

(plus some work to account for boundary conditions)

w�[I_Pro] w�[can_Pro] w�[can_Pro]

w�[I_Md] w�[can_Md] w�[can_Md]

w�[I_Vb] w�[can_Vb] w�[can_Vb]

I can can

w�[Pro-Pro]

w�[Pro-Md]

w�[Pro-Vb]


Structured perceptron as ranking

� For n=1..N:

� Run Viterbi:

� If :

� When does this make an update?

I can can a can

Pro Md Vb Dt Nn

Pro Md Vb Dt Vb

Pro Md Vb Dt Md

Pro Md Nn Dt Nn

Pro Md Nn Dt Md

Pro Md Md Dt Nn

Pro Md Md Dt Vb

y=argmaxk w��xn , k �y� yn w=w��x

n, y ��x

n, y

n�


From perceptron to margins

minw,�

s.t.

MaximizeMargin

MinimizeErrors

Each point is correctlyclassified, modulo �

s.t.

Each true output is morehighly ranked, modulo �

Responsefor truth

Responsefor other

1

2�w�2�C

n�n

yn w��xn��n�1

, � n

minw,�

1

2�w�2�C

n�n , y

w��xn , yn�

�w��xn , y �

��n�1 ,�n , y� yn

[Ta

ska

r+a

l, JM

LR

05

; Tsh

och

and

aritis

, JM

LR

05

]


From perceptron to margins

minw,�

s.t.

Each true output is morehighly ranked, modulo �

Responsefor truth

Responsefor other

1

2�w�2�C

n�n , y

w��xn , yn�

�w��xn , y �

��n�1 ,�n , y� yn

[Ta

ska

r+a

l, JM

LR

05

; Tsh

och

and

aritis

, JM

LR

05

]

��x1,y

1�

��x1,y�

��x1,y�

��x1,y�

��x1,y�

��x2,y

2�

��x2,y �

��x2,y �

��x2,y �


Ranking margins

� Some errors are worse than others...

I can can a can

Pro Md Vb Dt Nn

Pro Md Vb Dt Vb

Pro Md Vb Dt Md

Pro Md Nn Dt Nn

Pro Md Nn Dt Md

Pro Md Md Dt Nn

Pro Md Md Dt Vb

Margin of one

[Ta

ska

r+a

l, JM

LR

05

; Tsh

och

and

aritis

, JM

LR

05

]


Accounting for a loss function

� Some errors are worse than others...

I can can a can

Pro Md Vb Dt Nn

Margin of l(y,y')

[Ta

ska

r+a

l, JM

LR

05

; Tsh

och

and

aritis

, JM

LR

05

]

Pro Md Vb Dt Vb

Pro Md Vb Nn Md

Nn Md Vb Nn Md

Pro Md Md Dt NnPro Md Nn Dt Nn

Nn Md Vb Dt MdPro Vb Nn Dt Nn

Nn Md Vb Md MdPro Vb Nn Vb Nn


Accounting for a loss function

w��xn , yn��w��xn , y��n

[Ta

ska

r+a

l, JM

LR

05

; Tsh

och

and

aritis

, JM

LR

05

]

��x1,y

1�

��x1,y�

��x1,y�

��x1,y�

��x1,y�

��x2,y

2�

��x2,y �

��x2,y �

��x2,y �

w��xn , yn��w��xn , y ��n

�1 �l � yn , y �

is equivalent to

� y , w��xn , yn��w��xn , y ��n�l � yn , y �

max y w��xn , yn��w��xn , y ��n�l � yn , y �


Augmented argmax for sequences

� Add �loss� to each wrong node!

w�[I_Pro] w�[can_Pro] w�[can_Pro]

w�[I_Md] w�[can_Md] w�[can_Md]

w�[I_Vb] w�[can_Vb] w�[can_Vb]

I can can

w�[Pro-Pro]

w�[Pro-Md]

w�[Pro-Vb]

+1

+1 +1

+1

+1

+1

[Ta

ska

r+a

l, JM

LR

05

; Tsh

och

and

aritis

, JM

LR

05

]

What are weassuming here??!


Stochastically optimizing Markov nets

SOMEMATH

M3N Objective

� For n=1..N:

� Viterbi:

� If :

� For n=1..N:

� Augmented Viterbi:

� If :

�

y=argmaxk w��xn , k �y� yn

w=w��xn, y �

�� xn, y

n�


�l � yn , k �y� yn

w=w��xn, y �

�� xn, y

n�

w=�1� 1

CN �w

[Ratliff+

al, A

ISta

ts07

]


07/19/12

Stacking

� Structured models: accurate but slow

� Independent models: less accurate but fast

� Stacking: multiple independent models

Output Labels

Input Features

Output Labels

Input Features

Output Labels

Input Features

Output Labels 2


07/19/12

Training a stacked model

� Train independentclassifier f1 on

input features

� Train independentclassifier f2 on

input features + f1's output

� Danger: overfitting!

� Solution: cross-validation

Output Labels

Input Features

Output Labels 2


07/19/12

Do we really need structure?

� Structured models: accurate but slow

� Independent models: less accurate but fast

� Goal: transfer power to get fast+accurate

� Questions: are independent models...

� � expressive enough? (approximation error)

� � easy to learn? (estimation error)

Output Labels

Input Features

Output Labels

Input Features

Output Labels

Input Features

[Lia

ng

+D

+K

lein

, ICM

L0

8]


07/19/12

�Compiling� structure out[L

iang

+D

+K

lein

, ICM

L0

8]

LabeledData

CRF(f1) POS: 95.0%NER: 75.3%

f1 = words/prefixes/suffixes/forms

IRL(f1) POS: 91.7%NER: 69.1%


07/19/12

�Compiling� structure out[L

iang

+D

+K

lein

, ICM

L0

8]

LabeledData

Auto-LabeledData

CRF(f1) POS: 95.0%NER: 75.3%

f1 = words/prefixes/suffixes/forms

f2 = f1 applied to a larger window

IRL(f1) POS: 91.7%NER: 69.1%

IRL(f2) POS: 94.4%NER: 66.2%

CompIRL(f2) POS: 95.0%NER: 72.7%


07/19/12

Decomposition of errors[L

iang

+D

+K

lein

, ICM

L0

8]

CRF(f1): pC

� � �

IRL(f�): pA*

marginalized CRF(f1): pMC

IRL(f2): p1*

coherence

nonlinearities

global information

Theorem:KL(pC || p1*) = KL(pC || pMC) + KL(pMC || pA*) + KL(pA* || p1*)

Sum of MI on edgesPOS=.003 (95.0% 95.0%)�NER=.009 (76.3% 76.0%)�

Train a truncated CRFNER: 76.0% 72.7%�

Train a marginalized CRFNER: 76.0% 76.0%�


07/19/12

Structure compilation results

4 14 24 44 84 16476

78

80

82

84

86

88

Structured

Independent

2 4 8 16 32 6466

68

70

72

74

76

78

2 4 8 16 32 6493

94

95

96

Part of speech Named Entity Parsing

[Lia

ng

+D

+K

lein

, ICM

L0

8]


Learning to Search


07/19/12

Argmax is hard!

� Classic formulation of structured prediction:

something we learnto make �good� x,y pairsscore highly

� At test time:

� Combinatorial optimization problem� Efficient only in very limiting cases

� Solved by heuristic search: beam + A* + local search

f �x� = argmaxy�Y

score �x , y �

score �x , y� =


07/19/12

Argmax is hard!



� At test time:




score �x , y �

score �x , y� =

Order these words: bart better I madonna say than ,

[Soricut, PhD Thesis, USC 2007]


07/19/12

Argmax is hard!



� At test time:




score �x , y �

score �x , y� =


Best search (32.3): I say better than bart madonna ,

Original (41.6): better bart than madonna , I say



07/19/12

Argmax is hard!



� At test time:




score �x , y �

score �x , y� =




Best search (51.6): and so could really be a neural

apparently thought things as

dissimilar firing two identical



07/19/12

Argmax is hard!



� At test time:




score �x , y �

score �x , y� =




Best search (51.6): and so could really be a neural

apparently thought things as

dissimilar firing two identical

Original (64.3): could two things so apparently

dissimilar as a thought and neural

firing really be identical



Incremental parsing, early 90s style

I / Pro can / Md can / Vb a / Dt can / Nn

NP NP

VP

S

UnaryLeft

Left Up

Right

Right

Right

Left

Train a classifierto make decisions

[Ma

ge

rman

, AC

L9

5]


Incremental parsing, mid 2000s style


NP

VP

S

NP

Train a classifierto make decisions

[Co

llsn

s+

Ro

ark

, AC

L0

4]


Learning to beam-search


NP

VP

S

NP

� For n=1..N:� Viterbi:

� If :

Beam-search

y=argmaxk w��xn , k �y� yn

w=w��xn, y �

��xn, y

n�

[Co

llins+

Ro

ark

, AC

L0

4]




NP

VP

S

NP


� If :y=argmaxk w��xn , k �

y� yn

w=w��xn, y �

��xn, y

n�

Beam-search

� For n=1..N:� Run beam search until

truth falls out of beam� Update weights

immediately!

[Co

llins+

Ro

ark

, AC

L0

4]




NP

VP

S

NP


� If :

Beam-search



immediately!



immediately!� Restart at truth

[D+

Ma

rcu

, ICM

L0

5; X

u+

al, J

ML

R0

9]


Incremental parsing results[C

ollin

s+

Ro

ark

, AC

L0

4]


Generic Search Formulation

� Search Problem:� Search space

� Operators

� Goal-test function

� Path-cost function

� Search Variable:� Enqueue function

� nodes := MakeQueue(S0)

� while nodes is not empty� node :=

RemoveFront(nodes)

� if node is a goal state return node

� next := Operators(node)

� nodes := Enqueue(nodes, next)

� fail

Varying the Enqueue function can give us DFS, BFS, beam search, A* search, etc...

[D+

Ma

rcu

, ICM

L0

5; X

u+

al, J

ML

R0

9]


Online Learning Framework (LaSO)

� nodes := MakeQueue(S0)

� while nodes is not empty� node := RemoveFront(nodes)

� if none of {node} � nodes is y-good or node is a goal & not y-good

� sibs := siblings(node, y)

� w := update(w, x, sibs, {node} � nodes)

� nodes := MakeQueue(sibs)

� else� if node is a goal state return w

� next := Operators(node)

� nodes := Enqueue(nodes, next)

Monotonicity: for anynode, we can tell if it

can lead to the correct solution or not

If we erred... Where should we have gone?

Update our weights based on

the good and the bad choicesContinue search...

[D+

Ma

rcu

, ICM

L0

5; X

u+

al, J

ML

R0

9]


Search-based Margin

� The margin is the amount by which we are correct:

� Note that the margin and hence linear separability is also a function of the search algorithm!

�uT��x , g

1�

�uT��x , g

2�

�uT��x ,b

1�

�uT��x ,b

2�

�u

�

[D+

Ma

rcu

, ICM

L0

5; X

u+

al, J

ML

R0

9]


Syntactic chunking Results

Training Time (minutes)

F-

Score

[Collins 2002]

[Zhang+Damerau+Johnson

2002]; timing unknown

33 min

22 min

24 min

4 min

[D+

Ma

rcu

, ICM

L0

5; X

u+

al, J

ML

R0

9]


Tagging+chunking results

Training Time (hours) [log scale]

Join

t ta

ggin

g/c

hunkin

g

accura

cy

[Sutton+McCallum 2004]

23 min

7 min3 min

1 min

[D+

Ma

rcu

, ICM

L0

5; X

u+

al, J

ML

R0

9]


Variations on a beam

� Observation:

� We needn't use the same beam size for training and decoding

� Varying these values independently yields:

1 5 10 25 501 93.9 92.8 91.9 91.3 90.9

5 90.5 94.3 94.4 94.1 94.1

10 89.5 94.3 94.4 94.2 94.2

25 88.7 94.2 94.5 94.3 94.3

50 88.4 94.2 94.4 94.2 94.4

Decoding Beam

Train

ing

Beam

[D+

Ma

rcu

, ICM

L0

5; X

u+

al, J

ML

R0

9]


What if our model sucks?

� Sometimes our model cannot produce the �correct� output� canonical example: machine translation

ModelModelOutputsOutputs

GoodGoodOutputsOutputs

Reference

[Och

, AC

L0

3; L

ian

g+

al, A

CL

06

]

CurrentHypothesis

Bestachievableoutput

�Local� update

�Bold� update

N-best listor �optimaldecoding�

or ...


Local versus bold updating...

Monotonic Distortion32.5

33

33.5

34

34.5

35

35.5

Machine Translation Performance (Bleu)

Bold

Local

Pharoah


Refresher on Markov Decision Processes


Reinforcement learning

� Basic idea:� Receive feedback in the form of rewards� Agent�s utility is defined by the reward function� Must learn to act to maximize expected rewards� Change the rewards, change the learned behavior

� Examples:� Playing a game, reward at the end for outcome� Vacuuming, reward for each piece of dirt picked up� Driving a taxi, reward for each passenger delivered


Markov decision processes

What are the values (expected future rewards) of states and actions?

5

1

3

2

4

0.6

0.4

0.1

0.9

1.0

Q(s,a1)* = 30

Q(s,a2)* = 23

Q(s,a3)* = 17

V(s)* = 30


Markov Decision Processes

� An MDP is defined by:� A set of states s � S� A set of actions a � A� A transition function T(s,a,s�)

� Prob that a from s leads to s�� i.e., P(s� | s,a)� Also called the model

� A reward function R(s, a, s�) � Sometimes just R(s) or R(s�)

� A start state (or distribution)� Maybe a terminal state

� MDPs are a family of non-deterministic search problems

� Total utility is one of:

or t

rt t

�tr t


Solving MDPs

� In deterministic single-agent search problem, want an optimal plan, or sequence of actions, from start to a goal

� In an MDP, we want an optimal policy p(s)

� A policy gives an action for each state� Optimal policy maximizes expected if followed� Defines a reflex agent

Optimal policy when R(s, a, s�) = -0.04 for all non-terminals s


Example Optimal Policies

R(s) = -2.0R(s) = -0.4

R(s) = -0.03R(s) = -0.01


Optimal Utilities

� Fundamental operation: compute the optimal utilities of states s (all at once)

� Why? Optimal values define optimal policies!

� Define the utility of a state s:V*(s) = expected return starting in

s and acting optimally

� Define the utility of a q-state (s,a):Q*(s,a) = expected return starting

in s, taking action a and thereafter acting optimally

� Define the optimal policy:�*(s) = optimal action from state s

a

s

s, a

s,a,s�

s�


The Bellman Equations

� Definition of utility leads to a simple one-step lookahead relationship amongst optimal utility values:

Optimal rewards = maximize over first action and then follow optimal policy

� Formally:

a

s

s, a

s,a,s�

s�


Solving MDPs / memoized recursion

� Recurrences:

� Cache all function call results so you never repeat work

� What happened to the evaluation function?


Q-Value Iteration

� Value iteration: iterate approx optimal values� Start with V0

*(s) = 0, which we know is right (why?)� Given Vi

*, calculate the values for all states for depth i+1:

� But Q-values are more useful!� Start with Q0

*(s,a) = 0, which we know is right (why?)� Given Qi

*, calculate the q-values for all q-states for depth i+1:


RL = Unknown MDPs

� If we knew the MDP (i.e., the reward function and transition function):

� Value iteration leads to optimal values

� Will always converge to the truth

� Reinforcement learning is what we do when we do not know the MDP

� All we observe is a trajectory

� (s1,a1,r1, s2,a2,r2, s3,a3,r3, �)

� Many algorithms exist for this problem; see Sutton+Barto's excellent book!


Q-Learning

� Learn Q*(s,a) values� Receive a sample (s,a,s�,r)

� Consider your old estimate:

� Consider your new sample estimate:

� Incorporate the new estimate into a running average:


Exploration / Exploitation

� Several schemes for forcing exploration� Simplest: random actions (e greedy)

� Every time step, flip a coin

� With probability e, act randomly

� With probability 1-e, act according to current policy

� Problems with random actions?� You do explore the space, but keep thrashing around once

learning is done

� One solution: lower e over time� Another solution: exploration functions


Q-Learning

� In realistic situations, we cannot possibly learn about every single state!

� Too many states to visit them all in training� Too many states to hold the q-tables in memory

� Instead, we want to generalize:� Learn about some small number of training states from

experience� Generalize that experience to new, similar states:

� Very simple stochastic updates:


InverseReinforcement Learning (aka Inverse Optimal Control)


Inverse RL: Task

� Given:

� measurements of an agent's behavior over time, in a variety of circumstances

� if needed, measurements of the sensory inputs to that agent

� if available, a model of the environment.

� Determine: the reward function being optimized

� Proposed by [Kalman68]

� First solution, by [Boyd94]


Why inverse RL?

� Computational models for animal learning

� �In examining animal and human behavior we must consider the reward function as an unknown to be ascertained through empirical investigation.�

� Agent construction� �An agent designer [...] may only have a very rough idea of

the reward function whose optimization would generate 'desirable' behavior.�

� eg., �Driving well�

� Multi-agent systems and mechanism design� learning opponents� reward functions that guide their actions

to devise strategies against them


IRL from Sample Trajectories[N

g+

Ru

sse

ll, ICM

L0

0]

� Optimal policy available through sampled trajectories (eg., driving a car)

� Want to find Reward function that makes this policy look as good as possible

� Write so the reward is linear

and be the value of the starting state

How good does the�optimal policy� look?

maxw

k=1

K

f �V w

��

�s0��V w

�k �s0��

Rw �s�=w��s�

V w

� �s0�

How good does thesome other policy look?

Warning: need to becareful to avoidtrivial solutions!


Apprenticeship Learning via IRL

� For t = 1,2,�

� Inverse RL step:

Estimate expert�s reward function R(s)= wTf(s) such that under R(s) the expert performs better than all

previously found policies {pi}.

� RL step:

Compute optimal policy pt for the estimated reward w

[Ab

be

el+

Ng

, ICM

L0

4]


Car Driving Experiment

� No explicit reward function at all!

� Expert demonstrates proper policy via 2 min. of driving time on simulator (1200 data points).

� 5 different �driver types� tried.

� Features: which lane the car is in, distance to closest car in current lane.

� Algorithm run for 30 iterations, policy hand-picked.

� Movie Time! (Expert left, IRL right) [Ab

be

el+

Ng

, ICM

L0

4]


�Nice� driver


�Evil� driver


Distribution over trajectories:P( )�

Match the reward of observed behavior:

�� P( ) f� � = fdem

Maximizing the causal entropy over trajectories given stochastic outcomes: max H(P( )||O)�

(Condition on random uncontrolled outcomes, but only after they happen)

As uniform as possible

Maxent IRL[Z

ieb

art+

al, A

AA

I08

]


LengthSpeedRoad TypeLanes

AccidentsConstructionCongestionTime of day

25 Taxi Drivers

Over 100,000 miles

Data collection[Z

ieb

art+

al, A

AA

I08

]


Predicting destinations....


Planning as structured prediction[R

atliff+

al, N

IPS

05

]


Maximum margin planning

� Let �(s,a) denote the probability of reaching q-state (s,a) under current model w

maxw

margin s.t. planner run with wyields human output

minw

s.t.��s , a�w�� xn , s , a�

� ��s , a�w��xn , s , a��1

, �n , s , a

1

2�w�2

[Ratliff+

al, N

IPS

05

]

Q-state visitationfrequency by human

Q-state visitationfrequency by planner

All trajectories,and all q-states


Optimizing MMP

SOMEMATH

MMP Objective

� For n=1..N:

� Augmented planning:Run A* on current (augmented) cost mapto get q-state visitation frequencies

� Update:

� Shrink:

w=w� s

a

[ ��s , a��s , a� ]�� xn, s ,a�

w=�1� 1

CN �w

��s , a�

[Ratliff+

al, N

IPS

05

]


Maximum margin planning movies[R

atliff+

al, N

IPS

05

]


Parsing via inverse optimal control

� State space = all partial parse trees over the full sentence labeled �S�

� Actions: take a partial parse and split it anywhere in the middle

� Transitions: obvious

� Terminal states: when there are no actions left

� Reward: parse score at completion

[Ne

u+

Szep

evari, M

LJ0

9]


Parsing via inverse optimal control

Small Medium Large60

65

70

75

80

85

90

Maximum Likelihood

Projection Perceptron Appren-ticeship Learning

Maximum Margin

Maximum Entropy

Policy Matching

[Ne

u+

Szep

evari, M

LJ0

9]


Learning by Demonstration


07/19/12

Integrating search and learning Input: Le homme mange l' croissant.

Output: The man ate a croissant.

Hyp: The man ate

Cov: Le homme mange

l' croissant.

Hyp: The man ate a

Cov: Le homme mange

l' croissant.

Hyp: The man ate a

Cov: Le homme mange

l' croissant.

Hyp: The man ate happy

Cov: Le homme mange

l' croissant.

Hyp: The man ate a fox

Cov: Le homme mange

l' croissant.

Hyp: The man ate a croissant

Cov: Le homme mange

l' croissant.

Classifier 'h'

[D+

Ma

rcu

, ICM

L0

5; D

+L

an

gfo

rd+

Ma

rcu

, ML

J0

9]


07/19/12

Reducing search to classification

� Natural chicken and egg problem:� Want h to get low expected future loss

� � on future decisions made by h

� � and starting from states visited by h

� Iterative solution Input: Le homme mange l' croissant.

Output: The man ate a croissant.

Hyp: The man ate

Cov: Le homme mange

l' croissant.

Hyp: The man ate a

Cov: Le homme mange

l' croissant.

Hyp: The man ate a

Cov: Le homme mange

l' croissant.

Hyp: The man ate happy

Cov: Le homme mange

l' croissant.

Hyp: The man ate a fox

Cov: Le homme mange

l' croissant.

Hyp: The man ate a croissant

Cov: Le homme mange

l' croissant.

h(t-1)

h(t)

Loss = 0Loss = 0

Loss = 1.8

Loss = 1.2

Loss = 0Loss = 0.5

Loss = 0Loss = 0

[D+

La

ng

ford

+M

arc

u, M

LJ0

9]


Reduction for Structured Prediction

� Idea: view structured prediction in light of search

A

N

D

V

R

A

N

D

V

R

A

N

D

V

R

Each step here looks like it could be represented as a weighted multi-class problem.

Can we formalize this idea?

I sing merrily

Loss function: L([N V R], [N V R]) = 0 L([N V R], [N V V]) = 1/3 ...

[D+

La

ng

ford

+M

arc

u, M

LJ0

9]


Reducing Structured Prediction

A

N

D

V

R

A

N

D

V

R

A

N

D

V

R

I sing merrily

Key Assumption: Optimal Policy for training data

Weak!

Given: input, true output and state;Return: best successor state

Desired: good policy on test data (i.e., given only input string)

[D+

La

ng

ford

+M

arc

u, M

LJ0

9]


How to Learn in Search

Idea: Train based only on optimal path (ala MEMM)

Better Idea: Train based only on optimal policy, then train based on optimal policy + a little learned policy then train based on optimal policy + a little more learned policy then ... eventually only use learned policy

[D+

La

ng

ford

+M

arc

u, M

LJ0

9]


How to Learn in Search

I sing merrily

� Translating DSP into Searn(DSP, loss, �):

� Draw x ~ DSP

� Run � on x, to get a path

� Pick position uniformly on path

� Generate example with costs given by expected (wrt �) completion costs for �loss�

sing

c=0

c=1

c=2

c=1

c=1

[D+

La

ng

ford

+M

arc

u, M

LJ0

9]


Searn

Algorithm: Searn-Learn(A, DSP, loss, �*, �)1: Initialize: � = �*2: while not converged do3: Sample: D ~ Searn(DSP, loss, �)4: Learn: h � A(D)5: Update: � � (1-�) � + � h6: end while7: return � without �*

Theorem: L(�) � L(�*) + 2 avg(loss) T ln T + c(1+ln T) / T

Ingredients for Searn:Input space (X) and output space (Y), data from XLoss function (loss(y, y')) and features�Optimal� policy �*(x, y0)

[D+

La

ng

ford

+M

arc

u, M

LJ0

9]


07/19/12

But what about demonstrations?

� What did we assume before?

� We can have a human (or system) demonstrate, thus giving us an optimal policy

Key Assumption: Optimal Policy for training data

Given: input, true output and state;Return: best successor state


Steering in [-1,1]

Input: Output:

Resized to 25x19 pixels (1425 features)

3d racing game (TuxKart)[R

oss+

Go

rdo

n+

Ba

gn

ell, A

ISta

ts11

]


07/19/12

DAgger: Dataset Aggregation

��**

� Collect trajectories from expert �*

� Dataset D0 = { ( s, �*(s) ) | s ~ �* }

� Train �1 on D0

� Collect new trajectories from �1

� But let the expert steer!

� Dataset D1 = { ( s, �*(s) ) | s ~ �1 }

� Train �2 on D0 � D1

� In general:

� Dn = { ( s, �*(s) ) | s ~ �n }

� Train �n on � i<n Di

��11

��22

If N = T log T,

L(�n) < T �N + O(1)

for some n

[Ro

ss+

Go

rdo

n+

Ba

gn

ell, A

ISta

ts11

]


Steering in [-1,1]

Input:Output:

Resized to 25x19 pixels (1425 features)

Experiments: Racing Game[R

oss+

Go

rdo

n+

Ba

gn

ell, A

ISta

ts11

]


Bette

r

Average falls per lap[R

oss+

Go

rdo

n+

Ba

gn

ell, A

ISta

ts11

]


Jump in {0,1}Right in {0,1}Left in {0,1}Speed in {0,1}

Extracted 27K+ binary featuresfrom last 4 observations(14 binary features for every cell)

Output:Input:

From Mario AI competition 2009

Super Mario Bros.[R

oss+

Go

rdo

n+

Ba

gn

ell, A

ISta

ts11

]


Training (expert)[R

oss+

Go

rdo

n+

Ba

gn

ell, A

ISta

ts11

]


Test-time execution (classifier)[R

oss+

Go

rdo

n+

Ba

gn

ell, A

ISta

ts11

]


Test-time execution (Dagger)[R

oss+

Go

rdo

n+

Ba

gn

ell, A

ISta

ts11

]


Bett

er

Average distance per stage[R

oss+

Go

rdo

n+

Ba

gn

ell, A

ISta

ts11

]


07/19/12

Perceptron vs. LaSO vs. Searn

Incremental perceptron

LaSO

Searn / DAgger

Un-learnable decision


Discussion


Relationship between SP and IRL

� Formally, they're (nearly) the same problem� See humans performing some task

� Define some loss function

� Try to mimic the humans

� Difference is in philosophy:� (I)RL has little notion of beam search or dynamic

programming

� SP doesn't think about separating reward estimation from solving the prediction problem

� (I)RL has to deal with stochastiticity in MDPs


Important Concepts

� Search and loss-augmented search for margin-based methods

� Bold versus local updates for approximate search

� Training on-path versus off-path

� Stochastic versus deterministic worlds

� Q-states / values

� Learning reward functions vs. matching behavior


Language comprehension

But man is not destined to vanish. He can be killed, but he cannot be destroyed, because his soul is deathless and his spirit is irrepressible. Therefore, though the situation seems dark in the context of the confrontation between the superpowers, the silver lining is provided by amazing phenomenon that the very nations which have spent incalculable resources and energy for the production of deadly weapons are desperately trying to find out how they might never be used. They threaten each other, intimidate each other and go to the brink, but before the total hour arrives they withdraw from the brink.

The main point from the author's view is:

1. Man's soul and spirit can not be destroyed by superpowers.

2. Man's destiny is not fully clear or visible. 3. Man's soul and spirit are immortal. 4. Man's safety is assured by the delicate balance

of power in terms of nuclear weapons. 5. Human society will survive despite the

serious threat of total annihilation.


Language

The World

Cognitive scientistscall this �grounding�

"Understanding� isbeing able to interactwith the world basedon language inputs

Language and the world


Language

The World

...but where do we get data?


WalkthroughsYou will appear on screen, and you will see that you have no weapon in hand whatsoever! Walk north to enter the open cave above you, and an Old Man will be inside. He addresses you with the following, insightful sentence: "It is dangerous to go alone! Take this." At this point you should step forward to grasp the Wooden Sword (used with the A Button), and then head back into the overworld.


WalkthroughsYou will appear on screen, and you will see that you have no weapon in hand whatsoever! Walk north to enter the open cave above you, and an Old Man will be inside. He addresses you with the following, insightful sentence: "It is dangerous to go alone! Take this." At this point you should step forward to grasp the Wooden Sword (used with the A Button), and then head back into the overworld.

Go to the small opening in the rock face on your right to arrive at H9, where you should kill the four Red Octoroks before going through the opening on the right. H10 has 5 Blue Tektites waiting for you, so be sure to kill them off before leaving the screen through the right as these fellows are an excellent source of Rupees, Fairies, and Hearts. H11 has four Blue Tektites, and after going right to ...


Why is this so hard?

You will appear on screen, and you will see that you have no weapon in hand whatsoever!

Walk north to enter the open cave above you, and an Old Man will be inside.

He addresses you with the following, insightful sentence: "It is dangerous to go alone! Take this."

At this point you should step forward to grasp the Wooden Sword (used with the A Button), and then head back into the overworld.


Walk north to enter the cave...1111111001111111

1111213001111111

1113000001111111

1130000001111111

1300000004111111

0000000000000000

5600000000000011

1100000000000011

1100000000000011

1111111111111111

1111111111111111

a=8 b=8 c=0 d=0

e=0 f=3 g=3 h=8

i=6 j=0 k=0

L L L U U U UL L U L U U UU U U L L U U

L U L U L U UU L L U U L U



































Hal's Wager

� Give me a structured prediction problem where:� Annotations are at the lexical level

� Humans can do the annotation with reasonable agreement

� You give me a few thousand labeled sentences

� Then I can learn reasonably well...� ...using one of the algorithms we talked about

� Why do I say this?� Lots of positive experience

� I'm an optimist

� I want your counter-examples!


Open problems

� How to do SP when argmax is intractable....� Bad: simple algorithms diverge [Kulesza+Pereira, NIPS07]

� Good: some work well [Finley+Joachims, ICML08]

� And you can make it fast! [Meshi+al, ICML10]

� How to do SP with delayed feedback (credit assignment)

� Kinda just works sometimes [D, ICML09; Chang+al, ICML10]

� Generic RL also works [Branavan+al, ACL09; Liang+al, ACL09]

� What role does structure actually play?� Little: only constraints outputs [Punyakanok+al, IJCAI05]

� Little: only introduces non-linearities [Liang+al, ICML08]

� Role of experts?� what if your expert isn't actually optimal?

� what if you have more than one expert?

� what if you only have trajectories, not the expert?


Things I have no idea how to solve...

all : (a Bool) [a] Bool� � �

Applied to a predicate and a list, returns `True' if all elementsof the list satisfy the predicate, and `False' otherwise.

all p [] = Trueall p (x:xs) = if p x then all p xs else False

%module main:MyPrelude %data main:MyPrelude.MyList aadj = {main:MyPrelude.Nil; main:MyPrelude.Cons aadj ((main:MyPrelude.MyList aadj))}; %rec {main:MyPrelude.myzuall :: %forall tadA . (tadA -> ghczmprim:GHCziBool.Bool) -> (main:MyPrelude.MyList tadA) -> ghczmprim:GHCziBool.Bool = \ @ tadA (padk::tadA -> ghczmprim:GHCziBool.Bool) (dsddE::(main:MyPrelude.MyList tadA)) -> %case ghczmprim:GHCziBool.Bool dsddE %of (wildB1::(main:MyPrelude.MyList tadA)) {main:MyPrelude.Nil -> ghczmprim:GHCziBool.True; main:MyPrelude.Cons (xadm::tadA) (xsadn::(main:MyPrelude.MyList tadA)) -> %case ghczmprim:GHCziBool.Bool (padk xadm) %of (wild1Xc::ghczmprim:GHCziBool.Bool) {ghczmprim:GHCziBool.False -> ghczmprim:GHCziBool.False; ghczmprim:GHCziBool.True -> main:MyPrelude.myzuall @ tadA padk xsadn}}};


Things I have no idea how to solve...

(s1) A father had a family of sons who were perpetually quarreling among themselves. (s2) When he failed to healtheir disputes by his exhortations, he determined to give them a practical illustration of the evils of disunion; and forthis purpose he one day told them to bring him a bundle ofsticks. (s3) When they had done so, he placed the faggotinto the hands of each of them in succession, and orderedthem to break it in pieces. (s4) They tried with all theirstrength, and were not able to do it. (s5) He next openedthe faggot, took the sticks separately, one by one, andagain put them into his sons' hands, upon which they brokethem easily. (s6) He then addressed them in these words:�My sons, if you are of one mind, and unite to assist eachother, you will be as this faggot, uninjured by all the attemptsof your enemies; but if you are divided among yourselves,you will be broken as easily as these sticks.''


Software

� Sequence labeling� Mallet http://mallet.cs.umass.edu

� CRF++ http://crfpp.sourceforge.net

� Search-based structured prediction� LaSO http://hal3.name/TagChunk

� Searn http://hal3.name/searn

� Higher-level �feature template� approaches� Alchemy http://alchemy.cs.washington.edu

� Factorie http://code.google.com/p/factorie


Summary

� Structured prediction is easy if you can do argmax search (esp. loss-augmented!)

� Label-bias can kill you, so iterate (Searn/Dagger)

� Stochastic worlds modeled by MDPs

� IRL is all about learning reward functions

� IRL has fewer assumptions� More general

� Less likely to work on easy problems

� We're a long way from a complete solution

� Hal's wager: we can learn pretty much anything

Thanks! Questions?


Conclusion

Thanks! Questions?


References

See also:http://www.cs.utah.edu/~suresh/mediawiki/index.php/MLRGhttp://braque.cc/ShowChannel?handle=P5BVAC34


Stuff we talked about explicitly� Apprenticeship learning via inverse reinforcement learning, P. Abbeel and A. Ng. ICML, 2004.

� Incremental parsing with the Perceptron algorithm. M. Collins and B. Roark. ACL 2004.

� Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. M. Collins. EMNLP 2002.

� Search-based Structured Prediction. H. Daumé III, J. Langford and D. Marcu. Machine Learning, 2009.� Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction. H. Daumé III and D.

Marcu. ICML, 2005.

� An End-to-end Discriminative Approach to Machine Translation. P. Liang, A. Bouchard-Côté, D. Klein, B. Taskar. ACL 2006.

� Statistical Decision-Tree Models for Parsing. D. Magerman. ACL 1995.

� Training Parsers by Inverse Reinforcement Learning. G. Neu and Cs. Szepesvári. Machine Learning 77, 2009.

� Algorithms for inverse reinforcement learning, A. Ng and A. Russell. ICML, 2000.

� (Online) Subgradient Methods for Structured Prediction. N. Ratliff, J. Bagnell, and M. Zinkevich. AIStats 2007.

� Maximum margin planning. N. Ratliff, J. Bagnell and M. Zinkevich. ICML, 2006.� Learning to search: Functional gradient techniques for imitation learning. N. Ratliff, D. Silver, and J. Bagnell.

Autonomous Robots, Vol. 27, No. 1, July, 2009.

� Reduction of Imitation Learning to No-Regret Online Learning. S. Ross, G. Gordon and J. Bagnell. AIStats 2011.� Max-Margin Markov Networks. B. Taskar, C. Guestrin, V. Chatalbashev and D. Koller. JMLR 2005.� Large Margin Methods for Structured and Interdependent Output Variables. I. Tsochantaridis, T. Joachims, T.

Hofmann, and Y. Altun. JMLR 2005.� Learning Linear Ranking Functions for Beam Search with Application to Planning. Y. Xu, A. Fern, and S. Yoon.

JMLR 2009.� Maximum Entropy Inverse Reinforcement Learning. B. Ziebart, A. Maas, J. Bagnell, and A. Dey. AAAI 2008.


Other good stuff� Reinforcement learning for mapping instructions to actions. S.R.K. Branavan, H. Chen, L. Zettlemoyer and R.

Barzilay. ACL, 2009.

� Driving semantic parsing from the world's response. J. Clarke, D. Goldwasser, M.-W. Chang, D. Roth. CoNLL 2010.

� New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron. M.Collins and N. Duffy. ACL 2002.

� Unsupervised Search-based Structured Prediction. H. Daumé III. ICML 2009.

� Training structural SVMs when exact inference is intractable. T. Finley and T. Joachims. ICML, 2008.

� Structured learning with approximate inference. A. Kulesza and F. Pereira. NIPS, 2007.

� Conditional random fields: Probabilistic models for segmenting and labeling sequence data. J. Lafferty, A. McCallum, F. Pereira. ICML 2001.

� Structure compilation: trading structure for features. P. Liang, H. Daume, D. Klein. ICML 2008.

� Learning semantic correspondences with less supervision. P. Liang, M. Jordan and D. Klein. ACL, 2009.

� Generalization Bounds and Consistency for Structured Labeling. D. McAllester. In Predicting Structured Data, 2007.

� Maximum entropy Markov models for information extraction and segmentation. A. McCallum, D. Freitag, F. Pereira. ICML 2000.

� FACTORIE: Efficient Probabilistic Programming for Relational Factor Graphs via Imperative Declarations of Structure, Inference and Learning. A. McCallum, K. Rohanemanesh, M. Wick, K. Schultz, S. Singh. NIPS Workshop on Probabilistic Programming, 2008

� Learning efficiently with approximate inference via dual losses. O. Meshi, D. Sontag, T. Jaakkola, A. Globerson. ICML 2010.

� Learning and inference over constrained output. V. Punyakanok, D. Roth, W. Yih, D. Zimak. IJCAI, 2005.

� Boosting Structured Prediction for Imitation Learning. N. Ratliff, D. Bradley, J. Bagnell, and J. Chestnutt. NIPS 2007.

� Efficient Reductions for Imitation Learning. S. Ross and J. Bagnell. AISTATS, 2010.

� Kernel Dependency Estimation. J. Weston, O. Chapelle, A. Elisseeff, B. Schoelkopf and V. Vapnik. NIPS 2002.

Documents

Natural Language Processinglegacydirs.umiacs.umd.edu/~hal/tmp/mlss.pdf · Not just string processing or keyword matching End systems that we want to build Simple: Spelling correction,