A Whirlwind Tour of Natural Language Processing
Mark Sammons Cognitive Computation Group, UIUC
Who Cares about NLP?
…Eddie Izzard,
that’s who…
(Those of a sensitive disposition toward explicit language should probably cover their ears…)
Page 2
Remember Star Trek? HAL in 2001? The Heart of Gold in Hitch-hiker’s Guide…?
Grand Vision of Artificial Intelligence: computers that actively communicate.
A substantial effort devoted to achieving AI. But how do we decide whether a machine is smart? IBM’s Deep Blue plays a mean game of chess…
…but is it intelligent? Early idea of evaluation: Turing Test
If a human can’t tell that it’s a machine… AI philosophy: is *appearance* of intelligent behavior
the same as intelligence? General assumption: NLP is AI-complete (play on
concept of NP-completeness) – i.e. need Intelligence to properly solve NLP
Page 3
More Realistically… where does NLP help?
Already here: Context-sensitive spelling, grammar checkers in
text editors Machine Translation, e.g. in web browsers Automated phone trees (by some definition of
“help”) Web search
Under development: Better Machine Translation Better search Voice command in e.g. cars
Page 4
Outline
Why NLP is hard
NLP domains: Speech vs. Text
Attacking NLP problems
Linguistics: building explanatory models
Statistics: data-driven approaches
Machine Learning & NLP
NLP Problems and Solutions
Page 5
Why is NLP so hard?
Meaning
Language
Ambiguity
Variability
Variability
Example: Relation Extraction: “Works for”
Jim Carpenter works for the U.S. Government.
The American government employed Jim Carpenter.
Jim Carpenter was fired by the US Government.
Jim Carpenter worked in a number of important positions. …. As a press liaison for the IRS, he made contacts in the white house.
Top Russian interior minister Yevgeny Topolov met yesterday with his US counterpart, Jim Carpenter.
Former US Secretary of Defense Jim Carpenter spoke today…
Page 7
Context Sensitive Paraphrasing [3]
He used a Phillips head to tighten the screw.
The bank owner tightened security after a spate of local crimes.
The Federal Reserve will aggressively tighten monetary policy.
……….
LoosenStrengthenStep upToughenImproveFastenImposeIntensifyEaseBeef upSimplifyCurbReduce
LoosenStrengthenStep upToughenImproveFastenImposeIntensifyEaseBeef upSimplifyCurbReduce
Ambiguity
Domain Size
Ideal goal: must handle all well-formed strings of text Problem: infinite domain
Sequential modifiers:
I saw Martin Sheen in a movie I saw Martin Sheen in a movie in ParisI saw Martin Sheen in a movie in Paris in the SpringI saw Martin Sheen in a movie in Paris in the Spring with my friend…
Unbounded relative clauses:
I saw Martin Sheen, who was with a friend I knew from high school, which was well known for its long, storied history of ………….., in a movie…..
Page 9
Outline
Why NLP is hard
NLP domains: Speech vs. Text
Attacking NLP problems
Linguistics: building explanatory models
Statistics: data-driven approaches
Machine Learning & NLP
NLP Problems and Solutions
Page 10
Speech Recognition
NOT “voice recognition”
How hard can it be?
First image: “Fix the wing” Second image: same utterance in
noisy airport maintenanceenvironment
Page 11
Speech Recognition – yup, it’s hard…
“Yuhgudda unnuhstahn sheeguhnuhbeeyah, yunoewaah, dissappointed.”
“You’ve got to understand she’s going to be, ah, you know, ah, disappointed.”
Difficult to recognize words, word boundaries Even given word boundaries, utterances are ill-
formed (compared to text) (multiple variations for single word)
Hesitations, repetitions, fragmentary sentences, self-interruptions, poor word choice, sound quality…
LBJ/Mansfield audio sample
Page 12
Development and Evaluation for Speech Recognition
Switchboard (and other) corpora Large set of phone conversations Audio signals aligned with transcriptions of utterances
(phone sequences) Dictionaries aligning words with phone sequence
equivalents
Typically, machine learning approaches applied Signal processing techniques extract features from signals Statistical methods relate these features to particular
phones – create a model
Analyze new signals, use model to identify plausible phone sequences
Choose most likely sequence given another statistical model
Page 13
Speech Recognition System (Courtesy of ComputerWorld…)
Page 14
The State of the Art in Speech-to-Text Translation
Current performance on known tasks: 98% word accuracy for dictation Very controlled circumstances
State of the art for spontaneous speech: News broadcast: ~90% Switchboard (phone conversations): ~80%
A lot of work even to get to a clean text representation of signal
Notice that I haven’t even begun to address tasks like search using this input
(Note also that there are many other research directions in speech processing – e.g. speaker identification)
Page 15
What about Text?
A lot of overlap
If you can solve NLP in text, and can accurately parse
speech into text, the two problems are the same
Text domain has some nice characteristics
Paragraph, Sentence, Word segmentation already
present
Well-formed utterances (in many/most sub-domains)
Little regional variation
Most information is already in the form of text
Page 16
Outline
Why NLP is hard
NLP domains: Speech vs. Text
Attacking NLP problems
Linguistics: building explanatory models
Statistics: data-driven approaches
Machine Learning & NLP
NLP Problems and Solutions
Page 17
Linguistics
Linguists: meaning through structure + lexical knowledge
“Colorless green ideas sleep furiously”
Discover the rules of language (a grammar) Prescriptive grammar: rules describe what you shouldn’t
do. Generative grammar: a finite set of rules that can
generate all possible strings in a language, and only those strings that are valid in that language [3] “Generate” here means “assign a structural description
to” Attempts to move beyond simplistic linear models, where
words are dependent only on previous words
Page 18
Divide and Conquer: Morphology
Consider the sub-problem of recognizing well-formed variations of words
Popular method: Finite State Automata/Transducers
Automaton: recognizes patterns Transducer: maps from an input pattern to an output
pattern – e.g. indicate whether a noun is plural
Page 19
Morphology Example: plurals [5]
Page 20
q0
q1
q2
Regular noun
Irregular singular noun
Irregular plural noun
-s: N
: + PL
: N+ PL
: N
Basic Generative Grammar: Context-Free Grammar
Accomplishes the goal of a finite description of infinite domain, at least for syntactic structure
Generate parse trees, decompose into constituents, infer generative rules:
Page 21
S => NP VPVP => V VPVP => VP PPVP => V ADJPNP => PROPRO => HeV => wantsPP => to…..
[4]
Context-Free Grammar
Drawbacks to CFGs: Real natural language may not be context-free Hard to model some phenomena, e.g. limits on nesting:
The cat ran away.
The cat the dog bit ran away.The cat the dog the horse kicked bit ran
away.
Phenomena like agreement, morphology, long distance dependencies, require very complex set of rules
What about unseen words/phrases/sentences? Given a sentence, there may be multiple ways
to explain it.
I pointed to the man with the crutch.
Page 22
That doesn’t deter Real Linguists…
A range of formalisms have been developed Different ways of tackling composition of words,
phrases, clauses Trade-off between importance of sentence structure
and individual words Strong emphasis on generality, particularly across
languages
Typically much more involved than the simplistic CFG in the previous example
There is ongoing work to encode a hand-written grammar of English – English Resource Grammar Uses Head-driven Phrase Structure Grammar Explains syntax via a Typed Feature Structure model
Page 23
HPSG sample Feature Structure (for one word)
Page 24
General Points
Much work on analyzing languages for structure
Wide range of theories; all have some descriptive power
All assume close relation between structure and meaning
We will see CFGs again later…
Page 25
Outline
Why NLP is hard
NLP domains: Speech vs. Text
Attacking NLP problems: 4 research strands
Linguistics: building explanatory models
Statistics: data-driven approaches
Machine Learning & NLP
NLP Problems and Solutions
Page 26
Data-Driven Approaches
Consider a partially completed sentence…
We can capture some measure of this intuitive restriction on word choice using probabilities Bigrams, trigrams, n-grams Effect of adding complexity in terms of storage
requirements?50,0002 = 2.5 Billion
We can estimate these probabilities directly from a corpus (body of text): p(wn|wn-1) = C(wn-1 wn)/C(wn-1)
Applications: spelling checker, augmentative communication systems, speech processing…
Page 27
N-gram model samples
The following sentences were generated using n-gram models trained on Shakespeare’s works (~885,000 words, ~29,000 types) [5]:
1-gram: Every enter now severally so, let2-gram: What means sir. I confess she? Then all
sorts, he is trim, captain.3-gram: This shall forbid it should be branded, if
renown made it empty.4-gram: Enter Leonato’s brother Antonio, and the
rest, but seek the weary beds of people sick.
Page 28
N-Gram Modeling
What’s it good for? Determine plausibility of new sentence:
The man spoke briefly…
The dog spoke briefly…
The spoke briefly man…
The wheel spoke briefly…
Given N-gram models of two domains, identify most likely source:
ACENOR stocks caught fire today on word of a take-over….
Teen pop sensation Tilde Greengrass roared into Austin today…
Teen Angst Poetry and Band Names… Drawbacks: how to handle unseen sequences?
Page 29
Computational Linguistics
We just used very elementary statistics to make some potentially interesting discoveries about language
In fact, given the right resources, we can use statistics to build automated resources for linguistic analysis… Part of speech tagging:
(DT the) (NN man) (VBD climbed) (IN up) (DT the ) (NN tree)
Phrase boundary detection & phrase labeling
(NP the man) (VP climbed) (PP up the tree)
Parsing….
Page 30
Parsing Revisited
We saw earlier an outline of a Context-Free Grammar model of language S => NP VP
VP => NP PPNP => NP PPNP => DT NN
(NP I) (VP saw) (NP the man) (PP with the telescope)(NP I) (VP saw) (NP the man) (PP with the book)
Two valid parses for each… are they equally valid?
Page 31
Probabilistic CFGs
In the n-gram modeling example, we derived probabilities based on a corpus. Can we do the same for CFG rules? Not the same problem: for n-gram modeling, the
words alone were sufficient Need a corpus with additional information – the
parse trees Given such a corpus, can use statistical analysis
to derive the rules themselves, and the relative probabilities of rules.
This pattern – applying statistical methods to a labeled data set to extract a predictive model – is common in Machine Learning.
Page 32
Outline
Why NLP is hard
NLP domains: Speech vs. Text
Attacking NLP problems: 4 research strands
Linguistics: building explanatory models
Statistics: data-driven approaches
Machine Learning & NLP
NLP Problems and Solutions
Page 33
Page 34
Machine Learning: Classification
h: X -> Y(classifier)
yOutput:
xInput:
(x,y)(x,y)(x,y)(x,y)yx
Learningalgorithm
D: Training examples
Page 35
Machine Learning (supervised)
Given some labeled data, and assuming some set of models, find the model that best maps each example to its label.
Statistically: represent examples using some abstraction (set of features), compute the relation between features and labels. Choice of model affects best possible performance. Complex model: may get better results (more
expressive), but requires much more data to train (and labeled data is expensive)
Simple model: fewer parameters, so less expressive, but easier to learn
Some examples…
Outline
Why NLP is hard
NLP domains: Speech vs. Text
Attacking NLP problems: 4 research strands
Linguistics: building explanatory models
Logic: defining meaning and reasoning
Statistics: data-driven approaches
Machine Learning & NLP
NLP Problems and Solutions
Page 36
NLP Problems and Solutions (focused)
Part-of-Speech tagging Context Sensitive Spelling Correction Named Entity Recognition Relation detection Comma Resolution Verb and Noun Phrase Chunking Prepositional Phrase Attachment Coreference Resolution Statistical Parsing Semantic Role Labeling Emotion and Subjectivity detection
Page 37
Example: Named Entity Recognition
Entities are inherently ambiguous (e.g. JFK can be both location and a person depending on the context) Can appear in various forms ; Can be nested. Using lists is not sufficient New entities are always being introduced
A lot of Machine Learning work – significant over fitting
Key difficulties – Adaptation to: New domains/corpora Slightly new definition of an
entity New languages New types of entities
How to reduce the requirements on the resources needed to produce a semantic categorization for a new domain/new language/new type of entities
New
NE s
een
NE seen
Grand Challenges
Machine Translation
Message Understanding (Information Extraction)
Question Answering
Information Retrieval & Data Mining
Textual Entailment
Page 39
Page 40
Textual Entailment
Work at the level of meaning Frame the task of understanding text as recognizing
when two text fragments mean the same thing (one meaning ‘contains’ the other)
Dagan and Glickman, 2004 pose this problem as Recognizing Textual Entailment.
Now we can recast many problems in terms of TE:
The American government employed Jim Carpenter. Top Russian interior minister Yevgeny Topolov met yesterday with his US counterpart, Jim Carpenter.
Former US Secretary of Defence Jim Carpenter spoke today…
Jim Carpenter works for the U.S. Government.
?
Page 41
PASCAL RTE Challenges (2004-present)
Move away from strict definition (Chierchia & McConnell-Ginet, 2001 [6]):
A text T entails a hypothesis H if H is true in every circumstance (possible world) in which T is true
‘Applied’ Definition (Dagan & Glickman, 2004 [7])
T entails H (TH) if humans reading T will infer that H is most likely true
800 development, 800 test pairs for each challenge
Page 42
Some Examples (2nd RTE Challenge)
TEXT HYPOTHESIS TASK ENTAILMENT
1
Reagan attended a ceremony in Washington to commemorate the landings in Normandy.
Washington is located inNormandy.
IE False
2Google files for its long awaited IPO.
Google goes public. IR True
3
… a shootout at the Guadalajara airport in May, 1993, that killed Cardinal Juan Jesus Posadas Ocampo and six others.
Cardinal Juan Jesus Posadas Ocampo died in 1993.
QA True
4
The SPD got just 21.5% of the votein the European Parliament elections,while the conservative opposition parties polled 44.5%.
The SPD is defeated bythe opposition parties.
IE True
Incomplete List of Citations1. Peter Bell and Simon King. Sparse gaussian graphical models for
speech recognition. In Proc. Interspeech 2007, Antwerp, Belgium, August 2007
2. Connor & Roth ECML 073. Chomsky, Noam (1957,2002). Syntactic Structures. Mouton de Gruyter,
13.4. Image courtesy of Bill Wilson, Univ. New South Wales, Australia
http://www.cse.unsw.edu.au/~billw/5. Jurafsky and Martin. Speech and Language Processing, Prentice-Hall,
20006. Chierchia & McConnell-Ginet. Meaning and Grammar: An Introduction
to Semantics (rev. 2nd ed.), 20007. Dagan & Glickman, 2004. Probabilistic textual entailment: Generic
applied modeling of language variability. PASCAL workshop on Text Understanding and mining. 2004.
Some slides came from Prof. Dan Roth, University of Illinois.
Page 43