Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
1
1/18/08 1
CSCI 5832Natural Language
Processing
Lecture 2Jim Martin
1/18/082
Today 1/17
• Wrap up last time• Knowledge of language• Ambiguity• Models and algorithms• Generative paradigm• Finite-state methods
1/18/083
Course Material
• We’ll be intermingling discussions of: Linguistic topics
E.g. Morphology, syntax, discourse structure Formal systems
E.g. Regular languages, context-freegrammars
Applications E.g. Machine translation, information
extraction
2
1/18/084
Linguistics Topics
• Word-level processing• Syntactic processing• Lexical and compositional
semantics• Discourse processing
1/18/085
Topics: Techniques
• Finite-state methods• Context-free methods• Augmented grammars
Unification Lambda calculus
• First order logic
• Probabilitymodels
• Supervisedmachinelearningmethods
1/18/086
Topics: Applications
• Small Spelling correction Hyphenation
• Medium Word-sense
disambiguation Named entity recognition Information retrieval
• Large Question answering Conversational agents Machine translation
• Stand-alone
• Enabling applications
• Funding/Businessplans
3
1/18/087
Just English?
• The examples in this class will for the mostpart be English Only because it happens to be what I know.
• This leads to an (over?)-emphasis on certaintopics (syntax) to the detriment of others(morphology) due to the properties ofEnglish
• We’ll cover other languages primarily in thecontext of machine translation
1/18/088
Commercial World
• Lot’s of exciting stuff going on…
1/18/089
Google Translate
4
1/18/0810
Google Translate
1/18/0811
Web Q/A
1/18/0812
Summarization
• Current web-based Q/A is limited toreturning simple fact-like (factoid)answers (names, dates, places, etc).
• Multi-document summarization canbe used to address more complexkinds of questions. Circa 2002:
What’s going on with the Hubble?
5
1/18/0813
NewsBlaster Example
The U.S. orbiter Columbia has touched down at theKennedy Space Center after an 11-day mission toupgrade the Hubble observatory. The astronauts onColumbia gave the space telescope new solar wings, abetter central power unit and the most advanced opticalcamera. The astronauts added an experimentalrefrigeration system that will revive a disabled infraredcamera. ''Unbelievable that we got everything we set outto do accomplished,'' shuttle commander Scott Altmansaid. Hubble is scheduled for one more servicing missionin 2004.
1/18/0814
Weblog Analytics
• Textmining weblogs, discussionforums, message boards, user groups,and other forms of user generatedmedia. Product marketing information Political opinion tracking Social network analysis Buzz analysis (what’s hot, what topics are
people talking about right now).
1/18/0815
Web Analytics
6
1/18/0816
Categories of Knowledge
• Phonology• Morphology• Syntax• Semantics• Pragmatics• Discourse
Each kind of knowledge hasassociated with it anencapsulated set of processesthat make use of it.Interfaces are defined that allowthe various levels tocommunicate.This usually leads to a pipelinearchitecture.
1/18/0817
Ambiguity
• I made her duck
1/18/0818
Ambiguity
• I made her duck • Sources Lexical (syntactic)
Part of speech Subcat
Lexical (semantic) Syntactic
Different parses
7
1/18/0819
Dealing with Ambiguity
• Four possible approaches:1. Tightly coupled interaction among
processing levels; knowledge fromother levels can help decide amongchoices at ambiguous levels.
2. Pipeline processing that ignoresambiguity as it occurs and hopesthat other levels can eliminateincorrect structures.
1/18/0820
Dealing with Ambiguity
3. Probabilistic approaches based onmaking the most likely choices
4. Don’t do anything, maybe it won’t matter We’ll leave when the duck is ready to eat. The duck is ready to eat now. Does the ambiguity matter?
1/18/0821
Models and Algorithms
• By models I mean the formalisms thatare used to capture the various kindsof linguistic knowledge we need.
• Algorithms are then used tomanipulate the knowledgerepresentations needed to tackle thetask at hand.
8
1/18/0822
Models
• State machines• Rule-based approaches• Logical formalisms• Probabilistic models
1/18/0823
Algorithms
• Many of the algorithms that we’ll studywill turn out to be transducers;algorithms that take one kind ofstructure as input and output another.
• Unfortunately, ambiguity makes thisprocess difficult. This leads us toemploy algorithms that are designed tohandle ambiguity of various kinds
1/18/0824
Paradigms• In particular..
State-space search To manage the problem of making choices during
processing when we lack the information needed to makethe right choice
Dynamic programming To avoid having to redo work during the course of a state-
space search• CKY, Earley, Minimum Edit Distance, Viterbi, Baum-Welch
Classifiers Machine learning based classifiers that are trained to
make decisions based on features extracted from the localcontext
9
1/18/0825
State Space Search
• States represent pairings of partiallyprocessed inputs with partially constructedrepresentations.
• Goals are inputs paired with completedrepresentations that satisfy some criteria.
• As with most interesting problems thespaces are normally too large toexhaustively explore. We need heuristics to guide the search Criteria to trim the space
1/18/0826
Dynamic Programming
• Don’t do the same work over and over.• Avoid this by building and making use
of solutions to sub-problems thatmust be invariant across all parts ofthe space.
1/18/0827
Administrative Stuff
• Mailing list If you’re registered you’re on it with your
CU account I sent out mail this morning. Check to see
if you’ve received it• The textbook is now in the bookstore
10
1/18/0828
First Assignment
• Two parts1. Answer the following question:
How many words do you know?2. Write a python program that takes a
newspaper article (plain text that I willprovide) and returns the number of: Words Sentences Paragraphs
1/18/0829
First Assignment Details
• For the first part I want… An actual number and a explanation of
how you arrived at the answer Hardcopy. Bring to class.
• For the second part, email me yourcode and your answers to the test textthat I will send out shortly before theHW is due.
1/18/0830
First Assignment
• In doing this assignment you shouldthink ahead… having access to thewords, sentences and paragraphs willbe useful in future assignments.
11
1/18/0831
Getting Going
• The next two lectures will covermaterial from Chapters 2 and 3 Finite state automata Finite state transducers English morphology
1/18/0832
Regular Expressions and TextSearching
• Everybody does it Emacs, vi, perl, grep, etc..
• Regular expressions are a compacttextual representation of a set ofstrings representing a language.
1/18/0833
Example
• Find me all instances of the word “the”in a text. /the/
/[tT]he/
/\b[tT]he\b/
12
1/18/0834
Errors
• The process we just went throughwas based on two fixing kinds oferrors Matching strings that we should not
have matched (there, then, other) False positives (Type I)
Not matching things that we shouldhave matched (The) False negatives (Type II)
1/18/0835
Errors
• We’ll be telling the same story formany tasks, all semester. Reducing theerror rate for an application ofteninvolves two antagonistic efforts: Increasing accuracy, or precision,
(minimizing false positives) Increasing coverage, or recall, (minimizing
false negatives).
1/18/0836
Finite State Automata
• Regular expressions can be viewed as atextual way of specifying the structure offinite-state automata.
• FSAs and their probabilistic relatives are atthe core of what we’ll be doing all semester.
• They also conveniently (?) correspond toexactly what linguists say we need formorphology and parts of syntax. Coincidence?
13
1/18/0837
FSAs as Graphs
• Let’s start with the sheep language fromthe text /baa+!/
1/18/0838
Sheep FSA
• We can say the following things aboutthis machine It has 5 states b, a, and ! are in its alphabet q0 is the start state q4 is an accept state It has 5 transitions
1/18/0839
But note
• There are other machines thatcorrespond to this same language
• More on this one later
14
1/18/0840
More Formally
• You can specify an FSA byenumerating the following things. The set of states: Q A finite alphabet: Σ A start state A set of accept/final states A transition function that maps QxΣ to Q
1/18/0841
About Alphabets
• Don’t take that word to narrowly; itjust means we need a finite set ofsymbols in the input.
• These symbols can and will stand forbigger objects that can have internalstructure.
1/18/0842
Dollars and Cents
15
1/18/0843
Yet Another View
• The guts of FSAsare ultimatelyrepresented astables
443
2,3221
10e!abState
1/18/0844
Recognition
• Recognition is the process of determining ifa string should be accepted by a machine
• Or… it’s the process of determining if astring is in the language we’re defining withthe machine
• Or… it’s the process of determining if aregular expression matches a string
• Those all amount the same thing in the end
1/18/0845
Recognition
• Traditionally, (Turing’s idea) this process isdepicted with a tape.
16
1/18/0846
Recognition
• Simply a process of starting in thestart state
• Examining the current input• Consulting the table• Going to a new state and updating the
tape pointer.• Until you run out of tape.
1/18/0847
D-Recognize
1/18/0848
Key Points
• Deterministic means that at each pointin processing there is always oneunique thing to do (no choices).
• D-recognize is a simple table-driveninterpreter
• The algorithm is universal for allunambiguous regular languages. To change the machine, you just change
the table.
17
1/18/0849
Key Points
• Crudely therefore… matching strings withregular expressions (ala Perl, grep, etc.) isa matter of translating the regular expression into a
machine (a table) and passing the table to an interpreter
1/18/0850
Recognition as Search
• You can view this algorithm as a trivialkind of state-space search.
• States are pairings of tape positionsand state numbers.
• Operators are compiled into the table• Goal state is a pairing with the end of
tape position and a final accept state• Its trivial because?
1/18/0851
Generative Formalisms
• Formal Languages are sets of stringscomposed of symbols from a finite set ofsymbols.
• Finite-state automata define formallanguages (without having to enumerate allthe strings in the language)
• The term Generative is based on the viewthat you can run the machine as a generatorto get strings from the language.
18
1/18/0852
Generative Formalisms
• FSAs can be viewed from twoperspectives: Acceptors that can tell you if a string is in
the language Generators to produce all and only the
strings in the language
1/18/0853
Non-Determinism
1/18/0854
Non-Determinism cont.
• Yet another technique Epsilon transitions Key point: these transitions do not
examine or advance the tape duringrecognition
19
1/18/0855
Equivalence
• Non-deterministic machines can beconverted to deterministic ones witha fairly simple construction
• That means that they have the samepower; non-deterministic machinesare not more powerful thandeterministic ones in terms of thelanguages they can accept
1/18/0856
ND Recognition
• Two basic approaches (used in allmajor implementations of RegularExpressions)
1. Either take a ND machine and convert itto a D machine and then do recognitionwith that.
2. Or explicitly manage the process ofrecognition as a state-space search(leaving the machine as is).
1/18/0857
Implementations
20
1/18/0858
Non-DeterministicRecognition: Search
• In a ND FSA there exists at least one paththrough the machine for a string that is inthe language defined by the machine.
• But not all paths directed through themachine for an accept string lead to anaccept state.
• No paths through the machine lead to anaccept state for a string not in thelanguage.
1/18/0859
Non-DeterministicRecognition
• So success in a non-deterministicrecognition occurs when a path isfound through the machine that endsin an accept.
• Failure occurs when all of the possiblepaths lead to failure.
1/18/0860
Example
b a a a ! \
q0 q1 q2 q2 q3 q4
21
1/18/0861
Example
1/18/0862
Example
1/18/0863
Example
22
1/18/0864
Example
1/18/0865
Example
1/18/0866
Example
23
1/18/0867
Example
1/18/0868
Example
1/18/0869
Key Points
• States in the search space are pairingsof tape positions and states in themachine.
• By keeping track of as yet unexploredstates, a recognizer can systematicallyexplore all the paths through themachine given an input.
24
1/18/0870
Next Time
• Finish reading Chapter 2, start on 3. Make sure you have the book
• Make sure you have access to Python