Upload
jeremy-good
View
47
Download
2
Embed Size (px)
DESCRIPTION
Lecture 1 Overview. CSCE 771 Natural Language Processing. Topics Overview Readings: Chapters 1,2. January 14, 2013. Overview. Pragmatic issues Course Plans Foundation for research Today Challenge of 2001’s HAL Areas of Research Examples of Language Processing. - PowerPoint PPT Presentation
Citation preview
Lecture 1 Overview
Topics Topics Overview
Readings: Chapters 1,2Readings: Chapters 1,2
January 14, 2013
CSCE 771 Natural Language Processing
– 2 – CSCE 771 Spring 2013
OverviewPragmatic issuesPragmatic issues
Course PlansCourse Plans Foundation for research
TodayToday Challenge of 2001’s HAL Areas of Research Examples of Language Processing
– 3 – CSCE 771 Spring 2013Slide from: Speech and Language Processing Jurafsky and Martin
NLP Why Should You Care?
Two trendsTwo trends1. An enormous amount of amount of knowledge is now is now available in
machine readable form as natural language text2. Conversational agents are becoming an important form of
human-computer communicationMuch of human-human communication is now mediated by
computers
– 4 – CSCE 771 Spring 2013
Commercial WorldLot’s of exciting stuff going on…Lot’s of exciting stuff going on…
Powerset
Slide from: Speech and Language Processing Jurafsky and Martin
– 5 – CSCE 771 Spring 2013
Commercial WorldLot’s of exciting stuff going on…Lot’s of exciting stuff going on…
– 6 – CSCE 771 Spring 2013
Google Translate
Slide from: Speech and Language Processing Jurafsky and Martin
– 7 – CSCE 771 Spring 2013
Google Translate
Slide from: Speech and Language Processing Jurafsky and Martin
– 8 – CSCE 771 Spring 2013
Web Q/A
Slide from: Speech and Language Processing Jurafsky and Martin
– 9 – CSCE 771 Spring 2013
HAL 9000 of 2001: A Space OdysseyA scene from Arthur Clarke and Stanley Kubrick’s 2001A scene from Arthur Clarke and Stanley Kubrick’s 2001
DAVE:DAVE: Open the pod bay doors, HAL.Open the pod bay doors, HAL.
HAL:HAL: I’m sorry Dave, I’m afraid I can’t do that.I’m sorry Dave, I’m afraid I can’t do that.
Notes on Context:Notes on Context: HAL is the main computer on the spaceship HAL is paranoid and decides to kill off the crew
– 10 – CSCE 771 Spring 2013
Clarke a little too Optimistic We don’t have a HAL today in 2009.We don’t have a HAL today in 2009. How close are we?How close are we?
Computers replaced bank tellers (in many instances) But the NASA computers don’t talk yet Microsoft XP/Vista’s voice commands Adobe Reader reading PDF documents
But can they understand spoken commands?But can they understand spoken commands?
– 11 – CSCE 771 Spring 2013
Challenges in developing HALSo what are the major challenges in developing HAL?So what are the major challenges in developing HAL? Speech recognitionSpeech recognition Natural Language understandingNatural Language understanding Information retrievalInformation retrieval Information extractionInformation extraction InferenceInference Speech generationSpeech generation
– 12 – CSCE 771 Spring 2013
Samples of Language ProcessingText processing (in Unix)Text processing (in Unix) wc – word countwc – word count grep regexpr files - print lines in the files that match regrep regexpr files - print lines in the files that match re findfind
More knowledgeable processingMore knowledgeable processing spelling checking/correctingspelling checking/correcting grammar checkinggrammar checking Information retrievalInformation retrieval
Find all documents on decomposition by David Parnas
– 13 – CSCE 771 Spring 2013
Even More knowledgeable processing
Information extractionInformation extraction Reading the “online” Wall Street Journal
What was the dividend paid by GM last year? USC Handbook
How many hours does it take to get a PhD in CSE?
Machine translationMachine translation The spirit is willing but the body is weak. To Russian: Sprit охотно готово но тело слабо. Back to English: Vodka is good but the meat is rotten. (Rich 86) Babelfish - http://world.altavista.com/tr Back to English: Sprit is willingly prepared but body weakly.
– 14 – CSCE 771 Spring 2013
Even Deeper UnderstandingEmail access over the phoneEmail access over the phone
Respond to commands “list all emails from Bob” Read email message 8
Text to speech
Assistants Assistants Agents reading the net summarizing a topic
– 15 – CSCE 771 Spring 2013
Subcategories of Knowledge in S&L Phonetics/phonologyPhonetics/phonology Morphology – shape and behavior of words in Morphology – shape and behavior of words in
contextscontexts Syntax – the legitimate sequences of wordsSyntax – the legitimate sequences of words Semantics – the meanings of words, phrases, Semantics – the meanings of words, phrases,
sentences and documentssentences and documents Pragmatics – the appropriate use of language – Pragmatics – the appropriate use of language –
politeness, direct/indirectnesspoliteness, direct/indirectness Discourse conventions – correctly structuring Discourse conventions – correctly structuring
conversationsconversations
– 16 – CSCE 771 Spring 2013
Ambiguity: I made her duck.
1.1. ..
2.2. ..
3.3. ..
4.4. ..
5.5. ..
– 17 – CSCE 771 Spring 2013
Word AmbiguityHer – who is this?Her – who is this?
Made Made Verb with meanings: 1) create 2) cook 3) force
DuckDuck Noun: the waterfowl, the food Verb
So how do we resolve this sentence?So how do we resolve this sentence?
– 18 – CSCE 771 Spring 2013
Turing TestComputer simulate intelligenceComputer simulate intelligence
http://en.wikipedia.org/wiki/Turing_test
– 19 – CSCE 771 Spring 2013
The Chinese roomJohn Searle's 1980 paper 's 1980 paper Minds, Brains, and Programs
proposed an argument against the Turing Test proposed an argument against the Turing Test known as the "known as the "Chinese room" thought experiment." thought experiment.
Searle argued that software (such as ELIZA) could Searle argued that software (such as ELIZA) could pass the Turing Test simply by manipulating pass the Turing Test simply by manipulating symbols of which they had no understanding.symbols of which they had no understanding.
Without understanding, they could not be described as Without understanding, they could not be described as "thinking" in the same sense people do. "thinking" in the same sense people do.
Loebner Prize – competition since 1991 to best attempt Loebner Prize – competition since 1991 to best attempt at passing Turing Testat passing Turing Test
http://en.wikipedia.org/wiki/Turing_test
– 20 – CSCE 771 Spring 2013
Loebner PrizeThe prizes for each year include:The prizes for each year include:
$2,000 for the most human seeming of all bots for that $2,000 for the most human seeming of all bots for that year - awarded every year year - awarded every year
$25,000 for the first bot that judges cannot distinguish $25,000 for the first bot that judges cannot distinguish from a real human in a text-only based Turing Test from a real human in a text-only based Turing Test (awarded once only) (awarded once only)
$100,000 to the first bot that judges cannot distinguish $100,000 to the first bot that judges cannot distinguish from a real human in a Turing Test that includes from a real human in a Turing Test that includes deciphering and understanding text, visual, auditory deciphering and understanding text, visual, auditory (and tactile?) input.(and tactile?) input.
http://en.wikipedia.org/wiki/Loebner_prize
http://www.loebner.net/Prizef/loebner-prize.html www.loebner.net/Prizef/loebner-prize.html
– 21 – CSCE 771 Spring 2013
Finite Automata arose in the 1950’s1936 Turing’s model of algorithmic computation1936 Turing’s model of algorithmic computation
1943 McCulloch-Pitts model of the neuron1943 McCulloch-Pitts model of the neuron
1951, 1956 Kleene first introduced finite automata and 1951, 1956 Kleene first introduced finite automata and regular expressionsregular expressions
1959 Rabin and Scott - Nondeterministic finite automata1959 Rabin and Scott - Nondeterministic finite automata
1968 Thompson first to compile regular expressions into 1968 Thompson first to compile regular expressions into an editor for text searchingan editor for text searching
– 22 – CSCE 771 Spring 2013
Key Concepts #1 Formal LanguageA formal language is a set of strings (finite) from a finite A formal language is a set of strings (finite) from a finite
alphabet.alphabet.
Key Concept #1: A model that can both recognize and Key Concept #1: A model that can both recognize and generate all and only the strings of a formal generate all and only the strings of a formal language acts as a definition of the language.language acts as a definition of the language.
L(re) = L(ML(re) = L(Mnfanfa))
Formal languages are not the same as natural Formal languages are not the same as natural languages.languages.
Linguists are generally more interested Generative Linguists are generally more interested Generative Grammars, CS are more interested in recognizing.Grammars, CS are more interested in recognizing.
– 23 – CSCE 771 Spring 2013
Formal LanguagesAlphabet: Alphabet: ΣΣ (finite set of symbols) (finite set of symbols)
Strings:Strings: s = c1c2 … cn (finite sequence of characters) Length | s | = n
Language:Language: a language is a set of strings
Example languages over Example languages over ΣΣ = {a, b, c} = {a, b, c}
– 24 – CSCE 771 Spring 2013
Regular Expressions..
– 25 – CSCE 771 Spring 2013
Regular Expression Examples
– 26 – CSCE 771 Spring 2013
Finite Automata to recognize a Language
– 27 – CSCE 771 Spring 2013
CSCE 531 – Overview in one slide% flex lang.l% flex lang.l // lex.yy.c// lex.yy.c
% bison lang.y % bison lang.y // lang.c// lang.c
% gcc lex.yy.c lang.c –o parse% gcc lex.yy.c lang.c –o parse
% parse input% parse input
lang.y
lang.l FLEX lex.yy.cyylex()
lang.cyyparse()
BISON
Input source program
Executable Program
– 28 – CSCE 771 Spring 2013
Regular Expressions in Unix toolsKen Thompson regular expressions in ed Ken Thompson regular expressions in ed ex ex vi vi
Reg-expr NFA then simulate Global pattern match command
g/Unix/s/Unix/UNIX/gg/re/print == grep
– 29 – CSCE 771 Spring 2013
Grep family Global match Regular Expression and Print (GREP)Global match Regular Expression and Print (GREP)
grep [uU]nix f1 f2 … fn egrep pat files // efficient NFADFA, then execute fgrep pat files // fixed grep for fixed strings
Find for searching directories (not really reg expr)Find for searching directories (not really reg expr) find dir –name pat // search for files with name matching pat find dir -exec grep pat {} //search in files for the pattern pat
– 30 – CSCE 771 Spring 2013
Editing scriptsCreate a script of editing commands then execute withCreate a script of editing commands then execute with
ex file1 < edScriptex file1 < edScript
Example:Example:
1,$s/[uU]nix/UNIX/g1,$s/[uU]nix/UNIX/g
1,$s/langauge/language/g1,$s/langauge/language/g
g/^$/dg/^$/d // delete empty lines ^=start of line $=end// delete empty lines ^=start of line $=end
……
ww
– 31 – CSCE 771 Spring 2013
Other Unix regular expression Based Tools sed (stream editor)sed (stream editor) awk awk Perl – scripting languagePerl – scripting language PythonPython RubyRuby reg_comp, reg_exec in Creg_comp, reg_exec in C
– 32 – CSCE 771 Spring 2013
Python String constantshttp://docs.python.org/2/library/stdtypes.htmlstring.ascii_letters -string.ascii_lowercasestring.ascii_uppercase -string.digits - The string '0123456789'.string.hexdigits - The string '0123456789abcdefABCDEF'.string.letters - The specific value is updated when locale.setlocale() is
called.string.lowercasestring.octdigits - The string '01234567'.string.punctuation - String of ASCII characters which are considered
punctuationstring.printablestring.uppercasestring.whitespace
– 33 – CSCE 771 Spring 2013
String Method Exampless = "i think 771 is going great!"s = "i think 771 is going great!"print s.capitalize( )print s.capitalize( )
#center( width[, fillchar])#center( width[, fillchar])print ':'+ s.center(44, '.') + ':‘print ':'+ s.center(44, '.') + ':‘
#count( sub[, start[, end]])#count( sub[, start[, end]])print s.count("in")print s.count("in")print s.count("in", 13)print s.count("in", 13)print s.count("in", 3)print s.count("in", 3)print s.count("in", 13, 22)print s.count("in", 13, 22)print s.count("in", 13, 15)print s.count("in", 13, 15)
#decode( [encoding[, errors]])#decode( [encoding[, errors]])
#encode( [encoding[,errors]])#encode( [encoding[,errors]])
#endswith( suffix[, start[, end]])#endswith( suffix[, start[, end]])
– 34 – CSCE 771 Spring 2013
expandtabs( [expandtabs( [tabsizetabsize])])
find( find( subsub[[, start, start[[, end, end]])]])
index( index( subsub[[, start, start[[, end, end]]) Like find(), but raise ]]) Like find(), but raise ValueError when the substring is not found. ValueError when the substring is not found.
isalnum( )isalnum( )
isalpha( )isalpha( )
isdigit( )isdigit( )
– 35 – CSCE 771 Spring 2013
rpartition( rpartition( sepsep))
rsplit( [rsplit( [sep sep [[,maxsplit,maxsplit]])]])
rstrip( [rstrip( [charschars])])
split( [split( [sep sep [[,maxsplit,maxsplit]])]])
splitlines( [splitlines( [keependskeepends])])
startswith( startswith( prefixprefix[[, start, start[[, end, end]])]])
strip( [strip( [charschars]) swapcase( )]) swapcase( )
title( )title( )
translate( translate( tabletable[[, deletechars, deletechars])])
upper( )upper( )
zfill( zfill( widthwidth))
– 36 – CSCE 771 Spring 2013
Python re — Regular expressions
• http://docs.python.org/library/re.html• re — Regular expression modulere — Regular expression module
• Operators (special characters)• Lookahead / lookbehind• Search vs match• re module contents
– 37 – CSCE 771 Spring 2013
Python Regular Expressionshttp://docs.python.org/2/library/re.htmlhttp://docs.python.org/2/library/re.html
– 38 – CSCE 771 Spring 2013
Fundamental Re Operators in Python
RegExpr matches
c matches the single character cA | B Matches either re A or re B
AB matches re A followed by re B A* matches 0 or more repetitions of the re A
( A ) Matches re A, i.e. The re inside the parentheses
– 39 – CSCE 771 Spring 2013
Other Operators in PythonRegExpr Matches
'.' (Dot.) In the default mode, this matches any character except a newline. …
“A +”
“A ?”
“ A{m} ”
“A{m,n}”
“ \c ” Quoted character
“[chars]” character class
– 40 – CSCE 771 Spring 2013
Greedy Operators in Python
– 41 – CSCE 771 Spring 2013
Non Greedy Operators in Python
RegExpr Matches
“ A*? ”
“ A+? ”
“ A?? ”
“ A{m,n}? ”
– 42 – CSCE 771 Spring 2013
GroupsThe actual text that matches a re in parentheses is a group
can be referred to later
Example: (?P<frst> [a-z]{3}) (?P=frst)
Meaning of special character
( A ) Matches re A, and indicates the start and end of a group
(?P<name>A) Matches A and names the group “name”
(?: A) A non-capturing version of regular parentheses
(?P=name) Matches whatever text was matched by the earlier group named name.
\number Matches the contents of the group of that number.
– 43 – CSCE 771 Spring 2013
Group related
Meaning of special character
( ?# … ) A comment
(?= A) lookahead assertion
(?! A) negative lookahead assertion.
(?<= A) lookbehind assertion
(?<!...)
(?(id/name)yes-pattern|no-pattern)
– 44 – CSCE 771 Spring 2013
Positional special charactersMeaning of special character
'^' (Caret.) Matches the start of the string
'$' Matches the end of the string or just before the newline at the end of the string
– 45 – CSCE 771 Spring 2013
Positional special characters\A Matches only at the start of the string.Matches only at the start of the string.
\b Matches the empty string, but only at the beginning or Matches the empty string, but only at the beginning or end of a word.end of a word.
\B
\d matches any decimal digit ---matches any decimal digit --- \D any non-digit characterany non-digit character
\s matches any whitespace character, equivalent to matches any whitespace character, equivalent to [ \t\n\r\f\v] --- [ \t\n\r\f\v] --- \S
\w matches any alphanumeric character and the matches any alphanumeric character and the underscore ---underscore --- \W
\Z Matches only at the end of the stringMatches only at the end of the string
– 46 – CSCE 771 Spring 2013
re Module - Matching vs Searchingimport re import re
re.match(pattern, line)re.match(pattern, line)
re.search(pattern, line)re.search(pattern, line)
>>> re.match("c", "abcdef") # No match >>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match >>> re.search("c", "abcdef") # Match
<_sre.SRE_Match object at ...> <_sre.SRE_Match object at ...>
– 47 – CSCE 771 Spring 2013
re.compilere.compile(re.compile(patternpattern[, [, flagsflags])])
prog = re.compile(pattern) prog = re.compile(pattern)
result = prog.match(string) result = prog.match(string)
– 48 – CSCE 771 Spring 2013
Python’s Raw String FormatWhat regular expression matches the two character
pattern “\\”?• Re = “\\\\”
Sometimes it simplifies patterns to disable the ‘\’. The “raw” modifier changes the interpretation of ‘\’ in regular expressions.
For instance “\n” is an regular expression matches one character the
newliner“\n” is a regular expression with two characters ‘\’ and
‘n’
– 49 – CSCE 771 Spring 2013
Natural Language Toolkit• http://nltk.org/• interfaces to over 50 corpora andinterfaces to over 50 corpora and• lexical resources such as WordNetlexical resources such as WordNet• suite of text processing libraries for suite of text processing libraries for
• classification, • tokenization, • stemming, • tagging, • parsing, and • semantic reasoning.
– 50 – CSCE 771 Spring 2013
Installing NLTKhttp://nltk.org/install.htmlhttp://nltk.org/install.html
Windows 32-bit binary installationWindows 32-bit binary installation
1.1. Install Python: Install Python: http://www.python.org/download/releases/2.7.3/http://www.python.org/download/releases/2.7.3/
2.2. Install Numpy (optional): Install Numpy (optional): http://sourceforge.net/projects/numpy/files/NumPy/1.6.http://sourceforge.net/projects/numpy/files/NumPy/1.6.2/numpy-1.6.2-win32-superpack-python2.7.exe2/numpy-1.6.2-win32-superpack-python2.7.exe
3.3. Install NLTK: http://pypi.python.org/pypi/nltkInstall NLTK: http://pypi.python.org/pypi/nltk
4.4. Install PyYAML: http://pyyaml.org/wiki/PyYAMLInstall PyYAML: http://pyyaml.org/wiki/PyYAML
5.5. Test installation: Start>Python27, then type import nltkTest installation: Start>Python27, then type import nltk
– 51 – CSCE 771 Spring 2013
Installing NLTK Datahttp://nltk.org/nltk_data/http://nltk.org/nltk_data/
Run the Python interpreter and type the commands:Run the Python interpreter and type the commands:
>>> import nltk >>> import nltk
>>> nltk.download() >>> nltk.download()
– 52 – CSCE 771 Spring 2013
– 53 – CSCE 771 Spring 2013
– 54 – CSCE 771 Spring 2013
Eliza1966 Weizenbaum – program that chatted simulating a 1966 Weizenbaum – program that chatted simulating a
Rogerian psychologistRogerian psychologist
User: User: Men are all alike.Men are all alike.
Eliza:Eliza: IN WHAT WAY?IN WHAT WAY?
User:User: They are always bugging us about something.They are always bugging us about something.
Eliza:Eliza: CAN THINK OF A SPECIFIC EXAMPLE CAN THINK OF A SPECIFIC EXAMPLE
……
http://en.wikipedia.org/wiki/Elizahttp://en.wikipedia.org/wiki/Eliza
http://code.google.com/p/nltk/source/browse/trunk/nltk/http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/chat/eliza.py?r=8479nltk/chat/eliza.py?r=8479
– 55 – CSCE 771 Spring 2013
Links and ReferencesElizaEliza
http://i5.nyu.edu/~mm64/x52.9265/january1966.html http://www-ai.ijs.si/eliza/eliza.html http://www.strout.net/info/coding/python/ai/therapist.py
Turing TestTuring Test http://www.abelard.org/turpap/turpap.htm
– 56 – CSCE 771 Spring 2013
IBM’s Watson
http://en.wikipedia.org/wiki/Watson_%28computer%29http://en.wikipedia.org/wiki/Watson_%28computer%29
– 57 – CSCE 771 Spring 2013
Watson Architecturehttp://en.wikipedia.org/wiki/Watson_%28computer%29http://en.wikipedia.org/wiki/Watson_%28computer%29
– 58 – CSCE 771 Spring 2013
The Face of Watsonhttps://www.youtube.com/watch?v=WIKM732oEekhttps://www.youtube.com/watch?v=WIKM732oEek
Text to SpeechText to Speech