58
Lecture 1 Overview Topics Topics Overview Readings: Chapters 1,2 Readings: Chapters 1,2 January 14, 2013 CSCE 771 Natural Language Processing

Lecture 1 Overview

Embed Size (px)

DESCRIPTION

Lecture 1 Overview. CSCE 771 Natural Language Processing. Topics Overview Readings: Chapters 1,2. January 14, 2013. Overview. Pragmatic issues Course Plans Foundation for research Today Challenge of 2001’s HAL Areas of Research Examples of Language Processing. - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 1  Overview

Lecture 1 Overview

Topics Topics Overview

Readings: Chapters 1,2Readings: Chapters 1,2

January 14, 2013

CSCE 771 Natural Language Processing

Page 2: Lecture 1  Overview

– 2 – CSCE 771 Spring 2013

OverviewPragmatic issuesPragmatic issues

Course PlansCourse Plans Foundation for research

TodayToday Challenge of 2001’s HAL Areas of Research Examples of Language Processing

Page 3: Lecture 1  Overview

– 3 – CSCE 771 Spring 2013Slide from: Speech and Language Processing Jurafsky and Martin

NLP Why Should You Care?

Two trendsTwo trends1. An enormous amount of amount of knowledge is now is now available in

machine readable form as natural language text2. Conversational agents are becoming an important form of

human-computer communicationMuch of human-human communication is now mediated by

computers

Page 4: Lecture 1  Overview

– 4 – CSCE 771 Spring 2013

Commercial WorldLot’s of exciting stuff going on…Lot’s of exciting stuff going on…

Powerset

Slide from: Speech and Language Processing Jurafsky and Martin

Page 5: Lecture 1  Overview

– 5 – CSCE 771 Spring 2013

Commercial WorldLot’s of exciting stuff going on…Lot’s of exciting stuff going on…

Page 6: Lecture 1  Overview

– 6 – CSCE 771 Spring 2013

Google Translate

Slide from: Speech and Language Processing Jurafsky and Martin

Page 7: Lecture 1  Overview

– 7 – CSCE 771 Spring 2013

Google Translate

Slide from: Speech and Language Processing Jurafsky and Martin

Page 8: Lecture 1  Overview

– 8 – CSCE 771 Spring 2013

Web Q/A

Slide from: Speech and Language Processing Jurafsky and Martin

Page 9: Lecture 1  Overview

– 9 – CSCE 771 Spring 2013

HAL 9000 of 2001: A Space OdysseyA scene from Arthur Clarke and Stanley Kubrick’s 2001A scene from Arthur Clarke and Stanley Kubrick’s 2001

DAVE:DAVE: Open the pod bay doors, HAL.Open the pod bay doors, HAL.

HAL:HAL: I’m sorry Dave, I’m afraid I can’t do that.I’m sorry Dave, I’m afraid I can’t do that.

Notes on Context:Notes on Context: HAL is the main computer on the spaceship HAL is paranoid and decides to kill off the crew

Page 10: Lecture 1  Overview

– 10 – CSCE 771 Spring 2013

Clarke a little too Optimistic We don’t have a HAL today in 2009.We don’t have a HAL today in 2009. How close are we?How close are we?

Computers replaced bank tellers (in many instances) But the NASA computers don’t talk yet Microsoft XP/Vista’s voice commands Adobe Reader reading PDF documents

But can they understand spoken commands?But can they understand spoken commands?

Page 11: Lecture 1  Overview

– 11 – CSCE 771 Spring 2013

Challenges in developing HALSo what are the major challenges in developing HAL?So what are the major challenges in developing HAL? Speech recognitionSpeech recognition Natural Language understandingNatural Language understanding Information retrievalInformation retrieval Information extractionInformation extraction InferenceInference Speech generationSpeech generation

Page 12: Lecture 1  Overview

– 12 – CSCE 771 Spring 2013

Samples of Language ProcessingText processing (in Unix)Text processing (in Unix) wc – word countwc – word count grep regexpr files - print lines in the files that match regrep regexpr files - print lines in the files that match re findfind

More knowledgeable processingMore knowledgeable processing spelling checking/correctingspelling checking/correcting grammar checkinggrammar checking Information retrievalInformation retrieval

Find all documents on decomposition by David Parnas

Page 13: Lecture 1  Overview

– 13 – CSCE 771 Spring 2013

Even More knowledgeable processing

Information extractionInformation extraction Reading the “online” Wall Street Journal

What was the dividend paid by GM last year? USC Handbook

How many hours does it take to get a PhD in CSE?

Machine translationMachine translation The spirit is willing but the body is weak. To Russian: Sprit охотно готово но тело слабо. Back to English: Vodka is good but the meat is rotten. (Rich 86) Babelfish - http://world.altavista.com/tr Back to English: Sprit is willingly prepared but body weakly.

Page 14: Lecture 1  Overview

– 14 – CSCE 771 Spring 2013

Even Deeper UnderstandingEmail access over the phoneEmail access over the phone

Respond to commands “list all emails from Bob” Read email message 8

Text to speech

Assistants Assistants Agents reading the net summarizing a topic

Page 15: Lecture 1  Overview

– 15 – CSCE 771 Spring 2013

Subcategories of Knowledge in S&L Phonetics/phonologyPhonetics/phonology Morphology – shape and behavior of words in Morphology – shape and behavior of words in

contextscontexts Syntax – the legitimate sequences of wordsSyntax – the legitimate sequences of words Semantics – the meanings of words, phrases, Semantics – the meanings of words, phrases,

sentences and documentssentences and documents Pragmatics – the appropriate use of language – Pragmatics – the appropriate use of language –

politeness, direct/indirectnesspoliteness, direct/indirectness Discourse conventions – correctly structuring Discourse conventions – correctly structuring

conversationsconversations

Page 16: Lecture 1  Overview

– 16 – CSCE 771 Spring 2013

Ambiguity: I made her duck.

1.1. ..

2.2. ..

3.3. ..

4.4. ..

5.5. ..

Page 17: Lecture 1  Overview

– 17 – CSCE 771 Spring 2013

Word AmbiguityHer – who is this?Her – who is this?

Made Made Verb with meanings: 1) create 2) cook 3) force

DuckDuck Noun: the waterfowl, the food Verb

So how do we resolve this sentence?So how do we resolve this sentence?

Page 18: Lecture 1  Overview

– 18 – CSCE 771 Spring 2013

Turing TestComputer simulate intelligenceComputer simulate intelligence

http://en.wikipedia.org/wiki/Turing_test

Page 19: Lecture 1  Overview

– 19 – CSCE 771 Spring 2013

The Chinese roomJohn Searle's 1980 paper 's 1980 paper Minds, Brains, and Programs

proposed an argument against the Turing Test proposed an argument against the Turing Test known as the "known as the "Chinese room" thought experiment." thought experiment.

Searle argued that software (such as ELIZA) could Searle argued that software (such as ELIZA) could pass the Turing Test simply by manipulating pass the Turing Test simply by manipulating symbols of which they had no understanding.symbols of which they had no understanding.

Without understanding, they could not be described as Without understanding, they could not be described as "thinking" in the same sense people do. "thinking" in the same sense people do.

Loebner Prize – competition since 1991 to best attempt Loebner Prize – competition since 1991 to best attempt at passing Turing Testat passing Turing Test

http://en.wikipedia.org/wiki/Turing_test

Page 20: Lecture 1  Overview

– 20 – CSCE 771 Spring 2013

Loebner PrizeThe prizes for each year include:The prizes for each year include:

$2,000 for the most human seeming of all bots for that $2,000 for the most human seeming of all bots for that year - awarded every year year - awarded every year

$25,000 for the first bot that judges cannot distinguish $25,000 for the first bot that judges cannot distinguish from a real human in a text-only based Turing Test from a real human in a text-only based Turing Test (awarded once only) (awarded once only)

$100,000 to the first bot that judges cannot distinguish $100,000 to the first bot that judges cannot distinguish from a real human in a Turing Test that includes from a real human in a Turing Test that includes deciphering and understanding text, visual, auditory deciphering and understanding text, visual, auditory (and tactile?) input.(and tactile?) input.

http://en.wikipedia.org/wiki/Loebner_prize

http://www.loebner.net/Prizef/loebner-prize.html www.loebner.net/Prizef/loebner-prize.html

Page 21: Lecture 1  Overview

– 21 – CSCE 771 Spring 2013

Finite Automata arose in the 1950’s1936 Turing’s model of algorithmic computation1936 Turing’s model of algorithmic computation

1943 McCulloch-Pitts model of the neuron1943 McCulloch-Pitts model of the neuron

1951, 1956 Kleene first introduced finite automata and 1951, 1956 Kleene first introduced finite automata and regular expressionsregular expressions

1959 Rabin and Scott - Nondeterministic finite automata1959 Rabin and Scott - Nondeterministic finite automata

1968 Thompson first to compile regular expressions into 1968 Thompson first to compile regular expressions into an editor for text searchingan editor for text searching

Page 22: Lecture 1  Overview

– 22 – CSCE 771 Spring 2013

Key Concepts #1 Formal LanguageA formal language is a set of strings (finite) from a finite A formal language is a set of strings (finite) from a finite

alphabet.alphabet.

Key Concept #1: A model that can both recognize and Key Concept #1: A model that can both recognize and generate all and only the strings of a formal generate all and only the strings of a formal language acts as a definition of the language.language acts as a definition of the language.

L(re) = L(ML(re) = L(Mnfanfa))

Formal languages are not the same as natural Formal languages are not the same as natural languages.languages.

Linguists are generally more interested Generative Linguists are generally more interested Generative Grammars, CS are more interested in recognizing.Grammars, CS are more interested in recognizing.

Page 23: Lecture 1  Overview

– 23 – CSCE 771 Spring 2013

Formal LanguagesAlphabet: Alphabet: ΣΣ (finite set of symbols) (finite set of symbols)

Strings:Strings: s = c1c2 … cn (finite sequence of characters) Length | s | = n

Language:Language: a language is a set of strings

Example languages over Example languages over ΣΣ = {a, b, c} = {a, b, c}

Page 24: Lecture 1  Overview

– 24 – CSCE 771 Spring 2013

Regular Expressions..

Page 25: Lecture 1  Overview

– 25 – CSCE 771 Spring 2013

Regular Expression Examples

Page 26: Lecture 1  Overview

– 26 – CSCE 771 Spring 2013

Finite Automata to recognize a Language

Page 27: Lecture 1  Overview

– 27 – CSCE 771 Spring 2013

CSCE 531 – Overview in one slide% flex lang.l% flex lang.l // lex.yy.c// lex.yy.c

% bison lang.y % bison lang.y // lang.c// lang.c

% gcc lex.yy.c lang.c –o parse% gcc lex.yy.c lang.c –o parse

% parse input% parse input

lang.y

lang.l FLEX lex.yy.cyylex()

lang.cyyparse()

BISON

Input source program

Executable Program

Page 28: Lecture 1  Overview

– 28 – CSCE 771 Spring 2013

Regular Expressions in Unix toolsKen Thompson regular expressions in ed Ken Thompson regular expressions in ed ex ex vi vi

Reg-expr NFA then simulate Global pattern match command

g/Unix/s/Unix/UNIX/gg/re/print == grep

Page 29: Lecture 1  Overview

– 29 – CSCE 771 Spring 2013

Grep family Global match Regular Expression and Print (GREP)Global match Regular Expression and Print (GREP)

grep [uU]nix f1 f2 … fn egrep pat files // efficient NFADFA, then execute fgrep pat files // fixed grep for fixed strings

Find for searching directories (not really reg expr)Find for searching directories (not really reg expr) find dir –name pat // search for files with name matching pat find dir -exec grep pat {} //search in files for the pattern pat

Page 30: Lecture 1  Overview

– 30 – CSCE 771 Spring 2013

Editing scriptsCreate a script of editing commands then execute withCreate a script of editing commands then execute with

ex file1 < edScriptex file1 < edScript

Example:Example:

1,$s/[uU]nix/UNIX/g1,$s/[uU]nix/UNIX/g

1,$s/langauge/language/g1,$s/langauge/language/g

g/^$/dg/^$/d // delete empty lines ^=start of line $=end// delete empty lines ^=start of line $=end

……

ww

qq

Page 31: Lecture 1  Overview

– 31 – CSCE 771 Spring 2013

Other Unix regular expression Based Tools sed (stream editor)sed (stream editor) awk awk Perl – scripting languagePerl – scripting language PythonPython RubyRuby reg_comp, reg_exec in Creg_comp, reg_exec in C

Page 32: Lecture 1  Overview

– 32 – CSCE 771 Spring 2013

Python String constantshttp://docs.python.org/2/library/stdtypes.htmlstring.ascii_letters -string.ascii_lowercasestring.ascii_uppercase -string.digits - The string '0123456789'.string.hexdigits - The string '0123456789abcdefABCDEF'.string.letters - The specific value is updated when locale.setlocale() is

called.string.lowercasestring.octdigits - The string '01234567'.string.punctuation - String of ASCII characters which are considered

punctuationstring.printablestring.uppercasestring.whitespace

Page 33: Lecture 1  Overview

– 33 – CSCE 771 Spring 2013

String Method Exampless = "i think 771 is going great!"s = "i think 771 is going great!"print s.capitalize( )print s.capitalize( )

#center( width[, fillchar])#center( width[, fillchar])print ':'+ s.center(44, '.') + ':‘print ':'+ s.center(44, '.') + ':‘

#count( sub[, start[, end]])#count( sub[, start[, end]])print s.count("in")print s.count("in")print s.count("in", 13)print s.count("in", 13)print s.count("in", 3)print s.count("in", 3)print s.count("in", 13, 22)print s.count("in", 13, 22)print s.count("in", 13, 15)print s.count("in", 13, 15)

#decode( [encoding[, errors]])#decode( [encoding[, errors]])

#encode( [encoding[,errors]])#encode( [encoding[,errors]])

#endswith( suffix[, start[, end]])#endswith( suffix[, start[, end]])

Page 34: Lecture 1  Overview

– 34 – CSCE 771 Spring 2013

expandtabs( [expandtabs( [tabsizetabsize])])

find( find( subsub[[, start, start[[, end, end]])]])

index( index( subsub[[, start, start[[, end, end]]) Like find(), but raise ]]) Like find(), but raise ValueError when the substring is not found. ValueError when the substring is not found.

isalnum( )isalnum( )

isalpha( )isalpha( )

isdigit( )isdigit( )

Page 35: Lecture 1  Overview

– 35 – CSCE 771 Spring 2013

rpartition( rpartition( sepsep))

rsplit( [rsplit( [sep sep [[,maxsplit,maxsplit]])]])

rstrip( [rstrip( [charschars])])

split( [split( [sep sep [[,maxsplit,maxsplit]])]])

splitlines( [splitlines( [keependskeepends])])

startswith( startswith( prefixprefix[[, start, start[[, end, end]])]])

strip( [strip( [charschars]) swapcase( )]) swapcase( )

title( )title( )

translate( translate( tabletable[[, deletechars, deletechars])])

upper( )upper( )

zfill( zfill( widthwidth))

Page 36: Lecture 1  Overview

– 36 – CSCE 771 Spring 2013

Python re — Regular expressions

• http://docs.python.org/library/re.html• re — Regular expression modulere — Regular expression module

• Operators (special characters)• Lookahead / lookbehind• Search vs match• re module contents

Page 37: Lecture 1  Overview

– 37 – CSCE 771 Spring 2013

Python Regular Expressionshttp://docs.python.org/2/library/re.htmlhttp://docs.python.org/2/library/re.html

Page 38: Lecture 1  Overview

– 38 – CSCE 771 Spring 2013

Fundamental Re Operators in Python

RegExpr matches

c matches the single character cA | B Matches either re A or re B

AB matches re A followed by re B A* matches 0 or more repetitions of the re A

( A ) Matches re A, i.e. The re inside the parentheses

Page 39: Lecture 1  Overview

– 39 – CSCE 771 Spring 2013

Other Operators in PythonRegExpr Matches

'.' (Dot.) In the default mode, this matches any character except a newline. …

“A +”

“A ?”

“ A{m} ”

“A{m,n}”

“ \c ” Quoted character

“[chars]” character class

Page 40: Lecture 1  Overview

– 40 – CSCE 771 Spring 2013

Greedy Operators in Python

Page 41: Lecture 1  Overview

– 41 – CSCE 771 Spring 2013

Non Greedy Operators in Python

RegExpr Matches

“ A*? ”

“ A+? ”

“ A?? ”

“ A{m,n}? ”

Page 42: Lecture 1  Overview

– 42 – CSCE 771 Spring 2013

GroupsThe actual text that matches a re in parentheses is a group

can be referred to later

Example: (?P<frst> [a-z]{3}) (?P=frst)

Meaning of special character

( A ) Matches re A, and indicates the start and end of a group

(?P<name>A) Matches A and names the group “name”

(?: A) A non-capturing version of regular parentheses

(?P=name) Matches whatever text was matched by the earlier group named name.

\number Matches the contents of the group of that number.

Page 43: Lecture 1  Overview

– 43 – CSCE 771 Spring 2013

Group related

Meaning of special character

( ?# … ) A comment

(?= A) lookahead assertion

(?! A) negative lookahead assertion.

(?<= A) lookbehind assertion

(?<!...)

(?(id/name)yes-pattern|no-pattern)

Page 44: Lecture 1  Overview

– 44 – CSCE 771 Spring 2013

Positional special charactersMeaning of special character

'^' (Caret.) Matches the start of the string

'$' Matches the end of the string or just before the newline at the end of the string

Page 45: Lecture 1  Overview

– 45 – CSCE 771 Spring 2013

Positional special characters\A Matches only at the start of the string.Matches only at the start of the string.

\b Matches the empty string, but only at the beginning or Matches the empty string, but only at the beginning or end of a word.end of a word.

\B

\d matches any decimal digit ---matches any decimal digit --- \D any non-digit characterany non-digit character

\s matches any whitespace character, equivalent to matches any whitespace character, equivalent to [ \t\n\r\f\v] --- [ \t\n\r\f\v] --- \S

\w matches any alphanumeric character and the matches any alphanumeric character and the underscore ---underscore --- \W

\Z Matches only at the end of the stringMatches only at the end of the string

Page 46: Lecture 1  Overview

– 46 – CSCE 771 Spring 2013

re Module - Matching vs Searchingimport re import re

re.match(pattern, line)re.match(pattern, line)

re.search(pattern, line)re.search(pattern, line)

>>> re.match("c", "abcdef") # No match >>> re.match("c", "abcdef") # No match

>>> re.search("c", "abcdef") # Match >>> re.search("c", "abcdef") # Match

<_sre.SRE_Match object at ...> <_sre.SRE_Match object at ...>

Page 47: Lecture 1  Overview

– 47 – CSCE 771 Spring 2013

re.compilere.compile(re.compile(patternpattern[, [, flagsflags])])

prog = re.compile(pattern) prog = re.compile(pattern)

result = prog.match(string) result = prog.match(string)

Page 48: Lecture 1  Overview

– 48 – CSCE 771 Spring 2013

Python’s Raw String FormatWhat regular expression matches the two character

pattern “\\”?• Re = “\\\\”

Sometimes it simplifies patterns to disable the ‘\’. The “raw” modifier changes the interpretation of ‘\’ in regular expressions.

For instance “\n” is an regular expression matches one character the

newliner“\n” is a regular expression with two characters ‘\’ and

‘n’

Page 49: Lecture 1  Overview

– 49 – CSCE 771 Spring 2013

Natural Language Toolkit• http://nltk.org/• interfaces to over 50 corpora andinterfaces to over 50 corpora and• lexical resources such as WordNetlexical resources such as WordNet• suite of text processing libraries for suite of text processing libraries for

• classification, • tokenization, • stemming, • tagging, • parsing, and • semantic reasoning.

Page 50: Lecture 1  Overview

– 50 – CSCE 771 Spring 2013

Installing NLTKhttp://nltk.org/install.htmlhttp://nltk.org/install.html

Windows 32-bit binary installationWindows 32-bit binary installation

1.1. Install Python: Install Python: http://www.python.org/download/releases/2.7.3/http://www.python.org/download/releases/2.7.3/

2.2. Install Numpy (optional): Install Numpy (optional): http://sourceforge.net/projects/numpy/files/NumPy/1.6.http://sourceforge.net/projects/numpy/files/NumPy/1.6.2/numpy-1.6.2-win32-superpack-python2.7.exe2/numpy-1.6.2-win32-superpack-python2.7.exe

3.3. Install NLTK: http://pypi.python.org/pypi/nltkInstall NLTK: http://pypi.python.org/pypi/nltk

4.4. Install PyYAML: http://pyyaml.org/wiki/PyYAMLInstall PyYAML: http://pyyaml.org/wiki/PyYAML

5.5. Test installation: Start>Python27, then type import nltkTest installation: Start>Python27, then type import nltk

Page 51: Lecture 1  Overview

– 51 – CSCE 771 Spring 2013

Installing NLTK Datahttp://nltk.org/nltk_data/http://nltk.org/nltk_data/

Run the Python interpreter and type the commands:Run the Python interpreter and type the commands:

>>> import nltk >>> import nltk

>>> nltk.download() >>> nltk.download()

Page 52: Lecture 1  Overview

– 52 – CSCE 771 Spring 2013

Page 53: Lecture 1  Overview

– 53 – CSCE 771 Spring 2013

Page 54: Lecture 1  Overview

– 54 – CSCE 771 Spring 2013

Eliza1966 Weizenbaum – program that chatted simulating a 1966 Weizenbaum – program that chatted simulating a

Rogerian psychologistRogerian psychologist

User: User: Men are all alike.Men are all alike.

Eliza:Eliza: IN WHAT WAY?IN WHAT WAY?

User:User: They are always bugging us about something.They are always bugging us about something.

Eliza:Eliza: CAN THINK OF A SPECIFIC EXAMPLE CAN THINK OF A SPECIFIC EXAMPLE

……

http://en.wikipedia.org/wiki/Elizahttp://en.wikipedia.org/wiki/Eliza

http://code.google.com/p/nltk/source/browse/trunk/nltk/http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/chat/eliza.py?r=8479nltk/chat/eliza.py?r=8479

Page 55: Lecture 1  Overview

– 55 – CSCE 771 Spring 2013

Links and ReferencesElizaEliza

http://i5.nyu.edu/~mm64/x52.9265/january1966.html http://www-ai.ijs.si/eliza/eliza.html http://www.strout.net/info/coding/python/ai/therapist.py

Turing TestTuring Test http://www.abelard.org/turpap/turpap.htm

Page 56: Lecture 1  Overview

– 56 – CSCE 771 Spring 2013

IBM’s Watson

http://en.wikipedia.org/wiki/Watson_%28computer%29http://en.wikipedia.org/wiki/Watson_%28computer%29

Page 57: Lecture 1  Overview

– 57 – CSCE 771 Spring 2013

Watson Architecturehttp://en.wikipedia.org/wiki/Watson_%28computer%29http://en.wikipedia.org/wiki/Watson_%28computer%29

Page 58: Lecture 1  Overview

– 58 – CSCE 771 Spring 2013

The Face of Watsonhttps://www.youtube.com/watch?v=WIKM732oEekhttps://www.youtube.com/watch?v=WIKM732oEek

Text to SpeechText to Speech