Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Natural Language Processing
Introduction
Pieter Wellens2012-2013
These slides are partially based on the course materials from the ANLP course given at the School of Informatics, Edinburgh and the online coursera Stanford NLP course.
Saturday 16 February 13
Welcome
• Lecturer
• Pieter Wellens
• room: 10G725 (make appointment by e-mail or after class)
• Room: 10G725 (AI-Lab meeting room)
• Time: Each friday, 16-18h
Saturday 16 February 13
Welcome
• Examination:
• assignments: 50%
• traditional exam (written and oral): 50%
• Pointcarré: used for urgent communication
• let me know if you are not yet registered
• http://pointcarre.vub.ac.be/run.php?application=weblcms&go=course_viewer&course=9108
Saturday 16 February 13
Welcome
• Course website:
• contains slides, background reading, assignments, etc
• http://ai.vub.ac.be/courses/2012-2013/natural-language-processing
Saturday 16 February 13
Welcome
• Textbook:
• Daniel Jurafsky and James H. Martin, Speech and Language Processing, second edition (2008/2009), Pearson International Edition.
• Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processing with Python, first edition 2009.
• http://nltk.org/book/
Saturday 16 February 13
What is NLP: Applications
• Speech Recognition
• Machine Translation
• Information Retrieval
• Dialog Systems
• Question Answering
• Information Extraction
• Summarization
• Sentiment Analysis
• ...
Saturday 16 February 13
Overview of this class
• Linguistic terminology
• Tokenization
• Normalization and stemming
• Sentence segmentation
Saturday 16 February 13
Linguistic terminology 8
Words
This is a simple sentence WORDS
Philipp Koehn ANLP Lecture 1 19 September 2012
Words
Saturday 16 February 13
Linguistic terminology 8
Words
This is a simple sentence WORDS
Philipp Koehn ANLP Lecture 1 19 September 2012
9
Morphology
This is a simple sentencebe3sg
present
WORDS
MORPHOLOGY
Philipp Koehn ANLP Lecture 1 19 September 2012
Morphology
Saturday 16 February 13
8
Words
This is a simple sentence WORDS
Philipp Koehn ANLP Lecture 1 19 September 2012
9
Morphology
This is a simple sentencebe3sg
present
WORDS
MORPHOLOGY
Philipp Koehn ANLP Lecture 1 19 September 2012
10
Parts of Speech
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
WORDS
MORPHOLOGY
PART OF SPEECH
Philipp Koehn ANLP Lecture 1 19 September 2012
Part of Speech
Linguistic terminology
Saturday 16 February 13
Syntax
8
Words
This is a simple sentence WORDS
Philipp Koehn ANLP Lecture 1 19 September 2012
9
Morphology
This is a simple sentencebe3sg
present
WORDS
MORPHOLOGY
Philipp Koehn ANLP Lecture 1 19 September 2012
11
Syntax
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
WORDS
MORPHOLOGY
SYNTAX
PART OF SPEECH
Philipp Koehn ANLP Lecture 1 19 September 2012
Linguistic terminology
Saturday 16 February 13
Syntax
8
Words
This is a simple sentence WORDS
Philipp Koehn ANLP Lecture 1 19 September 2012
9
Morphology
This is a simple sentencebe3sg
present
WORDS
MORPHOLOGY
Philipp Koehn ANLP Lecture 1 19 September 2012
11
Syntax
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
WORDS
MORPHOLOGY
SYNTAX
PART OF SPEECH
Philipp Koehn ANLP Lecture 1 19 September 2012
12
Semantics
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
SENTENCE1string of words satisfying the
grammatical rules of a languauge
SIMPLE1having
few parts
WORDS
MORPHOLOGY
SYNTAX
PART OF SPEECH
SEMANTICS
Philipp Koehn ANLP Lecture 1 19 September 2012
12
Semantics
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
SENTENCE1string of words satisfying the
grammatical rules of a languauge
SIMPLE1having
few parts
WORDS
MORPHOLOGY
SYNTAX
PART OF SPEECH
SEMANTICS
Philipp Koehn ANLP Lecture 1 19 September 2012
Linguistic terminology
Saturday 16 February 13
Syntax
8
Words
This is a simple sentence WORDS
Philipp Koehn ANLP Lecture 1 19 September 2012
9
Morphology
This is a simple sentencebe3sg
present
WORDS
MORPHOLOGY
Philipp Koehn ANLP Lecture 1 19 September 2012
11
Syntax
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
WORDS
MORPHOLOGY
SYNTAX
PART OF SPEECH
Philipp Koehn ANLP Lecture 1 19 September 2012
12
Semantics
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
SENTENCE1string of words satisfying the
grammatical rules of a languauge
SIMPLE1having
few parts
WORDS
MORPHOLOGY
SYNTAX
PART OF SPEECH
SEMANTICS
Philipp Koehn ANLP Lecture 1 19 September 2012
12
Semantics
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
SENTENCE1string of words satisfying the
grammatical rules of a languauge
SIMPLE1having
few parts
WORDS
MORPHOLOGY
SYNTAX
PART OF SPEECH
SEMANTICS
Philipp Koehn ANLP Lecture 1 19 September 2012
13
Discourse
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
SENTENCE1string of words satisfying the
grammatical rules of a languauge
SIMPLE1having
few parts
But it is an instructive one.
CONTRAST
WORDS
MORPHOLOGY
SYNTAX
DISCOURSE
PART OF SPEECH
SEMANTICS
Philipp Koehn ANLP Lecture 1 19 September 2012
13
Discourse
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
SENTENCE1string of words satisfying the
grammatical rules of a languauge
SIMPLE1having
few parts
But it is an instructive one.
CONTRAST
WORDS
MORPHOLOGY
SYNTAX
DISCOURSE
PART OF SPEECH
SEMANTICS
Philipp Koehn ANLP Lecture 1 19 September 2012
Linguistic terminology
Saturday 16 February 13
Linguistic terminology today 8
Words
This is a simple sentence WORDS
Philipp Koehn ANLP Lecture 1 19 September 2012
9
Morphology
This is a simple sentencebe3sg
present
WORDS
MORPHOLOGY
Philipp Koehn ANLP Lecture 1 19 September 2012
Words and Morphology:Tokenization
Saturday 16 February 13
Word tokenization
• Demo time!
Saturday 16 February 13
Word tokenization
• Demo time!
shakes-large.txt | lesstr -sc 'A-Za-z' '\n' < shakes-large.txt | lesstr -sc 'A-Za-z' '\n' < shakes-large.txt | sort | lesstr -sc 'A-Za-z' '\n' < shakes-large.txt | sort | uniq -c | lesstr -sc 'A-Za-z' '\n' < shakes-large.txt | sort | uniq -c | sort -n -r | lesstr 'A-Z' 'a-z' < shakes-large.txt | tr -sc 'a-z' '\n' | sort | uniq -c | sort -n -r | less
Saturday 16 February 13
Word tokenization
• Definition: strings of letters separated by spaces
• But:
• punctuation: commas, periods, etc. (tokenization)
• hyphens: top-down
• clitics: Luc’s
• compounds: website, postbodetasbeschermhoes
• no spaces: 伦敦每日快报指出,两台记载黛安娜王妃一九九七年巴黎 死亡车祸调查资料的手提电脑,被从前大都会警察总长的 办公室里偷走. (chinese and japanese)
•
Saturday 16 February 13
• Most frequent words in the English Europarl corpus (www.statmt.org/europarl/)
• http://www.edict.biz/textanalyser/wordlists.htm
16
Word CountsMost frequent words in the English Europarl corpus
any word nouns
Frequency in text Token1,929,379 the
1,297,736 ,
956,902 .
901,174 of
841,661 to
684,869 and
582,592 in
452,491 that
424,895 is
424,552 a
Frequency in text Content word129,851 European
110,072 Mr
98,073 commission
71,111 president
67,518 parliament
64,620 union
58,506 report
57,490 council
54,079 states
49,965 member
Philipp Koehn ANLP Lecture 1 19 September 2012
Word tokenization: counting
Saturday 16 February 13
• But: many words occur only once and even more words never appear in the corpus
• Zipf’s law: f x r = k
• f = frequency of a word
• r = rank of a word (if sorted by frequency)
• k = a constant
Word tokenization: counting
Saturday 16 February 13
• 10000 sentences from the Europarl corpus
• Why the difference?
1
How Many Di↵erent Words?10,000 sentences from the Europarl corpus
Language Di↵erent words
English 16k
French 22k
Dutch 24k
Italian 25k
Portuguese 26k
Spanish 26k
Danish 29k
Swedish 30k
German 32k
Greek 33k
Finnish 55k
Why the di↵erence? Morphology.
Philipp Koehn ANLP Lecture 2 21 September 2012
Word tokenization: counting
Saturday 16 February 13
• 10000 sentences from the Europarl corpus
• Why the difference? Morphology
1
How Many Di↵erent Words?10,000 sentences from the Europarl corpus
Language Di↵erent words
English 16k
French 22k
Dutch 24k
Italian 25k
Portuguese 26k
Spanish 26k
Danish 29k
Swedish 30k
German 32k
Greek 33k
Finnish 55k
Why the di↵erence? Morphology.
Philipp Koehn ANLP Lecture 2 21 September 2012
Word tokenization: counting
Saturday 16 February 13
Word Tokenization
• Lemma: the canonical, dictionary form
• throw, threw, throws, throwing are forms of the same lexeme, with throw as lemma.
Saturday 16 February 13
• Lemma: the canonical, dictionary form
• throw, threw, throws, throwing are forms of the same lexeme, with throw as lemma
• Stem: the part of a word that is common to all its inflected variants
• stem of delete and deletion is delet
Word Tokenization
Saturday 16 February 13
• Lemma: the canonical, dictionary form
• throw, threw, throws, throwing are forms of the same lexeme, with throw as lemma
• Stem: the part of a word that is common to all its inflected variants
• stem of delete and deletion is delet
• Wordform: Full inflected surface form
• cat and cats are different wordform
Word Tokenization
Saturday 16 February 13
• Fragments: I uh am teaching you some uh interest - interesting stuff
• Filled pauses: I uh am teaching you some uh interest - interesting stuff
Word Tokenization
Saturday 16 February 13
Word Tokenization: Types and tokens
• The shuttle bus dropped me at the wrong hotel in New York.
• Type: an element of vocabulary (vocabulary == set of types)
• Token: instance of a type in running text
• How many types and tokens?
Saturday 16 February 13
• The shuttle bus dropped me at the wrong hotel in New York.
• Type: an element of vocabulary (vocabulary == set of types)
• Token: instance of a type in running text
• How many types and tokens?
• 12 tokens and 11 types
Word Tokenization: Types and tokens
Saturday 16 February 13
• The shuttle bus dropped me at the wrong hotel in New York.
• Type: an element of vocabulary (vocabulary == set of types)
• Token: instance of a type in running text
• How many types and tokens?
• 12 tokens and 11 types
Tokens'='N' Types'='|V|'
Switchboard,phone,conversa2ons, 2.4,million, 20,thousand,
Shakespeare, 884,000, 31,thousand,
Google,NBgrams, 1,trillion, 13,million,
Word Tokenization: Types and tokens
Saturday 16 February 13
Linguistic terminology today 8
Words
This is a simple sentence WORDS
Philipp Koehn ANLP Lecture 1 19 September 2012
9
Morphology
This is a simple sentencebe3sg
present
WORDS
MORPHOLOGY
Philipp Koehn ANLP Lecture 1 19 September 2012
Words and Morphology:Normalization and Stemming
Saturday 16 February 13
Word Normalization: Morphemes
• Morphemes: smallest meaningful units that make up words
• Two types of morphemes
• stems: dog, fast, study
• affixes: +ed, un+, +s
• Four types of affixes
• suffix: cat+s (plural), small+er, great+ly, walk+ed
• prefix: un+seen, dis+entangled, re+phrase
• infix: op+ge+kropt (dutch)
• circumfix: ge+vraag+d (dutch)
Saturday 16 February 13
• Creates lots of possibilities:
• nerd
Word Normalization: Morphemes
Saturday 16 February 13
• Creates lots of possibilities:
• nerd
• nerdify (14400 hits)
Word Normalization: Morphemes
Saturday 16 February 13
• Creates lots of possibilities:
• nerd
• nerdify (14400)
• nerdifier (2050 hits)
Word Normalization: Morphemes
Saturday 16 February 13
• Creates lots of possibilities:
• nerd
• nerdify (14400)
• nerdifier (2050 hits)
• nerdificiation (8860 hits)
Word Normalization: Morphemes
Saturday 16 February 13
• Creates lots of possibilities:
• nerd
• nerdify (14400)
• nerdifier (2050 hits)
• nerdificiation (8860 hits)
• nerdificationism (117 hits)
Word Normalization: Morphemes
Saturday 16 February 13
• Difficulties:
• Consonants of the lemma may be changed to removed
• walk+ed, frame+d, transmit+ted
• tiny, tinier
• irregular forms
Word Normalization: Morphemes
Saturday 16 February 13
• Most frequent words have irregular forms
• to be: am, are, is, been, was, ...
• to eat: eat, ate, eaten, ...
• to go: go, goes, went, ...
Word Normalization: Irregular forms
Saturday 16 February 13
• Differences in the definite article (e.g. the):
• http://wals.info/feature/37A?tg_format=map&v1=c00d&v2=c99f&v3=cd00&v4=dfff&v5=cfff
• Past tense:
• http://wals.info/feature/66A?tg_format=map&v1=cff0&v2=cf60&v3=cd00&v4=cfff
• Case affixes
• http://wals.info/feature/51A?s=20&z4=3000&z3=2999&z8=2998&z5=2997&z7=2996&z2=2995&z6=2994&z9=2993&z1=2992&tg_format=map&v1=c00d&v2=cd00&v3=c000&v4=c
Word Normalization: Cross-linguistic variation
Saturday 16 February 13
Word Normalization: what is it?
• Information retrieval
• U.S.A -> USA
• window -> window, windows
• windows -> Windows, windows, window
• Windows -> Windows
• reducing all letters to lower case
• exceptions: General Motors, New York, ...
• can be important for sentiment analysis: US vs us
Saturday 16 February 13
Lemmatization
• Reduce inflections or variant forms to base form
• am, are, is -> be
• car, cars, car’s cars’ -> car
• Finding the dictionary headword form
• Important in Machine translation
Saturday 16 February 13
Stemming
• Reduce terms to their stems in information retrieval
• It is chopping of affixes from word forms
• Is a language dependent process
Saturday 16 February 13
Porter’s algorithm: English stemmer
• Simple replacement algorithm, based on ordered replacement rules
Saturday 16 February 13
Porter’s algorithm: English stemmer
• Simple replacement algorithm, based on ordered replacement rules
!!!Step!1a!sses → ss ! caresses → caress!ies → i ! ponies → poni!ss → ss ! caress → caress!s → ø!!!!!!!!!cats → cat!
!!Step!1b!(*v*)ing → ø!!!!walking → walk! sing → sing!(*v*)ed → ø!!!!plastered → plaster!…!
Saturday 16 February 13
Porter’s algorithm: English stemmer
• Simple replacement algorithm, based on ordered replacement rules
!!!Step!1a!sses → ss ! caresses → caress!ies → i ! ponies → poni!ss → ss ! caress → caress!s → ø!!!!!!!!!cats → cat!
!!Step!1b!(*v*)ing → ø!!!!walking → walk! sing → sing!(*v*)ed → ø!!!!plastered → plaster!…!
!!!Step!2!(for!long!stems)!ational→ ate relational→ relate!izer→ ize ! digitizer → digitize!ator→ ate ! operator → operate!…!
!!!!Step!3!(for!longer!stems)!al → ø!!!!!!revival → reviv!able → ø!!!!!!adjustable → adjust!ate → ø activate → activ!…!
Saturday 16 February 13
• M.F. Porter, 1980, An algorithm for suffix stripping, Program, 14(3) pp 130−137.
• You can download versions for many programming languages at: http://tartarus.org/martin/PorterStemmer/
•Demo!
Porter’s algorithm: English stemmer
Saturday 16 February 13
Stemming: Language dependent
• Turkish:
• Uygarlas(ramadiklarimizdanmissinizcasina
• `(behaving) as if you are among those whom we could not civilize’
• Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ + lar ‘plural’+ imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
Saturday 16 February 13
Linguistic terminology today 8
Words
This is a simple sentence WORDS
Philipp Koehn ANLP Lecture 1 19 September 2012
9
Morphology
This is a simple sentencebe3sg
present
WORDS
MORPHOLOGY
Philipp Koehn ANLP Lecture 1 19 September 2012
Words and Morphology:Sentence segmentation
Saturday 16 February 13
Sentence segmentation
• Segmenting out sentences from running text.
• !, ? are relatively unambiguous
• Period “.” is ambiguous
• Sentence boundary
• Abbreviations like Inc. or Dr.
• Numbers like .02 or 3.14
Saturday 16 February 13
• How to solve the ambiguous “.”?
• Build a binary classifier that decides: EndOfSentence/NotEndOfSentence.
• possibilities: hand-written rules, regular expressions, or machine learning
• machine learning: SVM, neural network, Memory-based, decision tree, ...
Sentence segmentation
Saturday 16 February 13
Sentence segmentation: decision trees
Saturday 16 February 13
• More sophisticated features:
• Case of word with “.”: Upper, Lower, Cap, Number
• Case of word after “.”: Upper, Lower, Cap, Number
• Numeric features:
• Length of word with “.”
• Probability(word with “.” occurs at end-of-s)
• Probability(word after “.” occurs at beginning-of-s)
Sentence segmentation: decision trees
Saturday 16 February 13
Syntax
8
Words
This is a simple sentence WORDS
Philipp Koehn ANLP Lecture 1 19 September 2012
9
Morphology
This is a simple sentencebe3sg
present
WORDS
MORPHOLOGY
Philipp Koehn ANLP Lecture 1 19 September 2012
11
Syntax
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
WORDS
MORPHOLOGY
SYNTAX
PART OF SPEECH
Philipp Koehn ANLP Lecture 1 19 September 2012
12
Semantics
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
SENTENCE1string of words satisfying the
grammatical rules of a languauge
SIMPLE1having
few parts
WORDS
MORPHOLOGY
SYNTAX
PART OF SPEECH
SEMANTICS
Philipp Koehn ANLP Lecture 1 19 September 2012
12
Semantics
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
SENTENCE1string of words satisfying the
grammatical rules of a languauge
SIMPLE1having
few parts
WORDS
MORPHOLOGY
SYNTAX
PART OF SPEECH
SEMANTICS
Philipp Koehn ANLP Lecture 1 19 September 2012
13
Discourse
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
SENTENCE1string of words satisfying the
grammatical rules of a languauge
SIMPLE1having
few parts
But it is an instructive one.
CONTRAST
WORDS
MORPHOLOGY
SYNTAX
DISCOURSE
PART OF SPEECH
SEMANTICS
Philipp Koehn ANLP Lecture 1 19 September 2012
13
Discourse
This is a simple sentencebe3sg
present
DT VBZ DT JJ NN
NP
VP
S
NP
SENTENCE1string of words satisfying the
grammatical rules of a languauge
SIMPLE1having
few parts
But it is an instructive one.
CONTRAST
WORDS
MORPHOLOGY
SYNTAX
DISCOURSE
PART OF SPEECH
SEMANTICS
Philipp Koehn ANLP Lecture 1 19 September 2012
Linguistic terminology
Saturday 16 February 13
Conclusion
• Tokenization, lemmatization, stemming and sentence segmentation are almost inevitable when working with raw text.
• They are not rocket science but also not trivial. Can be very tedious.
• Often highly language dependent
• Use of machine learning when necessary but often regular expressions or iterative rules suffice
Saturday 16 February 13
assignments by next week!
• Install python and NLTK, the natural language toolkit
• Follow instructions on: http://nltk.org/
• For python optionally use: http://epd-free.enthought.com/?Download=Download+EPD+Free+7.3-2
Saturday 16 February 13
• Read the first chapter of the NLTK book and do the “Your turn” items.
• http://nltk.org/book/ch01.html
• Collect all your python commands and the output for the “Your turn” questions in a document and send it to [email protected]
• deadline: Friday 22 February at 12.00h midday.
assignments by next week!
Saturday 16 February 13