Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn...

Preview:

Citation preview

Natural Language Processing

Introduction

Pieter Wellens2012-2013

These slides are partially based on the course materials from the ANLP course given at the School of Informatics, Edinburgh and the online coursera Stanford NLP course.

Saturday 16 February 13

Welcome

• Lecturer

• Pieter Wellens

• pieter@ai.vub.ac.be

• room: 10G725 (make appointment by e-mail or after class)

• Room: 10G725 (AI-Lab meeting room)

• Time: Each friday, 16-18h

Saturday 16 February 13

Welcome

• Examination:

• assignments: 50%

• traditional exam (written and oral): 50%

• Pointcarré: used for urgent communication

• let me know if you are not yet registered

• http://pointcarre.vub.ac.be/run.php?application=weblcms&go=course_viewer&course=9108

Saturday 16 February 13

Welcome

• Course website:

• contains slides, background reading, assignments, etc

• http://ai.vub.ac.be/courses/2012-2013/natural-language-processing

Saturday 16 February 13

Welcome

• Textbook:

• Daniel Jurafsky and James H. Martin, Speech and Language Processing, second edition (2008/2009), Pearson International Edition.

• Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processing with Python, first edition 2009.

• http://nltk.org/book/

Saturday 16 February 13

What is NLP: Applications

• Speech Recognition

• Machine Translation

• Information Retrieval

• Dialog Systems

• Question Answering

• Information Extraction

• Summarization

• Sentiment Analysis

• ...

Saturday 16 February 13

Overview of this class

• Linguistic terminology

• Tokenization

• Normalization and stemming

• Sentence segmentation

Saturday 16 February 13

Linguistic terminology 8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

Words

Saturday 16 February 13

Linguistic terminology 8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

Morphology

Saturday 16 February 13

8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

10

Parts of Speech

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

WORDS

MORPHOLOGY

PART OF SPEECH

Philipp Koehn ANLP Lecture 1 19 September 2012

Part of Speech

Linguistic terminology

Saturday 16 February 13

Syntax

8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

11

Syntax

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

Philipp Koehn ANLP Lecture 1 19 September 2012

Linguistic terminology

Saturday 16 February 13

Syntax

8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

11

Syntax

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

Philipp Koehn ANLP Lecture 1 19 September 2012

12

Semantics

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

12

Semantics

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

Linguistic terminology

Saturday 16 February 13

Syntax

8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

11

Syntax

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

Philipp Koehn ANLP Lecture 1 19 September 2012

12

Semantics

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

12

Semantics

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

13

Discourse

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

But it is an instructive one.

CONTRAST

WORDS

MORPHOLOGY

SYNTAX

DISCOURSE

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

13

Discourse

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

But it is an instructive one.

CONTRAST

WORDS

MORPHOLOGY

SYNTAX

DISCOURSE

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

Linguistic terminology

Saturday 16 February 13

Linguistic terminology today 8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

Words and Morphology:Tokenization

Saturday 16 February 13

Word tokenization

• Demo time!

Saturday 16 February 13

Word tokenization

• Demo time!

shakes-large.txt | lesstr -sc 'A-Za-z' '\n' < shakes-large.txt | lesstr -sc 'A-Za-z' '\n' < shakes-large.txt | sort | lesstr -sc 'A-Za-z' '\n' < shakes-large.txt | sort | uniq -c | lesstr -sc 'A-Za-z' '\n' < shakes-large.txt | sort | uniq -c | sort -n -r | lesstr 'A-Z' 'a-z' < shakes-large.txt | tr -sc 'a-z' '\n' | sort | uniq -c | sort -n -r | less

Saturday 16 February 13

Word tokenization

• Definition: strings of letters separated by spaces

• But:

• punctuation: commas, periods, etc. (tokenization)

• hyphens: top-down

• clitics: Luc’s

• compounds: website, postbodetasbeschermhoes

• no spaces: 伦敦每日快报指出,两台记载黛安娜王妃一九九七年巴黎 死亡车祸调查资料的手提电脑,被从前大都会警察总长的 办公室里偷走. (chinese and japanese)

Saturday 16 February 13

• Most frequent words in the English Europarl corpus (www.statmt.org/europarl/)

• http://www.edict.biz/textanalyser/wordlists.htm

16

Word CountsMost frequent words in the English Europarl corpus

any word nouns

Frequency in text Token1,929,379 the

1,297,736 ,

956,902 .

901,174 of

841,661 to

684,869 and

582,592 in

452,491 that

424,895 is

424,552 a

Frequency in text Content word129,851 European

110,072 Mr

98,073 commission

71,111 president

67,518 parliament

64,620 union

58,506 report

57,490 council

54,079 states

49,965 member

Philipp Koehn ANLP Lecture 1 19 September 2012

Word tokenization: counting

Saturday 16 February 13

• But: many words occur only once and even more words never appear in the corpus

• Zipf’s law: f x r = k

• f = frequency of a word

• r = rank of a word (if sorted by frequency)

• k = a constant

Word tokenization: counting

Saturday 16 February 13

• 10000 sentences from the Europarl corpus

• Why the difference?

1

How Many Di↵erent Words?10,000 sentences from the Europarl corpus

Language Di↵erent words

English 16k

French 22k

Dutch 24k

Italian 25k

Portuguese 26k

Spanish 26k

Danish 29k

Swedish 30k

German 32k

Greek 33k

Finnish 55k

Why the di↵erence? Morphology.

Philipp Koehn ANLP Lecture 2 21 September 2012

Word tokenization: counting

Saturday 16 February 13

• 10000 sentences from the Europarl corpus

• Why the difference? Morphology

1

How Many Di↵erent Words?10,000 sentences from the Europarl corpus

Language Di↵erent words

English 16k

French 22k

Dutch 24k

Italian 25k

Portuguese 26k

Spanish 26k

Danish 29k

Swedish 30k

German 32k

Greek 33k

Finnish 55k

Why the di↵erence? Morphology.

Philipp Koehn ANLP Lecture 2 21 September 2012

Word tokenization: counting

Saturday 16 February 13

Word Tokenization

• Lemma: the canonical, dictionary form

• throw, threw, throws, throwing are forms of the same lexeme, with throw as lemma.

Saturday 16 February 13

• Lemma: the canonical, dictionary form

• throw, threw, throws, throwing are forms of the same lexeme, with throw as lemma

• Stem: the part of a word that is common to all its inflected variants

• stem of delete and deletion is delet

Word Tokenization

Saturday 16 February 13

• Lemma: the canonical, dictionary form

• throw, threw, throws, throwing are forms of the same lexeme, with throw as lemma

• Stem: the part of a word that is common to all its inflected variants

• stem of delete and deletion is delet

• Wordform: Full inflected surface form

• cat and cats are different wordform

Word Tokenization

Saturday 16 February 13

• Fragments: I uh am teaching you some uh interest - interesting stuff

• Filled pauses: I uh am teaching you some uh interest - interesting stuff

Word Tokenization

Saturday 16 February 13

Word Tokenization: Types and tokens

• The shuttle bus dropped me at the wrong hotel in New York.

• Type: an element of vocabulary (vocabulary == set of types)

• Token: instance of a type in running text

• How many types and tokens?

Saturday 16 February 13

• The shuttle bus dropped me at the wrong hotel in New York.

• Type: an element of vocabulary (vocabulary == set of types)

• Token: instance of a type in running text

• How many types and tokens?

• 12 tokens and 11 types

Word Tokenization: Types and tokens

Saturday 16 February 13

• The shuttle bus dropped me at the wrong hotel in New York.

• Type: an element of vocabulary (vocabulary == set of types)

• Token: instance of a type in running text

• How many types and tokens?

• 12 tokens and 11 types

Tokens'='N' Types'='|V|'

Switchboard,phone,conversa2ons, 2.4,million, 20,thousand,

Shakespeare, 884,000, 31,thousand,

Google,NBgrams, 1,trillion, 13,million,

Word Tokenization: Types and tokens

Saturday 16 February 13

Linguistic terminology today 8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

Words and Morphology:Normalization and Stemming

Saturday 16 February 13

Word Normalization: Morphemes

• Morphemes: smallest meaningful units that make up words

• Two types of morphemes

• stems: dog, fast, study

• affixes: +ed, un+, +s

• Four types of affixes

• suffix: cat+s (plural), small+er, great+ly, walk+ed

• prefix: un+seen, dis+entangled, re+phrase

• infix: op+ge+kropt (dutch)

• circumfix: ge+vraag+d (dutch)

Saturday 16 February 13

• Creates lots of possibilities:

• nerd

Word Normalization: Morphemes

Saturday 16 February 13

• Creates lots of possibilities:

• nerd

• nerdify (14400 hits)

Word Normalization: Morphemes

Saturday 16 February 13

• Creates lots of possibilities:

• nerd

• nerdify (14400)

• nerdifier (2050 hits)

Word Normalization: Morphemes

Saturday 16 February 13

• Creates lots of possibilities:

• nerd

• nerdify (14400)

• nerdifier (2050 hits)

• nerdificiation (8860 hits)

Word Normalization: Morphemes

Saturday 16 February 13

• Creates lots of possibilities:

• nerd

• nerdify (14400)

• nerdifier (2050 hits)

• nerdificiation (8860 hits)

• nerdificationism (117 hits)

Word Normalization: Morphemes

Saturday 16 February 13

• Difficulties:

• Consonants of the lemma may be changed to removed

• walk+ed, frame+d, transmit+ted

• tiny, tinier

• irregular forms

Word Normalization: Morphemes

Saturday 16 February 13

• Most frequent words have irregular forms

• to be: am, are, is, been, was, ...

• to eat: eat, ate, eaten, ...

• to go: go, goes, went, ...

Word Normalization: Irregular forms

Saturday 16 February 13

• Differences in the definite article (e.g. the):

• http://wals.info/feature/37A?tg_format=map&v1=c00d&v2=c99f&v3=cd00&v4=dfff&v5=cfff

• Past tense:

• http://wals.info/feature/66A?tg_format=map&v1=cff0&v2=cf60&v3=cd00&v4=cfff

• Case affixes

• http://wals.info/feature/51A?s=20&z4=3000&z3=2999&z8=2998&z5=2997&z7=2996&z2=2995&z6=2994&z9=2993&z1=2992&tg_format=map&v1=c00d&v2=cd00&v3=c000&v4=c

Word Normalization: Cross-linguistic variation

Saturday 16 February 13

Word Normalization: what is it?

• Information retrieval

• U.S.A -> USA

• window -> window, windows

• windows -> Windows, windows, window

• Windows -> Windows

• reducing all letters to lower case

• exceptions: General Motors, New York, ...

• can be important for sentiment analysis: US vs us

Saturday 16 February 13

Lemmatization

• Reduce inflections or variant forms to base form

• am, are, is -> be

• car, cars, car’s cars’ -> car

• Finding the dictionary headword form

• Important in Machine translation

Saturday 16 February 13

Stemming

• Reduce terms to their stems in information retrieval

• It is chopping of affixes from word forms

• Is a language dependent process

Saturday 16 February 13

Porter’s algorithm: English stemmer

• Simple replacement algorithm, based on ordered replacement rules

Saturday 16 February 13

Porter’s algorithm: English stemmer

• Simple replacement algorithm, based on ordered replacement rules

!!!Step!1a!sses → ss ! caresses → caress!ies → i ! ponies → poni!ss → ss ! caress → caress!s → ø!!!!!!!!!cats → cat!

!!Step!1b!(*v*)ing → ø!!!!walking → walk! sing → sing!(*v*)ed → ø!!!!plastered → plaster!…!

Saturday 16 February 13

Porter’s algorithm: English stemmer

• Simple replacement algorithm, based on ordered replacement rules

!!!Step!1a!sses → ss ! caresses → caress!ies → i ! ponies → poni!ss → ss ! caress → caress!s → ø!!!!!!!!!cats → cat!

!!Step!1b!(*v*)ing → ø!!!!walking → walk! sing → sing!(*v*)ed → ø!!!!plastered → plaster!…!

!!!Step!2!(for!long!stems)!ational→ ate relational→ relate!izer→ ize ! digitizer → digitize!ator→ ate ! operator → operate!…!

!!!!Step!3!(for!longer!stems)!al → ø!!!!!!revival → reviv!able → ø!!!!!!adjustable → adjust!ate → ø activate → activ!…!

Saturday 16 February 13

• M.F. Porter, 1980, An algorithm for suffix stripping, Program, 14(3) pp 130−137.

• You can download versions for many programming languages at: http://tartarus.org/martin/PorterStemmer/

•Demo!

Porter’s algorithm: English stemmer

Saturday 16 February 13

Stemming: Language dependent

• Turkish:

• Uygarlas(ramadiklarimizdanmissinizcasina

• `(behaving)  as  if  you  are  among  those  whom  we  could  not  civilize’

• Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ + lar ‘plural’+ imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’

Saturday 16 February 13

Linguistic terminology today 8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

Words and Morphology:Sentence segmentation

Saturday 16 February 13

Sentence segmentation

• Segmenting out sentences from running text.

• !, ? are relatively unambiguous

• Period “.” is ambiguous

• Sentence boundary

• Abbreviations like Inc. or Dr.

• Numbers like .02 or 3.14

Saturday 16 February 13

• How to solve the ambiguous “.”?

• Build a binary classifier that decides: EndOfSentence/NotEndOfSentence.

• possibilities: hand-written rules, regular expressions, or machine learning

• machine learning: SVM, neural network, Memory-based, decision tree, ...

Sentence segmentation

Saturday 16 February 13

Sentence segmentation: decision trees

Saturday 16 February 13

• More sophisticated features:

• Case of word with “.”: Upper, Lower, Cap, Number

• Case of word after “.”: Upper, Lower, Cap, Number

• Numeric features:

• Length of word with “.”

• Probability(word with “.” occurs at end-of-s)

• Probability(word after “.” occurs at beginning-of-s)

Sentence segmentation: decision trees

Saturday 16 February 13

Syntax

8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

11

Syntax

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

Philipp Koehn ANLP Lecture 1 19 September 2012

12

Semantics

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

12

Semantics

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

13

Discourse

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

But it is an instructive one.

CONTRAST

WORDS

MORPHOLOGY

SYNTAX

DISCOURSE

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

13

Discourse

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

But it is an instructive one.

CONTRAST

WORDS

MORPHOLOGY

SYNTAX

DISCOURSE

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

Linguistic terminology

Saturday 16 February 13

Conclusion

• Tokenization, lemmatization, stemming and sentence segmentation are almost inevitable when working with raw text.

• They are not rocket science but also not trivial. Can be very tedious.

• Often highly language dependent

• Use of machine learning when necessary but often regular expressions or iterative rules suffice

Saturday 16 February 13

assignments by next week!

• Install python and NLTK, the natural language toolkit

• Follow instructions on: http://nltk.org/

• For python optionally use: http://epd-free.enthought.com/?Download=Download+EPD+Free+7.3-2

Saturday 16 February 13

• Read the first chapter of the NLTK book and do the “Your turn” items.

• http://nltk.org/book/ch01.html

• Collect all your python commands and the output for the “Your turn” questions in a document and send it to pieter@ai.vub.ac.be

• deadline: Friday 22 February at 12.00h midday.

assignments by next week!

Saturday 16 February 13

Recommended