55
Natural Language Processing Introduction Pieter Wellens 2012-2013 These slides are partially based on the course materials from the ANLP course given at the School of Informatics, Edinburgh and the online coursera Stanford NLP course. Saturday 16 February 13

Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Natural Language Processing

Introduction

Pieter Wellens2012-2013

These slides are partially based on the course materials from the ANLP course given at the School of Informatics, Edinburgh and the online coursera Stanford NLP course.

Saturday 16 February 13

Page 2: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Welcome

• Lecturer

• Pieter Wellens

[email protected]

• room: 10G725 (make appointment by e-mail or after class)

• Room: 10G725 (AI-Lab meeting room)

• Time: Each friday, 16-18h

Saturday 16 February 13

Page 3: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Welcome

• Examination:

• assignments: 50%

• traditional exam (written and oral): 50%

• Pointcarré: used for urgent communication

• let me know if you are not yet registered

• http://pointcarre.vub.ac.be/run.php?application=weblcms&go=course_viewer&course=9108

Saturday 16 February 13

Page 4: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Welcome

• Course website:

• contains slides, background reading, assignments, etc

• http://ai.vub.ac.be/courses/2012-2013/natural-language-processing

Saturday 16 February 13

Page 5: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Welcome

• Textbook:

• Daniel Jurafsky and James H. Martin, Speech and Language Processing, second edition (2008/2009), Pearson International Edition.

• Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processing with Python, first edition 2009.

• http://nltk.org/book/

Saturday 16 February 13

Page 6: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

What is NLP: Applications

• Speech Recognition

• Machine Translation

• Information Retrieval

• Dialog Systems

• Question Answering

• Information Extraction

• Summarization

• Sentiment Analysis

• ...

Saturday 16 February 13

Page 7: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Overview of this class

• Linguistic terminology

• Tokenization

• Normalization and stemming

• Sentence segmentation

Saturday 16 February 13

Page 8: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Linguistic terminology 8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

Words

Saturday 16 February 13

Page 9: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Linguistic terminology 8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

Morphology

Saturday 16 February 13

Page 10: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

10

Parts of Speech

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

WORDS

MORPHOLOGY

PART OF SPEECH

Philipp Koehn ANLP Lecture 1 19 September 2012

Part of Speech

Linguistic terminology

Saturday 16 February 13

Page 11: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Syntax

8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

11

Syntax

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

Philipp Koehn ANLP Lecture 1 19 September 2012

Linguistic terminology

Saturday 16 February 13

Page 12: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Syntax

8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

11

Syntax

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

Philipp Koehn ANLP Lecture 1 19 September 2012

12

Semantics

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

12

Semantics

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

Linguistic terminology

Saturday 16 February 13

Page 13: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Syntax

8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

11

Syntax

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

Philipp Koehn ANLP Lecture 1 19 September 2012

12

Semantics

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

12

Semantics

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

13

Discourse

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

But it is an instructive one.

CONTRAST

WORDS

MORPHOLOGY

SYNTAX

DISCOURSE

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

13

Discourse

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

But it is an instructive one.

CONTRAST

WORDS

MORPHOLOGY

SYNTAX

DISCOURSE

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

Linguistic terminology

Saturday 16 February 13

Page 14: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Linguistic terminology today 8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

Words and Morphology:Tokenization

Saturday 16 February 13

Page 15: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Word tokenization

• Demo time!

Saturday 16 February 13

Page 16: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Word tokenization

• Demo time!

shakes-large.txt | lesstr -sc 'A-Za-z' '\n' < shakes-large.txt | lesstr -sc 'A-Za-z' '\n' < shakes-large.txt | sort | lesstr -sc 'A-Za-z' '\n' < shakes-large.txt | sort | uniq -c | lesstr -sc 'A-Za-z' '\n' < shakes-large.txt | sort | uniq -c | sort -n -r | lesstr 'A-Z' 'a-z' < shakes-large.txt | tr -sc 'a-z' '\n' | sort | uniq -c | sort -n -r | less

Saturday 16 February 13

Page 17: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Word tokenization

• Definition: strings of letters separated by spaces

• But:

• punctuation: commas, periods, etc. (tokenization)

• hyphens: top-down

• clitics: Luc’s

• compounds: website, postbodetasbeschermhoes

• no spaces: 伦敦每日快报指出,两台记载黛安娜王妃一九九七年巴黎 死亡车祸调查资料的手提电脑,被从前大都会警察总长的 办公室里偷走. (chinese and japanese)

Saturday 16 February 13

Page 18: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• Most frequent words in the English Europarl corpus (www.statmt.org/europarl/)

• http://www.edict.biz/textanalyser/wordlists.htm

16

Word CountsMost frequent words in the English Europarl corpus

any word nouns

Frequency in text Token1,929,379 the

1,297,736 ,

956,902 .

901,174 of

841,661 to

684,869 and

582,592 in

452,491 that

424,895 is

424,552 a

Frequency in text Content word129,851 European

110,072 Mr

98,073 commission

71,111 president

67,518 parliament

64,620 union

58,506 report

57,490 council

54,079 states

49,965 member

Philipp Koehn ANLP Lecture 1 19 September 2012

Word tokenization: counting

Saturday 16 February 13

Page 19: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• But: many words occur only once and even more words never appear in the corpus

• Zipf’s law: f x r = k

• f = frequency of a word

• r = rank of a word (if sorted by frequency)

• k = a constant

Word tokenization: counting

Saturday 16 February 13

Page 20: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• 10000 sentences from the Europarl corpus

• Why the difference?

1

How Many Di↵erent Words?10,000 sentences from the Europarl corpus

Language Di↵erent words

English 16k

French 22k

Dutch 24k

Italian 25k

Portuguese 26k

Spanish 26k

Danish 29k

Swedish 30k

German 32k

Greek 33k

Finnish 55k

Why the di↵erence? Morphology.

Philipp Koehn ANLP Lecture 2 21 September 2012

Word tokenization: counting

Saturday 16 February 13

Page 21: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• 10000 sentences from the Europarl corpus

• Why the difference? Morphology

1

How Many Di↵erent Words?10,000 sentences from the Europarl corpus

Language Di↵erent words

English 16k

French 22k

Dutch 24k

Italian 25k

Portuguese 26k

Spanish 26k

Danish 29k

Swedish 30k

German 32k

Greek 33k

Finnish 55k

Why the di↵erence? Morphology.

Philipp Koehn ANLP Lecture 2 21 September 2012

Word tokenization: counting

Saturday 16 February 13

Page 22: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Word Tokenization

• Lemma: the canonical, dictionary form

• throw, threw, throws, throwing are forms of the same lexeme, with throw as lemma.

Saturday 16 February 13

Page 23: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• Lemma: the canonical, dictionary form

• throw, threw, throws, throwing are forms of the same lexeme, with throw as lemma

• Stem: the part of a word that is common to all its inflected variants

• stem of delete and deletion is delet

Word Tokenization

Saturday 16 February 13

Page 24: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• Lemma: the canonical, dictionary form

• throw, threw, throws, throwing are forms of the same lexeme, with throw as lemma

• Stem: the part of a word that is common to all its inflected variants

• stem of delete and deletion is delet

• Wordform: Full inflected surface form

• cat and cats are different wordform

Word Tokenization

Saturday 16 February 13

Page 25: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• Fragments: I uh am teaching you some uh interest - interesting stuff

• Filled pauses: I uh am teaching you some uh interest - interesting stuff

Word Tokenization

Saturday 16 February 13

Page 26: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Word Tokenization: Types and tokens

• The shuttle bus dropped me at the wrong hotel in New York.

• Type: an element of vocabulary (vocabulary == set of types)

• Token: instance of a type in running text

• How many types and tokens?

Saturday 16 February 13

Page 27: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• The shuttle bus dropped me at the wrong hotel in New York.

• Type: an element of vocabulary (vocabulary == set of types)

• Token: instance of a type in running text

• How many types and tokens?

• 12 tokens and 11 types

Word Tokenization: Types and tokens

Saturday 16 February 13

Page 28: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• The shuttle bus dropped me at the wrong hotel in New York.

• Type: an element of vocabulary (vocabulary == set of types)

• Token: instance of a type in running text

• How many types and tokens?

• 12 tokens and 11 types

Tokens'='N' Types'='|V|'

Switchboard,phone,conversa2ons, 2.4,million, 20,thousand,

Shakespeare, 884,000, 31,thousand,

Google,NBgrams, 1,trillion, 13,million,

Word Tokenization: Types and tokens

Saturday 16 February 13

Page 29: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Linguistic terminology today 8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

Words and Morphology:Normalization and Stemming

Saturday 16 February 13

Page 30: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Word Normalization: Morphemes

• Morphemes: smallest meaningful units that make up words

• Two types of morphemes

• stems: dog, fast, study

• affixes: +ed, un+, +s

• Four types of affixes

• suffix: cat+s (plural), small+er, great+ly, walk+ed

• prefix: un+seen, dis+entangled, re+phrase

• infix: op+ge+kropt (dutch)

• circumfix: ge+vraag+d (dutch)

Saturday 16 February 13

Page 31: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• Creates lots of possibilities:

• nerd

Word Normalization: Morphemes

Saturday 16 February 13

Page 32: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• Creates lots of possibilities:

• nerd

• nerdify (14400 hits)

Word Normalization: Morphemes

Saturday 16 February 13

Page 33: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• Creates lots of possibilities:

• nerd

• nerdify (14400)

• nerdifier (2050 hits)

Word Normalization: Morphemes

Saturday 16 February 13

Page 34: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• Creates lots of possibilities:

• nerd

• nerdify (14400)

• nerdifier (2050 hits)

• nerdificiation (8860 hits)

Word Normalization: Morphemes

Saturday 16 February 13

Page 35: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• Creates lots of possibilities:

• nerd

• nerdify (14400)

• nerdifier (2050 hits)

• nerdificiation (8860 hits)

• nerdificationism (117 hits)

Word Normalization: Morphemes

Saturday 16 February 13

Page 36: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• Difficulties:

• Consonants of the lemma may be changed to removed

• walk+ed, frame+d, transmit+ted

• tiny, tinier

• irregular forms

Word Normalization: Morphemes

Saturday 16 February 13

Page 37: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• Most frequent words have irregular forms

• to be: am, are, is, been, was, ...

• to eat: eat, ate, eaten, ...

• to go: go, goes, went, ...

Word Normalization: Irregular forms

Saturday 16 February 13

Page 38: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• Differences in the definite article (e.g. the):

• http://wals.info/feature/37A?tg_format=map&v1=c00d&v2=c99f&v3=cd00&v4=dfff&v5=cfff

• Past tense:

• http://wals.info/feature/66A?tg_format=map&v1=cff0&v2=cf60&v3=cd00&v4=cfff

• Case affixes

• http://wals.info/feature/51A?s=20&z4=3000&z3=2999&z8=2998&z5=2997&z7=2996&z2=2995&z6=2994&z9=2993&z1=2992&tg_format=map&v1=c00d&v2=cd00&v3=c000&v4=c

Word Normalization: Cross-linguistic variation

Saturday 16 February 13

Page 39: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Word Normalization: what is it?

• Information retrieval

• U.S.A -> USA

• window -> window, windows

• windows -> Windows, windows, window

• Windows -> Windows

• reducing all letters to lower case

• exceptions: General Motors, New York, ...

• can be important for sentiment analysis: US vs us

Saturday 16 February 13

Page 40: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Lemmatization

• Reduce inflections or variant forms to base form

• am, are, is -> be

• car, cars, car’s cars’ -> car

• Finding the dictionary headword form

• Important in Machine translation

Saturday 16 February 13

Page 41: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Stemming

• Reduce terms to their stems in information retrieval

• It is chopping of affixes from word forms

• Is a language dependent process

Saturday 16 February 13

Page 42: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Porter’s algorithm: English stemmer

• Simple replacement algorithm, based on ordered replacement rules

Saturday 16 February 13

Page 43: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Porter’s algorithm: English stemmer

• Simple replacement algorithm, based on ordered replacement rules

!!!Step!1a!sses → ss ! caresses → caress!ies → i ! ponies → poni!ss → ss ! caress → caress!s → ø!!!!!!!!!cats → cat!

!!Step!1b!(*v*)ing → ø!!!!walking → walk! sing → sing!(*v*)ed → ø!!!!plastered → plaster!…!

Saturday 16 February 13

Page 44: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Porter’s algorithm: English stemmer

• Simple replacement algorithm, based on ordered replacement rules

!!!Step!1a!sses → ss ! caresses → caress!ies → i ! ponies → poni!ss → ss ! caress → caress!s → ø!!!!!!!!!cats → cat!

!!Step!1b!(*v*)ing → ø!!!!walking → walk! sing → sing!(*v*)ed → ø!!!!plastered → plaster!…!

!!!Step!2!(for!long!stems)!ational→ ate relational→ relate!izer→ ize ! digitizer → digitize!ator→ ate ! operator → operate!…!

!!!!Step!3!(for!longer!stems)!al → ø!!!!!!revival → reviv!able → ø!!!!!!adjustable → adjust!ate → ø activate → activ!…!

Saturday 16 February 13

Page 45: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• M.F. Porter, 1980, An algorithm for suffix stripping, Program, 14(3) pp 130−137.

• You can download versions for many programming languages at: http://tartarus.org/martin/PorterStemmer/

•Demo!

Porter’s algorithm: English stemmer

Saturday 16 February 13

Page 46: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Stemming: Language dependent

• Turkish:

• Uygarlas(ramadiklarimizdanmissinizcasina

• `(behaving)  as  if  you  are  among  those  whom  we  could  not  civilize’

• Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ + lar ‘plural’+ imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’

Saturday 16 February 13

Page 47: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Linguistic terminology today 8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

Words and Morphology:Sentence segmentation

Saturday 16 February 13

Page 48: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Sentence segmentation

• Segmenting out sentences from running text.

• !, ? are relatively unambiguous

• Period “.” is ambiguous

• Sentence boundary

• Abbreviations like Inc. or Dr.

• Numbers like .02 or 3.14

Saturday 16 February 13

Page 49: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• How to solve the ambiguous “.”?

• Build a binary classifier that decides: EndOfSentence/NotEndOfSentence.

• possibilities: hand-written rules, regular expressions, or machine learning

• machine learning: SVM, neural network, Memory-based, decision tree, ...

Sentence segmentation

Saturday 16 February 13

Page 50: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Sentence segmentation: decision trees

Saturday 16 February 13

Page 51: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• More sophisticated features:

• Case of word with “.”: Upper, Lower, Cap, Number

• Case of word after “.”: Upper, Lower, Cap, Number

• Numeric features:

• Length of word with “.”

• Probability(word with “.” occurs at end-of-s)

• Probability(word after “.” occurs at beginning-of-s)

Sentence segmentation: decision trees

Saturday 16 February 13

Page 52: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Syntax

8

Words

This is a simple sentence WORDS

Philipp Koehn ANLP Lecture 1 19 September 2012

9

Morphology

This is a simple sentencebe3sg

present

WORDS

MORPHOLOGY

Philipp Koehn ANLP Lecture 1 19 September 2012

11

Syntax

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

Philipp Koehn ANLP Lecture 1 19 September 2012

12

Semantics

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

12

Semantics

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

WORDS

MORPHOLOGY

SYNTAX

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

13

Discourse

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

But it is an instructive one.

CONTRAST

WORDS

MORPHOLOGY

SYNTAX

DISCOURSE

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

13

Discourse

This is a simple sentencebe3sg

present

DT VBZ DT JJ NN

NP

VP

S

NP

SENTENCE1string of words satisfying the

grammatical rules of a languauge

SIMPLE1having

few parts

But it is an instructive one.

CONTRAST

WORDS

MORPHOLOGY

SYNTAX

DISCOURSE

PART OF SPEECH

SEMANTICS

Philipp Koehn ANLP Lecture 1 19 September 2012

Linguistic terminology

Saturday 16 February 13

Page 53: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

Conclusion

• Tokenization, lemmatization, stemming and sentence segmentation are almost inevitable when working with raw text.

• They are not rocket science but also not trivial. Can be very tedious.

• Often highly language dependent

• Use of machine learning when necessary but often regular expressions or iterative rules suffice

Saturday 16 February 13

Page 54: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

assignments by next week!

• Install python and NLTK, the natural language toolkit

• Follow instructions on: http://nltk.org/

• For python optionally use: http://epd-free.enthought.com/?Download=Download+EPD+Free+7.3-2

Saturday 16 February 13

Page 55: Natural Language Processing - VUB · Syntax 8 Words This is a simple sentence WORDS Philipp Koehn ANLP Lecture 1 19 September 2012 9 Morphology This is a simple sentence be 3sg present

• Read the first chapter of the NLTK book and do the “Your turn” items.

• http://nltk.org/book/ch01.html

• Collect all your python commands and the output for the “Your turn” questions in a document and send it to [email protected]

• deadline: Friday 22 February at 12.00h midday.

assignments by next week!

Saturday 16 February 13