19
Lecture 6b NLTK Tagging Topics Topics Taggers Readings: Chapter 5 Readings: Chapter 5 CSCE 771 Natural Language Processing

Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

Embed Size (px)

Citation preview

Page 1: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

Lecture 6b NLTK Tagging

Lecture 6b NLTK Tagging

Topics Topics Taggers

Readings: Chapter 5Readings: Chapter 5

CSCE 771 Natural Language Processing

Page 2: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 2 – CSCE 771 Spring 2013

NLTK taggingNLTK tagging

>>> text = nltk.word_tokenize("And now for something >>> text = nltk.word_tokenize("And now for something completely different") completely different")

>>> nltk.pos_tag(text) >>> nltk.pos_tag(text)

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]'NN'), ('completely', 'RB'), ('different', 'JJ')]

Page 3: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 3 – CSCE 771 Spring 2013

>>> text = nltk.word_tokenize("They refuse to permit us >>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit") to obtain the refuse permit")

>>> nltk.pos_tag(text) >>> nltk.pos_tag(text)

[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]'DT'), ('refuse', 'NN'), ('permit', 'NN')]

Page 4: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 4 – CSCE 771 Spring 2013

>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())

>>> text.similar('woman') >>> text.similar('woman')

Building word-context index... man time day year car moment world Building word-context index... man time day year car moment world family house country child boy state job way war girl place room family house country child boy state job way war girl place room word word

>>> text.similar('bought') >>> text.similar('bought')

made said put done seen had found left given heard brought got been made said put done seen had found left given heard brought got been was set told took in felt that was set told took in felt that

>>> text.similar('over')>>> text.similar('over')

in on to of and for with from at by that into as up out down through is in on to of and for with from at by that into as up out down through is all about all about

>>> text.similar('the') >>> text.similar('the')

a his this their its her an that our any all one these my in your no some a his this their its her an that our any all one these my in your no some other andother and

Page 5: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 5 – CSCE 771 Spring 2013

Tagged CorporaTagged Corpora

By convention in NLTK, a tagged token is a tuple.By convention in NLTK, a tagged token is a tuple.

function str2tuple()function str2tuple()

>>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token = nltk.tag.str2tuple('fly/NN')

>>> tagged_token >>> tagged_token

('fly', 'NN') ('fly', 'NN')

>>> tagged_token[0] >>> tagged_token[0]

'fly' 'fly'

>>> tagged_token[1] >>> tagged_token[1]

'NN''NN'

Page 6: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 6 – CSCE 771 Spring 2013

Specifying Tags with StringsSpecifying Tags with Strings

>>> sent = ''' >>> sent = '''

... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT ... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN ... other/AP topics/NNS ,/, number/NN of/IN ... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CCAMONG/IN them/PPO the/AT Atlanta/NP and/CC

... ...

... accepted/VBN practices/NNS which/WDT inure/VB ... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT ... interest/NN of/IN both/ABX to/IN the/AT best/JJT ... interest/NN of/IN both/ABX governments/NNS ''/'' ./. ... ''' governments/NNS ''/'' ./. ... '''

>>> [nltk.tag.str2tuple(t) for t in sent.split()] >>> [nltk.tag.str2tuple(t) for t in sent.split()]

[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', [('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]

Page 7: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 7 – CSCE 771 Spring 2013

Reading Tagged CorporaReading Tagged Corpora

>>> nltk.corpus.brown.tagged_words() >>> nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...] [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]

>>> nltk.corpus.brown.tagged_words(simplify_tags=True) >>> nltk.corpus.brown.tagged_words(simplify_tags=True)

[('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...][('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...]

Page 8: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 8 – CSCE 771 Spring 2013

tagged_words() methodtagged_words() method

>>> print nltk.corpus.nps_chat.tagged_words() >>> print nltk.corpus.nps_chat.tagged_words()

[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...] [('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]

>>> nltk.corpus.conll2000.tagged_words() >>> nltk.corpus.conll2000.tagged_words()

[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...] [('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]

>>> nltk.corpus.treebank.tagged_words() >>> nltk.corpus.treebank.tagged_words()

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...][('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]

Page 9: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 9 – CSCE 771 Spring 2013

>>> nltk.corpus.brown.tagged_words(simplify_tags=True) >>> nltk.corpus.brown.tagged_words(simplify_tags=True)

[('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...] [('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]

>>> nltk.corpus.treebank.tagged_words(simplify_tags=True) >>> nltk.corpus.treebank.tagged_words(simplify_tags=True)

[('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...][('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...]

Page 10: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 10 – CSCE 771 Spring 2013

readme() methodsreadme() methods

Page 11: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 11 – CSCE 771 Spring 2013

Table 5.1: Simplified Part-of-Speech TagsetTable 5.1: Simplified Part-of-Speech Tagset

Tag Meaning Examples

ADJ adjective new, good, high, special, big, local

ADV adverb really, already, still, early, now

CNJ conjunction and, or, but, if, while, although

DET determiner the, a, some, most, every, no

EX existential there, there's

FW foreign word dolce, ersatz, esprit, quo, maitre

Page 12: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 12 – CSCE 771 Spring 2013

MOD modal verb will, can, would, may, must, should

N noun year, home, costs, time, education

NP proper noun Alison, Africa, April, Washington

NUM number twenty-four, fourth, 1991, 14:24

PRO pronoun he, their, her, its, my, I, us

P preposition on, of, at, with, by, into, under

TO the word to to

UH interjection ah, bang, ha, whee, hmpf, oops

V verb is, has, get, do, make, see, run

VD past tense said, took, told, made, asked

VGpresent participle

making, going, playing, working

VN past participle given, taken, begun, sung

WH wh determiner who, which, when, what, where, how

Page 13: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 13 – CSCE 771 Spring 2013

>>> from nltk.corpus import brown >>> from nltk.corpus import brown

>>> brown_news_tagged = >>> brown_news_tagged = brown.tagged_words(categories='news', brown.tagged_words(categories='news', simplify_tags=True) simplify_tags=True)

>>> tag_fd = nltk.FreqDist(tag for (word, tag) in >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) brown_news_tagged)

>>> tag_fd.keys() >>> tag_fd.keys()

['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', ['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD', ...]'VD', ...]

Page 14: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 14 – CSCE 771 Spring 2013

NounsNouns

>>> word_tag_pairs = nltk.bigrams(brown_news_tagged) >>> word_tag_pairs = nltk.bigrams(brown_news_tagged)

>>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if >>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N')) b[1] == 'N'))

['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', ['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', 'VG', 'VN', ...]'VG', 'VN', ...]

Page 15: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 15 – CSCE 771 Spring 2013

VerbsVerbs

>>> wsj = >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=Truenltk.corpus.treebank.tagged_words(simplify_tags=True) )

>>> word_tag_fd = nltk.FreqDist(wsj) >>> word_tag_fd = nltk.FreqDist(wsj)

>>> [word + "/" + tag for (word, tag) in word_tag_fd if >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')] tag.startswith('V')]

['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V', ...]'pay/V', ...]

Page 16: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 16 – CSCE 771 Spring 2013

>>> cfd1 = nltk.ConditionalFreqDist(wsj) >>> cfd1 = nltk.ConditionalFreqDist(wsj)

>>> cfd1['yield'].keys() >>> cfd1['yield'].keys()

['V', 'N'] ['V', 'N']

>>> cfd1['cut'].keys() >>> cfd1['cut'].keys()

['V', 'VD', 'N', 'VN']['V', 'VD', 'N', 'VN']

Page 17: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 17 – CSCE 771 Spring 2013

>>> cfd2 = nltk.ConditionalFreqDist((tag, word) for >>> cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj) (word, tag) in wsj)

>>> cfd2['VN'].keys() >>> cfd2['VN'].keys()

['been', 'expected', 'made', 'compared', 'based', 'priced', ['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said', ...]'taken', 'paid', 'traded', 'said', ...]

Page 18: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 18 – CSCE 771 Spring 2013

>>> [w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' >>> [w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' in cfd1[w]] in cfd1[w]]

['Asked', 'accelerated', 'accepted', 'accused', 'acquired', ['Asked', 'accelerated', 'accepted', 'accused', 'acquired', 'added', 'adopted', ...] 'added', 'adopted', ...]

>>> idx1 = wsj.index(('kicked', 'VD')) >>> idx1 = wsj.index(('kicked', 'VD'))

>>> wsj[idx1-4:idx1+1] >>> wsj[idx1-4:idx1+1]

[('While', 'P'), ('program', 'N'), ('trades', 'N'), ('swiftly', [('While', 'P'), ('program', 'N'), ('trades', 'N'), ('swiftly', 'ADV'), ('kicked', 'VD')] 'ADV'), ('kicked', 'VD')]

>>> idx2 = wsj.index(('kicked', 'VN')) >>> idx2 = wsj.index(('kicked', 'VN'))

>>> wsj[idx2-4:idx2+1] >>> wsj[idx2-4:idx2+1]

[('head', 'N'), ('of', 'P'), ('state', 'N'), ('has', 'V'), ('kicked', [('head', 'N'), ('of', 'P'), ('state', 'N'), ('has', 'V'), ('kicked', 'VN')]'VN')]

Page 19: Lecture 6b NLTK Tagging Topics Taggers Readings: Chapter 5 CSCE 771 Natural Language Processing

– 19 – CSCE 771 Spring 2013

def findtags(tag_prefix, tagged_text): def findtags(tag_prefix, tagged_text):

cfd = nltk.ConditionalFreqDist((tag, word) for cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if (word, tag) in tagged_text if tag.startswith(tag_prefix)) return dict((tag, tag.startswith(tag_prefix)) return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())cfd[tag].keys()[:5]) for tag in cfd.conditions())