54
Page 1 Part-of-Speech Tagging L545 Spring 2010

Page 1 Part-of-Speech Tagging L545 Spring 2010. Page 2 POS Tagging Problem Given a sentence W1…Wn and a tagset of lexical categories, find the most

Embed Size (px)

Citation preview

Page 1

Part-of-Speech Tagging

L545

Spring 2010

Page 2

POS Tagging Problem

Given a sentence W1…Wn and a tagset of lexical categories, find the most likely tag T1..Tn for each word in the sentence

Example

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN

People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN

Note that many of the words may have unambiguous tags

- But enough words are either ambiguous or unknown that it’s a nontrivial task

Page 3

More details of the problem

How ambiguous?- Most words in English have only one Brown Corpus tag

Unambiguous (1 tag) 35,340 word types Ambiguous (2- 7 tags) 4,100 word types = 11.5%

- 7 tags: 1 word type “still”- But many of the most common words are ambiguous

Over 40% of Brown corpus tokens are ambiguous Obvious strategies may be suggested based on intuition

to/TO race/VB the/DT race/NN will/MD race/NN

- This leads to hand-crafted rule-based POS tagging (J&M, 5.4) Sentences can also contain unknown words for which tags have

to be guessed: Secretariat/NNP

Page 4

POS tagging methods

First, we’ll look at how POS-annotated corpora are constructed

Then, we’ll narrow in on various POS tagging methods, which rely on POS-annotated corpora

- Supervised learning: use POS-annotated corpora as training material

HMMs, TBL, Maximum Entropy, MBL

- Unsupervised learning: use training material which is unannotated

HMMs, TBL

We’ll only provide an overview of the methods

- Many of the details will be left to L645

Page 5

Constructing POS-annotated corpora

By examining how POS-annotated corpora are created, we will understand the task even better

Basic steps

- 1. Tokenize the corpus

- 2. Design a tagset, along with guidelines

- 3. Apply the tagset to the text

Knowing what issues the corpus annotators had to deal with in hand-tagging tells us the kind of issues we’ll have to deal with in automatic tagging

Page 6

1. Tokenize the corpus

The corpus is first segmented into sentences And then each sentence is tokenized into unique tokens

(words)

Punctuation is usually split off

- Figuring out sentence-ending periods vs. abbreviatory periods is not always easy.

Can become tricky:

- Proper nouns and other multi-word units: one token or separate tokens? e.g., Jimmy James, in spite of

- Contractions often split into separate words: can’t can n’t (because these are two different POSs)

- Hyphenations sometimes split, sometimes kept together

Page 7

2. Design the Tagset

It is not obvious what the parts of speech are in any language, so a tagset has to be designed.

Criteria:

- External: what do we need to capture the linguistic properties of the text?

- Internal: what distinctions can be made reliably in a text?

Some tagsets are compositional: each character in the tag has a particular meaning:

Vr3s-f

= verb-present-3rd-singular-<undefined_for_case>-female

What do you do with multi-token words, e.g. in terms of?

Page 8

Example English Part-of-Speech Tagsets

Brown corpus - 87 tags- Allows compound tags

“I'm” tagged as PPSS+BEM - PPSS for "non-3rd person nominative personal

pronoun" and BEM for "am, 'm“

Others have derived their work from Brown Corpus- LOB Corpus: 135 tags- Lancaster UCREL Group: 165 tags- London-Lund Corpus: 197 tags. - BNC – 61 tags (C5)- PTB – 45 tags

Other languages have developed other tagsets

Page 9

PTB Tagset (36 main tags + punctuation tags)

Page 10

Typical Problem Cases

Certain tagging distinctions are particularly problematic

For example, in the Penn Treebank (PTB), tagging systems do not consistently get the following tags correct:

- NN vs NNP vs JJ, e.g., Fantastic

somewhat ill-defined distinctions

- RP vs RB vs IN, e.g., off

pseudo-semantic distinctions

- VBD vs VBN vs JJ, e.g., hated

non-local distinctions

Page 11

3. Apply Tagset to Text

PTB Tagging Process:

1. Automatic tagging by rule-based and statistical POS taggers, error rates of 2-6%.

2. Human correction using a text editor

- Takes under a month for humans to learn this, and annotation speeds after a month exceed 3,000 words/hour

- Inter-annotator disagreement (4 annotators, eight 2000-word docs) was 7.2% for the tagging task and 4.1% for the correcting task

Benefit of post-editing: Manual tagging took about 2X as long as correcting, with about 2X the inter-annotator disagreement rate and error rate that was about 50% higher

Page 12

POS Tagging Methods

Two basic ideas to build from:

- Establishing a simple baseline with unigrams

- Hand-coded rules

Machine learning techniques:

- Supervised learning techniques

- Unsupervised learning techniques

Page 13

A Simple Strategy for POS Tagging

Choose the most likely tag for each ambiguous word, independent of previous words

- i.e., assign each token the POS category it occurred as most often in the training set

- e.g., race – which POS is more likely in a corpus?

This strategy gives you 90% accuracy in controlled tests- So, this “unigram baseline” must always be compared

against

Page 14

Example of the Simple Strategy

Which POS is more likely in a corpus (1,273,000 tokens)? NN VB Total

race 400 600 1000

P(NN|race) = P(race&NN) / P(race) by the definition of conditional probability

- P(race) 1000/1,273,000 = .0008 - P(race&NN) 400/1,273,000 =.0003- P(race&VB) 600/1,273,000 = .0005

And so we obtain:

- P(NN|race) = P(race&NN)/P(race) = .0003/.0008 =.375- P(VB|race) = P(race&VB)/P(race) = .0004/.0008 = .625

Page 15

Hand-coded rules

Two-stage system:

- Dictionary assigns all possible tags to a word

- Rules winnow down the list to a single tag

Sometimes, multiple tags are left, if it cannot be determined

Can also use some probabilistic information

These systems can be highly effective, but they of course take time to write the rules.

- We’ll see an example later of trying to automatically learn the rules (transformation-based learning)

Page 16

Hand-coded Rules: ENGCG System Uses 56,000-word lexicon which lists parts-of-speech for

each word (using two-level morphology)

Uses up to 3,744 rules, or constraints, for POS disambiguation

ADV-that rule

Given input “that” (ADV/PRON/DET/COMP)

If (+1 A/ADV/QUANT) #next word is adj, adverb, or quantifier

(+2 SENT_LIM) #and following word is a sentence boundary

(NOT -1 SVOC/A) #and the previous word is not a verb like

#consider which allows adjs as object complements

Then eliminate non-ADV tags

Else eliminate ADV tag

Page 17

Machine Learning

Machines can learn from examples

- Learning can be supervised or unsupervised

Given training data, machines analyze the data, and learn rules which generalize to new examples

- Can be sub-symbolic (rule may be a mathematical function) e.g., neural nets

- Or it can be symbolic (rules are in a representation that is similar to representation used for hand-coded rules)

In general, machine learning approaches allow for more tuning to the needs of a corpus, and can be reused across corpora

Page 18

Ways of learning

Supervised learning:

- A machine learner learns the patterns found in an annotated corpus

Unsupervised learning:

- A machine learner learns the patterns found in an unannotated corpus

- Often uses another database of knowledge, e.g., a dictionary of possible tags

Techniques used in supervised learning can be adapted to unsupervised learning, as we will see.

Page 19

1. TBL: A Symbolic Learning Method

A method called error-driven Transformation-Based Learning (TBL) (Brill algorithm) can be used for symbolic learning

- The rules (actually, a sequence of rules) are learned from an annotated corpus

- Performs about as accurately as other statistical approaches

Can have better treatment of context compared to HMMs (as we’ll see)

- rules which use the next (or previous) POS

HMMs just use P(Ti| Ti-1) or P(Ti| Ti-2 Ti-1)

- rules which use the previous (next) word

HMMs just use P(Wi|Ti)

Page 20

Rule Templates

Brill’s method learns transformations which fit different templates

- Template: Change tag X to tag Y when previous word is W

Transformation: NN VB when previous word = to

- Change tag X to tag Y when next tag is Z

NN NNP when next tag = NNP

- Change tag X to tag Y when previous 1st, 2nd, or 3rd word is W

VBP VB when one of previous 3 words = has

The learning process is guided by a small number of templates (e.g., 26) to learn specific rules from the corpus

Note how these rules sort of match linguistic intuition

Page 21

Brill Algorithm (Overview)

Assume you are given a training corpus G (for gold standard)

First, create a tag-free version V of it … then do steps 1-4

Notes:

- As the algorithm proceeds, each successive rule covers fewer examples, but potentially more accurately

- Some later rules may change tags changed by earlier rules

1. Initial-state annotator: Label every word token in V with most likely tag for that word type from G.

2. Consider every possible transformational rule: select the one that leads to the most improvement in V using G to measure the error

3. Retag V based on this rule

4. Go back to 2, until there is no significant improvement in accuracy over previous iteration

Page 22

Error-driven method

How does one learn the rules?

The TBL method is error-driven

- The rule which is learned on a given iteration is the one which reduces the error rate of the corpus the most, e.g.:

- Rule 1 fixes 50 errors but introduces 25 more net decrease is 25

- Rule 2 fixes 45 errors but introduces 15 more net decrease is 30

Choose rule 2 in this case

We set a stopping criterion, or threshold once we stop reducing the error rate by a big enough margin, learning is stopped

Page 23

Brill Algorithm (More Detailed) 1. Label every word token with

its most likely tag (based on lexical generation probabilities).

2. List the positions of tagging errors and their counts, by comparing with “truth” (T)

3. For each error position, consider each instantiation I of X, Y, and Z in Rule template.- If Y=T, increment

improvements[I], else increment errors[I].

4. Pick the I which results in the greatest error reduction, and add to output- VB NN PREV1OR2TAG DT

improves on 98 errors, but produces 18 new errors, so net decrease of 80 errors

5. Apply that I to corpus 6. Go to 2, unless stopping

criterion is reached

Most likely tag:

P(NN|race) = .98

P(VB|race) = .02

Is/VBZ expected/VBN to/TO race/NN tomorrow/NN

Rule template: Change a word from tag X to tag Y when previous tag is Z

Rule Instantiation for above example: NN VB PREV1OR2TAG TO

Applying this rule yields:

Is/VBZ expected/VBN to/TO race/VB tomorrow/NN

Page 24

Example of Error Reduction

From Eric Brill (1995):Computational Linguistics, 21, 4, p. 7

Page 25

Rule ordering

One rule is learned with every pass through the corpus.

- The set of final rules is what the final output is

- Unlike HMMs, such a representation allows a linguist to look through and make more sense of the rules

Thus, the rules are learned iteratively and must be applied in an iterative fashion.

- At one stage, it may make sense to change NN to VB after to

- But at a later stage, it may make sense to change VB back to NN in the same context, e.g., if the current word is school

Page 26

Example of Learned Rule Sequence

1. NN VB PREVTAG TO- to/TO race/NN->VB

2. VBP VB PREV1OR20R3TAG MD- might/MD vanish/VBP-> VB

3. NN VB PREV1OR2TAG MD- might/MD not/RB reply/NN -> VB

4. VB NN PREV1OR2TAG DT - the/DT great/JJ feast/VB->NN

5. VBD VBN PREV1OR20R3TAG VBZ- He/PP was/VBZ killed/VBD->VBN by/IN Chapman/NNP

Page 27

Handling Unknown Words

Can also use the Brill method to learn how to tag unknown words

Instead of using surrounding words and tags, use affix info, capitalization, etc.

- Guess NNP if capitalized, NN otherwise.

- Or use the tag most common for words ending in the last 3 letters.

- etc.

TBL has also been applied to some parsing tasks

Example Learned Rule Sequence for Unknown Words

Page 28

Insights on TBL

TBL takes a long time to train, but is relatively fast at tagging once the rules are learned

The rules in the sequence may be decomposed into non-interacting subsets, i.e., only focus on VB tagging (need to only look at rules which affect it)

In cases where the data is sparse, the initial guess needs to be weak enough to allow for learning

Rules become increasingly specific as you go down the sequence.

- However, the more specific rules generally don’t overfit because they cover just a few cases

Page 29

2. HMMs: A Probabilistic Approach

What you want to do is find the “best sequence” of POS tags T=T1..Tn for a sentence W=W1..Wn.

- (Here T1 is pos_tag(W1)).find a sequence of POS tags

T that maximizes P(T|W) Using Bayes’ Rule, we can

sayP(T|W) = P(W|T)*P(T)/P(W)

We want to find the value of T which maximizes the RHS denominator can be discarded (same for every T)

Find T which maximizes P(W|T) * P(T)

Example: He will race Possible sequences:

- He/PRP will/MD race/NN- He/PRP will/NN race/NN- He/PRP will/MD race/VB- He/PRP will/NN race/VB

W = W1 W2 W3 = He will race

T = T1 T2 T3- Choices:

T= PRP MD NN T= PRP NN NN T = PRP MD VB T = PRP NN VB

Page 30

Ngram Models

POS problem formulation- Given a sequence of words, find a sequence of

categories that maximizes P(T1..Tn| W1…Wn)- i.e., that maximizes P(W1…Wn | T1…Tn) * P(T1..Tn)

(by Bayes’ Rule)

Chain Rule of probability: P(W|T) = i=1, n P(Wi|W1…Wi-1T1…Ti)

prob. of this word based on previous words & tagsP(T) = i=1, n P(Ti|W1…WiT1…Ti-1)

prob. of this tag based on previous words & tags

But we don’t have sufficient data for this, and we would likely overfit the data, so we make some assumptions to simplify the problem …

Page 31

Independence Assumptions

Assume that current event is based only on previous n-1 events (for a bigram model, it’s based only on previous 1 event)

P(T1….Tn) i=1, n P(Ti| Ti-1)

- assumes that the event of a POS tag occurring is independent of the event of any other POS tag occurring, except for the immediately previous POS tag

From a linguistic standpoint, this seems an unreasonable assumption, due to long-distance dependencies

P(W1….Wn | T1….Tn) i=1, n P(Wi| Ti)

- assumes that the event of a word appearing in a category is independent of the event of any surrounding word or tag, except for the tag at this position.

Page 32

Hidden Markov Models

Linguists know both these assumptions are incorrect!

- But, nevertheless, statistical approaches based on these assumptions work pretty well for part-of-speech tagging

In particular, with Hidden Markov Models (HMMs)

- Very widely used in both POS-tagging and speech recognition, among other problems

- A Markov model, or Markov chain, is just a weighted Finite State Automaton

Page 33

POS Tagging Based on Bigrams

Problem: Find T which maximizes P(W | T) * P(T)

- Here W=W1..Wn and T=T1..Tn

Using the bigram model, we get:

- Transition probabilities (prob. of transitioning from one state/tag to another):

P(T1….Tn) i=1, n P(Ti|Ti-1)

- Emission probabilities (prob. of emitting a word at a given state):

P(W1….Wn | T1….Tn) i=1, n P(Wi| Ti)

So, we want to find the value of T1..Tn which maximizes:

i=1, n P(Wi| Ti) * P(Ti| Ti-1)

Page 34

Using POS bigram probabilities: transitions

P(T1….Tn) i=1, n P(Ti|

Ti-1)

Example: He will race

Choices for T=T1..T3

- T= PRP MD NN- T= PRP NN NN- T = PRP MD VB- T = PRP NN VB

POS bigram probs from training corpus can be used for P(T)

P(PRP-MD-NN)=1*.8*.4 =.32

C|R MD NN VB PRP

MD .4 .6

NN .3 .7

PRP .8 .2

1

POS bigram probs

PRP

MDNN

VBNN

.4

.6.3

.7

.8

.2

1

Page 35

HMMs

Nodes (or states): tag options Edges: Each edge from a node has a (transition) probability to

another state- The sum of transition probabilities of all edges going out

from a node is 1

The probability of a path is the product of the probabilities of the edges along the path.

- Why multiply? Because of the assumption of independence (Markov assumption).

- POS tagging is interested in finding the probability of a path

The probability of a string is the sum of the probabilities of all the paths for the string.

- Why add? Because we are considering the union of all interpretations.

- A string that isn’t accepted has probability zero.

Page 36

Hidden Markov Models

Why the word “hidden”?

Called “hidden” because one cannot tell what state the model is in for a given sequence of words.

- e.g., will could be generated from NN state, or from MD state, with different probabilities.

That is, looking at the words will not tell you directly what tag state you’re in.

Page 37

Factoring in lexical generation probabilities

From the training corpus, we need to find the Ti which maximizes

i=1, n P(Wi| Ti) * P(Ti| Ti-1)

So, we’ll need to factor the lexical generation (emission) probabilities, somehow:

MD NN VB PRP

he 0 0 0 1

will .8 .2 0 0

race 0 .4 .6 0

lexical generation probs

PRP

MDNN

VBNN

.4

.6.3

.7

.8

.2

1

+A B

D

F

C E

Page 38

Adding emission probabilities

he|PRP .3

will|MD.8

race|NN.4

race|VB.6

will|NN .2

.4

.6.3

.7

.8

.2

<s>| 1

MD NN VB PRP

he 0 0 0 .3

will .8 .2 0 0

race 0 .4 .6 0

lexical generation probs

C|R MD NN VB PRP

MD .4 .6

NN .3 .7

PP .8 .2

1

pos bigram probs

Page 39

Calculating Possibilities: The Slow Way

Here’s what we could do:

- Calculate all 4 paths independently

P(PRP MD NN|He will run) P(PRP NN NN|He will run) P(PRP MD VB|He will run) P(PRP NN VB|He will run)

- P(PRP MD NN|He will run) = P(He|PRP)*P(MD|PRP)*P(will|MD)*…

- P(PRP MD VB|He will run) = P(He|PRP)*P(MD|PRP)*P(will|MD)*…

But wait: the first few probabilities are the same in both cases!

Page 40

Dynamic Programming

In order to find the most likely sequence of categories for a sequence of words, we don’t need to enumerate all possible sequences of categories.

Because of the Markov assumption, if you keep track of the most likely sequence found so far for each possible ending category, you can ignore all the other less likely sequences.

- i.e., multiple edges coming into a state, but only keep the value of the most likely path

- This is a use of dynamic programming

The algorithm to do this is called the Viterbi algorithm.

Page 41

The Viterbi algorithm

1. Assume we’re at state I in the HMM

• States H1 … Hm all come into I

2. Obtain

• the best probability of each previous state H1…Hm

• the transition probabilities: P(I|H1), … P(I|Hm)

• the emission probability for word w at I: P(w|I)

3. Multiple the probabilities for each new path:

• e.g., P(Hi,I) = Best(H1)*P(I|H1)*P(w|I)

4. One of these states (H1…Hm) will give the highest probability

• Only keep the highest probability when using I for the next state

Page 42

Finding the best path through an HMM

Best(I) = Max H < I [Best(H)* P(I|H)]* P(w|I)

Best(A) = 1

Best(B) = Best(A) * P(PRP|) * P(he|PRP) = 1*1*.3=.3

Best(C)=Best(B) * P(MD|PRP) * P(will|MD) = .3*.8*.8= .19

Best(D)=Best(B) * P(NN|PRP) * P(will|NN) = .3*.2*.2= .012

Best(E) = Max [Best(C)*P(NN|MD), Best(D)*P(NN|NN)] * P(race|NN) = .03

Best(F) = Max [Best(C)*P(VB|MD), Best(D)*P(VB|NN)] * P(race|VB)= .068

he|PRP.3

will|MD.8

race|NN.4

race|VB.6

will|NN.2

.4

.6.3

.7

.8

.2

<s>| 1

MD NN VB PRP

he 0 0 0 .3

will .8 .2 0 0

race 0 .4 .6 0

lexical generation probs

A

C

B

D

E

F

Viterbialgorithm

Page 43

Smoothing

Lexical generation probabilities will lack observations for low-frequency and unknown words

Most systems do one of the following

- Use deleted interpolation – so a trigram uses bigram and unigram information

- Smooth the counts

E.g., add a small number to unseen data

- Use more data (but you’ll still need to smooth)

- Group items into classes, thus increasing class frequency

e.g., group words into ambiguity classes, based on their set of tags.

Page 44

3. Maximum Entropy

Maximum entropy: model what you know as best you can, remain uncertain about the rest

Set up some features, similar to Brill’s templates

When tagging, the number of features which are true determines the probability of a tag in a particular context

- Make your probability model match the training data as well as it can, i.e., if an example matches features we’ve seen, we know exactly what to do

- But the probability model also maximizes the entropy = uncertainty … so, we are noncommittal to anything not seen in the training data

Page 45

Other techniques

Decision trees

Memory-Based Learning (Case-Based Reasoning)

Neural networks

Conditional random fields

A network of linear functions

Basically, whatever techniques are being used in machine learning are appropriate techniques to try with POS tagging

However, remember also that POS tagging is a sequence problem, and thus techniques developed for sequences should work better.

- Bidirectional dependency networks

- Guided learning (bidirectional labeling)

Page 46

Bidirectional dependency networks(Toutanova et al 2003)

Instead of using left-to-right or right-to-left probabilities, use them in both directions

- P(t0|t-1) may ignore certain patterns

e.g. in will to fight, will is likely a noun because nouns generally precede to

However, P(TO|NN) (=1 because to is always TO) does not capture this, whereas P(t-1=NN|t0=TO) does

Since bidirectionality means that the network is cyclic, one has to be more careful about combining probabilities

Toutanova et al integrate maximum entropy modeling into this network (to obtain probabilities)

- With this, they use more robust features about the surrounding context

Page 47

Guided Learning (Shen et al 2007)

It is not clear in which order POS tagging decisions should be made

- In some cases, the previous tag(s) are useful; in others, it is the following ones.

- Consider Agatha found that book interesting

- To disambiguate that, it helps to first disambiguate book and interesting

Shen et al employ a flexible search strategy in disambiguation

- Handle the cases you have most confidence in first

- This minimizes the amount of ambiguity when dealing with the less confident cases (meaning that the search space is smaller)

Page 48

Unsupervised learning

Unsupervised learning:

- Use an unannotated corpus for training data

- Instead, will have to use another database of knowledge, such as a dictionary of possible tags

Unsupervised learning use the same general techniques as supervised, but there are important differences

Advantage is that there is more unannotated data to learn from

- And annotated data isn’t always available

Page 49

Unsupervised Learning #1: TBL

With TBL, we want to learn rules of patterns, but how can we learn the rules if there’s no annotated data?

Main idea: look at the distribution of unambiguous words to guide the disambiguation of ambiguous words

Example: the can, where can can be a noun, modal, or verb

Let’s take unambiguous words from dictionary and count their occurrences after the

- the elephant- the guardian

Conclusion: immediately after the, nouns are more common than verbs or modals

Page 50

Unsupervised TBL

Initial state annotator- Supervised: assign random tag to each word- Unsupervised: for each word, list all tags in dictionary

The templates change accordingly …

Transformation template: - Change tag of word to tag Y if the previous (next)

tag (word) is Z, where is a set of 2 or more tags- Don’t change any other tags

Page 51

Error Reduction in Unsupervised Method

Let a rule to change to Y in context C be represented as Rule(, Y, C).

- Rule1: {VB, MD, NN} NN PREVWORD the- Rule2: {VB, MD, NN} VB PREVWORD the

Idea: - since annotated data isn’t available, score rules so as

to prefer those where Y appears much more frequently in the context C than all others in

frequency is measured by counting unambiguously tagged words

so, prefer {VB, MD, NN} NN PREVWORD the

to {VB, MD, NN} VB PREVWORD the

since dict-unambiguous nouns are more common in a corpus after the than dict-unambiguous verbs

Page 52

Unsupervised Learning #2: HMMs

We still want to use transition (P(ti|ti-1)) and emission (P(w|t)) probabilities, but how do we obtain them?

During training, the values of the states (i.e., the tags) are unknown, or hidden

Start by using a dictionary, to at least estimate the emission probability

Using the forward-backward algorithm, iteratively derive better probability estimates

- The best tag may change with each iteration

Stop when probability of the sequence of words has been (locally) maximized

Page 53

Forward-Backward Algorithm

The Forward-Backward (or Baum-Welch) Algorithm for POS tagging works roughly as follows:

- 1. Initialize all parameters (forward and backward probability estimates)

- 2. Calculate new probabilities

A. Re-estimate the probability for a given word in the sentence from the forward and backward probabilities

B. Use that to re-estimate the forward and backward parameters

- 3. Terminate when a local maximum for the string is reached

See L645 for more details …

Page 54

Summary: POS tagging

Part-of-speech tagging can use

- Hand-crafted rules based on inspecting a corpus

- Machine Learning-based approaches using rules derived automatically from a corpus

- Machine Learning-based approaches based on corpus statistics

When Machine Learning is used, it can be

- Supervised (using an annotated corpus)

- Unsupervised (using an unannotated corpus)