33
PART OF SPEECH TAGGING A Term Paper Submitted To Ceng463 Introduction To Natural Language Processing Course Of The Department Of Computer Engineering Of Middle East Technical University by Aslı Gülen 1128875 Esin Saka 1129121 December, 2001 Abstract This paper presents general information about part of speech tagging and morphological disambiguation concept. After general definitions about the topics, a more detailed explanation is done for rule-based (constraint-based) part-of-speech tagging and morphological disambiguation. At the end, some rule-based part-of-speech tagging studies on Turkish language are presented. Besides, there is a CD, attached to the paper. It contains related documentation on part-of-speech tagging and morphological disambiguation concept, like our reports, usable papers and example coding. 1

INTRODUCTIONccl.pku.edu.cn/doubtfire/NLP/Lexical_Analysis/Word... · Web viewA Term Paper Submitted To Ceng463 Introduction To Natural Language Processing Course Of The Department

Embed Size (px)

Citation preview

PART OF SPEECH TAGGING

A Term Paper

Submitted To Ceng463Introduction To Natural Language Processing Course

Of The Department Of Computer EngineeringOf Middle East Technical University

by

Aslı Gülen 1128875

Esin Saka 1129121

December, 2001

Abstract

This paper presents general information about part of speech tagging and morphological disambiguation concept. After general definitions about the topics, a more detailed explanation is done for rule-based (constraint-based) part-of-speech tagging and morphological disambiguation. At the end, some rule-based part-of-speech tagging studies on Turkish language are presented. Besides, there is a CD, attached to the paper. It contains related documentation on part-of-speech tagging and morphological disambiguation concept, like our reports, usable papers and example coding.

1

TABLE OF CONTENTS Introduction Text Tagging Examples Approaches To Tagging And Morphological Disambiguation

o Rule-Based (Constraint-Based) Approaches Historical Review

o Statistical (Stochastic) Approaches Historical Review

Comparison Of Rule-Based Part Of Speech Tagging With Statistical Part Of Speech Tagging

POS Tagging With Brill’s Algorithm Abstract Tagging And Morphological Disambiguation Of

Turkish Text Tagging And Solving Morphological Disambiguation Of

Turkish Texto Historical Overviewo Methodology

The Preprocessor Constraint Rules Evaluation

o Conclusion Acknowledgments

Bibliography

2

INTRODUCTION

Natural Language Processing is a research discipline related to artificial intelligence, linguistics, philosophy, and psychology. The aim of this discipline is building systems capable of understanding and interpreting the computational mechanisms of natural languages. Researches in natural language processing has been motivated by two main aims:

To lead to a better understanding of the structure and functions of human language

To support the construction of natural language interfaces and thus to facilitate communication between humans and computers

There are mainly four kinds of knowledge used in understanding natural language:

Morphological knowledge: description of the form of the words Syntactic knowledge: description of the ways in which words must be

ordered to make structurally acceptable sentences Semantic knowledge: description of the ways in which words are related to

the concepts Pragmatic knowledge: description of the ways in which we see the world

Assigning a category to a given word is tagging. The purpose of part of speech tagging is to assign part of speech tags to words reflecting their syntactic category. A part-of-speech (POS) tagger is a system that uses various sources of information to assign possibly unique POS to words. Automatic text tagging is an important step in discovering the linguistic structure of a large text corpora. It is a major component in higher-level analysis of text corpora. Its output can also be used in many natural language processing applications, such as: speech synthesis, speech recognition, spelling correction, query answering, machine translation, searching large text databases, and information extraction.

In this term paper, a part-of-speech tagger and morphological ambiguity, especially for Turkish text, is researched. Turkish language does not have a finite set of tag. Because of this reason, in place of the term “part of speech tagging”, the term “morphological disambiguation” can be used. Disambiguated texts can be beneficial for applications like:

Corpus analysis: For example, when gathering statistic information about a language, by using a corpus.

Syntactic parsing: Disambiguity of a text will decrease the ambiguity of sentence.

Spelling correction: For instance, to select a better pronunciation, context information may be used.

3

Speech synthesis: Such as to find the true spelling for a speech, tagging of the text will be useful.

Let us see the place of morphological disambiguation in an abstract context.

Figure:1 The place of morphological disambiguation in an abstract context

TEXT TAGGING EXAMPLES

The way you tag the text is up to the tagges you choose. For example if the only tagges are simple tagges like verb, noun, adjective etc.

Ayla sıcak çikolatayı sever.noun adjective noun verb

Ayla loves hot chocolate.noun verb adjective noun

is a simple example. Or more complex sentences and complex classifications can be done and these are more common and useful.

Consider the following example:

4

RAW TURKISH TEXT

MORPHOLOGICAL ANALYSIS

MORPHOLOGICAL DISAMBIGUATION

DISAMBIGUATED TURKISH TEXT

Tagged Corpus Parsing Text-to-Speech

İşten döner dönmez evimizin yakınında bulunan derin gölde yüzerek gevşemek en büyük zevkimdi.

(Relaxing by swimming the deep lake near our house, as soon as I return from work was my greatest pleasure.)

First let’s give the basic idea behind tagging the sentence:

The construct döner dönmez formed by two tensed verbs, is actually a temporal adverb meaning ... as soon as .. return(s) hence these two lexical items can be coalesced into a single lexical item and tagged as a temporal adverb.

The second person singular possessive (2SG-POSS) interpretation of yakınında is not possible since this word forms a simple compound noun phrase with the previous lexical item and the third person singular possessive functions as the compound marker.

The word derin (deep) is the modifier of a simple compound noun derin göl (deep lake) hence the second choice can safely be selected. The verbal root in the third interpretation is very unlikely to be used in text, let alone in second person imperative form. The fourth and the fifth interpretations are not plausible, as adjectives from aorist verbal forms almost never take any further inflectional suffixes. The first interpretation (meaning your skin) may be a possible choice but can be discarded in the middle of a longer compound noun phrase.

The word en preceding an adjective indicates a superlative construction and hence the noun reading can be discarded.

However, there exists a semantic ambiguity for the lexical item bulunan. It has two adjectival readings having the meaning something found and existing respectively. Among this two readings one can not resolve the ambiguity, as long as he/she does not have any idea about the discourse. Contextual information is not sufficient and the ambiguity should be left pending to the higher-level analysis.

So after the morphological analysis, figure-2 is computed. In the figure-2 upper-case letters in the morphological break-downs represent some specific classes of vowels, e.g., A stands for low-round vowels e and a, H stands for high vowels i,i,u and "u, and D = fd,tg. Although, the final category is adjective the use of possessive (and/or case, number) suffixes indicate nominal usage, as any adjective in Turkish can be used as a noun. And the correct choices of tags are marked with +.

işten Gloss POS 1. iş+Dan N(i,s)+ABL N+

döner 1. döner N(döner) N 2. dön+Ar V(dön)+AOR+3SG V+ 3. dön+Ar V(dön)+VtoAdj(er) ADJ

5

dönmez 1. dön+mA+z V(dön)+NEG+AOR+3SG V+ 2. dön+mAz V(dön)+VtoAdj(mez) ADJ

evimizin 1. ev+HmHz+nHn N(ev)+1PL-POSS+GEN N+

yakınında 1. yakın+sH+nDA ADJ(yakın)+3SG-POSS+LOC N2+ 2. yakin+Hn+DA ADJ(yakin)+2SG-POSS+LOC N

bulunan 1. bul+Hn+yAn V(bul)+PASS+VtoADJ(yan) ADJ 2. bulun+yAn V(bulun)+VtoADJ(yan) ADJ+

derin1. deri+Hn N(deri)+2SG-POSS N 2. derin ADJ(derin) ADJ+ 3. der+yHn V(der)+IMP+2PL V 4. de+Ar+Hn V(de)+VtoADJ(er)+2SG-POSS N 5. de+Ar+nHn V(de)+VtoADJ(er)+GEN N

gölde 1. göl+DA N(göl)+LOC N+

yüzerek 1. yüz+yArAk V(yüz)+VtoADV(yerek) ADV+

gevşemek 1. gevşe+mAk V(gevşe)+INF V+

en 1. en N(en) N 2. en ADV(en) ADV+

büyük 1. büyük ADJ(büyük) ADJ+

zevkimdi 1. zevk+Hm+yDH N(zevk)+1SG-POSS+NtoV()+PAST+3SG V+

Figure-2: Morphological analyzer output of the example sentence.

6

If we gather the correct choices of tags (the ones with + sign), we get figure-3:

Gloss POS iş+Dan N(i,s)+ABL Ndön+Ar V(dön)+AOR+3SG Vdön+mA+z V(dön)+NEG+AOR+3SG Vev+HmHz+nHn N(ev)+1PL-POSS+GEN Nyakın+sH+nDA ADJ(yakın)+3SG-POSS+LOC N2bulun+yAn V(bulun)+VtoADJ(yan) ADJderin ADJ(derin) ADJgöl+DA N(göl)+LOC Nyüz+yArAk V(yüz)+VtoADV(yerek) ADVgevşe+mAk V(gevşe)+INF Ven ADV(en) ADVbüyük ADJ(büyük) ADJzevk+Hm+yDH N(zevk)+1SG-POSS+NtoV()+PAST+3SG V

Figure-3: Tagged form of the second example sentence.

However, there are a number of choices for tags of the lexical items in the sentence, as can be seen in figure-2. Probably, all except the one above give rise to ungrammatical sentence structures. The number of that kind of undesired solutions shows the level of ambiguity of the tagging process.

APPROACHES TO TAGGING AND MORPHOLOGICAL DISAMBIGUATION

There are many approaches to automated part of speech tagging. Different critters give different methodologies for classification of tagging and morphological disambiguation approaches.

First, lets give a brief introduction to the types of tagging schemes, according to using pre-tagged corpora or not. The following diagram depicts the various approaches to automatic POS tagging. In reality, the picture is much more complicated, since many tagging systems use aspects of some or all of these approaches.

7

Figure-4: A classification of part-of-speech tagging methodologies.

According to this classification scheme, one of the main distinctions, which can be made among POS taggers is in terms of the degree of automation of the training and tagging process. The terms commonly applied to this distinction are supervised vs. unsupervised. Supervised taggers typically rely on pre-tagged corpora to serve as the basis for creating any tools to be used throughout the tagging process, for example: the tagger dictionary, the word/tag frequencies, the tag sequence probabilities and/or the rule set. Unsupervised models, on the other hand, are those which do not require a pre-tagged corpus but instead use sophisticated computational methods to automatically induce word groupings (i.e. tag sets) and based on those automatic groupings, to either calculate the probabilistic information needed by stochastic taggers or to induce the context rules needed by rule-based systems. Each of these approaches has pros and cons.

The primary argument for using a fully automated approach to POS tagging is that it is extremely portable. It is known that automatic POS taggers tend to perform best when both trained and tested on the same genre of text. The unfortunate reality is that pre-tagged corpora are not readily available for the many languages and genres, which one might wish to tag. Full automation of the tagging process addresses the need to accurately tag previously untagged genres and languages in light of the fact that hand tagging of training data is a costly and time-consuming process. There are, however, drawbacks to fully automating the POS tagging process. The word clustering, which tend to result from these methods are very coarse, i.e., one loses the fine distinctions found in the carefully designed tag sets used in the supervised methods.

The following table outlines the differences between these two approaches.

SUPERVISED UNSUPERVISED

Selection of tagset/tagged corpus Induction of tagset using untagged training data

Creation of dictionaries using tagged corpus

Induction of dictionary using training data

Calculation of disambiguation tools. may include:

Induction of disambiguation tools. May include:

Word frequencies Word frequenciesAffix frequencies Affix frequenciesTag sequence probabilities Tag sequence probabilities"Formulaic" expressionsTagging of test data using dictionary information

Tagging of test data using induced dictionaries

Disambiguation using statistical, hybrid or rule based approaches

Disambiguation using statistical, hybrid or rule based approaches

Calculation of tagger accuracy Calculation of tagger accuracy

8

As it can be seen from the Figure-4 there are two major approaches for used for POS taggers and morphological disambiguators:

Rule-based (Constraint-based) approaches Statistical (Stochastic) approaches

In constraint-based approaches, a large number of hand-crafted linguistic constraints are used. By these constraints, impossible tags and impossible morphological analysis for a given word in a text are eliminated. But in stochastic approaches, a large corpora is used to get statistical information. By using a part of the corpus, training phase is performed in order to get a statistical model, which will be used to tag untagged texts and to make morphological analysis. And remaining of the corpus is used to test the statistical model.

Early approaches to part-of-speech tagging were rule-based ones. After 1980’s, statistical methods became more popular. In 1990’s, Brill introduced a method to induce the constraints from tagged corpora, which is called transformation based error-driven learning. Nowadays, all of the approaches are used together to get better results.

RULE-BASED (CONSTRAINT-BASED) APPROACHES

Typical rule based approaches use contextual information to assign tags to unknown or ambiguous words. These rules are often known as context frame rules. As an example, a context frame rule might say something like: “if an ambiguous/unknown word X is preceded by a determiner and followed by a noun, tag it as an adjective”.

det - X - n = X/adj

In addition to contextual information, many taggers use morphological information to aid in the disambiguation process. One such rule might be: “if an ambiguous/unknown word ends in an -ing and is preceded by a verb, label it a verb” (depending on your theory of grammar, of course).

Some systems go beyond using contextual and morphological information by including rules pertaining to such factors as capitalization and punctuation. Information of this type is of greater or lesser value depending on the language being tagged. In German for example, information about capitalization proves extremely useful in the tagging of unknown nouns.

Rule based taggers most commonly require supervised training; but, very recently there has been a great deal of interest in automatic induction of rules. One approach to automatic rule induction is to run an untagged text through a tagger and see how it performs. A human then goes through the output of this first phase and corrects any erroneously tagged words. The properly tagged text is then submitted to the tagger, which learns correction rules by comparing the two sets of data. Several iterations of this process are sometimes necessary.

9

Historical Review

The earliest approach is due to Klein and Simmons. Their primary goal was to avoid the labor of constructing a very large dictionary. Their algorithm uses a set of 30 POS categories. It first seeks each word in dictionaries, then checks for suffixes and special characters as clues. Finally, the context frame tests are applied. These work on scopes bounded by unambiguous words. However, Klein and Simmons impose an explicit limit of three ambiguous words in a row. For each such span of ambiguous words, the pair of unambiguous categories bounding it, is mapped into a list. The list includes all known sequences of tags occurring between the particular bounding tags; all such sequences of the correct length become candidates. The program then matches the candidate sequences against the ambiguities remaining from earlier steps of the algorithm. When only one sequence is possible, disambiguation is successful. This algorithm correctly and unambiguously tags about 90% of the words in several pages of the Golden Book Encyclopedia.

The next important tagger, TAGGIT, was developed by Greene and Rubin in 1971. The tag set used is very similar, but somewhat larger, at about 86 tags. The dictionary used is derived from the tagged Brown Corpus, rather than from the untagged version. TAGGIT divides the task of category assignment into initial (potentially ambiguous) tagging, and disambiguation. Tagging is carried out as follows; first, the program consults an exception dictionary of about 3,000 words. Among other items, this contains all known closed-class words. It then handles various special cases, such as words with initial "$", contractions, special symbols, and capitalized words. A word's ending is then checked against a suffix list of about 450 strings, that was derived from the Brown Corpus. If TAGGIT has not assigned some tag(s) after these several steps, the word is tagged as a noun, a verb and an adjective, in order that the disambiguation routine may have something to work with. This tagger correctly tags approximately 77% of the million words in the Brown Corpus (the rest is completed by human post-editors).

A very successful constraint-based approach for morphological disambiguation, known as Constraint Grammar, was developed in Finland, from 1989 to 1992, by four researchers: Fred Karlsson, Arto Anttila, Juha Heikkila and Atro Voutilainen. In this framework, the problem of parsing was broken into seven sub-problems or modules; four of them are related to morphological disambiguation, the rest are used for parsing the running text. One of the most important steps of Constraint Grammar was context-dependent morphological disambiguation, where ambiguity is resolved using some context-dependent constraints. For this purpose they wrote a grammar, which contains a set of constraints based on descriptive grammars and studies of various corpora. Each constraint is a quadruple consisting of domain, operator, target and context condition(s).

Among those rule-based part-of-speech taggers, the one built by Brill has the advantage of learning tagging rules automatically. As it is the method that we mainly interest on, the detailed information about this approach will be given at a separate section later in this paper.

10

STATISTICAL (STOCHASTIC) APPROACHES

The term 'stochastic tagger' can refer to any number of different approaches to the problem of POS tagging. Any model, which somehow incorporates frequency or probability, i.e. statistics, may be properly labeled stochastic.

The simplest stochastic taggers disambiguate words based solely on the probability that a word occurs with a particular tag. In other words, the tag encountered most frequently in the training set is the one assigned to an ambiguous instance of that word. The problem with this approach is that while it may yield a valid tag for a given word, it can also yield inadmissible sequences of tags.

An alternative to the word frequency approach is to calculate the probability of a given sequence of tags occurring. This is sometimes referred to as the n-gram approach, referring to the fact that the best tag for a given word is determined by the probability that it occurs with the n previous tags. The most common algorithm for implementing an n-gram approach is known as the Viterbi Algorithm, a search algorithm which avoids the polynomial expansion of a breadth first search by "trimming" the search tree at each level using the best N Maximum Likelihood Estimates (where n represents the number of tags of the following word). If n is one (i.e. 1-gram approach), it is just the word frequency. If n is two (2-gram approach), it has a special name: bi-gram approach. For example, if you consider the frequency of “the smile”, where “the” is determiner and “smile” is common noun, this approach is bi-gram approach. Similarly, if you consider frequency of n words, ordered, it is n-gram approach.

The next level of complexity that can be introduced into a stochastic tagger is the one that combines the previous two approaches, using both tag sequence probabilities and word frequency measurements.

Historical Review

In 1983, Marshall described the Lancaster-Oslo-Bergen (LOB) Corpus tagging algorithm, later named as CLAWS. It is similar to TAGGIT program. The tag set used is very similar, but somewhat larger, at about 130 tags. The dictionary used is derived from the tagged Brown Corpus, rather than from the untagged version. The main innovation of the CLAWS is the use of a matrix of collocation probabilities, indicating the relative likelihood of co-occurrence of all ordered pairs of tags. This matrix can be mechanically derived from any pre-tagged corpus. CLAWS used a large portion of the Brown Corpus, with 200,000 words. CLAWS has been applied to the entire LOB Corpus with an accuracy of between 96% and 97%.

There are several advantages of this general approach over rule-based ones. First, spans of unlimited length can be handled. Second, a precise mathematical definition is possible for the fundamental idea of CLAWS. However, CLAWS is time- and storage-inefficient in the extreme.

11

Later in 1988, DeRose attempted to solve the inefficiency problem of the CLAWS and proposed a new algorithm called VOLSUNGA. The algorithm depends on a similar empirically-derived transitional probability matrix to that of CLAWS, and has a similar definition of optimal path. The tag set contains 97 tags. The optimal path is defined to be the one whose component collocations multiply out to the highest probability. The more complex definition applied by CLAWS, using the sum of all the paths at each node of the network, is not used. By this change VOLSUNGA overcomes the complexity problem. Application of the algorithm to Brown Corpus resulted with the 96% accuracy.

A form of Markov model has also been widely used in statistical approaches. In this model it is assumed that a word depends probabilistically on just its part-of-speech category, which in turn depends solely on the categories of the preceding two words. Two types of training have been used with this model. The first makes use of a tagged training corpus. The second method of training does not require a tagged training corpus. In this situation the Baum-Welch algorithm can be used. Under this regime, the model is called a Hidden Markov Model (HMM), as state transitions (i.e., part-of-speech categories) are assumed to be unobservable. Hidden Markov Model taggers and visible Markov Model taggers may be implemented using the Viterbi algorithm, and are among the most efficient of the tagging methods discussed here. Since this method is not the main interest of this paper the algorithms mentioned above will not be discussed here. For detailed information, use the progress report and related papers, which are in attached cd at the end of the term paper.

COMPARISON OF RULE-BASED PART OF SPEECH TAGGING WITH STATISTICAL PART OF SPEECH TAGGING

Until the simple rule-based part of speech tagging by Brill (1992), in the area of automatic part of speech tagging, statistical techniques were more successful than rule-based methods, but their storage, improvement and adaptation cost was higher. In 1992, Brill described a rule-based tagger. It’s success and efficiency was good enough to take consideration in tagging area. It had many advantages over statistical taggers. Some may be listed as [1]:

A vast reduction in stored information required The perspicuity of small set of meaningful rules as opposed to the large tables

of statistical needed for stochastic taggers Ease of finding and implementing improvements to the tagger Better portability from one tag set or corpus genre to another

After Brill’s tagger, rule-based (constraint-based) approaches improved. Nowadays, approaches combining both ideas are used commonly.

12

POS TAGGING WITH BRILL’S ALGORITHM

This tagger is an important step for natural language processing. Nearly all of the rule-based taggers, after 1992, references Brill’s tagger.

The tagger is a supervised model, which starts with a small structurally annotated corpus and a larger unannotated corpus, and uses these corpora to learn an ordered list of transformations that can be used to accurately annotate fresh text. It uses Brown corpus for testing. It works by automatically recognizing and remedying its weakness, thereby incrementally improving its performance.

In 1992, Brill applied transformation-based error-driven learning to part-of-speech tagging, and obtained performance comparable to that of stochastic taggers. In this work, the tagger is trained with the following process: First, text is tagged with an initial annotator, where each word is assigned with the most likely tag, estimated by examining a large corpus, without regard to context. The initial tagger has two non-textual procedures to improve the performance:

Words that were not in the training corpus and are capitalized tend to be proper nouns, and attempts to fix tagging mistakes.

Words that are not in the training corpus may be tagged by using the ending three letters.

Once text is passed through the annotator, it is then compared to the correct version, i.e., its manually tagged counterpart, and transformations, that can be applied to the output of the initial state annotator to make it better resemble the truth, can then be learned.

During this process, one must specify the following: (1) the initial state annotator, (2) the space of transformations the learner is allowed to examine, and (3) the scoring function for comparing the corpus to the truth.

In the first version, there were transformation templates of the following example forms:

Change tag a to tag b when:

1. The preceding (following) word is tagged z. 2. The preceding (following) word is tagged z and the word two before (after)

is tagged w.

where a, b, z and w are variables over the set of parts-of-speech. To learn a transformation, the learner applies every possible transformation, counts the number of tagging errors after that transformation is applied, and chooses that transformation resulting in the greatest error reduction. Learning stops when no transformations can be found whose application reduces errors beyond some pre-specified threshold. Once an ordered list of transformations is learned, new text can be tagged by first applying

13

the initial annotator to it and then applying each of the learned transformations, in order[1].

Later in 1994, Brill extended this learning paradigm to capture relationships between words by adding contextual transformations that could make reference to the words as well as part-of-speech tags[4].

The next step is applying patch templates to the training corpus and determining rules according to the errors. Patch templates are of the form:

If a word is tagged a and its content is in C then change that tag to b, or If a word is tagged a and it has lexical property P, then change that tag to b, If a word is tagged a and a word in region R has lexical property P, then

change that tag to b.

Some examples for patch (contextual rule) templates can be listed as:

Change tag a to tag b when:

1. The preceding (following) word is w. 2. The current word is w and the preceding (following) word is x 3. The current word is w and the preceding (following) word is tagged z.

where w and x are variables over all words in the training corpus, and z is a variable over all parts-of-speech.

In 1995, Brill improved this algorithm so that, it no longer requires a manually annotated training corpus. Instead, all needed is the allowable part-of-speech tags for each token, and the initial state annotator tags each token in the corpus with a list of all allowable tags.

The main idea can be explained best with the following example. Given the sentence:

The can will be crushed.

using an unannotated corpus it could be discovered that of the unambiguous tokens (i.e. that have only one possible tag) that appear after the in the corpus, nouns are much more common than verbs or modals. From this, the following rule could be learned:

“Change the tag of a word from modal or noun or verb to noun if the previous word is the.”

Unlike supervised learning, in this approach, main aim is not to change the tag of a token, but reduce the ambiguity, by choosing a tag for the words in a particular context. Another difference arises in calculating the scoring function. Unambiguous words are used in the scoring of this approach. In each learning iteration, the learner

14

searches for the transformation, which maximizes this function. Learning stops when no positive scoring transformations can be found.

This tagger has remarkable performance. After training the tagger with the corpus of size 600K, it produces 219 rules and generates 96.9% accuracy in the first scheme. Moreover, after the extension, number of rules increases to 267 and accuracy increases to 97.2%.

Brill’s rule-based method is an applicable method but is not fast as statistical ones. It may repeat unnecessary operations repeatedly. Deterministic POS tagging with finite state transducers decreased the complexity from RCn to n, where R is the number of contextual rules, C is the required tokens of context and n is number of input words [7]. This method relies on two central notions: the notion of finite state transducer and the notion of a subsequential transducer.

ABSTRACT TAGGING AND MORPHOLOGICAL DISAMBIGUATION OF TURKISH TEXT

Turkish is an agglutinative language with word structures formed by productive affixations of derivational and inflectional suffixes to the root words. Extensive use of suffixes results in ambiguous lexical interpretations in many cases. Almost 80% of each lexical item has more than one interpretation[9]. In this section, the sources of morphosyntactic ambiguity in Turkish is explored.

Many words have ambiguous readings even though they have the same morphological break-down. These ambiguities are due to different POS of roots. For example the word yana has three different readings:

yana Gloss POS English

1. yan+yA V(yan)+OPT+3SG V let it burn 2. yan+yA N(yan)+3SG+DAT N to this side 3. yana POSTP(yana) POSTP

The first and the second readings have the same root and derived with the same suffix, but since the root word yan has two different readings, one verbal and one nominal, morphological analyzer produces ambiguous output for the same break-down. Moreover, yana has a third postpositional reading without any affixation.

In Turkish there are many root words which are prefix of another root word. An example is:

Of the two root words, uymak and uyumak, uy is a prefix of uyu and when the morphological analyzer is fed with the word uyuyor, it outputs the following:

uyuyor Gloss POS English

1. uy+Hyor V(uy)+PR-CONT+3SG V it suits

15

2. uyu+Hyor V(uyu)+PR-CONT+3SG V s/he is sleeping

Nominal lexical items with nominative, locative or genitive case, have verbal/predicative interpretations. For example, the word evde is the locative case of the root word ev. And the morphological analyzer produces the following output for it.

evde Gloss POS English

1. ev+DA N(ev)+3SG+LOC N at home 2. ev+DA N(ev)+3SG+LOC+NtoV()+PR-CONT V (smt)is at home

There are morphological structure ambiguities due to the interplay between morphemes and phonetic change rules. Following is the output of morphological analyzer for the word evin:

evin Gloss POS English

1. ev+Hn N(ev)+3SG+2SG-POSS+NOM N your house2. ev+nHn N(ev)+3SG+GEN N of the house

Since the suffixes have to harmonize in certain aspects with the word affixed, the consonant "n" is deleted in the surface realization of the second reading of evin, causing it to have same lexical form with the first reading.

Within a word category, e.g., verbs, some of the roots have specific features which are not common to all. For example, certain reflexive verbs may also have passive readings, as in the following sentences:

Camasirlar dun yikandi.

Ali dun yikandi.

Following is the morphological break-down of yikandi:

yikandi Gloss POS English

1. yika+Hn+DH V(yika)+PASS+PAST+3SG V got washed

2. yika+n+DH V(yika)+REFLEX+PAST+3SG V s/he had a bath

From the same verbal root yika two different break-downs are produced. Passive reading of yikandi is used in the first sentence and the reflexive reading is used in the second sentence.

Some lexicalized word formations can also be re-derived from the original root and this is another source of ambiguity. The word mutlu has two parse with the same meaning, but different morphological break-down.

16

mutlu Gloss POS English

1. mut+lH N(mut)+NtoADJ(li)+3SG+NOM ADJ happy 2. mutlu ADJ(mutlu)+3SG+NOM ADJ happy

mutlu has a lexicalized adjectival reading where it is considered as a root form as seen in the second reading. However, the same surface form is also derived from the nominal root word mut, meaning happiness, with the suffix +li, and this form also has the same meaning.

Plural forms may display an additional ambiguity due to drop of a second plural marker. Consider the example word evleri:

evleri Gloss POS English

1. ev+lAr+sH N(ev)+3PL+3PS-POSS N his/her houses 2. ev+lArH N(ev)+3SG+3PL-POSS N their house 3. ev+lArH N(ev)+3PL+3PL-POSS N their houses 4. ev+lAr+yH N(ev)+3PL+ACC N houses (accusative)

In the first and the second reading there is only one level of plurality, where either the owner or the ownee is plural. However, the third reading contains a hidden suffix, where both of them are plural. Since it is not possible to detect which one is plural from the surface form, three ambiguous readings are generated.

Considering all these cases, it is apparent that the higher-level analysis of Turkish prose text will suffer from this considerable amount of ambiguity. On the other hand, available local context might be sufficient to resolve some of these ambiguities. For example, if we can trace the sentential positions of nominal forms in a given sentence, their predicative readings might be discarded, i.e., within a noun phrase it is obvious that they cannot be predicative.

TAGGING AND SOLVING MORPHOLOGICAL DISAMBIGUATION OF TURKISH TEXT

Historical Overview

Since most of the tagging studies in the world are performed for English and the structure of Turkish is different than English, it is not possible to apply available methods directly for Turkish. So, it is necessary to make researches specific for languages like Turkish and Finish.

In Turkey, a group of scientists worked on Turkish morphological ambiguity and part-of-speech tagging for Turkish, in 1990’s [2, 5, 6, 8, 9, 10]. Both stochastic and constraint-based approaches are applied.

17

Since we are especially interested in rule-based approaches, no detailed information will be given for other approaches. But there are same basic papers presented in the cd, at the end of the paper.

Methodology

The morphological disambiguation of a Turkish text, examined in this paper is based on constraints. The tokens, on which the disambiguation will be performed, are determined using a preprocessing module.

The Preprocessor

Early studies on automatic text tagging for Turkish had shown that some preprocessing on the raw text is necessary before analyzing the words in a morphological analyzer. This preprocessing module includes:

Tokenization, in which raw text is split into its tokens, which are not necessarily separated by blank characters or punctuation marks;

Morphological Analyzer, which is used for processing the tokens, obtained from the tokenization module, using the morphological analyzer;

Lexical and Non-lexical Collocation Recognizer, in which lexical and nonlexical collocations are recognized and packeted;

Unknown Word Processor, in which the tokens, which are marked as unknown after the lexical and non-lexical collocation recognizer, are parsed;

Format Conversion, in which each parse of a token is converted into a hierarchical feature structure;

Projection, in which each feature structure is projected on a subset of its features to be used in the training.

18

Figure-5: The structure of the preprocessor.

19

RAW TEXT

TOKENIZATION

MORPHOLOGICAL ANALYZER

LEXICAL AND NON-LEXICAL COLLOCATION

RECOGNIZER

UNKNOWN WORD PROCESSOR

FORMAT CONVERSION (/ PROJECTION)

LEARNING MODULE

MORPHOLOGICAL DISAMBIGUATION

MODULE

LEARNED RULES

PREPR

OC

ESSOR

Constraint Rules

The system uses rules of the sort:

if LC and RC then choose PARSE

or

if LC and RC then delete PARSE

where LC and RC are feature constraints on unambiguous left and right contexts of a given token, and PARSE is a feature constraint on the parse(s) that is (are) chosen (or deleted) in that context if they are subsumed by that constraint.

This system uses two handcrafted sets of rules:

1. It uses an initial set of hand-crafted choose rules to speed-up the learning process by creating disambiguated contexts over which statistics can be collected. These rules are independent of the corpus that is to be tagged, and are linguistically motivated. They enforce some very common feature patterns especially where word order is rather strict as in NP's or PP's. Another important feature of these rules is that they are applied even if the contexts are also ambiguous, as the constraints are tight. That is, if each token in a sequence of, say, three ambiguous tokens have a parse matching one of the context constraints (in the proper order), then all of them are simultaneously disambiguated.

2. It also uses a set of handcrafted heuristic delete rules to get rid of any very low probability parses. For instance, in Turkish, postpositions have rather strict contextual constraints and if there are tokens remaining with multiple parses one of which is a postposition reading, that reading is to be deleted.

Given a training corpus, with tokens annotated with possible parses, first the hand crafted rules are applied. Learning then goes on as a number of iterations over the training corpus. The following schema is an adaptation of Brill's formulation:

1. Generate a table, called in context, of all possible unambiguous contexts which contain a token with an unambiguous (projected) parse, along with a count of how many times this parse occurs unambiguously in exactly the same context in the corpus.

2. Generate a table, called count, of all unambiguous parses in the corpus along with a count of how many times this parse occurs in the corpus.

3. Start going over the corpus token by token generating contexts.4. For each unambiguous context encountered, with parses P1,……,Pk, and for

each parse Pi generate a candidate rule of the sort

if LC and RC then choose Pi

20

5. Every such candidate rule is then scored. 6. All candidate rules generated during one pass over the corpus are grouped by

context specificity and in each group, the rules are ordered by descending score.

7. The selected rules are then applied in the matching contexts and ambiguity in those contexts is reduced.

8. If the threshold for the most specific context falls below a given lower limit, the learning process is terminated.

The combination of these handcrafted, statistical and learned information sources is reported to yield a precision of 93 to 94% and ambiguity of 1.02 to 1.03 parses per token, on test texts, which is rather satisfactory.

Evaluation

The resulting disambiguated text is evaluated by some metrics. These metrics are:

Ambiguity = Number of Parses / Number of Tokens

Recall = Number of Tokens Correctly Disambiguated / Number of Tokens

Precision = Number of Tokens Correctly Disambiguated / Number of Parses Remaining

In the ideal case, when every token is correctly and uniquely disambiguated, recall and precision will be 1.0. But if they are not uniquely disambiguated, recall again will be 1.0 but precision will be small. So the aim is decreasing ambiguity and getting recall and precision as near as possible to 1.0.

Conclusion

The results are satisfactory. Results indicate that by combining these hand-crafted, statistical and learned information sources, a recall of 96 to 97% with a corresponding precision of 93 to 94% and ambiguity of 1.02 to 1.03 parses per token, on test texts is attained. However the impact of the rules that are learned is not significant as handcrafted rules do most of the easy work at the initial stages [2].

The results are also reasonable, when we do the same experiments on two unseen texts, which are on completely different topics. The recall we reach is 93-95% with a corresponding precision of 90-91% and ambiguity of 1.03 to 1.04 parses per token[2].

Since recall and precision are both near to 1.0 and when compared with the worldwide studies, the results are acceptable.

21

ACKNOWLEDGMENTS

We would like to thank Dilek Hakkani-Tür and Gökhan Tür for providing us their relevant research papers and their ideas; Bilge Say, Ayşenur Birtürk and Çağlar İskender for providing us related information and their feedback.

BIBLIOGRAPHY

[1] Brill, Eric. 1992. A simple rule-based part of speech tagger. In Third Conference on Applied Natural Language Processing.

[2] Tür, Gökhan. 1996. Using Multiple Sources Of Information For Constraint –Based Morphological Disambiguaition. Master’s thesis, Bilkent University, Department Of Computer Engineering and Information Science.

[3] Brill, Eric. 1994. Some advances in rule-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, Washington.

[4] Brill, Eric. 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4):543-566.

[5] Oflazer Kemal, Tür Gökhan. 1997. Morphological Disambiguation by Voting Constraints. In the Proceedings of ACL'97/EACL'97, the 35th Annual Meeting of the Association for Computational Linguistics, Madrid, Spain, July, 7-12.

[6] Oflazer Kemal, Tür Gökhan. 1996. Unsupervised Learning in Constraint-based Morphological Disambiguation. In the Proceedings of EMNLP'96, Conference on Empirical Methods in NLP, Pennsylvania, May 17-18.

[7] Roche, Emmanuel; Schabes, Yves. 1995. Deterministic Part of Speech Tagging With Finite State Transducers.

[8] Hakkani- Tür, Dilek; Oflazer Kemal; Tür Gökhan. 2000. StatisticalMorphological Disambiguation for Agglutinative Languages. In the proceedings of COLING-2000, the 18th International Conference on Computational Linguistics, August.

[9] Oflazer Kemal; Kuruöz İlker. 1994. Tagging and morphological disambiguation of Turkish text. In Proceedings of the 4th Applied Natural Language Processing Conference, pages 144-149. ACL, October.

[10] Tür Gökhan ; Hakkani- Tür, Dilek; Oflazer Kemal. Name Tagging Using Lexical, Contextual, and Morphological Information.

[11] Jorge, Alipio; Lopez, Alneu de Andrade. Iterative Part of Speech Tagging.

22

[12] Cutting, Doug; Kupiec, Julian; Petersen, Jan; Sibun, Penelope. A Practical Part of Speech Tagger.

[13] Ratnaparkhi, Adwait. A Maximum Entropy Model For Part of Speech Tagging.

[14] Guilder, Linda. 1995.Automated Part of Speech tagging: A Brief Overview.

23