[IEEE 2010 International Conference on Recent Trends in Information, Telecommunication and Computing (ITC) - Kerala, India (2010.03.12-2010.03.13)] 2010 International Conference on

SVM Based Part of Speech Tagger for Malayalam Antony P.J, Santhanu P Mohan, Soman K.P

CEN, Amrita University, Coimbatore, India [email protected]

{[email protected], kp_soman}@cb.amrita.edu

Abstract—This paper presents the building of part-of-speech Tagger for Malayalam Language using Support Vector Machine (SVM). POS tagger plays an important role in Natural language applications like speech recognition, natural language parsing, information retrieval and information extraction. This supervised machine learning POS tagging approach requires a large amount of annotated training corpus to tag properly. At initial stage of POS-tagging for Malayalam, the model is trained with a very limited resource of annotated corpus. We tried to maximize the performance with this a substantial amount of annotated corpus. The objective of this project was to identify the ambiguities in Malayalam lexical items and develop an efficient and accurate POS Tagger. We have developed our own tagset for training and testing the POS-tagger generators. The present tagset consists of 29 tags. A corpus size of one hundred and eighty thousand words was used for training and testing the accuracy of the tagger generators. We found that the result obtained was more efficient and accurate compared with earlier methods for Malayalam POS tagging.

Keywords— NLP, POS Tagger, Malayalam, Support Vector Machine, Classification

I. INTRODUCTION

Part of speech tagging (POS tagging) has a crucial role in different fields of natural language processing (NLP) [1] including machine translation. It is defined as the process of assigning to each word in a sentence a label which indicates the status of that word within some system of categorizing the words of that language according to their morphological and/or syntactic properties. The tagged data can be used for rule based machine translation system which improves the accuracy. Taggers [2] can be characterized as rule-based or stochastic. Rule-based taggers use hand-written rules to distinguish the tag ambiguity. Stochastic taggers are either HMM based, choosing the tag sequence which maximizes the product of word likelihood and tag sequence probability, or cue-based, using decision trees or maximum entropy models to combine probabilistic features. A supervised POS tagging approach requires a large amount of annotated training corpus to tag properly. Here we use a machine learning approach with SVM [7] to tackle POS tagging problem.

The objective of this project was to identify the ambiguities in Malayalam lexical items, and to develop a tag set appropriate for Malayalam. Finally, we succeeded to built an efficient and accurate POS Tagger.

II. COMPLEXITY IN MALAYALAM POS TAGGING

In Dravidian languages, particularly for Malayalam language, nouns and verbs get inflected. Nouns get inflected

for number and cases. Verbs get inflected for tense and are adjectivalized and adverbialized. So, many times we need to depend on syntactic function or context to decide upon whether the particular word is a noun or adjective or adverb or post position. This leads to the complexity in Malayalam POS tagging.

A noun may be categorized as common, proper or compound. Similarly, verb may be finite, infinite or gerund. Other parts of speech were also divided into their own subcategories.

For example, Malayalam [8] word ‘kaali’ ( ) in the following sentences gives different parts of speech.

a t c Avan kaali tozhuthil janichu Pocket kaali aayi

In the first sentence, the word ‘kaali’ is a noun whereas in the second sentence it is a verb. This is not rare in natural languages, and a large percentage of word-forms are ambiguous. Also, the parts of speech are not just the noun, pronoun, verb and adverb. There are clearly many more categories and sub-categories.

III. PROPOSED TAGSET FOR MALAYALAM

The proposed tagset for Malayalam language has 29 tags where there are 5 tags for nouns, 1 tag for pronoun, 7 tags for verbs, 3 for punctuations, two for number, and 1 for each adjective, adverb, conjunction, echo, reduplication, intensifier, postposition, emphasize, determiners, complimentizer and question word. The tags in the proposed tagset are described by the below Table 1 with example each.

I. SUPPORT VECTOR MACHINE

SVM is a machine learning algorithm for binary classification, which has been successfully applied to a number of practical problems, including NLP. In their basic form, a SVM learns a linear hyperplane, that separates the set of positive examples from the set of negative examples with maximal margin. This learning bias has proved to have good in terms of generalization bounds for the induced classifiers. The linear separator is defined by two elements: a weight vector w and a bias b which stands for the distance of the hyperplane to the origin. The classification rule of a SVM is:

SIGN (f(x,w,b))

2010 International Conference on Recent Trends in Information, Telecommunication and Computing

978-0-7695-3975-1/10 $25.00 © 2010 IEEE

DOI 10.1109/ITC.2010.86

339

TABLE ITAGSET FOR MALAYALAM f(x,w,b )= w · x + b

Being x the example to be classified. In the linearly separable case, learning the maximal margin hyperplane (w, b) can be stated as a convex quadratic optimization problem with a unique solution: minimize ||w||, subject to the constraints (one for each training example):

yi ( w. xi b ) 1

II. PROPOSED ARCHITECTURE

An algorithm which shows the basic structure of the proposed architecture for POS tagging is defined below.

A. Proposed Algorithm for POS tagging1 Take input text. 2 Tokenize the input text (Pre-editing). 3 Manual Tagging. 4 Train the corpus. 5 Tagging using SVM. 5.1 Search for the tokens in lexicon. 5.2 If it found, give the appropriate TAG from lexicon. 5.3 If not found, TAG it with SVM probabilities. 6 Get the tagged output text. 7 Insert those new words in lexicon.

B. Proposed Architecture for POS taggingThe Figure 1 shows the proposed architecture for POS

tagging. The architecture consist different modules based on their functionalities. The functionalities of each of this module are explained briefly as follow.

1) Tokenize: Untagged sentences are downloaded from Malayalam newspapers and commercial websites. We changed the input text into a column format suitable to the SVM tool [7]. We used blank space as the column separator. The corpus data was tokenized as the input data to the SVM tool must be in form of token.

Figure 1: Architecture for POS tagging

Input(Untagged Text)

Tokenize

SVM Lexicon

(Dictionary)

Output Train data manually

Features

Merged Model

340

2) Manual Tagging: The tokenizing module produces a corpus of untagged tokens. After which, the corpus is tagged manually using proposed AMRITA tagset. Initially around 20,000 words are tagged manually. Following is a sample of the training data:

. <NNPC> <NNPC>

<NN> <ADV> <VF>

. <DOT>

3) Corpus Training: The tagged corpus is trained using SVM (SVMTLearn, component of SVM tool) [7]. The training with SVM produces a dictionary with merged model and its feature set (Lexicon). This lexicon contains the tagged Malayalam words with its group of tags probability parameters.

4) Tagging using SVM: The next step is to tagging with trained SVM model that already created in the previous step. The remained pre-edited corpus is given to the SVM (SVMTagger, component of SVM tool) [7] for tagging in step by step. A word that currently not available in the lexicon might be tagged by the probability parameters, using the probability with the bias occurrence words. After tagging the displayed output is checked manually and the tags are corrected properly. Two different tagging schemes called Greedy and Sentence-level may be used. In Greedy schemes each tagging decision is made based on a reduced context. The concept behind the Sentence-level scheme is dynamic programming techniques (Viterbi algorithm), the global sentence sum of SVM tagging scores is the function to maximize. The tagging direction can be either “left-to right”, “right-to-left”, or a combination of both. When both are combined the tagging direction varies results yielding a significant improvement. For every token, each direction assigns a tag with a certain score. The highest-scoring tag is selected. The proposed POS tagger has a tagged Malayalam corpus with size of 1, 80,000 tagged words.

5) SVMTeval: SVMTeval is used to evaluate the performance of the POS tagger system in terms of accuracy. A morphological dictionary is automatically generated at training time. Based on this dictionary results may be presented for different sets of words such as known words vs. unknown words, ambiguous words vs. unambiguous words. A different view of these same results can also be seen from the class of ambiguity perspective i.e. words sharing the same kind of ambiguity may be considered together. Words sharing the same degree of disambiguation complexity, determined by the size of their ambiguity classes, can be grouped.

III. EVALUATION AND RESULT

The SVM consist of limited lexicon in the beginning. The system shows low accuracy when POS tagging is performed

with this limited lexicon. The size of the training data is increased with more pre-edited data by tagging with created SVM model. Again manual corrections for incorrect tags are done. The above process is repeated with more dataset to increase the size of lexicon. The resultant system has a high accuracy and a better performance. As the lexicon was increased to 180,000 words, the performance of proposed system also increased to 94%. Lexicon verses accuracy is shown in the Table 3. The accuracy increased with increasing the number of words in the lexicon.

TABLE III RESULT OF POSTAGGER

No. of Words in Lexicon POS tagger Accuracy

20,000 63% 1,00,000 86% 1,80,000 94%

IV. CONCLUSION AND FUTURE WORK

Part-of-Speech tagging, the assignment of Parts-of-Speech to the words in a given context of use, is a basic technique in many systems that handle natural languages. Tags play an important role in Natural language applications like speech recognition, natural language parsing, information retrieval and information extraction. In order to alleviate problems for Malayalam language, we proposed a new machine learning POS tagger approach. This paper describes a method for supervised training of a Part-of-Speech tagger using Support Vector Machines on a large corpus of annotated transcriptions for Malayalam. We identified the ambiguities in Malayalam lexical items, and developed a tag set appropriate for Malayalam. Finally, an efficient and accurate POS Tagger model for Malayalam language is built. We hope this will be very useful in natural language application like bilingual machine translation and in many areas.

REFERENCES[1] S. D. Jurafsky, and J. H. Martin, Speech and Language Processing,

Published by Pearson Education, 2000. [2] http://en.wikipedia.org/wiki/Part-of-speech_tagging. [3] Ashfaq Rahman, An Implementation of Brill’s Transformational

Tagger using NLTK, CS Department, UPenn (University of Pennsylvania).

[4] H. Schmid, M. Baroni, E. Zanchetta, A. Stein, Introduction to Corpus Resources, Annotation and Access: Tree Tagger, Universities of Stuttgart, Trento and Bologna (Forl`ı), 2006.

[5] Paul Rayson and Roger Garside, The CLAWS Web Tagger, University of Lancaster, 2007.

[6] http://www2.lingsoft.fi/cgi-bin/engtwol. [7] Jesús Giménez and Llu´ıs Màrquez., SVMTtool:Technical

manual v1.3, August 2006 [8] A.R .Rajarajavarma, Keralapanineeyam Malayalam Grammar Book

D.C Books, Kottayam [9] Pushpak Bhattacharyya, VeenaDixit, SachinBurange, Marathi POS

Tagger, IIT Bombay 1996. [10] Dhanalakshmi V, Anand Kumar M, Loganathan R, K.P Soman,

Rajendran .S, Tamil part-of-Speech-Tagger based on SVM Tool, in proceeding of the COLIPS International conference on Asian Language Processing 2008(IACP), Chang mai, Thailand, 2008.

341

Documents

[IEEE 2010 International Conference on Recent Trends in Information, Telecommunication and Computing (ITC) - Kerala, India (2010.03.12-2010.03.13)] 2010 International Conference on