Upload
thivaharanamrita
View
272
Download
0
Embed Size (px)
Citation preview
8/7/2019 Ma Language Format Final
1/35
0
Tamil Morphological Analyser
Vijay Sundar Ram R, Menaka S and Sobha Lalitha DeviAU-KBC Research Centre. MIT Campus of Anna University, Chennai-44
{sundar,menaka,[email protected]}
8/7/2019 Ma Language Format Final
2/35
1
Tamil Morphological Analyser
8/7/2019 Ma Language Format Final
3/35
2
Abstract. In morphologically rich languages, the word bears more grammatical information, due to rich suffix
affixation. Morphological analysis is the process of segmenting the given word into component morphemes and
assigning the correct morphosyntactic information. In this paper, we discuss about a method for developing a
morphological analyser for Tamil, a morphologically rich Dravidian language. This is designed using paradigm
based approach and Finite State Automata, which works efficiently in recursive tasks and considers only the
current state for having a transition. We test the morphological analyser with online web data and it
performances with a correctness of 91.70%.
8/7/2019 Ma Language Format Final
4/35
3
1. Introduction. Morphological analysis of a word is the process of segmenting the word into component
morphemes and assigning the correct morphosyntactic information. For a given word, a morphological analyser
(MA) will return its root word and the word class along with the other grammatical information depending upon
its word class. MA returns all possible parse for a given word, without considering the context. MA is a very
essential for languages having rich inflectional and derivational morphology such as morphologically rich
languages like Dravidian languages (Tamil, Telugu, Malayalam and Kannada), Finno-Ugric languages (Finnish,
Estonian, Hungarian, Turkish), Indo-Aryan languages (Hindi, Bengali, Marathi, Gujarati). In Indo-European
languages (French, English), as the affixations to the root word are less, lemmatization, the process of getting
the root word (lemma), serves the purpose of MA. MA is a vital tool in NLP applications. In morphological rich
languages, as there are multiple affixation, the finer grammatical information which helps in building efficient
NLP applications, can be obtained only from MA. MA is required in most of the applications such as
information extraction, QA system, machine translation, even in the information retrieval task to get the correct
root word.
There are several approaches attempted for MA. The two-level morphology approach by Kimmo
Koskenniemi is the early attempts, where he tested this formalism for Finnish (Koskenniemi 1983). In this two-
level representation, the surface level is to describe word forms as they occur in written text and lexical level to
encode lexical units such as stem and suffixes. The two-level rules define a mapping between the two levels and
they are represented in a Finite State Automata. This approach is used for recognizing and generating word
forms. This formalism is also used in other languages such as Arabic, Dutch, English, French, German, Italian,
Japanese, Portuguese, Swedish and Turkish (Schulze 1994). A rule based, heuristic analyser for Finnish nominal
and verb forms was developed by Jappinen (Jappinen 1983), a word-grammar based morphological analyser for
agglutinative languages using the two-level formalism and a unification-based formalism was introduced by
Agirve (Itziar 2000), here they have worked on Basque, a highly agglutinative language. Arabic Finite State
Transducer for morphological analysis using Xerox Finite State Transducer (XFST) was built by Beesley, by
reworking extensively on the lexicon and rules in the Kimmo-style (Beesley 1996). Similarly using XFST,
Wintner came up for Hebrew (Wintner 2005) and Karine made a Persian MA (Karine 2004). Oflazer Kamel
developed a Finite State Machine (FSM) based Turkish MA. For Swahili, using the syllables, utilizing the
surface level clues, the features present in a word are identified by Robert Elwell (Elwell 2008). A weighted
Finite State approach was used to handle Finnish compound words. Finite State Automata based MA was
developed in Tamil (). In Bengali, unsupervised methodology is used in developing a MA (Sajib Dasgupta,
8/7/2019 Ma Language Format Final
5/35
4
2007) and two-level morphology approach was used to handle Bengali compound words. There are rule based
MA was developed for Sanskrit (Girish Nath Jha 2007) and Oriya (Mohanty 2004).
In this paper, we present a methodology for morphological analysis of Tamil, morphologically rich
language. Here we have used Finite State Automata (FSA) and the paradigm approach. The reminder of the
paper is organized as follows. In the following section we have short description on morphology of Tamil,
where we have explained the inflections and derivations in nouns, verb. The third section is Orthographic rules
in Tamil, we have briefed on the orthographic changes that occur during affixation. We have explained our
approach in building Tamil morphological analyzer in section four. Section 5 discusses the different
experiments done to evaluate the MA and the paper concludes with the conclusion section.
2. Tamil morphology. Tamil belongs to the South Dravidian family of languages. It is a verb-final language
and has a relatively free word order; It is an inflectional language. Agglutination is another feature of the
language.
Tamil morphology is characterized as agglutinative or concatenative, i.e., Words are formed by successfully
adding suffixes to the root word in series. When suffixes attach to the root several morphophonemic changes
take place. The order in which suffixes attach to a root form determine the morphosyntax of the language and
the various changes that take place when a suffix attaches are called the morphophonemics.
The lexical categories in Tamil and their morphological processes are discussed below.
2.1. Nouns. Nouns form an important lexical category in the language and they take inflectional as well as
derivational suffixes.
Suffixation to a noun is not arbitrary. They attach in a particular order. The number suffix attaches to the noun
root, followed by the case suffix. Postpositions follow case. This, in turn, is followed by the clitics.
The number suffix is for singular and kaL for plural. In a few cases, the plural suffix is mAr.
After the number suffix, the stem takes the case suffix. Computationally, Tamil has 8 casesi. Lehmann
(Lehmann 1989) also classifies the Tamil case system in a similar manner.
INSERT Table 1. HERE
The next suffix in the series may be the disjunction clitic(o:) or the coordination clitic(um) or the emphatic
clitic(e:). After the addition of the above suffixes, the emphatic suffix (taan) is added. This may be followed by
the fifth suffix which can be interrogative (a:) or supposition (a:m).
8/7/2019 Ma Language Format Final
6/35
5
The Morphosyntax of Noun inflections may be summarized as
root + {number} + {case} + {DISJ/COOR/EMPH} + {PSP} + {EMP} + {INT/SUPP}
The following examples illustrate the inflections of a noun.
INSERT Table 2 HERE.
Productive suffixes of nouns.
1. The suffixes -ka:ran/-ka:ri denoting 'man/woman' as in pa:l-ka:ran (milk-man)/pa:l-ka:ri (milk-woman)
2. The suffixes -an/-i attach to a noun denoting a quality to derive the noun for the man/woman with thequality.
Eg. kuruTu + an -> kuruTan
'blindness' + MAS -> 'blind person(MAS)'
kuruTu + i -> kuruTi
'blindness' + FEM -> 'blind person(FEM)'
3. The suffix tanam attaches to a noun to show the habit.Eg. piTiva:Tam + tanam -> piTiva:Tattanam
'stubbornness' + SUFF -> 'habit of stubbornness'
aTimai + tanam -> aTimaittanam
'slave' + SUFF -> 'slavery'
4. The negative suffix -inmaican be added to several nouns to add the meaning of -lessness.Eg. tu:kkam + inmai -> tu:kkaminmai
'sleep' + SUFF -> 'sleeplessness'
payam + inmai -> payaminmai
'fear' + SUFF -> 'fearlessness'
The derivations that are possible from noun roots are discussed below.
Derivation of verbs from nouns.
1. Certain verbs like aLi/aTi/cey/koTu/paNNuadd to a noun to form the corresponding verb. These arequite productive, especially in the case of loan-words
8/7/2019 Ma Language Format Final
7/35
6
Eg. ka:pi + aTi -> ka:piyaTi
'copy' + 'beat' -> 'copy'
va:y + aTi -> va:yaTi
'mouth' + 'beat' -> 'chatter'
veLLai + aTi -> veLLaiyaTi
'white' + beat -> 'whitewash'
ca:vi + koTu -> ca:vikoTu
'key' + 'give' -> 'wind up'
kaTan + koTu -> kaTankoTu
'loan' + 'give' -> 'lend'
Derivation of adjectives from nouns
1. The dravidian suffix for adjective formation is a.Eg. azaku + a -> azakiya
'beauty' + ADJ -> 'beautiful'
2. The extremely productive suffix a:na (which is a frozen form reduced from the past tense relativeparticiple a:kiyaof the verba:ku) attaches to almost any noun denoting quality.
Eg. azaku + a:na -> azaka:na
'beauty' + ADJ -> 'beautiful'
ve:kam + a:na -> ve:kama:na
'speed' + ADJ -> 'fast'
3. The bound postposition uLLa/uTaiya attaches to several nouns to produce adjectives.Eg. azaku + uLLa-> azakuLLa
'beauty' + ADJ -> 'beautiful'
4. The suffix -aavatu/-a:m is added to ordinals to form the corresponding adjective.Eg.
iraNTu + a:vatu -> iraNTa:vatu
'two' + ADJ -> 'second'
iraNTu + a:m -> iraNTa:m
'two' + ADJ -> 'second'
8/7/2019 Ma Language Format Final
8/35
7
5. The place names with an ending a: shorten the last vowel to form the corresponding adjective.Eg. intiya: + tu:tarakam -> intiya tu:tarakam
'India' + Embassy -> 'Indian Embassy'
6. The negative suffix -aRRaadds to nouns to produce an adjective.Eg. oLi + aRRa -> oliyaRRa
'light' + 'without' -> 'without light'
Derivation of adverbs from nouns
1. The suffix a:ka is equally productive in deriving adverbs from nouns as a:na in deriving adjectivesfrom nouns. Sometimes a:kagets reduced to a:y
Eg.azaku + a:ka-> azaka:ka
'beauty' + ADV -> 'beautifully'
azaku + a:y -> azaka:y
'beauty' + ADV -> 'beautifully'
2. The negative suffixes -inRi/-anRi/-aRRuadd to the root noun to produce an adverb.Eg. paNam + anRi -> paNamaNri
money + apart -> apart from money
pa:tuka:ppu + inRi -> pa:tuka:ppinRi
'safety' + 'without' -> 'without safety'
mati + aRRu -> matiyaRRu
'intelligence' + 'without' -> 'without intelligence'
3. The bound postposition e:Rpa adds to the noun in dative case to give an adverb.vitikku + e:Rpa -> vitikke:Rpa
'fate-DAT' + 'according' -> 'according to fate'
Compound Nouns. Compound nouns formation is productive in Tamil. They may be formed by several
strategies as explained by Rajendran(Rajendran 2004). Some examples below illustrate the abundance of
compound nouns in Tamil.
Eg.
vattam + me:jai -> vatta me:jai
8/7/2019 Ma Language Format Final
9/35
8
'round' + 'table' -> 'round table'
kuzantai + paruvam -> kuzantaipparuvam
'child' + 'period' -> 'childhood'
marapu + aNu -> marapaNu
'tradition' + 'atom' -> 'gene'
coTTu + ni:r + pa:canam -> coTTu ni:rp pa:canam
'drop' + 'water' + 'irrigation' -> 'drip irrigation'
aTi + vayiRu -> aTivayiRu
'below' + 'stomach' -> 'abdomen'
ka:l + kaTTu -> ka:lkaTTu
'leg' + 'knot' -> 'marriage'
Pronouns. Pronouns in Tamil are a closed set of words. They have person, number and gender (PNG)
information in them.
The following table summarizes the pronouns in Tamil
INSERT Table 3. HERE
Pronouns are part of nouns and hence behave like nouns. The above pronouns inflect by taking the Case suffix
and other suffixes that a noun takes. Since number is inherent in the pronoun, it doesn't take an explicit Number
suffix. The only exception is ivai which sometimes takes a redundant plural suffix kaL.
2.2. Verbs. Verb forms can be broadly classified into two types.
1. Finite verbs2. Non-finite verbs
Finite verbs. The verb root takes the tense suffix first, followed by a fused PNG suffix. This can be followed by
any of the clitics.
The Tense can be Past/Present/Future if it is in the affirmative. The negative form does not take tense.
The PNG Suffixes may be as in the table below.
INSERT Table 4. HERE
The Morphosyntax of finite verbs may be summarized as
root + Tense + PNG + {DISJ/COOR/EMPH/EMP/INT/SUPP}
8/7/2019 Ma Language Format Final
10/35
9
root + INF + NEGVERB + {DISJ/COOR/EMPH/EMP/INT/SUPP }
The following examples illustrate the above morphosyntactic rule.
INSERT Table 5. HERE
The verb root, after taking the tense suffix may take the relative participle markera instead of the PNG suffix to
produce the relative participles.
INSERT Table 6. HERE
The above inflections can be summarized as
root + Tense/NEG + RP
Only derivations can happen at this point. One of the pronominal endings from avan(3SM) /avaL(3SF)
/avar(3SH) /atu(3SN)/avai(3PN)/avarkaL(3PE) may agglutinate to the relative participle, thus forming a noun.
Now, the word behaves as a noun and takes noun inflections.
Eg.
paTi + tt + a + avan + o:Tu -> paTittavano:Tu
read + PST + RP + 3SM + SOC -> with the one(MAS) who read
Non-finite verbs. The verb root may directly one of the following suffixes Infinitive, Verbal Participle,
Conditional, Concessive, Hortative, Optative. These forms may also have a negative suffix attaching to the root
before taking on these suffixes. Some of these forms take an auxiliary verb like iru/viTuto produce the negative
form.
Eg.
pa:Tu + a -> pa:Ta
'sing' + INF
pa:Tu + a: + a -> pa:Ta:ta
'sing' + NEG + INF
pa:Tu + i -> pa:Ti
'sing' + VBP
pa:Tu + a: + u -> pa:Ta:tu
'sing' + NEG + VBP
pa:Tu + a:l -> pa:Tina:l
'sing' + COND
pa:Tu + a: +viTu+ a:l -> pa:Ta:viTTa:l
8/7/2019 Ma Language Format Final
11/35
10
'sing' + NEG + AUXV + COND
pa:Tu + a:lum -> pa:Tina:lum
'sing' + CONC
pa:Tu + a: +viTu +a:lum -> pa:Ta:viTTa:lum
'sing' + NEG + AUXV + CONC
pa:Tu + ala:m -> pa:Tala:m
'sing' + HORT
pa:Tu + a: + iru + ala:m -> pa:Ta:tirukkala:m
'sing' + NEG + AUXV + HORT
pa:Tu + aTTum -> pa:TaTTum
'sing' + OPT
pa:Tu + a: + iru + aTTum -> pa:Ta:tirukkaTTum
'sing' + NEG + AUXV + OPT
The Morphosyntax of Non-finite Verbs may be summarized as
root + {NEG} + INF/VBP/COND/CONC/HORT/OPT + {DISJ/COOR/EMPH} + {EMP} + {INT/SUPP}
Derivation of nouns from verbs
1. From the Relative participle (RP) forms, nouns can be derived by the pronominalisation process. i.e.,one of the pronominal suffixes avan, avaL, avar, avarkaL, atu, avaiattach to the RP form to produce a noun.
This is very productive.
2. The suffix -kai is added to some verbs to produce nounsEg. cey + kai -> ceykai
'do' + SUFF -> 'act'
3. The suffix -talis added to several verbs to form the corresponding noun.Eg. paRa + tal -> paRattal
'fly' + SUFF -> 'flying'
makiz + tal -> makiztal
'enjoy' + SUFF -> 'enjoying'
vaLar + tal -> vaLartal
'grow' + SUFF -> 'growing'
8/7/2019 Ma Language Format Final
12/35
11
4. The suffix -puis added to some verbs to form the corresponding noun.Eg. vaLar + pu -> vaLarppu
'grow' + SUFF -> ' bringing up'
ninai + pu -> ninaippu
'think' + SUFF -> 'thought'
5. One of the sequences of suffixes -v-atu/p-atu/pp-atuis added to any verb to denote the action denotedby the verb.
Eg. cey + v +atu -> ceyvatu
do + FUT + SUFF -> doing
Derivation of adjectives from verbs
1. Suffixes like -ataRka:na/ -avaRRukka:na / ataRkuriya/ atarke:RRa / takka are actually frozen forms ofagglutinating words that can add to a verb root to form an adjective.
For instance, if we consider ataRka:na,
it is atu + ku + a:na -> ataRka:na
'that' + DAT + ADJ ->'for that'
Now this can attach to a verb
cey + ataRka:na -> ceyvataRka:na
'do' + 'for that' -> 'for doing'
Derivation of adverbs from verbs
1. Suffixes like a:Rpo:la/ ava:Ru/ a:ka/ ma:Ru/ a:maland certain postpositions like anRi add to verbs toform adverbs.
Eg. paTi + tt + a:Rpo:la -> paTitta:Rpo:la
'read' + PST + ADV -> 'as though ... was reading'
paTi + ma:Ru -> paTikkuma:Ru
'read' + ADV -> 'to read'
paTi + a:mal -> paTikka:mal
'read' + NEG -> 'without reading'
2. enRuand enaare synonymous Complementizers which add to reduplicating onomatopoeic forms toform adverbs.
Eg. kalakala + enRu -> kalakalavenRu
8/7/2019 Ma Language Format Final
13/35
12
ONOM + ADV -> 'happily'
kalakala + ena -> kalakalavena
ONOM + ADV -> 'happily'
2.3.Inflections of other categoriesIn Tamil, the other POS categories do not inflect, but they take the clitics that the nouns / verbs take.
Hence for any other category apart from the ones discussed above, the morphosyntax is
root + {DISJ/COOR/EMPH} + {EMP} + {INT/SUPP}
Agglutination. Agglutination is a feature of the Tamil language. Due to the highly agglutinating nature of this
language and the morphophonemic variations that take place at the point of agglutination, it is very difficult to
mark the word boundaries.
Eg. arapi + katal + in + araci -> arapikkatalinaraci
'Arabian' + 'sea' + GEN + 'queen' -> 'Queen of the Arabian Sea'
paTi + ttu + koL + NT u + iru +t + a + avan + ai -> paTittukkoNTirutavanai
'read' + VBP + AUXV + VBP + AUXV + PST + RP + PRON-3SM + ACC -> 'the one(MAS) who was
reading(OBJ)'
3. Orthographic Ruless in Tamil "The ways in which the morphemes of a given language are variously
represented by phonemic shapes can be regarded as a kind of code. This code is the orthographic system of the
language." (Hockett 1958:135). It is also known as Internal Sandhi.
The orthographic rules in Tamil given below were arrived at from the works of Pope (Pope 1979) and tamiz
(Subramanian and Gnanasundaram 2001).
1. When the root word ends in a vowel and the attaching suffix begins with any vowel, the glide vor y isadded depending on the following rules.
INSERT Table 7 HERE.
2. When the root word ends in one of the long Close vowels (i:/u: )and the attaching suffix/word beginswith one of the stop consonants k/c/T/t/p/R, the stop consonant doubles at the end of the root.
Eg. i: + kaL -> i:kkaL
'fly' + PL -> 'flies'
3. When the root word is of two syllables, with a short first syllable and ends in u, and the attachingsuffix/word begins with one of the stop consonants k/c/T/t/p/R, the stop consonant doubles at the end of the
8/7/2019 Ma Language Format Final
14/35
13
root. In all other cases of root word ending in u, and the attaching suffix/word begins with one of the stop
consonants k/c/T/t/p/R, there is no change.
Eg. maTu + kaL -> maTukkaL
'hillock' + PL -> 'hillocks'
4. When the root word ends in TTu,ttuand the suffix starts with k/c/T/t/p/R, there is no change.Eg. pa:TTu + kaL -> pa:TTukaL
'song' + PL -> 'songs'
5. When the root word ends in the labial nasal m, and the attaching suffix/word begins with one of thestop consonants k/c/T/t/p/R, the m is replaced by the homorganic nasal of the stop consonant.
Eg. maram + kaL -> marakaL
'tree' + PL -> 'trees'
6. When the root word ends in the labial nasal m, and the attaching suffix begins with a vowel, the m isreplaced by the oblique suffix tt.
Eg. maram + ai -> marattai
'tree' + ACC -> 'tree(OBJ)'
7. When the root word has a short single syllable and ends with the nasal n, and the attaching suffix/wordbegins with a vowel, the n doubles.
Eg. pon + iliruntu -> ponniliruntu
'gold' + ABL -> 'from gold'
8. When the root ends in the nasal , and the attaching suffix starts with a vowel, the homoorganic stopconsonant is added in between.
Eg. manmo:hanci + a:l -> manmo:hancika:l
'Manmohan Singh'+INS -> 'by Manmohan Singh'
9. When the root word has short single syllable ending in a glide (y/v), and the attaching suffix startswith a vowel, the glide doubles. Native words do not end in v.
Eg. poy + ai -> poyyai
'lie' + ACC -> 'lie(OBJ)'
lav + a:l -> lavva:l
'love' + INS -> 'by love'
10. When the root word has a short single syllable and ends with a lateral (l/L ), or if the root word is
8/7/2019 Ma Language Format Final
15/35
14
another word with such a word at the end, and the attaching suffix/word begins with one of the stop
consonants k/c/T/t/p/R, the lateral may be replaced with the homorganic stop consonant.
Eg. kal + kaL -> kaRkaL
'stone' + PL -> 'stones'
cekal + kaL -> cekaRkaL
'brick' + PL -> 'bricks
poruL + kaL -> poruTkaL
'thing' + PL -> 'things'
11. When the root word has a short single syllable and ends with a lateral (l/L), and the attaching suffixstarts with a vowel, the lateral doubles.
Eg. kaL + il -> kaLLil
'toddy' + LOC -> 'in toddy'
pal + il -> pallil
'tooth' + LOC -> 'in the tooth'
12. When the root ends in a stop consonant k/c/T/t/p/R/j, and the attaching suffix starts with a vowel, theconsonant doubles.
Eg. maik + ai -> maikkai
'mike' + ACC -> 'mike(obj)'
Tip + il -> Tippil
'tip' + LOC -> 'in the tip'
jeT + il -> jeTTil
'jet' + LOC -> 'in the jet'
ko:c + o:tu -> ko:cco:tu
'coach' + SOC -> 'with the coach'
pert + ukku -> perttukku
'berth' + DAT -> 'to the berth'
haj + ukku -> hajjukku
'Haj' + DAT -> 'to Haj'
But when the stop consonant is p, and it is preceded by the modifier H to denote the labiodental
fricative, there is no doubling.
8/7/2019 Ma Language Format Final
16/35
15
Eg. vulHp + in -> vulHpin
'wolf' + GEN -> 'wolf's'
When the root is a loan-word, it may end in the stop consonant, but may be voiced, preceded by a long
vowel, and the attaching suffix starts with a vowel, there is no change.
Eg. la:lpa:k + il -> la:lpa:kil
'Lalbagh' + LOC -> 'in Lalbagh'
vik + ai -> vikkai
'wig' + ACC -> 'wig(OBJ)'
13. When the root ends in a stop (k/c/T/t/p/R), preceded by the homorganic nasal (//N//m/n), and theattaching suffix starts with a vowel, there is no change.
Eg. vik + il -> vikil
'wing' + LOC -> 'in the wing'
pec + il -> pecil
'bench' + LOC -> 'on the bench'
pat + a:l -> pata:l
'bandh' + INS -> 'due to bandh'
14. In cases where the root ends in a sibilant (s/sh), preceded by a short vowel, the sibilant doubles.Eg. push +in -> pushshin
'Bush' + GEN -> 'Bush's'
pas + il -> passil
'bus' + LOC -> 'on the bus'
In other cases where the ending sibilant is not preceded by a short vowel, the suffix can attach without
any change. Sometimes, we observe that the s is replaced with c.
Eg. pars + il -> parcil /parsil
'purse' + LOC -> 'in the purse'
When the attaching suffix starts with a consonant, there may be no change or the smay change to cu.
Eg.ke:s + kaL -> ke:skaL/ke:cukaL
'case' + PL -> 'cases'
8/7/2019 Ma Language Format Final
17/35
16
Some of the rules above where there is a difference in behaviour when the loan-word ends in a particular
consonant and the corresponding phoneme is voiced or voiceless in the source language of the loan-word, it is
not directly possible to encode this info in the rules. Hence the default rule for the particular consonant ending is
applied.
4. Our Approach.In this approach, we built a FSA using all possible suffixes, categorize the root word lexiconbased on paradigm approach to optimize the number of orthographic rules and use morphosyntax rules to get the
correct analysis for the given word. FSA is used FSA using as the analysis of the word is done suffix by suffix.
FSA are the proven technology for efficient and speedy processing.
When applying the formalism of two-level morphology for morphologically rich languages, there are
some well-known limitations such as
1, developing Finite State transducers that encode very complex two-level rules is not easy.
2, morphological categories are not directly encoded as a part of the lexical form.
3, lexical representation tends to be arbitrary.
4, various diacritical features inserted into the lexical strings to insure proper analysis makes Kimmo-
style awkward or impractical for generation (Beesley 1996).
In our approach the complex affixations are easily handled by FSA and in the FSA, the required
orthographic changes are handled in every state.
Our MA consists of three major components
1, Finite State Automata, modeled using all possible suffixes (allomorphs).
2, lexicon, categorized based on the paradigm approach
3, morphosyntax rules, for filter the correct parse of the word
4.1. Finite State Automata (FSA). FSA is a model of behavior composed of a finite number of states andtransitions between these states. FSA is an abstract device used for recognizing simple syntactic structures or
patterns. An automata is normally depicted by directed graph, called State Diagram and it is also represented in
a tabular form as State Table. An FSA as a string processing device accepts strings as input and decides if the
structure is correct, that is, it either accepts or rejects the string. From a mathematical perspective it is regarded
as a function, mapping a set of string to the set {Accept, Reject}. Based on the transition given by the FSA, it is
classified as Non-deterministic FSA (NDFSA) and deterministic FSA (DFSA).
8/7/2019 Ma Language Format Final
18/35
17
The requirements of DFSA is
1, there are no transition involving (no null transition).
2, no state has two outgoing transition based on the same symbol.
Modeling of Suffix based FSA. FSA is modeled using all possible suffixes ie all the allmorphs, where
allomorphs are defined as a morpheme that is manifested as one or more morphs in different environment. Eg:
th, nth, in, i are the allomorphs of the past tense marker.
Here FSA is built by considering the suffixes from left to right of the word, ie moving from end of the
word towards the root word. Our implementation is varied for the other Finite State MAs, the suffixes are the
symbols, which trigger the transition. After determinising the DFSA reduces to two states. The 1st suffixes that
are affixed to the root word immediately triggers the transition from state 0 to state 1. And the other suffixes
that are affixed to the 1st
suffix form a self-loop at the state 0. Sample State Table is shown in the table 8 and
Sample State Diagram is shown in figure1.
INSERT Table 8. HERE
INSERT Figure 1. HERE
The word is parsed in the FSA by identifying suffix by suffix, from the last suffix to the first suffix. Whenever
the transition is triggered by the suffix, that suffix is stripped from the word and required orthographic
corrections are done.
Orthographic Rules in FSA. Orthographic rules are the spelling rules used to model the changes that occur in a
word, usually when morphemes are combined (Jurafsky 2000). The characters that are deleted from the root
word or the suffix, when a suffix (allomorph) is affixed, it is stored after the suffix in the state table. Example is
given below
0 0 atu a
Consider the word makanuTaiyatu, in this word there are two suffixes uTaiya and atu. When the word
is parsed in the FSA, the last suffix atu is first identified. It triggers a transition to the same state and in the
current word this suffix is stripped and the orthographic correction character a is added. Thus the remaining
word makanuTaiya is further parsed.
8/7/2019 Ma Language Format Final
19/35
18
Root Information in FSA. In the end state of the state table, for the suffixes that are affixed to the root word,
after the orthographic correction characters the category of the root is added. Sample of the state table is given
below.
0 1 kaL m N13
Consider the word marakaL, here kaL is the suffix added to the root word. When this word is parsed
through the FSA, the suffix kaL triggers from state 0 to state 1 and in the suffix the current word, the suffix
is stripped and the orthographic correction is done. The reminder word maram is compared in the particular
category of the root word lexicon. If this matches the root word lexicon, then this parse of the word is
considered as a valid parse for this input word.
4.2. Lexicon paradigm based approach.In paradigm approach, we group the root words into different groups,where every word in each group will have similar orthographic changes (sandhi changes), when a suffix is
added to it.
Consider the words paTam and varam. These two words, when inflected with plural marker kaL, m, the
last character is deleted in both the words and kaL is added to the words to form paTakaL and varakaL. As
these two words show same orthographic changes they are grouped under the same paradigm.
In our task, we have categorized noun into 36 paradigms and verbs into 34 paradigms. The lexicon has
44055 root words.
Apart from the root word lexicon, a suffix list with suffixes and the corresponding syntactic
information is used, as MA has to assign the correct morphosyntactic information to the component morphemes.
4.3. Morphosyntax Rules. A set of rules that explains which classes of morphemes can follow other classes of
morphemes inside a word. Example plural marker can occur only immediately after the noun root word, and this
can be followed by a case marker or clitic. This set of rules filter out the correct parsing of the word from the
FSA. Here we have 286 rules.
Handling of Compound Words. In morphological rich and productive languages like Tamil, occurrence of
compounding words are high. In compound words, only the last word in the compound words is inflected. This
was have handled as follows
8/7/2019 Ma Language Format Final
20/35
19
Step 1: Parsing the suffixes from the last suffix to the first suffix in the word, and checks for the root
word in the given category in the FSA.
Step 2: If the root word is not matched then step 3
Step 3: The root word is split based on syllables and checked with the root dictionary
Step 4: Once a word is matched, the remaining part of the word is splitted similarly and compared with
the root dictionary.
Step 5: If the complete root word, is matched into different root words in the dictionary, this multiple
words as root with suffix information is given as analysis.
Step 6: If the complete root is not matched even after splitting into multiple words, the analysis is given
as unknown word.
The other form is the verb which is inflected agglutinated with the pronoun, which can also be inflected, such as
vatavan -> va: + t + a + avan
come+root past RP pronoun
Here the relative participle verb vata is agglutinated with avan, a pronoun. This we have handled by
having a separate rule in the morphosyntax rule file.
Agglutination of inflected verb and verb illai (negation), the verb illai agglutinate with the infinite verb forming
one word, such as
varavillai -> va: + a + illai
Come+root inf negative verb
This is also handled similarly as the previous example by adding a separate rule.
5. Evaluation. We have evaluated the system with two sets of web data, first set is the words collected from
general domain and the second set is the words collected from the tourism domain. The detail of evaluations is
shown in table 9.
INSERT Table 9. HERE
The tourism documents have more compound words and the agglutination of words is more. In this domain,
there are more number of named entities such as person name, place name, area specific words. The sentences
commonly end with a:kum, a copula verb. This verb is agglutinated to the preceding noun phrase, such as
8/7/2019 Ma Language Format Final
21/35
20
u:ra:kum -> u:r + a:kum
place + copula verb
Similarly there are more compound nouns, such as
maNme:TukaLuTaiya -> maN+me:Tu + kaL + uTaiya
sand dune pl genetive
Compound root suffix
Consider the word maNme:TukaLuTaiya, kaL and uTaiya are the two suffixes. After removing the
suffixes, the reminder is maNme:Tu, which does not match with root word dictionary. The word is spliited and
compared with the root word list and man, me:Tu are two root words forming the word maNme:Tu. Similarly
iravupakalai -> iravu+pakal + ai
night+day accusative
Compound root Suffix
teyvacceyalil -> teyvam+ceyal + il
Compound root Suffix
Periodic updating of the root word lexicon will help in improving the performance of the system.
6. Conclusion.The paper is about the design and development of Tamil morphological analysis, using the FiniteState Automata and the paradigm approach. The complex suffixation is effectively handled using FSA. The
system performs at an average precision of 91.70%.
8/7/2019 Ma Language Format Final
22/35
21
Reference
Beesley, Kenneth R. 1996. Arabic Finite-State Morphological Analysis and generation. Proceedings of the 16th
International Conference on Computational Linguistics, Vol.1.Copenhagen, Denmark. 89-94.
Elwell, Robert., Jason Baldridge. 2008. Using Syllables as Features in Morpheme Tagging in Swahili.
Proceedings of the Fifth Midwest Computational Linguistics Colloquium, East Lansing.
Itziar Aduriz, Eneko Agirre, Izaskun Aldezabal, Inaki Alegria, Xabier Arregi, Jose Maria Arriola, Xabier
Artola, Koldo Gojenola Galletebeitia, Montse Maritxalar, Kepa Sarasola, Miriam Urkia. 2000. A word-grammar
based morphological analyer for agglutinative languages. In Proceedings of COLING'2000. 1-7.
Jppinen, H., Lehtola, A., Nelimarkka, E. and Ylilammi, M. 1983. Morphological Analysis of Finnish: A
Heuristic Approach. Report B26, Helsinki University of Technology, Digital Systems Laboratory, Helsinki,
Finland.
Jurafsky, Daniel and James H. Martin. 2000. Speech and Language processing. Prentice Hall.
Hockett, Charles F. 1958. A course in modern linguistics. New York: Macmillan.
Girish Nath Jha., Muktanand Agarwal., Subash., Sudhir K Mishra., DiwakarMishra., Manji Bhadra Surjit K
Singh. 2007. Inflectional Morphology for Sanskrit. In Proceedings of First
International Symposium on Sanskrit Computational Linguistics. 46-77.
Koskenniemi, Kimmo. 1983. Two-Level Morphology: A General Computational Model for Word-Form
Recognition and Production. Publication No. 11. Helsinki: Department of General Linguistics, University of
Helsinki.
Lehmann, T. 1989. A Grammar of Modern Tamil, Pondicherry: Pondicherry Institute of Linguistics and Culture.
8/7/2019 Ma Language Format Final
23/35
22
Megerdoomian, Karine. 2004. Finite-State morphological analysis of Persian. In Proceedings of the Workshop
on Computational Approaches to Arabic Scriptbased Languages. Coling 2004, University of Geneva,
Switzerland.
Mohanty, S., Santi, P.K., Adhikary, K.P.D. 2004. Analysis and Design of Oriya Morphological Analyser: Some
Tests with OriNet. In Proceeding of symposium on Indian Morphology, phonology and Language Engineering,
IIT Kharagpur.
Pope, G. U. 1904. A handbook of the Tamil language. 7th ed. New Delhi, First published Oxford. Asian
Educational Services, 1979.
Sajib Dasgupta, Vincent Ng. 2007. Unsupervised morphological parsing of Bengali. Language Resources and
Evaluation 40:3-4, pp 311-330
Sajib Dasgupta, Dewan Shahriar Hossain Pavel, Asif Iqbal Sarkar, Naira Khan and Mumit Khan., 2005.
Morphological Analysis of Inflecting Compound Words in Bangla, Proc. 8th International Conference on
Computer & Information Technology (ICCIT), Islamic University of Technology (IUT), Dhaka, Bangladesh.
Schulze, B. M. et al. 1994. DECIDE Designing and Evaluating Extraction Tools for Collocations in Dictionaries
and Corpora. MLAP Project 93- 19.
Viswanathan, S., Ramesh Kumar, S., Kumara Shanmugam, B., Arulmozi, S. and Vijay Shanker, K. (2003). A
Tamil Morphological Analyser, Proceedings of the International Conference On Natural language processing
ICON 2003, Central Institute of Indian Languages, Mysore, India, pp. 3139.
Yona, S. and Wintner, S. 2005. A finite-state morphological grammar of Hebrew. In Proceedings of the ACL-
2005 Workshop on Computational Approaches to Semitic Languages, Ann Arbor.
8/7/2019 Ma Language Format Final
24/35
23
Table1. Tamil Case System.
Case Case Suffix
Nominative
Accusative ai
Dative kuInstrumental a:l
Sociative o:Tu/uTan
Locative il/iTam
Ablative ilirutu
Genitive in/atu/uTaiya
8/7/2019 Ma Language Format Final
25/35
24
Table.2 Inflections of a noun.
Root Number Case Postposition Clitic Word
paiyan
'boy'
SG
NOM
paiyan
'boy'
paiyan
'boy'
SG
ai
ACC
paiyanai
'boy(OBJ)'paiyan'boy'
SG
kuDAT
paiyanukku'to the boy'
paiyan'boy'
SG
a:lINS
paiyana:l'by the boy'
paiyan
'boy'
SG
o:Tu
SOC
paiyano:Tu
'with the boy'
pai'bag'
SG
ilLOC
paiyil'in the bag''
pai
'bag'
SG
iliruntu
ABL
paiyiliruntu
'from the bag'
pai'bag'
SG
inNOM
paiyin'bag's'
paiyan'boy'
SG
aiACC
e:EMPH
paiyanaiye:'the boy(OBJ) himself'
paiyan
'boy'
SG
ai
ACC
a:
INT
paiyanaiya:
'the boy(OBJ)?'
paiyan'boy'
SG
ukkuDAT
ta:nEMPH
paiyanukkuta:n'it is for the boy'
paiyan'boy'
SG
ukkuDAT
ku:TaPSP
ta:nEMPH
paiyanukkukku:Tata:n'it is also for the boy'
paiyan
'boy'
kaL
PL
NOM
paiyankaL
'boys'
paiyan'boy'
kaLPL
aiACC
paiyankaLai'boys(OBJ)'
paiyan
'boy'
kaL
PL
ku
DAT
paiyankaLukku
'to the boys'
paiyan
'boy'
kaL
PL
a:l
INS
paiyankaLa:l
'by the boys'
paiyan'boy'
kaLPL
o:TuSOC
paiyankaLo:Tu'with the boys'
pai
'bag'
kaL
PL
il
LOC
paikaLil
'in the bags'
pai'bag'
kaLPL
iliruntuABL
paikaLiliruntu'from the bags'
pai'bag'
kaLPL
inNOM
paikaLin'of the bags'
paiyan
'boy'
kaL
PL
ai
ACC
e:
EMPH
paiyankaLaiye:
'the boys(OBJ) themselves'
paiyan'boy'
kaLPL
aiACC
a:INT
paiyankaLaiya:'the boys(OBJ)?'
paiyan'boy'
kaLPL
ukkuDAT
ta:nEMPH
paiyankaLukkuta:n'it is for the boys'
paiyan
'boy'
kaL
PL
ai
ACC
viTa
PSP
a:
INT
paiyankaLaiviTava:
'than boys(OBJ)?'
8/7/2019 Ma Language Format Final
26/35
25
Table. 3. Pronouns in Tamil.
Non-neuter Neuter
Singular Plural Honorific Singular Plural
a:m'We'
(inclusive)
a:m'We'
(inclusive)
FirstPerson
a:n'I'
a:kaL'We'
(exclusive)
a:n'I'
a:kaL'We'(exclusive)
SecondPerson
i:'You'
i:kaL/i:vir'You'
i:kaL'You'
i:'You'
i:kaL'You'
avan'He'
ThirdPerson
avaL
'She'
avarkaL'They'
avar'He/She'
atu'It'
avai'Those'
8/7/2019 Ma Language Format Final
27/35
26
Table 4. PNG in Tamil
Person Number Gender PNG Suffix
Singular Masculine/Feminine
-e:n
First
Plural Masculine/
Feminine
-o:m
Singular Masculine/ -a:y
Plural Masculine/Feminine
-i:rkaLSecond
SingularHonorific
Masculine/Feminine
-i:r
Singular Masculine -a:n
Singular Feminine -a:L
Plural Masculine/
Feminine
-a:rkaL
SingularHonorific
Masculine/Feminine
-a:r
Singular Neuter -atu
Third
Plural Neuter -ana
8/7/2019 Ma Language Format Final
28/35
27
Table 5. Inflections of verbs.
Root Tense/Inf+NEG PNG Clitics Example
paTiread
ttPST
a:n3SM
paTitta:n(He) read
paTiread
kkiRPRE
a:L3SF
paTikkiRa:L(She) is reading
paTiread
umFUT
3SN
paTikkum(It) will read
paTiread
tt:PST
a:n3SM
a:INT
paTitta:na:?Did (he) read?
paTiread
a + illaiINF+NEGVERB
a:INT
patikkavillaiya:?Did not read?
8/7/2019 Ma Language Format Final
29/35
28
Table 6. Relative participle formation
Root Tense RelativeParticiplemarker
Form
paTi tt a paTitta
paTi kkiR a paTikkiRa
paTi um paTikkum
paTi a: a paTikka:ta
8/7/2019 Ma Language Format Final
30/35
29
Table 7. Glides that a word ending in a vowel take.
Ending Vowel Glide Example
Mid Open shorta
v Native root words do not end in a
Mid Open long
a:
v ci:ta: + ai -> ci:ta:vai
'Sita' + ACC-> 'Sita(obj)'
Front Close short
i
y puli + uTan -> puliyuTan
'tiger'+SOC ->'with the tiger'
Front Close long
i:
y ti: + a:l -> ti:ya:l
'fire' + INS-> 'due to fire'
Back Close short
u
v e:cu +ai -> e:cuvai
'Jesus' + ACC -> 'Jesus(obj)'
Back Close long
u:v pu: +in -> pu:vin
'flower + GEN -> 'flower's'
Front Mid short
e
y Native root words do not end in e
puNe + il -> puNeyil'Pune' + LOC -> 'in Pune'
Front Mid longe:
y Native root words do not end in e:me: + il -> me:yil
'May' + LOC -> 'in May'
Back Mid short
o
v Native root words do not end in o
Back Mid longo:
v Native root words do not end in o:a:TTo: + in -> a:TTo:vin'auto' + GEN-> 'auto's'
Diphthongai
y a:cai + a:l -> a:caiya:l'desire'+INS->'due to desire'
Diphthongau
v Native root words do not end in aulaknau + il -> laknauvil'Lucknow'+LOC->'in Lucknow'
8/7/2019 Ma Language Format Final
31/35
30
Table 8. Sample of the State Table
Current State Next State Symbol
0 0 ai
0 0 utaiya
0 1 kal0 1 ai
0 1 utaiya
8/7/2019 Ma Language Format Final
32/35
31
Figure 1. Sample State Diagram
8/7/2019 Ma Language Format Final
33/35
32
Table 9. Evaluation of Morphological analyser
Types General Domain Tourism Domain
Total number of Words 50,000 50,000
Analysed words 46620 45085
Error due to Missing
morphosyntax rules and statetable entries
223 344
Error due to agglutination 485 531
Error due to missing root word 1345 1987
Input Error 1327 2053
Correctness of analysis 93.24% 90.17%
8/7/2019 Ma Language Format Final
34/35
33
Table 10. Linguistic abbreviations.
Abbreviation Full Form
3PE 3rd
person Plural Epicene
3PN 3rd person Plural Neuter
3SF 3
rd
person SingularFeminine
3SH 3rd
person SingularHonorific
3SM 3rd
person SingularMasculine
3SN 3rd
person Singular Neuter
ABL Ablative
ACC Accusative
ADJ Adjective Suffix
ADV Adverb Suffix
AUXV Auxiliary Verb
CAUS Causative
COND Conditional
CONC Concessive
COOR Coordination Clitic
DAT Dative
DISJ Disjunction Clitic
EMPH Emphatic Clitic
EMP Emphatic Suffix
FEM Feminine
FUT Future Tense
GEN Genitive
HORT Hortative
INF Infinitive
INS Instrumental
INT Interrogative
LOC Locative
MAS Masculine
NEG Negative Suffix
NEGVERB Negative Verb
OBJ Object
ONOM Onomatopoeic form
OPT Optative
PL Plural
PRON-3SM Pronominal - 3rd personSingular Masculine
PRE Present Tense
8/7/2019 Ma Language Format Final
35/35
34
PSP Postposition
PST Past Tense
RP Relative Participle
SOC Sociative
SUFF Suffix
SUPP Supposition marker
i According to Schiffman (Schiffman 1994), Thus the usual treatment of Tamil case (Arden 1942) is one wherethere are seven cases--the nominative (first case), accusative (second case), instrumental (third), dative (fourth),ablative (fifth), genitive (sixth), and locative (seventh). The vocative is sometimes given a place in the casesystem as an eighth case, although vocative forms do not participate in usual morphophonemic alternations, nordo they govern the use of any postpositions.