3
Integer Linear Programming Approach to Dependency Parsing for MALAYALAM Aparnna T, Raji P G, Soman K P Centre for Excellence in Computational Engineering and Networking, Amrita Vishwa Vidyapeetham, Coimbatore, India. {aparnu, rajiipg1985}@gmail.com Abstract-This paper is an attempt to show that a dependency parser for Malayalam language can be produced using Integer Linear Programming Approach. We describe a process for developing a parser by incorporating the Paninian Grammar concepts in Malayalam. I. INTRODUCTION As computational linguistics become more and more popular, it’s important to have a parser for Malayalam language processing. Parsing is one of the complex tasks natural language processing .The most difficult challenge is to derive a formal specification for the grammar. Malayalam has also got a very vast grammar structure. Dependency grammar driven parsing is much better suited for such type of languages as it is giving more importance to the words connected via dependency relations between them. The parser we describe uses a grammar driven approach to show dependency relations for a given sentence. First we construct a framework which works out the constraints among the local word groups of a Malayalam sentence then these constraints are translated into an Integer Programming Problem. Solution to the problem gives the parse structure for the Malayalam sentence. The parser we are described here can be used for other Indian Languages that follow the same word order. II. PANINIAN FRAMEWORK Still after ages Panineeyam is considered as the basic grammar text in Malayalam. Here we are using the Paninian grammar concepts to develop our parser. Paninian grammar is a dependency grammar which uses the notion of karaka relations, which are synatactico-sematic relations between the verb and other related constituents in a sentence. A. Karaka Relations “Namavum Kriyayum Thammilulla Yojana Karakam” i.e The relation of nouns to the verb in a sentence is called karaka. Vibhakthi(cases) denotes the karaka relations. In Malayalam all nouns are declined and have seven cases (Vibhakthikal): nirddESika or nominative, prathigraahika or accusative, samyOjika which is translated as "social," uddESika or dative, prayOjika or instrumental, sambandhika or possessive, aadhaarika or locative, and sambhOdana or vocative. As an example, consider the samyOjika vibhakthi in Malayalam, which refers to the use of the suffix –OT(u). For e.g., suppose we want to translate the following sentence from English to Malayalam: "I asked Raman whether the exam was tomorrow." In Malayalam, the translation is “Pareeksha nale ano ennu njan RamanOT(u) chodichu” Emphasis added to the suffix –OT(u) Now we will see how these Vibhakthi’s are important in Paninian grammar framework. B. Paninian Model We will discuss the Paninian Model with the help of an example . Example 1:Raman pashuvine adikkunnu (Raman beats the cow) As part of the Paninian framework, a mapping is specified between karaka relations and vibhakthi(which covers collectively: case endings, post-positional markers,etc).In computational Paninian grammar this mapping is given by a structure called the basic karaka chart. It specifies what karakas are mandatory or optional and what vibhaktis they take. Basic karaka chart given in Fig1 correctly derives the sentence given in e.g.1. 2010 International Conference on Recent Trends in Information, Telecommunication and Computing 978-0-7695-3975-1/10 $25.00 © 2010 IEEE DOI 10.1109/ITC.2010.97 324

[IEEE 2010 International Conference on Recent Trends in Information, Telecommunication and Computing (ITC) - Kerala, India (2010.03.12-2010.03.13)] 2010 International Conference on

  • Upload
    soman

  • View
    224

  • Download
    1

Embed Size (px)

Citation preview

Integer Linear Programming Approach to Dependency Parsing for MALAYALAM

Aparnna T, Raji P G, Soman K P Centre for Excellence in Computational Engineering and Networking,

Amrita Vishwa Vidyapeetham, Coimbatore, India.

{aparnu, rajiipg1985}@gmail.com

Abstract-This paper is an attempt to show that a

dependency parser for Malayalam language can be produced using Integer Linear Programming Approach. We describe a process for developing a parser by incorporating the Paninian Grammar concepts in Malayalam.

I. INTRODUCTION As computational linguistics become more and more

popular, it’s important to have a parser for Malayalam language processing. Parsing is one of the complex tasks natural language processing .The most difficult challenge is to derive a formal specification for the grammar. Malayalam has also got a very vast grammar structure. Dependency grammar driven parsing is much better suited for such type of languages as it is giving more importance to the words connected via dependency relations between them. The parser we describe uses a grammar driven approach to show dependency relations for a given sentence. First we construct a framework which works out the constraints among the local word groups of a Malayalam sentence then these constraints are translated into an Integer Programming Problem. Solution to the problem gives the parse structure for the Malayalam sentence. The parser we are described here can be used for other Indian Languages that follow the same word order.

II. PANINIAN FRAMEWORK Still after ages Panineeyam is considered as the basic

grammar text in Malayalam. Here we are using the Paninian grammar concepts to develop our parser. Paninian grammar is a dependency grammar which uses the notion of karaka relations, which are synatactico-sematic relations between the verb and other related constituents in a sentence.

A. Karaka Relations

“Namavum Kriyayum Thammilulla

Yojana Karakam”

i.e The relation of nouns to the verb in a sentence is called karaka. Vibhakthi(cases) denotes the karaka relations.

In Malayalam all nouns are declined and have seven cases (Vibhakthikal):

nirddESika or nominative,

prathigraahika or accusative,

samyOjika which is translated as "social,"

uddESika or dative,

prayOjika or instrumental,

sambandhika or possessive,

aadhaarika or locative,

and sambhOdana or vocative.

As an example, consider the samyOjika vibhakthi in Malayalam, which refers to the use of the suffix –OT(u). For e.g., suppose we want to translate the following sentence from English to Malayalam:

"I asked Raman whether the exam was tomorrow."

In Malayalam, the translation is

“Pareeksha nale ano ennu njan RamanOT(u) chodichu”

Emphasis added to the suffix –OT(u)

Now we will see how these Vibhakthi’s are important in Paninian grammar framework.

B. Paninian Model

We will discuss the Paninian Model with the help of an example .

Example 1:Raman pashuvine adikkunnu

(Raman beats the cow)

As part of the Paninian framework, a mapping is specified between karaka relations and vibhakthi(which covers collectively: case endings, post-positional markers,etc).In computational Paninian grammar this mapping is given by a structure called the basic karaka chart. It specifies what karakas are mandatory or optional and what vibhaktis they take. Basic karaka chart given in Fig1 correctly derives the sentence given in e.g.1.

2010 International Conference on Recent Trends in Information, Telecommunication and Computing

978-0-7695-3975-1/10 $25.00 © 2010 IEEE

DOI 10.1109/ITC.2010.97

324

Karaka Vibhakthi Presence

Karta φ Mandatory

Karma aE or φ Mandatory

Karana aL or φ Optional

Figure 1: Basic karaka chart for ‘adikkuka’(beat)

Now consider another sentence containing the verb adikkuka(beat)

Example 2: Raman kai kondu pashuvine adikkunnu

(Raman beats the cow with his hand)

Its word groups are marked and adikkuka (beat) has the same karaka chart as in Fig 1.Its constraint graph is shown in Fig 2.Nodes of the graph are the word groups and there is an arc labeled by a karaka from a verb group to a noun group, if the noun group satisfies the karaka restriction in the karaka chart of the verb group. The verb groups are called demand groups as they make demand about their karakas, and the noun groups are called source groups because they satisfy demands.

A parse is a sub-graph of the constraint graph containing all the nodes of the constraint graph and satisfying the following conditions [1]

i. For each of the mandatory karakas in a karaka chart for each demand group, there should be exactly one out going edge labeled by the karaka from the demand group

ii. For each of the desirable or optional karakas in a karaka chart for each demand group, there should be at most one outgoing edge labeled by the karaka from the demand group

iii. There should be exactly one incoming arc into each source group

If several sub-graphs of a constraint graph satisfy the above conditions, it means that there are multiple parses for the given Malayalam sentence and the sentence is ambiguous. If no sub-graph satisfies the above constraints, the sentence does not have a parse and is probably ill-formed.

Karta Karaka

Karana Karaka

Karma Karaka

Raman kai kondu pashuvine adikkunnu

(Raman) (with his hand) (the cow) (beats)

Figure 2: Constraint graph for Example2

III. FORMATION OF INTEGER PROGRAMMING CONSTRAINTS FROM THE KARAKA

RELATIONS IN MALAYALAM[1] A parse for Malayalam language can be obtained from

the constraint graph using integer programming. A constraint graph is converted into an integer programming problem by introducing a variable z for an arc from node i to j labeled by karaka k in the constraint graph such that for every arc there is a variable. The variables take their values as 0 or 1.A parse is an assignment of 1 to those variables whose corresponding arcs are in the parse sub-graph, and 0 to those are not. Equality and inequality constraints in integer programming problem can be obtained from the conditions (i,ii and iii) listed earlier, as follows respectively:

1.For each demand group i, for each of its mandatory karakas k, the following equalities must hold:

Q i,k : , ,i k j

j

z∑ =1

Note that Qi,k stands for the equation formed, given a demand word i and karaka k. Thus there will be as many equations as combinations of i and k.

2. For each demand group i, for each of its optional or desirable karakas k, the following inequalities must hold

Oi,k : , ,i k j

j

z∑ ≤1

3. For each of the source group’s j, the following equalities must hold:

Rj: , ,

,

i k j

i k

z∑ =1

Thus there will be as many equations as there are source words. The cost function to be minimized is given as the sum of all the variables.

However, there might be more than one possible ways to produce local word groups for some words in Malayalam. This results in many possible Local Word groupings or LW groupings for a Malayalam sentence. For example, for a sentence with 4 words (d1 to d4) if d1 and d2 might be in a local word group or not and similarly, d2 and d3 might be in a local word group or not, it leads to the following alternative LW groupings:

1. (d1 d2) d3 d4

2. d1 d2 d3 d4

3. d1 (d2 d3) d4

There can also be more than one karaka charts for a verb in Malayalam. In order to handle these cases we can use two approaches. The first one is to use the above mentioned equations for each LW grouping and for each karaka chart of the verb. This means that the integer programming

325

package would be called separately, once for each combination of LW grouping and karaka charts

In the second approach we formulate the constraint equations so that there are variables for the selection of a LW grouping and a karaka chart of a verb with multiple karaka charts. The following variables are introduced to hold this case

• rl represents lth possible LW grouping for a sentence in Malayalam.

• sl,i,p represents pth karaka chart of ith word group which belongs to lth LW grouping

• zl,i,j,p,k represents an arc from ith word group to jth

word group of lth LW grouping with karaka label k of pth karaka chart.

At any time arcs corresponding to any one LW grouping should be considered. This can be ensured by the following equation.

l

l

r∑ =1

For a LW grouping l there should be only one karaka chart for each of the demand groups in it.

, ,l i p lp

s r−∑ =0

The above equation says that if rl is 1 for some l(i.e.The particular LW grouping is selected),the sigma-term should also be one(i.e. should give us a parse for that LW grouping).Otherwise if l is zero then sigma-term is also zero. To ensure that mandatory arcs are labeled unique:

, , , , , , 0l i j p kz l i p

j

s =−∑

This equation forces that entire variable which does not represent the karakas of the selected karaka chart to be zeros. For the optional arcs the following equation should be satisfied:

, , , , , ,l i j p k l i pj

z s−∑ ≤ 0

In order to ensure that for each group there should be only one incoming arc, we have:

, , , ,

, ,

l i j p k l

i p k

z r−∑ =0

The above equation also focuses all the variables which do not belong to the selected LW grouping to be zeros.

IV. CONCLUSION In this paper we have shown how Computational

Paninian grammar can be used for parsing Malayalam language. The parser is based on translating grammatical constraints to integer programming constraints. A solution of these constraints produces a parse which can be used for carrying out various Natural Language Processing (NLP) tasks. The dependency parser we are described here is in the implementation stage, and it is found that it is working for Malayalam sentences which follow free word order.

V. REFERENCES [1] Akshar Bharati,Rajeev Sangal,T.Papi Reddy,”A constraint based

parser using integer programming.”, Language Technologies Research Centre,Inernational Institute of Information Technology, Hyderabad 500019,Andhra Pradesh, India

[2] Sebastian Reedel,James Clarke, “Incremental integer linear programming for non-projective dependency parsing”, School of Informatics, University of Edinburgh,2 Bucclecuch Place, Edinburg EH8 9LW,UK

[3] Bharati,Akshar and Rajeev Sangal ,”A karaka based approach ro parsing of Indian languages”,In COLING90: Proc.of Int.Conf. on Computational Linguistics,NY,August1990.

[4] Bharati,Akshar and Rajeev Sangal, ”Parsing of free word order languages using the Paninian framework”,In ACL93: Proc. Of Annual Meeting of Association for Computational Linguistics.Association for Computational Linguistics,NY,1993a.

[5] Varma,A.R.Rajaraja(2005),”Kerala Panineeyam”.Kottayam:DC Books,301-302.ISBN 81-713-0672-1

[6] V.Ramkumar ,”Sampoornna Malayala Vyakaranam”,SISO Books,Trivandrum,India

326