13
Knowledge-Based Systems 109 (2016) 147–159 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys A generalized framework for anaphora resolution in Indian languages Utpal Kumar Sikdar, Asif Ekbal , Sriparna Saha Department of Computer Science and Engineering Indian Institute of Technology Patna, India a r t i c l e i n f o Article history: Received 16 May 2015 Revised 27 June 2016 Accepted 28 June 2016 Available online 6 July 2016 Keywords: Multiobjective optimization (MOO) Single objective optimization (SOO) Conditional random field (CRF) Support vector machine (SVM) a b s t r a c t In this paper, we propose a joint model of feature selection and ensemble learning for anaphora reso- lution in the resource-poor environment like the Indian languages. The proposed approach is based on multi-objective differential evolution (DE) that optimises five coreference resolution scorers, namely Muc, Bcub, Ceafm, Ceafe and Blanc. The main goal is to determine the best combination of different mention classifiers and the most relevant set of features for anaphora resolution. The proposed method is evalu- ated for three leading Indian languages, namely Hindi, Bengali and Tamil. Experiments on the benchmark datasets of ICON-2011 Shared Task on Anaphora Resolution in Indian Languages show that our proposed approach attains good level of accuracies, which are often better with respect to the state-of-the-art sys- tems. It achieves the F-measure values of 71.89%, 59.61%, 52.55% 34.45% and 72.52% for Muc, Bcub, Ceafm, Ceafe and Blanc, respectively, for Bengali language. For Hindi we obtain the F-measure values of 33.27%, 63.06%, 49.59%, 49.06% and 55.45% for Muc, Bcub, Ceafm, Ceafe and Blanc metrics, respectively. In order to further show the efficacy of our proposed algorithm, we evaluate with Tamil, a language that belongs to a different family. This shows the F-measure values of 31.79%, 64.67%, 46.81%, 45.29% and 52.80% for Muc, Bcub, Ceafm, Ceafe and Blanc metrics, respectively. Experiments on Dutch show the F-measure values of 17.67%, 74.43%, 58.08%, 59.21% and 55.58% for Muc, Bcub, Ceafm, Ceafe and Blanc metrics, respectively. © 2016 Elsevier B.V. All rights reserved. 1. Introduction The anaphora or coreference resolution is the process of iden- tifying noun phrases that denote the same real world entities [1,21,30,46]. Many crucial application areas of Natural Language Processing (NLP), for example, Information Extraction [11], Text Summarization [36], Question Answering [16], Text Retrieval [30], Machine Translation [14,18,23,24] etc. require the task of coref- erence resolution to be performed. There have been significant amount of works in this area, but most of these focus mainly for the languages such as English [5,26,27], due to the availability of different lexical resources and large corpora such as ACE [44] and OntoNotes [45]. We use a state-of-the-art English coreference res- olution system, BART [42] for our task, and so proper adaptation had to be carried out for the resource-scare Indian languages. India is a multilingual country with great cultural and lin- guistic diversities. There are 22 officially spoken languages in In- dia. However, there has not been significant number of works on anaphora resolution involving Indian languages due to its resource- constrained nature, i.e., annotated corpora and other lexical re- Corresponding author. fax.: +91 612 2277383. E-mail addresses: [email protected] (U. Kumar Sikdar), [email protected], [email protected] (A. Ekbal), [email protected] (S. Saha). sources are not readily available in the required measure. Litera- ture shows that the existing works on anaphora resolution involv- ing Indian languages are a few in number, and they cover only few of the languages like Bengali, Hindi and Tamil [2,33,34,39]. However, based on these works it is difficult to get a comprehen- sive view of the research on anaphora resolution related to Indian languages because each of these was developed using the self- generated datasets, and evaluation setups are not the same. There- fore, it is not fair to compare between the algorithms reported in these works. The first benchmark setup for anaphora resolution involving In- dian languages was established in ICON-2011 NLP Tools Contest on Anaphora Resolution 1 . Out of the six participating teams, four addressed the issues of anaphora resolution in Bengali, and one each for Hindi and Tamil. Apart from these some other works have been reported in [9,31–33] for anaphora resolution, especially for Bengali and Hindi. A system for anaphora resolution in Bengali is reported in [32], where various models for mention detection were developed, and their impacts on anaphora resolution were reported. In another work, Senapati and Garain [31] have shown how an off-the-shelf anaphora resolution system can be effectively 1 http://ltrc.iiit.ac.in/icon2011/contests.html http://dx.doi.org/10.1016/j.knosys.2016.06.033 0950-7051/© 2016 Elsevier B.V. All rights reserved.

A generalized framework for anaphora resolution in Indian …sriparna/papers/general.pdf · 2018. 10. 1. · tion technique. A recent study for feature selection in anaphora resolution

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A generalized framework for anaphora resolution in Indian …sriparna/papers/general.pdf · 2018. 10. 1. · tion technique. A recent study for feature selection in anaphora resolution

Knowledge-Based Systems 109 (2016) 147–159

Contents lists available at ScienceDirect

Knowle dge-Base d Systems

journal homepage: www.elsevier.com/locate/knosys

A generalized framework for anaphora resolution in Indian languages

Utpal Kumar Sikdar, Asif Ekbal ∗, Sriparna Saha

Department of Computer Science and Engineering Indian Institute of Technology Patna, India

a r t i c l e i n f o

Article history:

Received 16 May 2015

Revised 27 June 2016

Accepted 28 June 2016

Available online 6 July 2016

Keywords:

Multiobjective optimization (MOO)

Single objective optimization (SOO)

Conditional random field (CRF)

Support vector machine (SVM)

a b s t r a c t

In this paper, we propose a joint model of feature selection and ensemble learning for anaphora reso-

lution in the resource-poor environment like the Indian languages. The proposed approach is based on

multi-objective differential evolution (DE) that optimises five coreference resolution scorers, namely Muc ,

Bcub , Ceafm , Ceafe and Blanc . The main goal is to determine the best combination of different mention

classifiers and the most relevant set of features for anaphora resolution. The proposed method is evalu-

ated for three leading Indian languages, namely Hindi, Bengali and Tamil. Experiments on the benchmark

datasets of ICON-2011 Shared Task on Anaphora Resolution in Indian Languages show that our proposed

approach attains good level of accuracies, which are often better with respect to the state-of-the-art sys-

tems. It achieves the F-measure values of 71.89%, 59.61%, 52.55% 34.45% and 72.52% for Muc , Bcub , Ceafm ,

Ceafe and Blanc , respectively, for Bengali language. For Hindi we obtain the F-measure values of 33.27%,

63.06%, 49.59%, 49.06% and 55.45% for Muc , Bcub , Ceafm , Ceafe and Blanc metrics, respectively. In order

to further show the efficacy of our proposed algorithm, we evaluate with Tamil, a language that belongs

to a different family. This shows the F-measure values of 31.79%, 64.67%, 46.81%, 45.29% and 52.80% for

Muc , Bcub , Ceafm , Ceafe and Blanc metrics, respectively. Experiments on Dutch show the F-measure

values of 17.67%, 74.43%, 58.08%, 59.21% and 55.58% for Muc , Bcub , Ceafm , Ceafe and Blanc metrics,

respectively.

© 2016 Elsevier B.V. All rights reserved.

1

t

[

P

S

M

e

a

t

d

O

o

h

g

d

a

c

a

s

t

i

f

H

s

l

g

f

t

d

o

a

e

b

B

i

w

r

h

0

. Introduction

The anaphora or coreference resolution is the process of iden-

ifying noun phrases that denote the same real world entities

1,21,30,46] . Many crucial application areas of Natural Language

rocessing (NLP), for example, Information Extraction [11] , Text

ummarization [36] , Question Answering [16] , Text Retrieval [30] ,

achine Translation [14,18,23,24] etc. require the task of coref-

rence resolution to be performed. There have been significant

mount of works in this area, but most of these focus mainly for

he languages such as English [5,26,27] , due to the availability of

ifferent lexical resources and large corpora such as ACE [44] and

ntoNotes [45] . We use a state-of-the-art English coreference res-

lution system, BART [42] for our task, and so proper adaptation

ad to be carried out for the resource-scare Indian languages.

India is a multilingual country with great cultural and lin-

uistic diversities. There are 22 officially spoken languages in In-

ia. However, there has not been significant number of works on

naphora resolution involving Indian languages due to its resource-

onstrained nature, i.e., annotated corpora and other lexical re-

∗ Corresponding author. fax.: +91 612 2277383.

E-mail addresses: [email protected] (U. Kumar Sikdar), [email protected] ,

[email protected] (A. Ekbal), [email protected] (S. Saha).

h

ttp://dx.doi.org/10.1016/j.knosys.2016.06.033

950-7051/© 2016 Elsevier B.V. All rights reserved.

ources are not readily available in the required measure. Litera-

ure shows that the existing works on anaphora resolution involv-

ng Indian languages are a few in number, and they cover only

ew of the languages like Bengali, Hindi and Tamil [2,33,34,39] .

owever, based on these works it is difficult to get a comprehen-

ive view of the research on anaphora resolution related to Indian

anguages because each of these was developed using the self-

enerated datasets, and evaluation setups are not the same. There-

ore, it is not fair to compare between the algorithms reported in

hese works.

The first benchmark setup for anaphora resolution involving In-

ian languages was established in ICON-2011 NLP Tools Contest

n Anaphora Resolution

1 . Out of the six participating teams, four

ddressed the issues of anaphora resolution in Bengali, and one

ach for Hindi and Tamil. Apart from these some other works have

een reported in [9,31–33] for anaphora resolution, especially for

engali and Hindi. A system for anaphora resolution in Bengali

s reported in [32] , where various models for mention detection

ere developed, and their impacts on anaphora resolution were

eported. In another work, Senapati and Garain [31] have shown

ow an off-the-shelf anaphora resolution system can be effectively

1 http://ltrc.iiit.ac.in/icon2011/contests.html

Page 2: A generalized framework for anaphora resolution in Indian …sriparna/papers/general.pdf · 2018. 10. 1. · tion technique. A recent study for feature selection in anaphora resolution

148 U. Kumar Sikdar et al. / Knowledge-Based Systems 109 (2016) 147–159

p

s

j

o

o

o

p

i

c

w

s

b

v

p

i

s

a

e

i

E

t

i

a

o

C

r

d

t

d

n

2

a

g

t

a

u

w

a

C

(

g

t

a

a

d

w

t

b

s

B

t

r

f

t

2

p

i

W

used for Bengali. A more recent study on anaphora resolution in

Bengali can be found in [33] . In recent times a generic framework

for anaphora resolution in Indian languages has been reported in

[9] .

Literature survey shows that a large collection of multiobjec-

tive evolutionary approaches are developed to solve several real-

life problems in recent years. In [38] , a multiobjective symbi-

otic organism search based approach is developed for solving

time-cost-labor utilization trade-off problem. Similarly in [48] , a

performance comparison is conducted between generational and

steady-state asynchronous multi-objective evolutionary algorithms

for computationally-intensive problems. In [10] , a multiobjective

evolutionary approach is developed for named entity recognition

in chemical domain. The paper proposes a joint model of feature

selection and parameter optimization of classifier. The technique is

based on multobjective genetic algorithms. The algorithm that we

propose here is a joint model, but it performs ensemble learning

for mention detection and feature selection for coreference resolu-

tion. Unlike [10] , we do not make use of genetic algorithm, and use

differential evolution to build our models. Genetic algorithm and

differential evolution are two different optimization strategies. Two

problems, NER solved in [10] and coreference resolution solved

here are completely different.

In anaphora resolution there does not exist any globally ac-

cepted metric for measuring the performance, and each of Muc ,

Bcub , ceaf , Blanc metrics exhibits significantly different behavior.

The most popular coreference resolution scorer is Muc [43] which

is defined based on the number of common links between refer-

ence (true) and system prediction (response). The Muc scorer ig-

nores single mention entity as this does not contain any link in it.

An error which is encountered because of linking two large groups

could do more damage compared to one link connecting two small

groups. Bagga and Baldwin [4] present their B-cubed evaluation to

deal with these issues. While the B-cubed metric fixes some of

the shortcomings of the Muc scorer problem, it has also its own

problems. The Bcub scorer is based on entities containing the men-

tions. But when precision or recall is calculated by comparing en-

tities, it may be the case that an entity may be considered more

than once. For this problem, Luo introduced one new metric called

Constrained Entity-Alignment F-Measure (CEAF) [20] based on the

aligned entities in key (true) and response (system output). In [28] ,

authors have proposed another scorer called BiLateral Assessment

of Noun-phrase Coreference ( Blanc ). This scorer takes into account

the number of coreference and non-coreference links. Because of

these different views, systems optimized according to one metric

often tend to perform poorly with respect to the others, and there-

fore comparing performance among different systems introduces

difficulties. Hence, we decide to optimize all the well-known met-

rics simultaneously.

In anaphora resolution an early attempt for explicit optimiza-

tion was proposed in [22] , where no significant performance im-

provement was observed over the baseline that was constructed

with all the available features. A systematic effort of manual fea-

ture selection on the benchmark datasets was carried out in [40] ,

where total 600 features were evaluated. The very first attempt for

automatic optimization of anaphora resolution was carried out in

[15] , where the usability of evolutionary genetic algorithms (GAs)

is investigated. The authors have suggested that such a technique

may yield significant performance improvements over the Muc -

6/7 datasets. The concept of multi-objective optimization (MOO)

for feature selection in anaphora resolution has been addressed in

[29] for the English language, where GA was used as an optimiza-

tion technique. A recent study for feature selection in anaphora

resolution for Bengali can be found in [33] . In contrast to [29] , we

have not made use of GA as an optimization technique, and per-

formed feature selection for a non-English language. The algorithm

roposed here is based on differential evolution (DE), which is not

imilar to GA. The method proposed in [33] is based on single ob-

ective optimization (SOO). While SOO concentrates in optimizing

nly one function, MOO simultaneously optimizes more than one

bjective functions. The output of MOO produces a set of solutions

n the Pareto optimal front. Each of these solutions is equally im-

ortant from the algorithmic point of view. Hence, one interest-

ng aspect of our algorithm is that depending upon the need user

an pick up any solution. In contrast to all these previous works,

e propose here a method that performs feature selection and en-

emble learning jointly. The algorithm can simultaneously yield the

est ensemble model for mention detection and the most rele-

ant features for anaphora resolution. In none of the previous ap-

roaches these two problems were modeled in this way. Other ex-

sting works such as [31,32] do not address the problems of feature

election and/or ensemble learning.

The key contributions of this work are four-fold, viz. (i) building

generic framework for anaphora resolution in a less-resourced

nvironment such as for Indian languages; (ii) adapting an exist-

ng state-of-the-art English co-reference resolution system for one

uropean language (i.e., Dutch) which has completely different or-

hography and characteristics; (iii). joint model for ensemble learn-

ng and feature selection, especially for this kind of application;

nd (iv) the use of evolutionary algorithm such as DE to simultane-

usly optimize the coreference metrics such as Muc , Bcub , Ceafm ,

eafe and Blanc using the concepts of multiobjective optimization.

The rest of the paper is structured as follows. Section 2 elabo-

ately discusses our proposed methods that include technique for

eveloping mention detection, technique for feature selection, and

he features used for anaphora resolution. In Section 3 we report

etailed evaluation results along with the necessary analysis. Fi-

ally, in Section 4 , we conclude the paper.

. Proposed method

In this section, we present a joint model for feature selection

nd ensemble construction. Performance of anaphora resolution

reatly depends on the mentions (also called markables). Men-

ions are basically the noun phrases that form the core parts in

naphoric relations. A good performing mention detector might be

seful to develop an accurate model for anaphora resolution. Here

e build several models for mention detection using heuristics

nd machine learning (c.f Section 2.1 ). As machine learner we use

onditional Random Field (CRF) [17] and Support Vector Machine

SVM) [41] . All these models of mention detection are combined to-

ether to further increase the performance. The noun phrases ex-

racted from this combined model are used as the candidate mark-

bles to anaphora resolution. Anaphora resolver is trained with

set of features which were implemented without using much

omain-dependent resources and/or tools. We propose a frame-

ork based on multiobjective optimization (MOO) that determines

he best ensemble of mention detectors and the best feature com-

ination for anaphora resolution that optimizes several evaluation

corers. In particular we optimize Muc , Bcub , Ceafm , Ceafe and

lanc . We generate N mention detection models ( N = 10 ) for men-

ion detection and implement M features ( M = 18 ) for anaphora

esolution. The proposed approach is general in nature, and there-

ore applicable to many less-resourced languages. An overall archi-

ecture of the proposed model is shown in Fig. 1 .

.1. Models for mention detection

Mention or markable denotes the terms or phrases that partici-

ate in anaphoric relations. Accuracy of mention detection plays an

mportant role in the overall performance of anaphora resolution.

e develop the following 10 models for mention detection.

Page 3: A generalized framework for anaphora resolution in Indian …sriparna/papers/general.pdf · 2018. 10. 1. · tion technique. A recent study for feature selection in anaphora resolution

U. Kumar Sikdar et al. / Knowledge-Based Systems 109 (2016) 147–159 149

Fig. 1. Anaphora resolution system.

2

r

s

i

b

e

t

l

t

A

g

i

f

f

o

m

e

g

f

W

r

E

i

a

a

p

e

i

i

h

(1) Model-1 : This model is developed based on the supervised

classifier, namely CRF. The classifier is trained using the fol-

lowing features: Local context, Fixed length prefix and suf-

fix strings of length up to four characters, Part-of-Speech

(PoS) information of the current token, Named entity (NE)

information ( Muc categories like person, location and orga-

nization names) of the current token, noun phrase preced-

ing a pronoun, morphological constructs ( lemma and num-

ber information ) and several binary valued features. Most of

these features are extracted from the Indian language shal-

low parser 2 . These binary valued features check whether the

current token corresponds to the first word of the sentence

or a pronoun or a definite or demonstrative noun. In addi-

tion, for Bengali, we prepare a list of frequently occurring

suffixes that appear with the person names (e.g., - bAbu 3 , -

der , - dI , - rA etc.) and pronouns (e.g.,- tI , - ke , - der etc.), and

define a feature that fires if the current word contains any of

these suffixes. For Hindi, we prepare a list of pronouns, and

define a feature that is set to 1 if the current token matches

with any of the entries of the pronoun list.

(2) Model-2 : This model was developed using SVM [41] . For

training we use the same set of features as the first one.

(3) Model-3 : We observed that all the pronouns do not take

part in anaphoric relations. We build a SVM based classi-

fier to distinguish between anaphoric vs. non-anaphoric pro-

nouns. The classifier was trained using the following set of

features: first word of the sentence, suffixes and prefixes

upto three characters, root word, predefined pronoun list ex-

tracted from the training data and PoS information of the

current token.

(4) Model-4 : This is similar to the third model except we use

CRF instead of SVM.

(5) Model-5 : This is a stacked model where Model-1 is used as

the base classifier and Model-2 is used as the meta-classifier.

The output obtained from the first classifier is used as an

additional feature for the second classifier.

(6) Model-6 : This is also a stacked model, where we use the

output of Model-2 as an additional feature for Model-1.

(7) Model-7 : This model was constructed by considering all the

noun phrases as mentions.

2 http://ltrc.iiit.ac.in/showfile.php?filename=downloads/shallow _ parser.php 3 Bengali glosses are written in ITRANS notation.

o

B

r

(8) Model-8 : This rule-based model was built by including pro-

nouns and preceding noun phrases as mentions.

(9) Model-9 : In this model, we consider pronouns and all its

preceding NEs as possible candidates for the markables.

(10) Model-10 : This is the model that was constructed by con-

sidering all the pronouns and NEs as markables.

.2. Brief description of BART system architecture

We use BART [42] as our underlying platform for anaphora

esolution. It provides the state-of-the-art approaches, including

yntax-based and semantic features. The flexibility of BART is that

ts design is very modular, and this provides effective separation

etween several tasks, including engineering new features that

xploit different sources of knowledge, and improving the way

hat anaphora resolution is mapped to a machine learning prob-

em. BART has five main components: preprocessing pipeline, men-

ion factory, feature extraction module, learning and encoder/decoder .

nother component language plugin helps to adopt different lan-

uages with the BART architecture. Preprocessing consists of mark-

ng up noun chunks and named entities (NEs). It also adds the in-

ormation such as part-of-speech (PoS) tags and merging these in-

ormation into markables. These markables are the starting points

f the mentions used by the coreference resolution and these

arkables are finally converted into a XML file. Mention factory

xtracts mentions and finds out the basic properties like number,

ender etc. from the XML files. Feature extraction module extracts

eatures for each anaphor-antecedent pair. We experiment with

EKA’s [47] implementation of C4.5 decision tree learning algo-

ithm [25] . Training (encoder) instances are created following [35] .

ach pair of adjacent coreferent markables denotes a positive train-

ng instance. A negative instance is created with the pairs of the

naphor and with any markable occurring between the anaphor

nd the antecedent. During testing (decoder), each markable is

aired with any preceding markable from right to left and coref-

rence chains are created by best-first clustering. Each mention

s compared with all of its previous mentions with a probabil-

ty greater than a fixed threshold value, and is clustered with the

ighest probability. If none has probability greater than the thresh-

ld, the mention becomes a new cluster. The different steps of the

ART architecture are elaborately shown in Fig. 2 .

During testing, we perform a closest first clustering of instances

eferred coreferent by the classifier. Each text is processed from

Page 4: A generalized framework for anaphora resolution in Indian …sriparna/papers/general.pdf · 2018. 10. 1. · tion technique. A recent study for feature selection in anaphora resolution

150 U. Kumar Sikdar et al. / Knowledge-Based Systems 109 (2016) 147–159

Fig. 2. Different steps of BART architecture.

left to right: each markable is paired with any preceding markable

from right to left, until a pair labeled as coreferent is output, or

the beginning of the document is reached. In this step, the coref-

erence chains are created by the best-first clustering. Each men-

tion is compared with all of its previous mentions with a probabil-

ity greater than a fixed threshold value, and is clustered with the

highest probability. If none has probability greater than the thresh-

old, the mention becomes a new cluster.

2.3. Features for anaphora resolution

We view coreference resolution as a binary classification

problem. Following similar proposals for English [22] , we use the

learning framework proposed in [35] as a baseline. Each classi-

fication instance consists of two markables, i.e., an anaphor and

its potential antecedent. Instances are modeled as feature vectors

and used to train a binary classifier. The classifier has to decide,

given the features, whether the anaphor and the candidate are

co-referent or not. We use BART [42] as our underlying platform

for anaphora resolution. Given a potential antecedent RE i and

an anaphor RE j , we compute the following set of features. We

implement the following set of features after being motivated

from our prior works [33] .

(1) String match(SM) : This feature compares the surface forms,

and takes the value true if the candidate anaphor ( RE j ) and

antecedent ( RE i ) have the same surface string forms, other-

wise false.

(2) Sentence distance(SD) : This feature denotes the distance

between the anaphor and antecedent. The value of this fea-

ture is non-negative integer that captures the distance in

terms of the number of sentences between an anaphor and

its antecedent. The feature takes the value of 0 if both

anaphor and antecedent are in the same sentence, the value

of 1 is produced if their sentence distance is 1 and so on.

(3) Markable distance(MD) : This non-negative integer feature

captures the distance in terms of the number of mentions

between the two markables. The distance is measured by

the total number of markables occured between anaphor

and antecedent pair.

(4) First person pronoun (FPP) : This feature is defined based

on the direct and indirect speech. For a given anaphor-

antecedent pair ( RE j , RE i ) a feature is set to true if RE j is a

first person pronoun found within a quotation and RE i is a

mention immediately preceding it within the same quote. If

RE i is outside the quote and appears either in the same sen-

tence or in any of the preceding three sentences and is not

the first person then the corresponding feature is also set to

true. The feature also behaves in a similar way if the pair

( RE j , RE i ) appears outside the quotation. Let us consider the

following example:

papi biswas ballen, “amAke sAbdhAn kare bhAlo kAj korechen.

ami hoito purono abhyAs moto bole ditAm.”

Here, in the anaphor-antecedent pair (“ami ”,“amAke ”),

anaphor Ami and the antecedent AmAke are both inside the

quotation. In another pair (“amAke ”, “papi biswas ”), “papi

biswas ” is outside the quotation. For each of these cases, the

feature value is set to true.

(5) Second person pronoun (SPP) : This feature is defined for

the pair ( RE j , RE i ) that appears in the same quote. If RE i is

not the first person and RE j corresponds to a second per-

son then this feature is set to true. The feature also fires

if RE j is inside the quotation, but RE i is outside and ends

with the suffix “ke ”. Consider the following example: miss.

biswas bollen, “AmAr sAthe lukochurI kelbAr chestA korben na,

mr. sankar. kon bApyArtAr kothA bolchi tA apni besh bhujte

pArchen. bishwA suddhA lok je bApyArtA niye fisfis kore kothA

bolche tA aponi kichhu jAnen nA.”

Here, both mentions of the anaphor - antecedent pair

(“apni ”,“mr. sankar ”) are within a quotation, anaphor apni is

the second person and antecedent mr. sankar is the third

person. Each mention in the other pair (“apni ”,“apni ”) is the

second person and both lie inside the quotation. In all these

cases the feature value is set to true.

Consider another example:

se pichon theke harike bollo, “tui bol.”

In the mention pair (“tui ”,“harike ”), the anaphor tui is within

the quotation but the antecedent harike is outside the quo-

tation and this also ends with the suffix string “ke ”. The fea-

ture value is also set to true in this case.

(6) Third person pronoun (TPP) : If both mentions in the pair

( RE j , RE i ) denote the third person pronouns and are outside

the quotation then the feature fires. Consider the following

example:

sarat chandrar ekti ne.Di kuttA chilo. tabuo tini bileti kuttAr

moto jAtno korten .

In this example, for the anaphor-antecedent pair

(“tini ”,“sarat chandrar ”), both mentions are third persons,

and so the feature is set to true.

(7) Reflexive pronoun (RP) : For a given pair ( RE j , RE i ), this fea-

ture checks whether RE j is a reflexive pronoun and fires

accordingly. It means if any antecedent is immediately fol-

lowed by a reflexive pronoun then the feature is true, other-

wise false. Let us take the following example:

tumi nijeo AtatAYer hate nihato habe.

In this example, for the anaphor-antecedent

pair(“nijeo ”,“tumi ”), the anaphor nijeo is reflexive pronoun

and antecedent tumi immediately precedes that anaphor. In

this case, the feature value is set to true.

(8) Number agreement (NA) : This feature checks whether the

anaphor and antecedent pair agree in the number informa-

tion. This feature is extracted from the Indian language shal-

Page 5: A generalized framework for anaphora resolution in Indian …sriparna/papers/general.pdf · 2018. 10. 1. · tion technique. A recent study for feature selection in anaphora resolution

U. Kumar Sikdar et al. / Knowledge-Based Systems 109 (2016) 147–159 151

2

s

e

a

2

t

c

o

i

a

1

l

n

r

f

c

m

t

p

l

l

p

a

b

i

a

t

p

2

m

d

c

s

t

o

B

s

s

d

t

C

n

m

s

(

o

2

c

m

t

t

p

t

r

e

n

c

m

r

low parser 4 . For the example given above (Point-7), the an-

tecedent tumi and the anaphor nijeo are singular numbers

and the pair feature is set to true.

(9) Semantic class feature (SCF) : If the semantic types of both

RE j and RE i are same then the value of this feature is set to

true, otherwise false. The semantic types denote the Muc NE

categories. In the above example (Point-7), both antecedent

tumi and anaphor nijeo are person name categories and the

pair feature returns true.

(10) Alias feature (AF) : It checks whether RE j is an alias of RE i or not. The feature value is then set accordingly. This feature

is very effective when the abbreviations are used instead of

full form of markables.

(11) Appositive feature (APF) : If RE j is in apposition to RE i then

the value of this feature is set to true, otherwise it is false.

Let us take the following example:

mr. sankar, tumi bAri jabe.

Here, anaphor tumi and antecedent mr. sankar are separated

by comma and both refer to the same person, and hence the

feature is set to true.

(12) String kernel (SK) : String kernel similarity is used to es-

timate the similarity between two strings based on string

sub-sequence kernel.

(13) Mention type (MT) : Following [35] , we have encoded men-

tion types ( name, nominal or pronoun ) of the anaphor and

the antecedent. In addition, we check whether the anaphor

RE j is a definite pronoun or demonstrative pronoun or

merely a pronoun. We also check whether each of the en-

tities in the mention pairs denotes a proper name.

(14) CorefChain (CC) : It is the size of the coreference chain com-

puted for the antecedent found so far. This feature is com-

puted dynamically, and it gives a boost to central entities

making them likely candidates for coreference.

(15) NonPronStrMatch (NPSM) : The feature returns the true

value if both the antecedent and anaphor pairs are not the

pronouns and match with each other.

(16) PronStrMatch (PSM) : The feature returns the true value if

both the antecedent and anaphor are pronoun and match

with each other.

(17) PrprNameStrMatch (PNSM) : This checks whether the candi-

date pairs denote the proper nouns. The feature value is set

to true if they are both proper nouns and false, otherwise.

(18) Gender( GEN) : In Hindi, gender information provides use-

ful evidence for anaphora resolution. We extract this feature

from the Indian language shallow parser as mentioned ear-

lier. If both the markable pairs agree in the gender informa-

tion then the value is set to true, otherwise, false. In case of

Bengali, for person NE type, we manually prepare two lists

that correspond to male (e.g., ‘kumar’, ‘chandra’ etc.) and fe-

male ( ‘devi’ , ‘kumari’ etc.) names. If the markable pairs agree

in this information then the feature is set to true, otherwise

false.

.4. Joint model for ensemble learning and Feature selection

Our proposed multi-objcetive DE based algorithm produces a

et of solutions, each representing weights for combining the mod-

ls of mention detection and the most relevant set of features for

naphora resolution.

.4.1. Overview of multiobjective differential evolution:

Differential Evolution (DE) [37] is a popular evolutionary op-

imization technique that performs a parallel direct search in

4 http://ltrc.iiit.ac.in/showfile.php?filename=downloads/shallow _ parser.php

e

s

v

omplex, large and multi-modal landscapes, and provides near-

ptimal solutions. Parameters in the search space are encoded

n the form of chromosomes. Each chromosome is denoted by

D -dimensional parameter vector X i,G = [ x 1 ,i,G , x 2 ,i,G , . . . , x D,i,G ] , i = , 2 , . . . , NP, where NP is the number of solutions in the popu-

ation. For multiobjective version more than one objective or fit-

ess functions are associated with each chromosome. The algo-

ithm generates new parameter vector by adding the weighted dif-

erence of any two vectors to a third one, and this operation is

alled the mutation. The parameters of the mutated vectors are

ixed with the parameters of another predetermined vector, the

arget vector, to yield a new vector known as the trial vector. The

rocess of parameter mixing is often referred to as crossover. Se-

ection operation refers to the process of selecting the effective so-

utions. In this process the trial vectors are merged to the current

opulation and then ranked based on the concept of domination

nd non-domination. In the next generation we select NP num-

er of chromosomes from the ranked solutions using the crowd-

ng distance sorting algorithm. The process of selection, crossover

nd mutation continues for a fixed number of generations or till a

ermination condition is satisfied. The crowding distance d i of each

oint i in the non-dominated front I [8] is computed as follows:

• For i = 1 , . . . , I, initialize d i = 0 .

• For each objective function f k , k = 1 , . . . , K, do the following:

• Sort the set I according to f k in ascending order.

• Set d 1 = d | I| = ∞ .

• For j = 2 to (| I| − 1) , set d j = d j + ( f k ( j+1) − f k ( j−1) ) .

.4.2. Problem formulation

Suppose, there are N classifiers (denoting mention detection

odels) and M features. Weights for combining N classifiers are

enoted by W 1 , . . . , W N , where A = { W i : i = 1 ; N} . Feature values

orrespond to B = { F i : i = 1 ; M} . The problem of ensemble con-

truction and feature selection can be stated as follows: Compute

he appropriate weights for each classifier to combine the outputs

f various mention detectors, and determine the subset of features

′ ⊆ B such that when the anaphora resolver is trained on this sub-

et of features using the combined model for mention detection

hould have optimized some metrics. In our proposed MOO based

ifferential evolution (DE) setting, we optimize five objective func-

ions, namely the F-measure values corresponding to Muc , Bcub ,

eafm , Ceafe and Blanc scorers. All these metrics represent sig-

ificantly different behaviors. Our multi-objective DE based joint

odel selects the best weights for combining the outputs of clas-

ifiers for mention detection, and chooses the subset of features

best features) to maximize the five metrics simultaneously. Details

f the procedure for joint modelling are described below:

.4.3. Problem encoding and population initialization:

Problems are encoded as real-valued strings of length D (also

alled chromosomes) equals to N + M, where N is the number of

ention classifiers and M is the number of available features used

o construct the anaphora resolution system. A collection of such

ype of D length chromosomes is called a population. Size of the

opulation (total number of chromosomes) is denoted by NP . All

he NP number of chromosomes are randomly initialized with the

eal values between 0 to 1. A fitness function is associated with

ach chromosome which is having N number of classifiers and M

umber of features. More the weight of a classifier denotes more

onfidence for markable (mention) selection from the respective

odel. If the value of the bit is ≥ 0.5 then it represents that the

espective feature is used for training of anaphora resolver; oth-

rwise the feature is not used. An example of a chromosome is

hown in Fig. 3 that shows four classifiers and five features. Here,

alues of the first four bits denote the weights by which the classi-

Page 6: A generalized framework for anaphora resolution in Indian …sriparna/papers/general.pdf · 2018. 10. 1. · tion technique. A recent study for feature selection in anaphora resolution

152 U. Kumar Sikdar et al. / Knowledge-Based Systems 109 (2016) 147–159

Fig. 3. Chromosome representation.

(

(

(

(

(

(

(

(

(

2

1

V

w

t

a

r

f

2

t

m

r

t

i

U

w

u

f

w

2

t

y

o

t

n

d

t

a

b

N

g

n

N

t

s

fiers have to be combined. Anaphora resolver is trained only with

three features as these bit positions have the values of ≥ 0.5.

2.4.4. Fitness computation:

The fitness computation is described below. For ensemble, we

perform the following steps:

1) Suppose, N is the total number of classifiers. The classifiers’ F-

measure values on the development set be denoted by F n , n =1 . . . N.

2) For each token of the development set, we have K classes from

N classifiers (each output corresponds to a particular classifier).

Final output of the ensemble is determined using the weighted

voted combination of these N classifiers. The weight for a par-

ticular class predicted by the n th classifier is equal to F n (i.e.,

F-measure value of the n th classifier). The final weight of a par-

ticular token t for a particular class is:

g(c k ) =

F n ∗ Q(n, k ) ,

∀ n = 1 to N and op(t, n ) = c k

Here, Q ( n, k ) corresponds to the entry of the chromosome that

represents the n th classifier and k th class, and op ( t, n ) denotes

the output class predicted by the n th classifier for the t th token.

Final prediction of a token depends on the maximum combined

weight that a particular class receives. Let us consider the fol-

lowing example for the explanation:

Examples : Let us consider the chromosome in Fig. 3 . Suppose

the three classes be ‘B-mention’(class 1), ‘I-mention’ (class 2),

and ‘O’ (class 3); and F-measure values of four classifiers be

0.82, 0.93, 0.78 and 0.85, respectively. Let for a token ‘ Ra-

bindranath ’ four classifiers produce outputs as follows: Clas-

sifier 1: ‘B-mention’; Classifier 2:‘B-mention’; Classifier 3:‘O’

and Classifier 4:‘I-mention’. Then f(‘B-mention’) = 0 . 82 ∗ 0 . 23 +0 . 93 ∗ 0 . 67 = 0 . 81 ; f(‘O’) = 0 . 78 ∗ 0 . 32 = 0 . 25 and f(‘I-mention’) =0 . 85 ∗ 0 . 45 = 0 . 38 . Thus, the final output selected for this

particular token is ‘B-mention’ as f(‘B-mention’) > f(‘I-

mention’) > f(‘O’).

3) In this way we combine the outputs of multiple classifiers for

all the tokens.

The output of the combined model is subjected for anaphora

resolution using the relevant feature set present in that particular

chromosome. Below we mention the steps for feature selection:

1) Suppose, there are M number of available features denoted by

F 1 , . . . , F M

, where B = { F i : i = 1 ; M} corresponds to the feature

values.

2) Suppose P number of features are selected from the M num-

ber of available features depending upon the bit values of the

chromosome. If the value of the bit is ≥ 0.5 then the feature

is present, otherwise absent. An example of a chromosome is

shown in Fig. 3 , that shows the use of three features for classi-

fier’s training out of five (i.e., M = 5 and P = 3 ).

3) Anaphora resolver is trained only with these P features, and

evaluated on the development data.

4) Compute the F-measure values for all the five objective func-

tions, namely Muc , Bcub , Ceafm , Ceafe and Blanc .

5) The objective functions corresponding to this particular chro-

mosome are f 1 = Muc ; f 2 = Bcub ; f 3 = Ceafm ; f 4 = Ceafe ; f 5 =Blanc .

6) These five objecti ve functions are simultaneously optimized us-

ing the search capability of multiobjective DE.

.4.5. Mutation

In multiobjective DE, for each target vector X i, G ; i = , 2 , 3 , . . . , NP, a mutant vector is generated according to

i,G +1 = x r1 ,G + F × (x r2 ,G − x r3 ,G ) , (1)

here r 1, r 2, r 3 are mutually different random indices and belong

o { 1 , 2 , . . . , NP } , G is the generation number and F > 0. The r 1, r 2

nd r 3 are chosen in such a way that they are different from the

unning current index i , so that the value of NP is at least equal to

our. The process of mutation is described in Fig. 4 .

.4.6. Crossover or recombination

Crossover or recombination represents the parameter mixing of

he target vector X i, G and mutant vector V i,G +1 . Exchange of infor-

ation is performed in order to generate a better offspring that

epresents a promising solution. Diversity of the mutant vector can,

hus, be increased. In order to perform this operation, a trial vector

s formed as follows:

i,G +1 = (u 1 ,i,G +1 , u 2 ,i,G +1 , . . . , u D,i,G +1 ) (2)

here

j,i,G +1 = v j,i,G +1 if (r j ≤ CR ) or j = i r (3)

= x j,i,G if (r j > CR ) and j = i r (4)

or j = 1 , 2 , . . . , D,

In Eq. 3 , r j is an uniform random number of the j th evaluation

hich belongs to [0, 1]. The crossover process is described in Fig. 5 .

.4.7. Selection

To select the best NP solutions for the next generation G + 1 ,

rial population is merged to the current population, and this

ields 2 × NP chromosomes. These solutions are sorted based

n the concept of domination and non-domination relations in

he objective function space. As an example, the dominated and

on-dominated relations are shown in Fig. 7 . In this figure non-

ominated solutions (i.e., ranked solutions) are represented in

he Pareto-optimal surface. Thereafter these ranked solutions are

dded to the population in the next generation until the num-

er of solutions equates to NP . If the number of solutions exceeds

P , then crowding distance sorting algorithm is applied. This al-

orithm chooses the solutions starting from the beginning of the

ext sorted rank solutions and keeps on adding until it reaches to

P . This process ultimately determines the best NP chromosomes

o be included in the next population. The selection procedure is

hown in Fig. 6 .

Page 7: A generalized framework for anaphora resolution in Indian …sriparna/papers/general.pdf · 2018. 10. 1. · tion technique. A recent study for feature selection in anaphora resolution

U. Kumar Sikdar et al. / Knowledge-Based Systems 109 (2016) 147–159 153

Fig. 4. Mutation process.

Fig. 5. Crossover process.

Fig. 6. Selection process.

(

t

5

3

f

5

p

2

f

<

5

For example, suppose population size NP = 5 , so target vector

chromosome) size is 5 and the fitness values of the target vec-

ors are < 65.47, 57.52, 46.95, 30.25, 68.81 > for vector-1; < 65.52,

5.4 8, 4 8.88, 30.64, 69.62 > for vector-2; < 63.89, 54.59, 48.25,

1.34, 68.87 > for vector-3; < 66.56, 56.82, 47.41, 32.26, 67.88 >

or vector-4 and < 61.01, 53.24, 46.25, 26.70, 68.76 > for vector-

. After mutation and crossover, 5 trail vectors are generated. Sup-

ose, the fitness values of the trail vectors are < 70.14, 50.67, 29.43,

2.57, 47.29 > for vector-1; < 64.4 9, 54.82, 4 8.18, 32.44, 6 8.00 >

or vector-2; < 62.60, 55.76, 46.68, 30.92, 68.00 > for vector-3;

66.71, 55.60, 44.72, 28.89, 65.00 > for vector-4 and < 63.73,

4.51, 47.38, 31.47, 67.49 > for vector-5. After merging the set of

Page 8: A generalized framework for anaphora resolution in Indian …sriparna/papers/general.pdf · 2018. 10. 1. · tion technique. A recent study for feature selection in anaphora resolution

154 U. Kumar Sikdar et al. / Knowledge-Based Systems 109 (2016) 147–159

Fig. 7. Representation of dominated and non-dominated solutions.

Fig. 8. The entire DE process with termination condition.

s

E

c

t

a

b

t

t

g

t

m

f

target vectors with the set of trail vectors, total number of vectors

is equal to 10. For the next generation 5 vectors have to be selected

from these 10 vectors. Firstly, 10 vectors are ranked based on dom-

inance and non-dominance concepts. The ranked solutions are <

65.47, 57.52, 46.95, 30.25, 68.81 > and < 70.14, 50.67, 29.43, 22.57,

47.29 > for rank-1 (2 solutions); < 65.52, 55.4 8, 4 8.88, 30.64, 69.62

> ; < 63.89, 54.59, 48.25, 31.34, 68.87 > ; < 66.56, 56.82, 47.41,

32.26, 67.88 > ; < 64.4 9, 54.82, 4 8.18, 32.44, 6 8.00 > ; < 62.60,

55.76, 46.68, 30.92, 68.00 > and < 66.71, 55.60, 44.72, 28.89, 65.00

> for rank-2 (6 solutions) and < 61.01, 53.24, 46.25, 26.70, 68.76 >

and < 63.73, 54.51, 47.38, 31.47, 67.49 > for rank-3 (2 solutions).

All the solutions of the first rank are propagated to the next

generation. Hence, out of six solutions available in the second rank

we have to select three solutions ( 5 − 2 = 3 ) to be included in the

next generation. We apply crowding distance sorting algorithm (al-

ready described above) to the second rank, and select the most

promising three solutions. Three solutions that are finally selected

for the next generation are: < 65.52, 55.48, 48.88, 30.64, 69.62 >

; < 66.71, 55.60, 44.72, 28.89, 65.00 > and < 63.89, 54.59, 48.25,

31.34, 68.87 > . This is the process following which we select the

solutions for the next generation.

2.4.8. Termination condition

The processes of mutation, crossover, fitness computation and

selection are executed for a G Max number of generations. Finally we

obtain a set of non-dominated solutions on the final Pareto optimal

front. Each of these solutions represents a set of (near)-optimal so-

lutions. The entire process of DE along with the termination con-

dition is described in Fig. 8 .

2.4.9. Selecting the best solution

The MOO based approach yields a set of solutions. None of

these is strictly better compared to the others, and therefore all

are equally important from the algorithmic point of view. Here we

determine the final solution based on the F-measure values of the

individual scorers. For each of the five objective functions, namely

Muc , Bcub , Ceafe , Ceafm and Blanc we select the particular so-

lution that yields the highest F-measure value (for the respective

metric) among all the ranked solutions.

For example, suppose we obtain 6 solutions in rank-1, and these

solutions are < 70.72, 51.11, 31.27, 25.75, 50.26 > for solution-1; <

64.47, 58.98, 51.88, 32.57, 72.36 > for solution-2; < 69.05, 59.25,

50.71, 31.79, 64.61 > for solution-3; < 68.58, 58.38, 52.55, 33.56,

65.30 > for solution-4; < 66.14, 59.28, 50.77, 34.03, 61.58 > for

olution-5 and < 70.08, 50.81, 41.73, 23.54, 70.52 > for solution-6.

ach of these 6 solutions is equally important (based on the con-

ept of non-domination), and user can select any solution based on

heir preferences or requirements. For a hypothetical case, suppose

n user needs a solution which should have good performance for

oth Muc and Blanc scorers. User will choose 6th solution as for

his the average Muc and Blanc scorers are better compared to

he others. Suppose another user needs a solution which should be

ood with respect to the Muc scorer. Obviously, user should select

he first solution (highest Muc = 70.72) as this seems to be the

ost effective one. In our case, the highest performance obtained

or a particular scorer is finally reported.

Page 9: A generalized framework for anaphora resolution in Indian …sriparna/papers/general.pdf · 2018. 10. 1. · tion technique. A recent study for feature selection in anaphora resolution

U. Kumar Sikdar et al. / Knowledge-Based Systems 109 (2016) 147–159 155

Table 1

Statistics of the datasets. Here ‘#sen’: number of sentences and ‘#tok’: number of tokens.

Dataset Bengali Hindi Tamil Dutch

#sen #tok #sen #tok #sen #tok #sen #tok

Training 881 10 ,504 974 21 ,067 5 ,536 53 ,194 2 ,544 46 ,894

Development 598 5 ,785 427 9 ,370 3 ,085 30 ,954 496 9 ,164

Test 572 6 ,985 – – – – – –

3

t

d

n

t

t

t

m

g

s

f

A

t

n

t

m

r

i

s

i

s

o

a

T

s

w

i

t

W

d

t

m

fi

i

S

a

t

t

p

a

a

i

n

m

r

s

b

S

N

P

S

S

N

M

S

S

T

P

G

p

s

o

f

h

f

T

f

F

m

B

F

f

p

7

a

s

t

t

i

o

o

m

w

t

s

t

r

t

j

f

[

a

3

. Experiments and discussions

Our experiments are based on the datasets provided in

he ICON-2011 NLP Tool Contest. For training and development

atasets, annotations were provided by the organizers. But no an-

otation was provided for the test data. In line with the anno-

ations of training and development datasets, we manually anno-

ated the test dataset for Bengali. For Hindi and Tamil, we report

he results on the development set as no gold annotations were

ade available in public. However all these three languages (Ben-

ali, Hindi and Tamil) are spoken in India, they don’t have the

ame characteristics. While Bengali and Hindi belong to Indo-Aryan

amily, Tamil is a language that belongs to the Dravidian family.

lso, there are many phenomenon that are not same for all these

hree languages. For example, in Bengali gender information does

ot have any influence, but Hindi has. Tamil is agglutinative in na-

ure unlike Bengali or Hindi. Please note that for Bengali we opti-

ize our algorithm on the development data, and finally report the

esults on the test data. For Hindi and Tamil, a part of the train-

ng set is used for optimizing our model, and finally evaluation re-

ults are reported on the development data. Based on these exper-

ments we set the following parameter values to DE: population

ize ( NP ) = 40, number of generations ( G Max ) = 50, CR (probability

f crossover) = 0.5 and F (mutation factor) = 0.7

Statistics of the datasets in terms of the number of sentences

nd the number of tokens present in each set are provided in

able 1 . The datasets are of mixed domains, covering tourism, short

tory, news article and sports as domains.

In order to further show the efficacy of our proposed approach

e evaluate it for another language, namely Dutch, one of the lead-

ng European languages. This language has completely different or-

hography and characteristics compared to the Indian languages.

e use the datasets of CoNLL-2012 Shared Task 5 . Statistics of the

atasets are reported in Table 1 .

We use standard CoNLL scorer 6 for the evaluation. In order

o compare with our proposed method we construct a baseline

odel using a loose re-implementation of a subset of features de-

ned in [35] . These include number agreement, alias, string match-

ng, semantic class agreement, sentence distance and appositive (c.f.

ection 2.3 ). Thereafter, for training of anaphora resolver, we use

ll the features as mentioned in Section 2.3 , and then the op-

imized feature sets (obtained from multi-objective DE in isola-

ion). Detailed evaluation results of these three models are re-

orted in Tables 2–4 for Bengali, Hindi and Tamil, respectively. We

lso evaluate the model by considering only the gold mentions. We

lso carry out experiments on Dutch, results of which are reported

n Table 5 , to establish that our proposed approach is generic in

ature. Comparisons show that we obtain better accuracies for the

odel trained with all the features over the respective baselines.

For multi-objective DE, we consider the solutions of the first

ank, and finally select the best one following the technique as de-

cribed in Section 2.4.9 . The (near)-optimal features as determined

y the proposed algorithm are shown below:

5 http://conll.cemantix.org/2012/task-description.html 6 http://conll.cemantix.org/2012/software.html

e

p

Bengali (4 solutions): GEN − F P P − T P P − RP − SM − SCF −D − SK − CC − NP SM, GEN − F P P − SP P − T P P − SCF − MT − SK −P SM − P NSM, F P P − T P P − RP − AF − SCF − MT − AP F − SD − SK − SM − P NSM and RF − AF − MT − SD − SK

Hindi (5 solutions): GEN − NA − T P P − RP − AF − SCF − MT −D − SK − NP SM − P N SM, N A − T P P − RP − AF − SM − SCF − AP F −K − CC, F P P − SP P − T P P − AF − SM − AP F − MD − SD − SK −P SM − P NSM, GEN − NA − SP P − T P P − RP − AF − SM − AP F −D − SK and NA − RP − SM − MT − AP F − SK − NP SM.

Tamil (5 solutions): LRM − NA − T P P − RP − SCF − MT − AP F −D − MD, LRM − NA − SP P − AF − SM − AP F − MD, LRM − RP − AF −CF − MD − GEN, SK − SCF − SD − MD − GEN and LRM − F P P − P P − SCF − MD

Dutch (3 solutions): MD − SP P − T P P − SCF − AP F − P SM − NSM − GEN, SD − MD − T P P − SCF − AP F − P SM − P NSM −EN and MD − SP P − SCF − AP F − P SM − P NSM

Results of experiments for all these feature combinations are

resented in Tables 7–10 for Bengali, Hindi, Tamil and Dutch, re-

pectively. Results reported in these tables are generated based

n the solutions obtained on the first rank of the final Pareto

ront. The final score for each evaluation metric corresponds to the

ighest value obtained in any of the solutions of the final Pareto

ront. The best performance achieved by four solutions reported in

able 7 correspond to 71.89%, 59.61%, 52.55%, 34.45% and 72.52%

or Muc , Bcub , Ceafm , Ceafe and Blanc , respectively, for Bengali.

or Hindi the best performance achieved corresponds to the F-

easures of 33.23%, 63.06%, 4 9.59%, 4 9.06% and 55.45% for Muc ,

cub , Ceafm , Ceafe and Blanc , respectively. For Tamil the obtained

-measure values are 31.79%, 64.67%, 46.81%, 45.29% and 52.80%

or Muc , Bcub , Ceafm , Ceafe and Blanc scorers, respectively. Ex-

eriments on Dutch demonstrate the F-measure values of 17.67%,

4.43%, 58.08%, 59.21% and 55.58% for Muc , Bcub , Ceafm , Ceafe

nd Blanc scorers, respectively. A closer analysis of these results

hows the effectiveness of the proposed approach. It shows consis-

ent performance improvements in all the settings over the system

hat makes use of single mention classifier, and anaphora resolver

s trained either with all the features (c.f. ‘All Features’ columns)

r the optimized feature set (c.f. ‘Selected Features’ columns). The

ptimized feature set denotes the set of relevant features as deter-

ined by the multi-objective DE based feature selection technique,

hen it is executed in isolation (i.e., not with ensemble construc-

ion for mention detection). In two other experiments we also ob-

erve that our approach attains better accuracy over the system

hat uses the combined model for mention detection, but anaphora

esolver is trained either with all the available features or the op-

imized feature set. Therefore, we can argue that our proposed

oint model for feature selection and ensemble learning works ef-

ectively for all the languages. We have also conducted ANOVA

3] analysis to exhibit that performance increment by our proposed

pproach is statistically significant.

.1. Comparisons with SOO based joint model and isolated models of

nsemble construction and feature selection

In order to compare with the MOO based technique, we im-

lement a joint model using the concept of single objective DE

Page 10: A generalized framework for anaphora resolution in Indian …sriparna/papers/general.pdf · 2018. 10. 1. · tion technique. A recent study for feature selection in anaphora resolution

156 U. Kumar Sikdar et al. / Knowledge-Based Systems 109 (2016) 147–159

Table 2

Evaluation results for Bengali: different models and optimal feature set.

System Baseline All Features Selected features

Muc Bcub Ceafm Ceafe Blanc Muc Bcub Ceafm Ceafe Blanc Muc Bcub Ceafm Ceafe Blanc

Model _ 1 54 .31 40 .30 32 .05 25 .96 57 .63 67 .68 54 .57 41 .97 28 .83 59 .70 71 .04 58 .69 51 .96 34 .34 72 .30

Model _ 2 53 .29 39 .74 31 .72 25 .59 57 .02 66 .61 54 .83 43 .35 28 .17 59 .36 70 .68 58 .23 51 .39 33 .74 71 .65

Model _ 3 52 .79 40 .71 31 .88 25 .45 57 .18 66 .05 55 .93 42 .33 26 .88 59 .56 69 .77 58 .90 50 .38 32 .50 71 .93

Model _ 4 53 .50 41 .34 32 .25 25 .93 57 .57 65 .90 54 .49 41 .17 26 .27 61 .21 69 .87 59 .21 51 .47 32 .60 72 .26

Model _ 5 53 .29 39 .74 31 .72 25 .59 57 .02 66 .61 54 .83 43 .35 28 .17 59 .36 70 .68 58 .23 51 .39 33 .74 71 .65

Model _ 6 54 .31 40 .30 32 .05 25 .96 57 .63 67 .68 54 .57 41 .97 28 .83 59 .70 71 .04 58 .69 51 .96 34 .34 72 .30

Model _ 7 34 .98 57 .38 31 .74 25 .97 56 .13 28 .84 36 .54 19 .40 11 .65 48 .85 37 .66 56 .31 37 .57 26 .39 63 .51

Model _ 8 28 .54 43 .58 24 .43 20 .08 51 .98 34 .27 47 .44 23 .90 15 .62 52 .24 38 .37 52 .25 32 .58 22 .27 59 .23

Model _ 9 37 .34 43 .56 26 .70 21 .33 53 .25 46 .39 49 .60 29 .26 17 .94 54 .38 50 .50 53 .78 38 .59 23 .49 63 .76

Model _ 10 50 .41 49 .93 32 .58 26 .45 58 .32 55 .54 48 .65 34 .87 20 .33 57 .08 61 .65 60 .68 47 .41 28 .77 71 .50

GoldMention 60 .60 40 .30 35 .08 29 .44 58 .71 75 .55 54 .01 49 .76 40 .01 65 .89 80 .57 61 .59 59 .44 47 .89 74 .19

Ourapproach 71 .89 59 .61 52 .55 34 .45 72 .52 71 .89 59 .61 52 .55 34 .45 72 .52 71 .89 59 .61 52 .55 34 .45 72 .52

Table 3

Evaluation results for Hindi: different models and optimal feature set.

System Baseline All features Selected features

Muc Bcub Ceafm Ceafe Blanc Muc Bcub Ceafm Ceafe Blanc Muc Bcub Ceafm Ceafe Blanc

Model _ 1 10 .82 58 .70 46 .00 42 .91 51 .16 28 .43 60 .08 45 .31 44 .67 46 .45 31 .79 59 .39 49 .03 47 .89 52 .85

Model _ 2 11 .11 59 .08 45 .41 42 .45 51 .02 29 .02 60 .18 43 .12 42 .25 44 .87 31 .53 62 .36 48 .73 47 .78 53 .42

Model _ 3 10 .46 58 .81 45 .41 42 .55 50 .79 30 .28 60 .39 42 .76 41 .87 44 .37 32 .76 62 .56 47 .79 47 .04 52 .14

Model _ 4 9 .57 58 .19 44 .88 42 .09 50 .37 28 .77 60 .12 44 .37 43 .63 45 .00 32 .06 61 .35 48 .07 47 .27 51 .60

Model _ 5 11 .11 59 .08 45 .41 42 .45 51 .02 29 .02 60 .18 43 .12 42 .25 44 .87 31 .53 62 .36 48 .73 47 .78 53 .42

Model _ 6 10 .82 58 .70 46 .00 42 .91 51 .16 28 .43 60 .08 45 .31 44 .67 46 .45 31 .79 59 .39 49 .03 47 .89 52 .85

Model _ 7 5 .02 59 .52 39 .01 35 .80 48 .26 15 .70 11 .93 6 .89 3 .96 8 .46 16 .45 37 .29 23 .13 19 .20 47 .40

Model _ 8 12 .24 62 .13 46 .64 43 .43 52 .99 24 .76 52 .54 32 .25 32 .17 34 .15 22 .11 62 .04 47 .96 45 .65 55 .27

Model _ 9 11 .45 61 .27 45 .55 42 .85 51 .96 26 .85 55 .96 37 .18 36 .20 38 .01 26 .08 62 .26 45 .95 43 .59 54 .15

Model _ 10 7 .18 58 .54 36 .90 34 .82 45 .44 24 .02 36 .04 19 .61 16 .59 23 .88 23 .57 49 .56 31 .42 27 .56 48 .47

GoldMention 12 .78 55 .20 44 .80 43 .67 51 .27 57 .70 58 .98 35 .48 35 .66 37 .85 55 .70 63 .22 47 .00 46 .72 55 .38

Ourapproach 33 .27 63 .06 49 .59 49 .06 55 .45 33 .27 63 .06 49 .59 49 .06 55 .45 33 .27 63 .06 49 .59 49 .06 55 .45

Table 4

Evaluation results for Tamil: different models and optimal feature set.

System Baseline All features Selected features

Muc Bcub Ceafm Ceafe Blanc Muc Bcub Ceafm Ceafe Blanc Muc Bcub Ceafm Ceafe Blanc

Model _ 1 7 .25 59 .54 46 .30 44 .47 51 .85 30 .91 50 .95 32 .06 37 .73 40 .36 30 .67 60 .70 46 .23 43 .88 52 .69

Model _ 2 6 .80 59 .74 46 .33 44 .52 51 .65 28 .87 50 .37 31 .22 37 .08 39 .79 28 .74 60 .80 46 .34 43 .94 52 .30

Model _ 3 6 .74 59 .74 46 .08 44 .27 51 .56 30 .00 49 .43 29 .49 34 .75 38 .94 29 .91 60 .89 46 .18 43 .74 52 .27

Model _ 4 7 .09 59 .58 46 .00 44 .27 51 .72 29 .54 49 .39 30 .31 36 .12 39 .01 29 .84 60 .49 46 .15 43 .69 52 .54

Model _ 5 7 .01 59 .74 46 .39 44 .85 51 .72 30 .63 48 .29 29 .67 35 .49 38 .20 30 .61 60 .80 46 .11 43 .94 52 .56

Model _ 6 7 .01 59 .58 46 .38 44 .92 51 .71 31 .42 48 .65 30 .24 36 .14 38 .57 31 .14 60 .59 46 .09 43 .88 52 .55

Model _ 7 6 .50 58 .86 43 .12 41 .53 46 .25 4 .50 26 .87 25 .76 25 .94 31 .12 6 .50 57 .28 37 .34 36 .22 43 .78

Model _ 8 6 .41 59 .92 43 .25 42 .02 51 .05 22 .46 24 .15 14 .33 16 .85 20 .68 22 .07 62 .92 43 .57 39 .96 52 .28

Model _ 9 6 .41 59 .92 43 .25 42 .02 51 .05 24 .11 23 .80 14 .27 17 .18 19 .90 24 .02 62 .78 44 .06 40 .56 52 .28

Model _ 10 7 .06 59 .38 43 .54 41 .18 51 .23 5 .43 18 .06 11 .23 44 .63 51 .50 7 .08 62 .66 41 .64 36 .15 51 .15

GoldMention 8 .71 56 .31 45 .69 45 .63 52 .50 70 .21 17 .16 10 .93 10 .84 12 .36 71 .57 57 .18 46 .40 47 .90 53 .67

Ourapproach 31 .79 64 .67 46 .81 45 .29 52 .80 31 .79 64 .67 46 .81 45 .29 52 .80 31 .79 64 .67 46 .81 45 .29 52 .80

Table 5

Evaluation results for Dutch language: different models and optimal feature set.

System Baseline All features Selected features

Muc Bcub Ceafm Ceafe Blanc Muc Bcub Ceafm Ceafe Blanc Muc Bcub Ceafm Ceafe Blanc

Model _ 1 9 .94 69 .14 43 .16 46 .17 49 .13 9 .55 67 .31 40 .87 44 .92 47 .43 17 .46 73 .55 58 .03 59 .00 55 .48

Model _ 2 10 .03 69 .84 44 .10 46 .76 49 .45 9 .38 67 .80 41 .63 45 .53 48 .04 17 .35 73 .61 58 .08 59 .01 55 .39

Model _ 3 10 .02 69 .75 44 .06 46 .78 49 .41 9 .57 67 .78 41 .69 45 .59 48 .03 17 .29 73 .55 58 .08 59 .09 55 .40

Model _ 4 10 .11 69 .34 43 .51 46 .56 49 .08 9 .59 67 .71 41 .58 45 .61 47 .74 17 .26 73 .38 58 .02 59 .01 55 .37

Model _ 5 9 .37 68 .71 42 .38 45 .37 49 .00 8 .78 66 .57 39 .93 44 .06 47 .17 17 .14 73 .62 57 .92 58 .91 55 .27

Model _ 6 9 .41 68 .47 42 .05 44 .97 49 .01 8 .64 66 .36 39 .58 43 .59 47 .13 17 .34 73 .74 57 .83 58 .70 55 .54

Model _ 7 6 .18 61 .53 34 .49 36 .69 48 .79 5 .28 56 .61 29 .70 33 .59 44 .39 15 .50 73 .75 56 .57 56 .59 55 .39

Model _ 8 2 .88 69 .54 43 .12 45 .32 49 .75 2 .90 64 .59 36 .01 40 .98 46 .07 4 .32 73 .44 52 .77 53 .54 51 .05

Model _ 9 3 .96 70 .71 46 .08 48 .07 50 .18 4 .05 66 .15 38 .42 43 .61 46 .63 5 .01 73 .25 52 .40 53 .19 51 .24

Model _ 10 10 .06 70 .13 45 .54 47 .21 50 .91 9 .24 65 .29 38 .63 42 .99 47 .58 14 .22 74 .24 54 .63 54 .94 54 .51

GoldMention 10 .10 68 .81 43 .18 46 .41 48 .86 10 .31 66 .92 41 .32 45 .59 47 .10 17 .91 73 .24 58 .47 59 .75 55 .10

Ourapproach 17 .67 74 .43 58 .08 59 .21 55 .58 17 .67 74 .43 58 .08 59 .21 55 .58 17 .67 74 .43 58 .08 59 .21 55 .58

Page 11: A generalized framework for anaphora resolution in Indian …sriparna/papers/general.pdf · 2018. 10. 1. · tion technique. A recent study for feature selection in anaphora resolution

U. Kumar Sikdar et al. / Knowledge-Based Systems 109 (2016) 147–159 157

Table 6

Solutions of the first rank on the final Pareto optimal

set for Bengali.

No . Muc Bcub Ceafm Ceafe Blanc

1 69 .59 59 .22 50 .78 31 .82 70 .74

2 69 .58 58 .30 50 .66 32 .30 69 .89

3 71 .89 50 .05 29 .59 23 .83 46 .06

4 69 .79 59 .28 51 .23 32 .30 70 .52

5 67 .93 59 .61 51 .57 32 .09 72 .52

6 69 .18 58 .64 51 .56 32 .68 71 .35

7 68 .99 58 .89 52 .55 32 .21 70 .94

8 70 .18 52 .10 33 .92 26 .05 53 .71

9 70 .72 51 .11 31 .27 25 .75 50 .26

10 66 .92 55 .49 49 .22 34 .45 69 .04

11 69 .24 58 .75 52 .08 32 .39 72 .00

Table 7

Evaluation results of the optimized model for Bengali.

No . Muc Bcub Ceafm Ceafe Blanc

1 71 .89 50 .05 29 .59 23 .83 46 .06

2 67 .93 59 .61 51 .57 32 .09 72 .52

3 68 .99 58 .89 52 .55 32 .21 70 .94

4 66 .92 55 .49 49 .22 34 .45 69 .04

Table 8

Evaluation results of the optimized model for Hindi.

No . Muc Bcub Ceafm Ceafe Blanc

1 33 .27 54 .68 45 .36 47 .14 44 .00

2 19 .30 63 .06 38 .36 35 .54 45 .37

3 16 .10 58 .89 49 .59 48 .67 55 .00

4 17 .35 57 .81 49 .26 49 .06 52 .08

5 14 .69 59 .65 48 .91 47 .60 55 .45

Table 9

Evaluation results of the optimized model for Tamil.

No . Muc Bcub Ceafm Ceafe Blanc

1 31 .79 44 .33 27 .93 34 .53 35 .20

2 6 .21 64 .67 43 .14 38 .97 51 .97

3 8 .20 60 .86 46 .81 44 .53 52 .62

4 11 .41 59 .18 45 .42 45 .29 51 .87

5 8 .47 62 .50 46 .49 43 .58 52 .80

Table 10

Evaluation results of the optimized model for Dutch.

No . Muc Bcub Ceafm Ceafe Blanc

1 17 .67 73 .61 58 .08 59 .01 55 .58

2 15 .53 74 .43 56 .51 56 .76 55 .28

3 17 .14 73 .41 57 .93 59 .21 54 .92

(

w

t

B

e

e

a

t

t

t

t

t

O

d

t

o

m

i

t

t

m

i

b

m

b

M

F

t

j

a

b

o

m

p

m

d

f

o

s

fl

u

o

b

a

b

t

p

3

t

i

r

s

i

o

f

w

c

p

c

o

s

l

t

o

e

7

I

c

c

[

g

s

s

7 This is written in ITRANS notation.

henceforth denoted as Model − 1 ). In case of single objective DE

e optimize each scorer separately. Hence we execute SOO five

imes with each of the scorer ( Muc , Bcub , Ceafm , Ceafe and

lanc ), and its results are reported in Table 11 . In our further

xperiments we study the effects of isolated model, i.e., when

ither ensemble or feature selection module is developed sep-

rately. At first multi-objective DE based ensemble construction

echnique is used to combine the various models of mention detec-

ion. Mentions extracted from this module are passed as markables

o anaphora resolution. Multi-objective DE based feature selection

echnique is then utilized to identify the most relevant set of fea-

ures for anaphora resolution. This model is denoted as: Model − 2 .

ur proposed approach is denoted as Model − 3 in our subsequent

iscussion. Please note that multi-objective DE is executed two

imes for the isolated case-firstly, for mention detection and sec-

ndly, for anaphora resolution. But in our proposed joint model,

ulti-objective DE is executed only once. Hence, the complexity

nvolved in our model is obviously less.

Performance of the model ( Model − 2 ) when ensemble and fea-

ure selection are performed separately sometimes achieves bet-

er result compared to SOO based joint model ( Model − 1 ). Perfor-

ance of the isolated models are reported in Table 11 . Compar-

sons between SOO based joint model, framework where ensem-

le and feature selection work in isolation, and MOO based joint

odel illustrate the following:

For Bengali, our proposed joint model (i.e., Model − 3 ) performs

etter compared to both single objective DE based joint model (i.e.,

odel − 1 ), and the separately executed model (i.e., Model − 2 ).

or Hindi, proposed approach shows better performance compared

o the other two models. For Tamil, our proposed multi-objective

oint model achieves better results among the three models for

ll scorers. For some scorers, single objective joint model performs

etter compared to the MOO based isolated model; and for some

f the scorers MOO based isolated model exhibits better perfor-

ance in comparison to SOO based model. For Dutch, our pro-

osed approach also achieves better results among all the other

odels. In order to show that MOO based joint model indeed pro-

uces multiple solutions, we demonstrate the final Pareto optimal

ront in Table 6 . This table shows that there are multiple trade-

ff solutions, which are non-dominated (in the objective function

pace) to each other. This trade-off is representative of the con-

icting nature of the five objective functions. Depending upon the

sers’ needs, appropriate solution can be picked up. With the help

f a SOO based approach any of the solutions may be obtained

y using a weighted sum approach, where five objective functions

re assigned some weights and then combined. But it has already

een proved in [7] that weighted sum approach can not capture

he whole Pareto optimal front. Thus, MOO based approach indeed

rovides a flexible platform.

.2. Comparisons with other existing systems

Comparisons with the works reported in ICON-2011 shared

asks show that the performance achieved by our proposed model

s the best. In particular, for Bengali, we obtain much higher accu-

acy for the Muc scorer. The performance obtained for the blanc

corer is also at par the state-of-the-art methods, and often better

n few points over the existing works. However, the results for the

ther three scorers need further attention. The relatively lower per-

ormance in these three metrics may be attributed to the fact that

e re-annotated the training, development and test datasets to in-

lude longer coreference chains. For example, the two coreference

airs like ( SachIn 7 - Se ) and ( SachIn - tAr ) are merged into a single

oreference chain like ( SachIn - Se - SachIn - tAr ). In contrast in the

riginal datasets of ICON-11 shared task these were treated as two

eparate instances. This is one of the possible explanations why the

ink-based metric(s) such as Muc exhibits better performance and

he others suffer. The method proposed in [31] is developed based

n the benchmark setup of ICON-2011. They developed three mod-

ls and obtained the average F-measure values of 66.6%, 68.9% and

7.1%, respectively. However it is to be noted that along with the

CON-11 datasets, they also used additional four documents which

ontain 4923 tokens each. Hence, the performance reported here

an’t be directly compared with the method proposed in [31] . In

9] , a generic framework for anaphora resolution in the Indian lan-

uages has been reported. They used ICON-2011 shared task data

ets but no evaluation figures were reported in the published ver-

ion of the paper. A hybrid approach for anaphora resolution has

Page 12: A generalized framework for anaphora resolution in Indian …sriparna/papers/general.pdf · 2018. 10. 1. · tion technique. A recent study for feature selection in anaphora resolution

158 U. Kumar Sikdar et al. / Knowledge-Based Systems 109 (2016) 147–159

Table 11

Comparison between Model − 1 , Model − 2 and Model − 3 for Bengali, Hindi, Tamil and Dutch languages.

System Model-1 Model-2 Model-3

Muc Bcub Ceafm Ceafe Blanc Muc Bcub Ceafm Ceafe Blanc Muc Bcub Ceafm Ceafe Blanc

Bengali 68 .53 58 .12 51 .14 33 .09 71 .04 70 .78 58 .32 51 .41 33 .25 71 .46 71 .89 59 .61 52 .55 34 .45 72 .52

Hindi 31 .60 61 .34 47 .56 47 .45 52 .91 32 .05 61 .73 48 .27 47 .57 52 .36 33 .27 63 .06 49 .59 49 .06 55 .45

Tamil 32 .37 60 .32 45 .48 43 .37 51 .14 30 .79 59 .63 45 .23 44 .24 50 .83 31 .79 64 .67 46 .81 45 .29 52 .80

Dutch 16 .17 73 .89 57 .98 58 .61 55 .11 17 .46 73 .64 58 .08 58 .92 55 .57 17 .67 74 .43 58 .08 59 .21 55 .58

t

o

I

4

3

I

f

t

a

o

o

p

s

i

v

H

i

c

l

m

e

f

e

b

i

a

t

t

t

S

o

m

t

4

a

d

m

m

i

t

o

r

i

p

f

a

p

u

i

been reported in [6] , where they used dependency structures for

anaphora resolution. Their experiments were carried out on differ-

ent datasets, which are not publicly available, and therefore, the

comparison can’t be directly made. Comparisons with our existing

works [32,33] also show that joint modeling is more effective.

In [13] , a named entity recognition (NER) system was devel-

oped for friction domain. Fiction based domain has a complex con-

text in locating the corresponding named entities (NEs), specifi-

cally whereby its characters could be represented in diverse spec-

trum, ranging from living things (animals, plants, and person) to

non-living things (vehicle, furniture). This NER system was evalu-

ated on Fables and Fairy tales corpus. It was also shown that the

application of anaphora resolution does not play a significant role

in the proposed NER system. The proposed NER system was ap-

plied on the original corpus and the corpus after application of

anaphora resolution. Though there are some differences in recall

and precision values but F-measure values remain unaltered. As

mentioned in [13] the proposed system was for detecting NEs from

friction domain, and because of this authors were not able to com-

pare it with the other existing NER systems. Hence, we are also

not able to compare directly as our corpus and languages are not

suited for the system proposed in [13] .

In [12] , a dialogue system is developed. In this paper, a

cognitively-inspired representational model has been proposed to

address the research question of how to enable dialogue systems

to capture the meaning of spontaneously produced linguistic in-

puts without explicit syntactic expectations. The proposed model

is cognitively-inspired and thus it integrates insights from be-

havioural and neuroimaging studies on working memory opera-

tions and language-impaired patients (i.e., Brocas aphasics). As this

paper deals with dialogue system so comparison with the existing

approach would not be feasible.

In [19] , a spoken dialogue system is developed. This paper pro-

poses a technique to improve the performance of spoken dia-

logue systems that not only consider knowledge about the seman-

tic frames used by systems to understand the spoken language but

also employ knowledge about the words in the system application

domain that are used to complete frame slots. As this system deals

with the spoken dialogue system, and hence it would not be feasi-

ble to compare our proposed system with this.

3.3. Complexity

The complexity of the algorithm mainly depends on the size

of the training data, number of available features and the num-

ber of mention detection models etc. Let us assume that D, NP and

G Max represent the length of the chromosome, number of chromo-

somes in a population and the maximum number of generations,

respectively. The complexity can then be represented as: O( D ×NP × G Max ). For ensemble and feature selection in Bengali, average

learning time for each chromosome is almost 21 s. Hence, the total

time required is equal to 21 ∗40 ∗50 s (50-size of the population).

Once optimal weights and features are selected by multiobjective

DE, 80 s are required for testing. For Hindi, the time for learning

is 32 ∗40 ∗50 with a rate of 32 s for each chromosome. The testing

ime for all the five solutions is 155 s. Experiments were carried

ut on a Linux environment with machine having 8 GB memory,

ntel(R) Core(TM) i7-4510U CPU @ 2.00 GHz and the cache size of

MB.

.4. Challenges of NLP in Indian languages

India is a multilingual country with a very rich cultural history.

ndian languages are derived from most of the existing language

amilies. Although Indian languages have very old literary tradition,

echnological developments are of recent origin. Indian languages

re not resource-rich in nature. The greatest bottleneck in devel-

ping NLP systems involving Indian languages are the availability

f resources and tools such as annotated corpora, PoS tagger, mor-

hological analyzer, named entity tagger etc. in the required mea-

ure. Developing good accurate anaphora resolution system is lim-

ted by all these constrains. Indian languages are morphologically

ery rich. Most of the times, same word has different meanings.

ence it is very difficult to determine the actual semantic mean-

ng of a word. For example ‘kAshI’ is a Bengali word which may

orrespond to three different concepts. It denote a person name or

ocation name or a disease name. In some languages, gender infor-

ation has great influence, while on others it does not have any

ffect. For Hindi depending upon gender verb form changes, but

or Bengali it does not change. Pronouns may also appear in differ-

nt forms. For example ‘Ami’ is a pronoun in Bengali, and this can

e written at least in 10 different forms. Same word may appear

n different forms. Often, a word that appears as a part of a mark-

ble may not be a part in some other places. This ambiguity affects

he performance of mention detection model which, in turn, affects

he overall performance of anaphora resolution. Three languages

hat we have dealt with in our work have dissimilar characteristics.

uch peculiarities involved in Indian languages make anaphora res-

lution task a more complex problem to solve. All these challenges

ake it more harder to come up with a robust set of features for

he task.

. Conclusion

In this paper we propose a joint model for ensemble learning

nd feature selection for anaphora resolution. The algorithm can

etermine the best weights for combining the decisions of various

ention classifiers and at the same time it can also find out the

ost relevant set of features for anaphora resolution. The system

s optimized by implementing a multi-objective verion of DE, and

he five well-known coreference resolution evaluation metrics are

ptimized. The proposed model has been evaluated for the less-

esourced Indian languages, namely Bengali, Hindi and Tamil. Us-

ng the same configuration, we also apply our model to a com-

letely new language, namely Dutch and achieve good accuracies

or all the scorers.

Evaluation on a benchmark setup shows that the performance

chieved by our proposed model is encouraging, often better com-

ared to the state-of-the-art systems. In the current setting we

sed only decision tree as the machine learning algorithm. Exper-

ments with other machine learning algorithms such as maximum

Page 13: A generalized framework for anaphora resolution in Indian …sriparna/papers/general.pdf · 2018. 10. 1. · tion technique. A recent study for feature selection in anaphora resolution

U. Kumar Sikdar et al. / Knowledge-Based Systems 109 (2016) 147–159 159

e

f

s

h

fi

o

R

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

ntropy and support vector machine would be another direction

or future work. We would also like to concentrate on porting the

ystems to other domains (e.g., biomedical texts). In this work we

ave evaluated on one of the non-Indian languages to show the ef-

cacy of our proposed approach. In future we would like to carry

ut experiments on other international languages.

eferences

[1] H. Adel , H. Schütze , Using mined coreference chains as a resource for a se-mantic task, in: Proceedings of the 2014 Conference on Empirical Methods in

Natural Language Processing (EMNLP), Association for Computational Linguis-

tics, Doha, Qatar, 2014, pp. 1447–1452 . [2] S. Agarwal , M. Srivastava , P. Agarwal , R. Sanyal. , Anaphora resolution in hindi

documents, in: Proceedings of Natural Language Processing and KnowledgeEngineering (IEEE NLP-KE), Beijing, China, 2007, pp. 452–458 .

[3] T.W. Anderson , S.L. Scolve , Introduction to the Statistical Analysis of Data, 1978 .[4] A. Bagga , B. Baldwin , Algorithms for scoring coreference chains, in: In The First

International Conference on Language Resources and Evaluation Workshop onLinguistics Coreference, 1998, pp. 563–566 .

[5] P. Chaimongkol , A. Aizawa , Y. Tateisi , Corpus for Coreference Resolution on

Scientific Papers, in: Nicoletta Calzolari (Conference Chair), Khalid Choukri,Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion

Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth In-ternational Conference on Language Resources and Evaluation (LREC’14), Euro-

pean Language Resources Association (ELRA), Reykjavik, Iceland, 2014 . [6] P. Dakwale , V. Mujadia , D.M. Sharma , A Hybrid Approach for Anaphora Reso-

lution in Hindi, in: Proceedings of the Sixth International Joint Conference on

Natural Language Processing, Asian Federation of Natural Language Processing,Nagoya, Japan, 2013, pp. 977–981 .

[7] K. Deb , Multi-Objective Optimization Using Evolutionary Algorithms, John Wi-ley & Sons, Inc., New York, NY, USA, 2001 .

[8] K. Deb , A. Pratap , S. Agarwal , T. Meyarivan , A fast and elitist multiobjectivegenetic algorithm: NSGA-II, IEEE Trans. Evol. Comput. 6 (2) (2002) 181–197 .

[9] S.L. Devi , V.S. Ram , P.R. Rao , A generic anaphora resolution engine for indian

languages, in: Proceedings of COLING-14, 2014, pp. 1824–1833 . [10] A. Ekbal , S. Saha , Joint model for feature selection and parameter opti-

mization coupled with classifier ensemble in chemical mention recognition,Knowl.Based Syst. 85 (2015) 37–51 .

[11] M. García , P. Gamallo , Entity-centric coreference resolution of person entitiesfor open information extraction, Procesamiento del Lenguaje Natural 53 (2014)

25–32 .

[12] M. Gnjatovic , V. Delic , Cognitively-inspired representational approach to mean-ing in machine dialogue, Knowl. Based Syst. 71 (2014) 25–33 .

[13] H.-N. Goh , L.-K. Soon , S.-C. Haw , Automatic dominant character identificationin fables based on verb analysis - empirical study on the impact of anaphora

resolution., Knowl. Based Syst. 54 (2013) 147–162 . [14] C. Hardmeier , Discourse in Statistical Machine Translation. A Survey and a Case

Study, 2012 PhD thesis .

[15] V. Hoste , Optimization Issues in Machine Learning of Coreference Resolution,Antwerp University, 2005 Phd thesis .

[16] A . Iftene , M.A . Moruz , E. Ignat , Using anaphora resolution in a question an-swering system for machine reading evaluation, in: Working Notes for CLEF

2013 Conference, Valencia, Spain, 2013, pp. 23–26 . 2013 [17] J.D. Lafferty , A. McCallum , F.C.N. Pereira , Conditional random fields: proba-

bilistic models for segmenting and labeling sequence data, in: ICML, 2001,

pp. 282–289 . [18] S. Loáiciga , E. Wehrli , Rule-based pronominal anaphora treatment for machine

translation, in: Proceedings of the Second Workshop on Discourse in MachineTranslation, Association for Computational Linguistics, Lisbon, Portugal, 2015,

pp. 86–93 . [19] R. López-Cózar , Using knowledge on word-islands to improve the performance

of spoken dialogue systems, Knowl. Based Syst. 88 (2015) 223–243 . 20] X. Luo , On coreference resolution performance metrics, in: Proceedings of the

Conference on Human Language Technology and Empirical Methods in Nat-

ural Language Processing, HLT ’05, Association for Computational Linguistics,Stroudsburg, PA, USA, 2005, pp. 25–32 .

[21] C. Ma , J.R. Doppa , J.W. Orr , P. Mannem , X. Fern , T. Dietterich , P. Tadepalli ,Prune-and-score: learning for greedy coreference resolution, in: Proceedings

of the 2014 Conference on Empirical Methods in Natural Language Process-ing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014,

pp. 2115–2126 .

22] V. Ng , C. Cardie , Improving machine learning approaches to coreference reso-lution, in: Proceedings of the 40th Annual Meeting of the Association for Com-

putational Linguistics, 2002, pp. 104–111 . 23] M. Novk , Utilization of anaphora in machine translation, in: WDS’11 Pro-

ceedings of Contributed Papers, Part I, Matfyzpress, Praha, Czechia, 2011,pp. 155–160 .

[24] M. Novk , D. Oele , G. van Noord , Comparison of coreference resolvers for deepsyntax translation, in: Proceedings of the Second Workshop on Discourse in

Machine Translation, Association for Computational Linguistics, Lisboa, Portu-gal, 2015, pp. 17–23 .

25] J.R. Quinlan , Programs for Machine Learning, Morgan Kaufmann Publishers Inc.,San Francisco, CA, USA, 1993 .

26] A. Rahman , V. Ng , Coreference resolution with world knowledge, in: Proceed-ings of the 49th Annual Meeting of the Association for Computational Lin-

guistics: Human Language Technologies - Volume 1, HLT ’11, Association for

Computational Linguistics, Stroudsburg, PA, USA, 2011, pp. 814–824 . [27] M. Recasens , E. Hovy , A deeper look into features for coreference resolution,

Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 29–42 . 28] M. Recasens , E.H. Hovy , Blanc: implementing the rand index for coreference

evaluation, Natural Lang. Eng. 17 (4) (2011) 485–510 . 29] S. Saha , A. Ekbal , O. Uryupina , M. Poesio. , Single and multi-objective optimiza-

tion for feature selection in anaphora resolution, in: Proceedings of the fifth

International Joint Conference in Natural Langauge Processing (IJCNLP 2011),2011, pp. 93–101 .

30] H. Schmolz , Anaphora resolution and text retrieval, De Gruyter Empirische Lin-guistik / Empirical Linguistics 3. De Gruyter Mouton, Berlin, 2015 .

[31] A. Senapati , U. Garain , Guitar-based pronominal anaphora resolution in ben-gali, in: ACL (2), The Association for Computer Linguistics, 2013, pp. 126–130 .

32] U. Sikdar , A. Ekbal , S. Saha , O. Uryupina , M. Poesio , Adapting a state-of-the-art

anaphora resolution system for resource-poor language, in: Proceedings of theSixth International Joint Conference on Natural Language Processing, Asian

Federation of Natural Language Processing, 2013, pp. 815–821 . [33] U.K. Sikdar , A. Ekbal , S. Saha , O. Uryupina , M. Poesio , Differential evolu-

tion-based feature selection technique for anaphora resolution., Soft Comput.(2014) 1–13 .

34] L. Sobha , B. N.Patnaik , Vasisth: An anaphora resolution system for indian lan-

guages, in: Proceedings Artificial and Computational Intelligence for Decision,Control and Automation in Engineering and Industrial Applications (ACIDCA),

Monastir, Tunisia, 20 0 0 . [35] W.M. Soon , H.T. Ng , D.C.Y. Lim , A machine learning approach to coreference

resolution of noun phrases, Comput. Ling. 27 (4) (2001) 521–544 . 36] J. Steinberger , M. Poesio , M.A. Kabadjov , K. Jeek , Two uses of anaphora resolu-

tion in summarization, in: Information Processing and Management: an Inter-

national Journal, 2007, pp. 1663–1680 . [37] R. Storn , K. Price , Differential evolution a simple and efficient heuristic for

global optimization over continuous spaces., J. Global Optim. 11 (4) (1997)341–359 .

38] D.-H. Tran , M.-Y. Cheng , D. Prayogo , A novel multiple objective symbiotic or-ganisms search (mosos) for timecostlabor utilization tradeoff problem., Knowl-

edge-Based Systems 94 (2016) 132–145 .

39] B. Uppalapu , D.M. Sharma , Pronoun resolution for hindi, in: Proceedings ofDAARC, 2009 .

40] O. Uryupina , Knowledge Acquisition for Coreference Resolution, University ofthe Saarland, 2007 Phd thesis .

[41] V.N. Vapnik , The nature of statistical learning theory, Springer-Verlag NewYork, Inc., New York, NY, USA, 1995 .

42] Y. Versley , S.P. Ponzetto , M. Poesio , V. Eidelman , A. Jern , J. Smith ,X. Yang , A. Moschitti , Bart: a modular toolkit for coreference resolution, in:

HLT-Demonstrations ’08 Proceedings of the 46th Annual Meeting of the Asso-

ciation for Computational Linguistics on Human Language Technologies, 2008,pp. 9–12 .

43] M. Vilain , J. Burger , J. Aberdeen , D. Connolly , L. Hirschman , A model-theoreticcoreference scoring scheme, in: Proceedings of the 6th Conference on Message

Understanding, MUC6 ’95, Association for Computational Linguistics, Strouds-burg, PA, USA, 1995, pp. 45–52 .

44] C. Walker , S. Strassel , J. Medero , K. Maeda. , Ace 2005 Multilingual Training Cor-

pus, Linguistic Data Consortium, Philadelphia, 2006 . 45] R. Weischedel, S. Pradhan, L. Ramshaw, M. Palmer, N. Xue, M. Marcus, A.

Taylor, C. Greenberg, E. Hovy, R. Belvin, A. Houston, 2008, Ontonotes release2.0:ldc2008t04 philadelphia penn.: Linguistic data consortium.

46] S. Wiseman , A.M. Rush , S.M. Shieber , J. Weston , Learning anaphoricity and an-tecedent ranking features for coreference resolution, in: Proceedings of the

53rd Annual Meeting of the Association for Computational Linguistics and the

7th International Joint Conference on Natural Language Processing of the AsianFederation of Natural Language Processing, ACL 2015, July 26–31, 2015, Beijing,

China, Volume 1: Long Papers, 2015, pp. 1416–1426 . [47] I.H. Witten , E. Frank , Data Mining: Practical Machine Learning Tools and Tech-

niques, (Morgan Kaufmann Series in Data Management Systems), MorganKaufmann Publishers Inc., San Francisco, CA, USA, 2005 .

48] A.-C. Zavoianu , E. Lughofer , W. Koppelstätter , G. Weidenholzer , W. Amrhein ,

E. Klement , Performance comparison of generational and steady-state asyn-chronous multi-objective evolutionary algorithms for computationally-inten-

sive problems, Knowl.-Based Syst. 87 (2015) 47–60 .