12
adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011 A Dialogue System for Telugu, a Resource-Poor Language M Ch Sravanthi K Prathyusha Radhika Mamidi IIIT-Hyderabad IIIT-Hyderabad IIIT-Hyderabad Mullapudi.sravanthi prathyusha.k radhika.mamidi @research.iiit.ac.in @research.iiit.ac.in @iiit.ac.in Abstract. A dialogue system is a computer system which is designed to con- verse with human beings in natural language (NL). A lot of work has been done to develop dialogue systems in regional languages. This paper presents an ap- proach to build a dialogue system for resource poor languages. The approach comprises of two parts namely Data Management and Query Processing. Data Management deals with storing the data in a particular format which helps in easy and quick retrieval of requested information. Query Processing deals with producing a relevant system response for a user query. Our model can handle code-mixed queries which are very common in Indian languages and also han- dles context which is a major challenge in dialogue systems. It also handles spelling mistakes and a few grammatical errors. The model is domain and lan- guage independent. As there is no automated evaluation tool available for dia- logue systems we went for human evaluation of our system, which was devel- oped for Telugu language over ‘Tourist places of Hyderabad’ domain. 5 people evaluated our system and the results are reported in the paper. 1 Introduction A dialogue system is a computer program that communicates with a human in a natu- ral way. Many efforts are being done to make the conversations seem natural. Despite a lot of progress in computational linguistics and language processing techniques we do not see much usage of dialogue systems in real time. Some reasons for this may be the lack of domain expertise, linguistic experts and computational tools. Our approach to build a dialogue system is quick and does not require many language processing tools. Our approach can be described in two parts namely Data Management and Que- ry Processing. Data Management: This component deals with categorization, text segmentation and storage of the data in a hierarchical manner which helps in fast retrieval of the output. Query Processing: This takes a natural language query from a user as input, pro- cess it to extract the keywords and update the context if necessary. Based on the extracted keywords and the context it either retrieves an answer from the database

A Dialogue System for Telugu, a Resource-Poor Language

Embed Size (px)

Citation preview

adfa, p. 1, 2011.

© Springer-Verlag Berlin Heidelberg 2011

A Dialogue System for Telugu, a Resource-Poor

Language

M Ch Sravanthi K Prathyusha Radhika Mamidi

IIIT-Hyderabad IIIT-Hyderabad IIIT-Hyderabad

Mullapudi.sravanthi prathyusha.k radhika.mamidi

@research.iiit.ac.in @research.iiit.ac.in @iiit.ac.in

Abstract. A dialogue system is a computer system which is designed to con-

verse with human beings in natural language (NL). A lot of work has been done

to develop dialogue systems in regional languages. This paper presents an ap-

proach to build a dialogue system for resource poor languages. The approach

comprises of two parts namely Data Management and Query Processing. Data

Management deals with storing the data in a particular format which helps in

easy and quick retrieval of requested information. Query Processing deals with

producing a relevant system response for a user query. Our model can handle

code-mixed queries which are very common in Indian languages and also han-

dles context which is a major challenge in dialogue systems. It also handles

spelling mistakes and a few grammatical errors. The model is domain and lan-

guage independent. As there is no automated evaluation tool available for dia-

logue systems we went for human evaluation of our system, which was devel-

oped for Telugu language over ‘Tourist places of Hyderabad’ domain. 5 people

evaluated our system and the results are reported in the paper.

1 Introduction

A dialogue system is a computer program that communicates with a human in a natu-

ral way. Many efforts are being done to make the conversations seem natural. Despite

a lot of progress in computational linguistics and language processing techniques we

do not see much usage of dialogue systems in real time. Some reasons for this may be

the lack of domain expertise, linguistic experts and computational tools. Our approach

to build a dialogue system is quick and does not require many language processing

tools. Our approach can be described in two parts namely Data Management and Que-

ry Processing.

─ Data Management: This component deals with categorization, text segmentation

and storage of the data in a hierarchical manner which helps in fast retrieval of the

output.

─ Query Processing: This takes a natural language query from a user as input, pro-

cess it to extract the keywords and update the context if necessary. Based on the

extracted keywords and the context it either retrieves an answer from the database

or asks the user an interactive question. In this way, it maintains the conversation

with the user all the time.

This model can be implemented with minimal or no use of any language processing

tool. Our approach is language independent and can be extended to any domain. It can

also handle context and code-mixed queries to make the conversations look more

natural.

Dialogue Example:

U1: hExarAbAxulo*

mAnyumeVMts emi unnAyi?

(What are the monuments in Hyderabad?)

S1: cArminAr, golkoVMda, makkA masjix. mIku vItilo exi kAvAli?

(Charminar, Golconda, Mecca Masjid. What among these do you need?)

U2: cArminAr eVkkada uMxi?

(Where is Charminar?)

S2: cArminAr yoVkka cirunAmA - cArminAr, heVxarAbAx, weVlaMgANa.

(The address of Charminar is -Charminar, Hyderabad, Telangana.)

U2: eVppudu opeVn uMtuMxi?

(When is it open?)

S2: cArminar yoVkka samayaM anni rojulu 9:00 am - 5:30 pm

(Charminar open timings -All days 9:00 am - 5:30 pm)

U3: eMtrI PIju eMwa?

(What is the entry fee?)

S3: cArminAr yoVkka PIju - iMdiyans - 20, vixeSIyulu - 150

(The fee for Charminar - Indians - Rs.20, Foreigners -Rs.150)

For example, the above conversation shows a user-system interaction from ‘Tourism’

domain for Telugu language. User responses are represented as U1, U2, etc. and sys-

tem responses are represented as S1, S2, etc. User and system response together make

a discourse unit i.e. U1 and S1 is a discourse unit. We can observe that context is also

handled up to many discourse units i.e. to answer U3, U4 we need context infor-

mation from U2.

U3 is a code-mixed query as it contains ‘opeVn’ (open), an English word. We can see

that U3 has been successfully processed and understood by the system. This shows

that code-mixed queries are also handled by our system.

* Words are in wx format (sanskrit.inria.fr/DATA/wx.html). All the examples given in the

paper are from Telugu language.

2 Related Work:

There has been a lot of progress in the field of dialogue systems in last few years. In

general dialogue systems are classified into three types. (a) Finite State (or graph)

based systems, (b) Frame based systems, (c) Agent based systems.

(a)Finite state based systems: In this type of systems, conversation occurs according

to the predefined states or steps. This is simple to construct but doesn’t allow user to

ask questions and take initiative. [3] Proposed a method using weighted finite state

transducer for dialogue management.

(b)Frame based systems: These systems have a set of templates which are filled

based on the user responses. These templates are used to perform other tasks. [2] Pro-

posed an approach to build natural language interface to databases (NLIDB) using

semantic frames based on Computational Paninian Grammar. Context information in

NLIDB is handled by [1]. In this paper different types of user-system interactions

were identified and context was handled for one specific type of interaction. A dia-

logue based question answering system [6] which extracts keywords from user query

to identify a query frame has been developed for Railway information in Telugu.

(c)Agent based systems: These systems allow more natural flow of communication

between user and system than the other systems. The conversations can be viewed as

interaction between two agents, each of which is capable of reasoning about its own

actions. [4] Developed an agent based dialogue system called Smart Personal Assis-

tant for email management. This has been further extended for calendar task domain

in [5]. Our model can be categorized as an agent based system.

3 Our Approach

Fig.1 describes the flow chart of the internal working of our model.

Fig. 1. System Architecture

The major components in our method are:

─ Data Organization(Database) ─ Query Processing

Knowledge Base Query Analyzer

Advanced Filter

Context Handler

Dialogue Manager

3.1 Data Organization

Every domain has an innate hierarchy in it. Study of the possible queries in a domain

gives an insight about the hierarchical organization of the data in that domain. For

example, consider ‘Tourist places of Hyderabad’ domain in Fig. 2.

Fig. 2. Data organization of ‘Tourist places of Hyderabad’ domain

When we store data in this manner it becomes easy to add information and extend the

domain. We can see the extension of ‘Tourist places of Hyderabad’ domain to ‘Hy-

derabad Tourism’ domain in Fig. 3.

Fig. 3. Data organization of ‘Tourism of Hyderabad’ domain

In this hierarchical tree structure the leaf nodes are at level-0 and the level increases

from a leaf node to the root node. The data at level-n (where n is number of levels) is

segmented recursively until level-1. Then each segment at level-i (i=1... n) is given a

level-(i-1) tag.

In physical memory all the layers above level-1 are stored as directories, level-1 as

files and level-0 as tags in a file. The text in a file is stored in the form of segments

and each segment is given a level-0 tag (address, open timings, entry fee etc.). The

labels of all the files and directories along with the information in the files contribute

to the data set.

3.2 Query Processing

The entire process from taking a user query to generating a system response is termed

as ‘Query Processing’. The different components of the ‘Query Processing’ module

are described in the subsequent sections.

3.2.1 Knowledge Base

Knowledge base contains a domain dependent ontology like list of synonyms and

code-mixed words. This helps in handling code-mixed and wrongly spelt words in the

queries. This module is used by the Query Analyzer to replace the synonyms, code-

mixed words etc., in a query with corresponding level-i (0…n) tags. If any language

has knowledge resources like WordNet, dbpedia etc., they can be used to build the

knowledge base. This has to be done manually.

3.2.2 Query Analyzer

The NL query given by the user is converted into wx query which is given as input to

the Query Analyzer. The wx query is then tokenized and given as input to morpholog-

ical analyzer and parts of speech (POS) tagger[11]. From the morphological analyz-

er’s output, extract the root words of all the tokens in the query and replace these

tokens with the corresponding root words. In this modified query the synonyms, code-

mixed words etc., are replaced with corresponding level-i tags as discussed in

Knowledge Base. Languages that do not have a morphological analyzer can build a

simple stemmer which applies ‘minimum edit distance algorithm’ to find the root

word of the given token by maintaining a root dictionary and then replaces the token

with the root word [7]. Here, POS tagger is used only to identify the question words

in the query. Languages with no POS tagger can have a list of question words.

Example:

User Query: golkoVMda addreVssu emiti, e tEMlo cUdavaccu?

Golconda address what what time can visit

(What is the address of Golconda, what time to visit?)

POS-tagger: golkoVMda addreVssu emiti/WQ e/WQ tEMlo cUda-vaccu

Root word replacement: golkoVMda addreVssu emiti/WQ e/WQ tEM cUdu

Query Analyzer’s Output: golkoVMda cirunAmA emiti/WQ e/WQ samayaM cUdu

From the above output, we can see that English words like ‘addreVssu’ (address) and

‘tEM’ (time) are mapped to corresponding Telugu words i.e. ‘cirunAmA’ (address)

and ‘samayaM’ (time) respectively by using knowledge base. If there is no corre-

sponding Telugu word the English word remains as it is.

3.2.3 Advanced Filter

From the above modified query, question words and level-i words are extracted. From

these words this module extracts only the words which play a role in the answer re-

trieval by applying heuristics like level-i words nearest to the question words, level-0

words etc. If these heuristics are not satisfied, all the level-i words are considered for

further processing. These final sets of words are the keywords.

A user query can contain more than one question. In such cases, keywords belonging

to a particular question are grouped together. All the groups of a query are collective-

ly called a ‘query set’.

Example:

U: nenu golkoVMda xaggara unnAnu, cArminAr PIju eVMwa iMkA cArminAr

makkAmasjix eVkkada unnAyi ?

(I am near Golconda, what is the fee for Charminar and what is the address of

Charminar and Mecca Masjid?)

Query analyzer output: nenu golkoVMda xaggara uMdu cArminAr PIju

eVMwa/WQ iMkA cArminAr makkAmasjix cirunAmA uMdu

Extracted words: golkoVMda, cArminAr, PIju, eVMwa, cArminAr, makkAmasjix,

cirunAmA

Keywords: cArminAr, PIju, eVMwa, cArminAr, makkAmasjix, cirunAmA

Query set: [cArminAr, PIju, eVMwa], [cArminAr, makkAmasjix, cirunAmA]

In ‘U’, ‘I am near Golconda’ is unnecessary information. Therefore even if ‘Golcon-

da’ is a word belonging to level-i, it will not be considered as a keyword.

3.2.4 Context Handler

Handling context is very important in any conversation to capture user’s intention.

This is a major challenge in present day dialogue systems. The query set given by

Advanced Filter is used to update the context. If we identify any level-i (i=1, 2...n)

word in the query set then there is a shift in the context. If there are no level-n(n<i)

words in the query set we borrow level-n words relevant to level-i from the previous

query set and add them to form the new query set. If the query set contains level-0

words and no words from level-i (i=1, 2...n), then we borrow level-i words from the

previous query set.

Example:

U1: cArminAr eVkkada uMxi?

(Where is Charminar?)

S1: cArminAr yoVkka cirunAmA - cArminAr, heVxarAbAx, weVlaMgANa.

(The address of Charminar is -Charminar, Hyderabad, Telangana.)

U2: eVppudu opeVn uMtuMxi?

(When is it open?)

S2: cArminar yoVkka samayaM anni rojulu 9:00 am - 5:30 pm

(Charminar open timings -All days 9:00 am - 5:30 pm)

U3: eMtrI PIju eMwa?

(What is the entry fee?)

S3: cArminAr yoVkka PIju - iMdiyans - 20, vixeSIyulu - 150

(The fee for Charminar - Indians - Rs.20, Foreigners -Rs.150)

U4: golkoVMda eVkkada uMxi?

(Where is Golconda?)

S4: golkoVMda yoVkka cirunAmA - ibrahIM bAg, heVxarAbAx,

weVlaMgANa 500008.

(The address of Golconda is Ibrahim Bagh, Hyderabad, Telangana 500008.)

In this example, to answer U2 and U3 we need contextual information (cArminAr

[Charminar] ) from U1. Context doesn’t change for U2 and U3 as there is no level-

i(i=1,2..n) word in them. We can observe the context switch from U3 to U4 i.e. switch

from ‘cArminAr’ (Charminar) to ‘golkoVMda’ (Golconda) due to the occurrence of

Golconda (a level-1 word in ‘Tourism’ domain) in U4.

3.2.5 Dialogue Manager

In any dialogue system, Dialogue Manager (DM) is the major component. It coordi-

nates the activity of several subcomponents in a dialogue system and controls the flow

of dialogue by giving relevant responses to user queries. The dialogue manager takes

the query set and the context information as input. If the user query is ambiguous or

no keywords are identified then the dialogue manager poses the user an interactive

question from the set of canned questions. Otherwise it retrieves a relevant answer

from the database.

Example:

U1: nenu golkoVMda cUdAli

(I have to visit Golconda)

S1: mIku memu e vidamugA sAyapadagalamu?

(How can we help you?)

U2: axi eVkkada uMxi?

(Where is it?)

S2: golkoVMda yoVkka cirunAmA - ibrahIM bAg, heVxarAbAx,

weVlaMgANa 500008.

(The address of Golconda is Ibrahim Bagh, Hyderabad, Telangana 500008.)

In U1, the information provided is insufficient. Therefore the dialogue manager posed

an interactive question to the user. Based on the query set and contextual information

of U2, the Dialogue manager retrieves a relevant answer from the database.

4 Detailed Execution

User query: hExarAbaxlo makkAmasjix eVppudu opeVn uMtuMxi?

(When is Mecca Masjid open in Hyderabad?)

POS-tagger: hExarAbaxlo makkAmasjix eVppudu/WQ opeVn uMtuMxi

Replace with root word: hExarAbax makkAmasjix eVppudu/WQ opeVn uMdu

Replace with synonym: hExarAbax makkAmasjix eVppudu/WQ samayaM uMdu

Keywords: makkAmasjix, eVppudu, samayaM

Query set: [makkAmasjix, eVppudu, samayaM]

Answer: makkAmasjix yoVkka samayaM – annirojulu 4:00 am - 9:30 pm

( Mecca Masjid open timings - All days 4:00 am -9:30 pm).

5 Evaluation

Automatic evaluation is not available for dialogue systems. Human evaluation is only

possible. There were 5 evaluators who used and evaluated the system based on the

metrics given in Table 1. The mother tongue of all the evaluators is Telugu. Table 1

shows the criteria for the evaluation and the average of the rating given by human

evaluators on the scale of 1-5 where 1 means poor and 5 means excellent.

Table 1. Human evaluation of our system

Metric average rating

Speed How fast are the responses? 4

Timeout Does the system hang? 5

Recognition Does the system understand your intention? 3.5

Reliability Did you find all the information you were

looking for?

4

Relevance Are the responses appropriate? 4

Usability Is the system easy to use? 4

Complexity Does the system handle complex sentences? 3

Performance Overall performance of the system 3.5

6 Error Analysis

Many efforts are being done to make the conversations seem natural i.e. closer to

human conversation. For this, we need to handle many discourse issues such as

Anaphora, ellipsis etc., and other issues like grammatical and spelling errors. Our

system can handle these issues to some extent but as the complexity of the sentence

increases, the performance of the system degrades. Some such issues are discussed in

this section.

Anaphora Resolution: [8]

U1: cArminAr eVkkada uMxi ?

(Where is Charminar?)

S1: cArminAr - cirunAmA - cArminAr, hExarAbAx, weVlaMgANA

(The address of Charminar is - Charminar, Hyderabad, Telangana.)

U2: xAni PIju eMwa, golkoVMda eVkkada uMxi

(what is its fee and where is Golconda? )

S2: golkoVMda yoVkka cirunAmA - ibrahIM bAg, heVxarAbAx,

weVlaMgANa 500008.

(The address of Golconda is Ibrahim Bagh, Hyderabad, Telangana 500008.)

golkoVMda yoVkka PIju - iMdiyans - 5, vixeSIyulu - 100

(The fee for Golconda - Indians - Rs.5, Foreigners -Rs.100)

Here, in U2, ‘xAni’ (its) may refer to ‘cArminAr’ (Charminar) or ‘golkoVMda’

(Golconda). Though the pronoun ‘xAni’ should map to ‘cArminAr’, it is mapped to

‘golkoVMda’ as golkoVMda’ is nearer to the pronoun (‘its’).

U1: cArminAr eVkkada uMxi ?

(Where is Charminar?)

S1: cArminAr - cirunAmA - cArminAr, hExarAbAx, weVlaMgANA

(The address of Charminar is - Charminar, Hyderabad, Telangana.)

U2: xAni PIju eMwa ?

(What is the entry fee?)

S3: cArminAr yoVkka PIju - iMdiyans - 20, vixeSIyulu - 150

(The fee for Charminar - Indians - Rs.20, Foreigners -Rs.150)

Here, in U2, as there is no ambiguity in co-reference, ‘xAni’ (its) can only refer to

‘cArminAr’ (Charminar) in U1. The system successfully analyzed and gave the cor-

rect response.

Ellipsis: [10]

U1: cArminAr cirunAmA eMti, mari PIju ?

(What is the address of Charminar, (what about its) fees?)

S1: cArminAr - cirunAmA - cArminAr, hExarAbAx, weVlaMgANA

(The address of Charminar is - Charminar, Hyderabad, Telangana.)

cArminAr yoVkka PIju - iMdiyans - 20, vixeSIyulu - 150

(The fee for Charminar - Indians - Rs.20, Foreigners -Rs.150)

In this example, ‘what about its’ in U1 is ellipsis. The system properly interprets the

above query. In general if the sentence becomes complex the system may not be able

to handle the ellipses like in the below example.

U1: cArminAr cirunAmA eMti, mari PIju, golkoVMda eVppudu opeVn

uMtaxi, mari cArminAr

(What is the address of Charminar, (what about its) fees, When is Golconda open,

(what about) Charminar)

S1: golkoVMda - samayaM - somavAraM wappa anni rojulu

5:00pm – 9:00 pm, AxivAraM 7:00pm - 8:00pm kAMwi praxarSana

(except on Monday Golconda is open on all other days from 5:00pm to 9:00pm

and on Sundays there is alazer light show from 7:00pm to 8:00pm)

golkoVMda - PIju - iMdiyans - 5, vixeSIyulu – 100

(The fee for Golconda - Indians - Rs.5, Foreigners -Rs.100)

golkoVMda yoVkka cirunAmA - ibrahIM bAg, heVxarAbAx,

weVlaMgANa 500008.

(The address of Golconda is Ibrahim Bagh, Hyderabad, Telangana 500008.)

cArminAr - cirunAmA - cArminAr, hExarAbAx, weVlaMgANA

(The address of Charminar is - Charminar, Hyderabad, Telangana.)

The output of the system has to be address, entry fee and open timings of Charminar

along with the open timings of Golconda, but we can see that the system wrongly

interpreted and gave output for open timings, entry fee and address of Golconda along

with address of Charminar.

Sandhi: [9]

Sandhi is a common phenomenon in agglutinative languages. For example, consider

the below dialogue.

U: cArminAreVkkaduMxi ?

(Where is Charminar?)

S: mIku memu e vidamugA sAyapadagalamu?

(How can I help you?)

In U1, we can see that a sentence is expressed as single word in Telugu language

which cannot be analyzed by NLP applications. To handle these cases, there is a need

for sandhi splitter which splits ‘cArminAreVkkaduMxi ’ to ‘cArminAr ’ (Charminar),

‘eVkkada’ (where) and ‘uMxi’ (present).

7 Conclusion

We have shown a new and quick approach to build a dialogue system. It can be readi-

ly adapted to other languages. In general, only language specific parts like Database

and Knowledge base have to be replaced for this purpose. Our model is portable to

any domain. It requires a stemmer and a set of question words which can be easily

developed. This brings us one step closer to build dialogue systems for resource poor

languages. Our system also maintains conversation by posing questions to the user.

In future, we intend to build a multi-lingual and multi-domain dialogue system by

improving our current model which should be able to handle pragmatics and dis-

course. We also intend to handle sandhi, ellipses and anaphora resolution to make the

conversations seem more natural. This system can also be integrated with speech

input and output modules.

Acknowledgements. This work is supported by Information Technology Research

Academy (ITRA), Government of India under, ITRA-Mobile grant

ITRA/15(62)/Mobile/VAMD/01

References

1. Akula, A. R., Sangal, R., and Mamidi, R. A novel approach towards incorporating context

processing capabilities in nlidb system.

2. Gupta, A., Akula, A., Malladi, D., Kukkadapu, P., Ainavolu, V., and Sangal, R. (2012). A

novel approach towards building a portable nlidb system using the computational paninian

grammar framework In Asian Language Processing (IALP), 2012 International Confer-

ence on, pages 93–96. IEEE.

3. Hori, C., Ohtake, K., Misu, T., Kashioka, H., and Nakamura, S. (2009). Weighted finite

state transducer based statistical dialog management. In Automatic Speech Recognition

Understanding, 2009. ASRU 2009. IEEE Workshop on, pages 490–495.

4. Nguyen, A. and Wobcke, W. (2005). An agent-based approach to dialogue management in

personal assistants. In Proceedings of the 10th International Conference on Intelligent Us-

er Interfaces, IUI ’05, pages 137–144, New York, NY, USA. ACM.

5. Nguyen, A. and Wobcke, W. (2006). Extensibility and reuse in an agent-based dialogue

model. In Web Intelligence and Intelligent Agent Technology Workshops, 2006. WI-IAT

2006 Workshops. 2006 IEEE/WIC/ACM International Conference on, pages 367–371.

IEEE.

6. Reddy, R. R. N. and Bandyopadhyay, S. (2006). Dialogue based question answering sys-

tem in telugu. In Proceedings of the Workshop on Multilingual Question Answering,

MLQA ’06, pages 53–60, Stroudsburg, PA, USA. Association for Computational Linguis-

tics.

7. Srirampur, S., Chandibhamar, R., and Mamidi, R. (2014). Statistical morph analyzer

(sma++) for indian languages. COLING 2014, page 103.

8. Ruslan Mitkov. 1999. Anaphora resolution: The state of the art. Technical report. Univer-

sity of Wolverhampton, Wolverhampton.

9. Sandhi splitter and analyzer for Sanskrit(with special reference to aC sandhi), by Sachin

kumar, thesis submitted to JNU special centre for Sanskrit,2007. http://sanskrit.jnu.ac.in/

rstudents/mphil/sachin.pdf

10. Dalrymple, Mary, Stuart M. Shieber, and Fernando C. N. Pereira. Ellipsis and higher-order

unification. Technical report, Computation and Language E-Print Archive. 1991

11. Brants T, TnT–A statistical part-of-speech tagger. In: Proceedings of the sixth applied nat-

ural language processing conference (ANLP-2000). p. 224–31.