The Development of a Thai Spoken Dialogue · PDF fileThe Development of a Thai Spoken Dialogue System 135 ... Thai spoken dialogue system, Spoken language understanding 1. ... spontaneous

The Development of a Thai Spoken Dialogue System 135

The Development of a Thai Spoken DialogueSystem

Chai Wutiwiwatchai, Non-member

ABSTRACTThis article summarily reports the development of

the first Thai spoken dialogue system (SDS), namelya Thai Interactive Hotel Reservation Agent (TIRA). In the development, a multi-stage technique of spo-ken language understanding (SLU), which combineda word spotting technique for concept extraction anda pattern classification technique for goal identifica-tion, was proposed. To improve system performance,a novel approach of logical n-gram modeling was de-veloped for concept extraction in order to enhancethe SLU robustness. Furthermore, dialogue contex-tual information was used to help understanding andfinally, an error detection mechanism was constructedto prevent unreliable interpretation outputs of theSLU. All the mentioned algorithms were incorporatedin the TIRA system and evaluated by real users.

Keywords: Thai spoken dialogue system, Spokenlanguage understanding

1. INTRODUCTION

A spoken dialogue system (SDS) is a complex sys-tem that combines technologies from several areassuch as automatic speech recognition, spoken lan-guage understanding, dialogue and database manage-ment, and text-to-speech synthesis. Many researchsites have developed spoken dialogue systems in var-ious domains and languages. Well-known systemsinclude MIT weather information [1], and AT & Toperator assistance [2]. A difficulty in creating suchsystems depends strongly on the capability of han-dling mixed-initiative dialogues, which means the de-gree with which the system maintains an active rolein the conversation. Other important factors are nat-uralness and the number of contents recognized bythe system.

At present, there is no research site reporting aThai spoken dialogue system. Thai writing is an al-phabetic system with extra-symbols indicating fivetonal levels of a syllable. There is typically no spacebetween words and no sentence boundary marker. Al-though Thai has a phonetic spelling system [3], pre-dicting the pronunciation from the orthography is not

Manuscript received on June 12, 2006; revised on August 20,2006.

The author is with Human Language Technology Labora-tory, National Electronics and Computer Technology Center(NECTEC) 112 Pahonyothin Rd., Klong-luang, Pathumthani,Thailand 12120; E-mail: [email protected]

a trivial problem, due to a weak relationship betweenletters and phones. The shortage of efficient phono-logical and morphological analyzers is a major factorhindering the development of Thai speech recogni-tion. Initiation of a spoken dialogue system for Thaiis therefore a challenging task as we have to developmost of sub-components specifically for Thai.

We proposed a pioneering spoken dialogue systemfor Thai language. The system, namely Thai Interac-tive Hotel Reservation Agent (TIRA), provides cus-tomers a voice-enable interactive hotel informationretrieval and reservation. The customers can talk ina natural, spontaneous speaking style. We also aim tocreate a mixed-initiative dialogue system, where thecustomers can either answer to system-directive ques-tions or interrupt the system by asking for anythingthey need at any stage of the conversation.

There have been three versions of TIRA. The firstversion consisted of several components constructedmainly by rules. To improve the accuracy, a multi-stage SLU was developed and trained by a large sizeof text collected via a webpage simulating desired di-alogues. The novel SLU module was integrated in thesecond version of TIRA. In the third version, a newalgorithm namely logical n-gram modeling was intro-duced to improve the SLU robustness, while severalways to incorporate dialogue contextual informationwere tested. An error detection mechanism based onconfidence scoring was also incorporated.

In this article, we summarize the development ofthe TIRA system as follows. The next section de-scribes architecture of the TIRA system. Section 3and 4 explain in details the highlighted componentsof TIRA version 2 and 3 respectively. Evaluationsthat demonstrate the development trend are given inSect. 5 followed by discussion and conclusion in Sect.6.

2. TIRA SYSTEM

All the three versions of TIRA have a similar basicstructure as shown in Fig. 1.

• User interface (UI) - Users talk to the system ina push-to-talk interface. The system displays threenecessary key-values, the number of guests, check-inand check-out date, on a small screen and displayedresponse text in a main screen. Synthesized soundis played to the users while displaying response text.The users can interrupt the system at any time theywant.

136 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.2, NO.2 NOVEMBER 2006

Fig.1: The Diagram of TIRA System

• Automatic speech recognition (ASR) - An tied-state triphone acoustic model was created by HTK .A word-class n-gram language model was constructedusing CMU-SLM toolkit . Speech decoder is JULIUS.

• Spoken language understanding (SLU) - The ideabehind our SLU module is similar to case grammar.Concepts (semantic slots) such as “the number of peo-ple” and “check-in date” appeared in a sentence in-dicate the goal of user utterance such as “giving pre-requisite keys for room reservation”. The SLU henceconsisted of concept extraction and goal identifica-tion sub-modules, each constructed by rules in thefirst version of TIRA.

• Dialogue management (DM) - Capability to han-dle mixed-initiative turns is added in a rule-basedDM. According to the control rules, the system ex-ecutes internal functions depending on an incomingsemantic input, dialogue histories, and dialogue statevariables, and produces a set of semantic responses.Some typical internal functions include “consistencyverification”, in which input keys are parsed to the ex-isting keys perceived in previous turns and “databaseretrieval”, where room information is retrieved froma database.

• Text generation (TG) and Text-to-speech syn-thesis (TTS) - Given a semantic response, a tex-tual response is then generated using a template-based text generation. At the same time, a responsesound is synthesized from the textual response us-ing a Thai text-to-speech synthesizer provided byNECTEC, Thailand [4].

Figure 2 demonstrates a typical dialogue tran-scribed from a real conversation between a user andthe TIRA system. It is noted that there is an openquestion of “What else may I help you?”, which ischallenging in developing a natural spoken dialoguesystem.

Fig.2: A Typical Dialogue of the TIRA System

3. TIRA VERSION 2

The TIRA version 1 (TIRA-1) was constructedmainly by handcrafted rules. Coding rules in severalcomponents is not a trivial task as it needs specialistsor linguists who know well the behavior of spoken di-alogues in the desired domain. The handcrafted rulesthemselves often conflict each other and hence requirean efficient way to select the best one. A common ap-proach to solve the problem is to instead implementa statistical model that can be learned from a well-annotated corpus. In the TIRA-2 system, a novellearnable multi-stage SLU approach was developedin order to solve the mentioned problem. Further-more, the invention was suited to spoken languagewhere sentence grammar was often loose especiallyfor Thai.

3.1 Understanding Thai Utterances

Following Panupong [5], a Thai spoken utterance isproperly analyzed as shown in Fig. 3. Major meaningof the sentence is expressed in the basic part, whichconsists of several types of basic part such as subject,action, and object part. The sentence can be aug-mented by various kinds of supplementary part. Allof these parts share the same characteristics as below.

•Word ordering is crucial within each part. Missordering can make confusion.

•Positions of several parts in a sentence are usuallyunrestricted. Thais can perceive the same meaningsgiven various ways to form the sentence from severalsentence parts.

•Often, a sentence part is separated by anotherpart as shown in the Fig. 3.

3.2 Data-driven SLU

In the technology of trainable or data-driven SLU,two different practices based on robust semantic pars-ing and topic classification have been widely investi-


gated. The first practice aims to tag the words (orgroup of words) in the utterance with semantic labels,which are later converted to a certain format of se-mantic representation. A common representation is aframe of semantic slots (or concepts) coupled option-ally with their values. To generate such a semanticframe, words in the utterance are usually aligned to asemantic tree by a parsing algorithm such as a proba-bilistic context-free grammar or a recursive transitionnetwork, whose nodes represent semantic symbols ofthe words and arcs consist of transition probabilities.This approach has been successfully incorporated inseveral spoken dialogue systems [6, 7]. A drawback ofthis method is, however, the requirement of a large,fully annotated corpus, i.e. a corpus with semantictags on every word, to ensure training reliability.

In the second practice, an understanding moduleis used to classify an input utterance to one of prede-fined user goals (if an utterance is supposed to haveone goal) directly from the words contained in the ut-terance. In this case, the semantic representation is asemantic symbol representing the goal. A well-knownapplication based on this method is a call routingsystem [2], where an incoming call is classified to atype of request. An advantage of this application isthe need for training sentences tagged only with theirgoals, one for each utterance. However, another pro-cess is required if one needs to obtain more detailedinformation from the utterance.

Another different technique of SLU is based onword spotting, where only specific keywords are ex-tracted from a word string. Keywords are often in-formation items important for communication such asthe name of places and the numbers. Word-spottingbased systems are usually implemented by simplerules, finite state machines, or n-gram models [8]. Itmeans that a word-spotting model can be constructedby either handcrafted rules or learning from taggedcorpus.

For the Thai hotel reservation domain, we needan SLU system that can capture both global infor-mation (user goals) and detailed information (con-cepts). Since no large annotated tree bank corpusis available, an SLU model that combines the ad-vantages of topic classification and word spotting de-scribed above would be optimum. Some systems haveinvestigated on similar ideas of combination, i.e. con-structing a multi-pass system to extract concepts ornamed-entities (NE) and a goal or a dialogue-act, butin different ways of implementation as shown in Ta-ble 1. Due to the nature of Thai as explained in theSect. 3.1, such existing techniques remain problem-atic and another technique of multi-pass SLU needsto be explored.

3.3 Multi-stage SLU

Based on the problems described above, we in-vented a new learnable SLU consisting of three sub-

modules as shown in Fig. 4.• Concept extraction - This sub-module aims to

extract a set of concepts from an input utterance.The concept is actually the basic or supplementarypart described in the previous sub-section. Table2 gives some examples of concepts observed in twoutterances. We implemented the concept extractioncomponent by using weighted finite-state transduc-ers (WFSTs). Each WFST represented substrings(word sequences) possibly expressed for the concept.This approach is sometimes called a regular grammarmodel, denoted in short as ‘Reg’

Table 1: Combination of Algorithms in Multi-passSLU

Fig.3: Analysis of a Thai Spoken Sentence

• Goal identification - Having extracted the con-cepts, the goal of the utterance can be identified. Thegoal in our case can be considered as a derivative ofthe dialogue act coupled with additional information.As the examples show in Table 2, the goal “requestfacility” means a request (dialogue act) for some facil-ities (additional information). Our previous work [9]showed that the goal identification task could be effi-ciently achieved by the simple multi-layer perceptrontype of artificial neural network (ANN).

• Concept-value recognition - Some key conceptscontain values such as “the number of people” and“facility”. The aim of this sub-module is to find outthe value inside the word sequence of a concept. Dur-ing parsing an input sentence by concept WFSTs inthe concept extraction sub-module, important key-words indicating concept-values are also marked. Themarked keywords are converted to the concept-values


Fig.4: The Structure of Multi-stage SLU

using simple rules.

Table 2: Examples of Goals, Concepts, andConcept-Values

The concept WFSTs and the ANN goal identifiercan be trained by a partially annotated text corpus.To collect a large corpus, a specific webpage simu-lating expected dialogues was created and Thai na-tives were requested to answer to the questions by ty-ing. By this way, we could obtain more than 10,000sentences (dialogue turns) from over 200 Thai na-tives within one month. The sentences were filtered,verbalized, word-segmented, annotated, and used totrain the ASR language model as well as the SLUmodule.

4. TIRA VERSION 3

To improve the overall performance of TIRA sys-tem, three major issues were considered. The first onewas to enhance robustness of the concept extractioncomponent in the SLU module. The second issue washow to incorporate the knowledge of dialogue statesinto the SLU module, and the last issue is how to pre-

Fig.5: Algorithm of Logical N-gram Modeling

vent errors made by the speech understanding part.These three issues were solved and implemented inthe TIRA-3 system. The followings explain in de-tails.

4.1 Logical N-gram Modeling

The concept WFST, the Reg model, trained byan annotated corpus accepts only substrings seen inthe corpus. To make it more robust to unseen spokensentences, a statistical model is preferable [10-12]. Itcan be noted that probabilistic models proposed inseveral works utilized n-gram probabilities with dif-ferent designs of semantic units. Since our way todefine concepts differs from other systems, a specificdesign of semantic units is needed.

The concept extraction process could be viewed asa sequence labeling task, where a label sequence, L,is determined given a word string, W. Each label in-dicates which concept the word lied in. Finding themost probable sequence is equivalent to maximizingthe joint probability P(W,L), which can be simplifiedusing n-gram modeling (denoted as Ngram model-ing).

The n-gram model can assign a likelihood scoreto any input utterance, it however cannot distin-guish between valid and invalid grammar structure.Thus, another probabilistic approach that improvedthe n-gram model was proposed, namely logical n-gram modeling (LNgram). The algorithm is shownin Fig. 5.

The LNgram model, motivated mainly by [13],combines the statistical and structural models in two-pass processing. Firstly, the conventional n-grammodel is used to generate M -best hypotheses of la-bel sequences given an input word string. The likeli-hood score of each hypothesis is then enhanced onceits word-and-label syntax is permitted by the regulargrammar model. By rescoring the M -best list usingthe modified scores, the syntactically valid sequencethat has the highest n-gram probability is reorderedto the top. Even if no label sequence is permitted by


the regular grammar, the hybrid model is still ableto output the best sequence based on the original n-gram scores. This idea can be implemented efficientlyin the framework of WFST.

4.2 Incorporating System Beliefs

the state of dialogue can help much in the un-derstanding process. Incorporating the dialogue-contextual information (system belief) into the SLUcomponent of the TIRA system was performed intwo ways. The first way was to develop a dialogue-state dependent semantic model in the concept ex-traction module. This was simple by training sep-arated Ngram models for each dialogue-state andused an interpolation or maximization over all mod-els during parsing.

The second way was to rescore N-best goal hy-potheses produced by the ANN goal identifier usingsystem-belief scores. The belief score could be com-puted as a probability conditioned on the latest sys-tem prompt and the dialogue history such as the pre-vious user turn. Rescoring could be performed simplyby linear interpolation or, in our proposed method,non-linear estimation such as ANN and Support vec-tor machines (SVM). We found an advantage of us-ing the non-linear estimator when linear-interpolationweights could not be reliably estimated.

4.3 Understanding-error Detection

The final issue implemented in the TIRA systemwas to incorporate an understanding-error detectionmechanism. Given a goal identified by the ANN goalidentifier, the mechanism computed confidence scorebased on several potential features such as the bestANN output value, a difference between the first andthe second ANN output values, the number of ANNoutputs whose values were greater than a threshold,an average probability of LNgram parser, etc.

An accept/reject classifier was used to determinewhether the SLU output was reliable based on theconfidence scores. The non-linear estimator used forN-best goal rescoring mentioned in the previous sub-section could be used at the same time as the ac-cept/reject classifier by just extending the numberof estimator inputs to cover the confidence features.This process did not help in improving the SLU accu-racy, but prevented harmful errors made by the SLU.

5. EXPERIMENTAL RESULTS

Table 3 briefly summarizes some important char-acteristics of each version. Figure 6 presents an im-provement of ASR performance in terms of word errorrate (WER), perplexity of test sets (PP), and out-of-vocabulary rate (OOV). The major reason of WERimprovement came from the larger text used for lan-guage modeling. Since there was no change from

TIRA-2 to TIRA-3 in language modeling, WER re-sults of the two versions were almost equal. It wasobserved that in this hotel reservation domain, thevocabulary size of 850 in the latter two versions al-most covered all words expressed by the users.

Table 3: Characteristics of 3 Versions of TIRA

Figure 7 shows a trend of SLU improvement bycomparing results of concept F-measures (ConF),goal accuracies (GAcc), concept-value accuracies(CAcc), and out-of-goal rates (OOG), i,e, the goalsnot trained in the SLU. Figure 7(a) presents re-sults when test utterances are manually transcribed,whereas Fig. 7(b) is for automatically speech-recognized test utterances. It is noted that in thecase of TIRA-3, goal accuracies are computed af-ter goal rescoring and before error detection. Thehigh improvement of the goal accuracy came fromsignificant reduction of out-of-goals. The improve-ment trends of the concept F-measure were similarto the concept-value accuracy. However, a few incre-ment of the concept F-measure could highly improvethe goal accuracy. This might be due to recovery ofdominant concepts. TIRA-3 incorporated the error-detection mechanism, which resulted in 15.6% correctrejection and 5.4% false rejection. This reduced theactual goal accuracy from 83.3% to 78.8% with a ben-efit of 15.6% error rejection. The correctly rejectedturns were not only the ones misclassified by the goalidentifier, but also the ones containing serious errorsof speech recognition or concept extraction.

Fig.6: Improvement of ASR Performance

Finally, Fig. 8 illustrates an improvement in the


Fig.7: Improvement of SLU Performance

term of dialogue success rate separated between na?veand non-na?ve users. The enhancement of dialoguesuccess in TIRA-2 came from the high improvementof goal and concept-value accuracies due to the largercoverage of concepts and goals in the new data-drivenSLU. The success rate was further enlarged in TIRA-3 not only by the higher goal and concept-value ac-curacies achieved by the logical n-gram model, butalso by the use of error-detection mechanism. As ex-pected, na?ve users produced much smaller successrates than non-na?ve users.

6. CONCLUSION

A pioneering Thai spoken dialogue system in thedomain of hotel reservation was created under lim-ited resources of fundamental tools and corpora.Speech understanding was mainly focused as it wasfound to highly affect the overall performance andThai specificity could improve the performance. Twomain breakthroughs include the invention of learn-able multi-stage SLU which, trained by a large set oftyped-in sentences, achieved a significant incrementof SLU accuracy. The other breakthrough was theintroduction of the logical n-gram modeling, whichhighly increased SLU robustness.

There are many issues left for research especiallyfor Thai. There has been a very little effort in incor-

porating Thai-specific knowledge in the ASR. Fur-thermore, Passing either 1-best or N-best hypothesesfrom the ASR to the SLU was only a simple method.There should be a tighter connection approach. Thehotel reservation domain is such a complicated taskwhich requires more extensive research and develop-ment to achieve a practical system.

Fig.8: Improvement of Dialogue Success

References

[1] ] Zue, V., Seneff, S., Glass, J., Polifroni, J.,Pao, C., Hazen, T., Hetherington, L., “Jupiter:A telephone-based conversational interface forweather information”, IEEE Trans. Speech AudioProcessing, vol. 8, no.1, pp. 85-96, 2000.

[2] Gorin, A., Riccardi, G., Wright, J., “How may Ihelp you?”, Speech Communication, vol. 23, pp.113-127, 1997.

[3] Luksaneeyanawin, S., “Speech computing andspeech technology in Thailand”, Proc. SNLP1993, pp. 276-321, 1993.

[4] Mittrapiyanurak, P., Hansakunbuntheung, C.,Tesprasit, V., Sornlertlamvanich, V., “Issues inThai text-to-speech synthesis: the NECTEC ap-proach”, Proc. of NECTEC Annual Conference,Bangkok, pp. 483-495, 2000.

[5] Panupong, V., The structure of Thai, Master The-sis, Faculty of Arts, Chulalongkorn University,1982. (in Thai)

[6] Seneff, S., “TINA: A natural language systemfor spoken language applications”, ComputationalLinguistics, vol. 18, no. 1., pp. 61-86, 1992.

[7] Potamianos, A., Kwang, H., Kuo, J., “Statisticalrecursive finite state machine parsing for speechunderstanding”, Proc. ICSLP 2000, vol. 3, pp.510-513, 2000.

[8] Kono, Y., Yano, T., Sasajima, M., “BTH: An effi-cient parsing algorithm for word-spotting”, Proc.ICSLP 1998, pp. 2067-2070, 1998.

[9] Wutiwiwatchai, C., Furui, S., “Combinationof finite state automata and neural networkfor spoken language understanding”, Proc. EU-ROSPEECH 2002, pp. 2761-2764, 2002.


[10] Miller, S., Bobrow, R., Ingria, R., Schwartz,R., “Hidden understanding models of natural lan-guage”, Proc. ACL 1994, pp. 25-32, 1994.

[11] Minker, W., Bennacef, S., Gauvain, J. L., “Astochastic case frame approach for natural lan-guage understanding”, Proc. ICSLP 1996, pp.1013-1016, 1996.

[12] He, Y., Young, S., “A data-driven spoken lan-guage understanding system”, Proc. ASRU, 2003.

[13] B?chet, F., Gorin, A., Wright, J., and Tur, D.H., “Named entity extraction from spontaneousspeech in How May I Help You”, Proc. ICSLP2002, pp. 597-600, 2002.

Chai Wutiwiwatchai received hisB.Eng. (the first honor goal medal)and M.Eng. degrees in electrical en-gineering from Thammasat and Chula-longkorn University in Thailand in 1994and 1997, respectively. In 2001, he re-ceived a scholarship from the Japanesegovernment to pursue a Ph.D. at theFurui laboratory at Tokyo Institute ofTechnology in Japan and graduated in2004. He has been conducting research

in the National Electronics and Computer Technology Center(NECTEC) in Thailand since 1997 and now acts as the Chief ofHuman Language Technology Laboratory. His research inter-ests include speech and speaker recognition, natural languageprocessing, and human-machine interaction.

Documents

The Development of a Thai Spoken Dialogue · PDF fileThe Development of a Thai Spoken Dialogue System 135 ... Thai spoken dialogue system, Spoken language understanding 1. ... spontaneous