12
S. Abiteboul, A.-M. Vercoustre (Eds.): ECDL ‘99, LNCS 1696, pp. 311–322, 1999 © Springer-Verlag Berlin Heidelberg 1999 Term Similarity-Based Query Expansion for Cross- Language Information Retrieval Mirna Adriani and C. J. van Rijsbergen Department of Computing Science University of Glasgow Glasgow G12 8RZ, Scotland [mirna, keith]@dcs.gla.ac.uk Abstract. We propose a query expansion technique which is based on a statistical similarity measure among terms to improve the effectiveness of the dictionary-based cross-language information retrieval (CLIR) method. We employ a term similarity-based sense disambiguation technique proposed in our earlier work to enhance the accuracy of the dictionary-based query translation method. The query expansion technique is then applied to the translation of queries to further improve their retrieval performance. We demonstrate the effectiveness of the two techniques combined using queries in three languages, namely, German, Spanish, and Indonesian, to retrieve English documents from a standard TREC (Text Retrieval Conference) collection. The results of our experiments indicate that the term similarity-based techniques work better when there are more phrases in the queries. In addition, our results also re-emphasize other researchers’ finding that phrase recognition and translation are critical to CLIR’s effectiveness. 1 Introduction The increasing availability of digital documents from all around the world accessible through the Internet has allowed Internet users to access text materials written in foreign languages. Today, there are many Internet search engines that make searching for documents easier using a query containing a set of keywords. Most of these search engines can locate documents in various languages as long as the query is written in the same language as the target documents, referred to as monolingual searching, or the documents contain at least one of the keywords, e.g., proper names. It is a challenge when the sought-after documents are written in a language that is foreign to the user, and none of the query words are contained in the documents. This challenge has contributed to the increasing popularity of Cross-Language Information Retrieval (CLIR) among researchers in the Information Retrieval (IR) community in recent years. Research in CLIR explores techniques for retrieving documents in one language in response to queries in a different language. The most obvious approach to CLIR is by either translating the queries into the language of the target documents or translating

[Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1696 || Term Similarity-Based Query Expansion for Cross-Language Information Retrieval

Embed Size (px)

Citation preview

Page 1: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1696 || Term Similarity-Based Query Expansion for Cross-Language Information Retrieval

S. Abiteboul, A.-M. Vercoustre (Eds.): ECDL ‘99, LNCS 1696, pp. 311–322, 1999© Springer-Verlag Berlin Heidelberg 1999

Term Similarity-Based Query Expansion for Cross-Language Information Retrieval

Mirna Adriani and C. J. van Rijsbergen

Department of Computing ScienceUniversity of Glasgow

Glasgow G12 8RZ, Scotland[mirna, keith]@dcs.gla.ac.uk

Abstract. We propose a query expansion technique which is based on astatistical similarity measure among terms to improve the effectiveness of thedictionary-based cross-language information retrieval (CLIR) method. Weemploy a term similarity-based sense disambiguation technique proposed in ourearlier work to enhance the accuracy of the dictionary-based query translationmethod. The query expansion technique is then applied to the translation ofqueries to further improve their retrieval performance. We demonstrate theeffectiveness of the two techniques combined using queries in three languages,namely, German, Spanish, and Indonesian, to retrieve English documents from astandard TREC (Text Retrieval Conference) collection. The results of ourexperiments indicate that the term similarity-based techniques work better whenthere are more phrases in the queries. In addition, our results also re-emphasizeother researchers’ finding that phrase recognition and translation are critical toCLIR’s effectiveness.

1 Introduction

The increasing availability of digital documents from all around the world accessiblethrough the Internet has allowed Internet users to access text materials written inforeign languages. Today, there are many Internet search engines that make searchingfor documents easier using a query containing a set of keywords. Most of these searchengines can locate documents in various languages as long as the query is written inthe same language as the target documents, referred to as monolingual searching, or thedocuments contain at least one of the keywords, e.g., proper names. It is a challengewhen the sought-after documents are written in a language that is foreign to the user,and none of the query words are contained in the documents. This challenge hascontributed to the increasing popularity of Cross-Language Information Retrieval(CLIR) among researchers in the Information Retrieval (IR) community in recentyears.

Research in CLIR explores techniques for retrieving documents in one language inresponse to queries in a different language. The most obvious approach to CLIR is byeither translating the queries into the language of the target documents or translating

Page 2: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1696 || Term Similarity-Based Query Expansion for Cross-Language Information Retrieval

312 M. Adriani and C.J. van Rijsbergen

the documents into the language of the queries. Translating documents is obviously avery expensive task since there are typically so many documents in a collection andeach of which may be very long, and above all, there is not yet a reliable naturallanguage translation system that can be used to automate the process. For thesereasons, many researchers in this field, the authors included, opted to take the querytranslation approach.

There is a number of ways to perform query translations, namely, by 1) employingmachine translation techniques [12], 2) using parallel corpora [19, 20], or 3) usingbilingual dictionaries [1, 3, 7, 10]. The first two approaches are very labour intensive.Manual hand-coding of linguistic, semantic and pragmatic knowledge is required for amachine translation engine to produce good translations. This can be quiteoverwhelming when the domain of coverage is wide. Likewise, a great deal of work isalso required for building parallel collections, i.e., for manually translating each of thedocuments in the collection into its equivalence in another language.

The third approach, dictionary based translation, has recently become very practicalwith the increasing availability of machine-readable bilingual dictionaries. In addition,the topic coverage of this technique is less limited than that of parallel corpus since adictionary typically contains a wider variety of terms than a sample corpus. However,the effectiveness of this approach depends on its ability to select the right sense of aword from many possible senses provided by the dictionary. In our earlier work [2], weproposed a technique for selecting the best sense of a query term from all possiblesenses given by a dictionary based on statistical term similarity among the senses. Inthis work, we employ the same technique but enhanced with a query expansiontechnique which is also based on the term similarity measure. Unlike the queryexpansion technique used in our earlier work, the current technique selects terms, to beadded to the queries, based on the collective similarity between each candidate termand all of the existing terms in the query.

We hypothesize that there are differences in the pattern of term distribution amongdifferent languages, which may correlate with the effectiveness of our techniques inthose languages. For this reason, we have conducted a series of experiments using aquery set transcribed in three languages, namely, German, Spanish, and Indonesian toretrieve documents in English. We then analyzed the retrieval effectiveness of ourtechniques for each of the languages and investigated the cause of variations.

In Section 2, we present a brief survey of relevant work done by other researchers.Section 3 describes our term similarity based query expansion technique, and providesa review of our sense disambiguation technique. Section 4 discusses the experimentsthat we conducted to measure the effectiveness of our techniques and their results.Finally, Section 5 concludes this paper with a summary.

Page 3: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1696 || Term Similarity-Based Query Expansion for Cross-Language Information Retrieval

Term Similarity-Based Query Expansion for Cross-Language Information Retrieval 313

2 Related Work

Some of the earliest work in CLIR was done by Salton [17] and Pevzner [13] who usedthesauri to index and retrieve documents written in multiple languages. Their resultsshowed that the effectiveness of cross-language retrieval was almost the same as thatof monolingual retrieval. Since then, research in CLIR has grown to cover a widervariety of languages and techniques.

A technique for translating queries indirectly using parallel corpora has beenproposed by Sheridan & Ballerini [19, 20]. A parallel corpus is a collection of the sameor equivalent set of documents written in two (or more) languages. These authorsemployed a term similarity thesaurus built by comparing the occurrences of terms ineach pair of equivalent documents written in Italian and German. Their Germanqueries were substituted with Italian queries constructed from the terms found in theItalian documents parallel to the relevant German documents [19].

Query translation using bilingual dictionaries has been much studied by researchersin the field [1, 3, 6, 7, 10]. As mentioned in the previous section, dictionary basedtranslation is prone to errors due to the high possibility of selecting the wrong sense(meaning) of a term among the senses provided by the dictionary. As a result, queriestranslated using this method typically perform worse than the equivalent monolingualqueries - referred to here as monolingual retrieval performance. This is called theambiguity problem in CLIR. Another problem associated with the dictionary-basedmethod is the problem in translating compound-noun phrases in a query. The problemstems from separately translating words belonging to such a phrase, word-for-word,resulting in a sequence of words with incompatible meanings.

A technique that can be used to alleviate the impact of the above problems is byidentifying phrases in the query and translating them using a phrase dictionary. Such atechnique has been shown to improve CLIR performance. Hull & Grefenstette [10]demonstrated that the retrieval performance of queries produced using manual phrasetranslation was significantly better than that of queries produced by simple (word-for-word) dictionary-based translation. Davis & Ogden [8] showed that phrase translationusing a phrase dictionary built by extracting phrases from parallel sentences in Frenchand English documents improved the performance of their dictionary-based CLIR.Work in term-sense disambiguation has been done by Ballesteros & Croft [4] whodemonstrated the effectiveness of a phrase recognition and translation technique whichis based on the co-occurrence of terms in the target document collection. A differentapproach is proposed by Pirkola [14] whose technique reduces the effect of theambiguity problem by structuring the queries and translating them using a general anda domain-specific dictionary.

To further mitigate the negative effect of mistranslated query terms, manyresearchers have employed query expansion techniques. Query expansion is a well-known method in IR for improving retrieval performance. Basically, it adds newterms, selected using a certain technique, to the query such that the query becomesmore precise where the added terms clarify the meaning of the original query terms,and its recall is improved as terms associated with the original query terms are added.

Page 4: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1696 || Term Similarity-Based Query Expansion for Cross-Language Information Retrieval

314 M. Adriani and C.J. van Rijsbergen

In monolingual IR, Sparck Jones [21] proposed a query expansion technique whichadds terms obtained from term clusters built based on co-occurrences of terms in thedocument collection. Qiu & Frei [15] have done similar work, but using termsimilarity, instead of term co-occurrence. In CLIR, Ballesteros & Croft [4], andCarbonell et al. [6] employed pseudo relevance feedback techniques to obtain terms forthe query expansion. The pseudo relevance feedback techniques assume that the toprank documents initially retrieved using the queries are relevant. Terms appearing inthese relevant documents are then added to the queries. Ballesteros & Croft [3]proposed pre-translation, post-translation and a combination of post and pre-translationquery expansion techniques based on term co-occurrence. They found that post-translation query expansion, i.e., query expansion on the translated queries, and thecombination-translation query expansion, i.e., query expansion on both the original andthe translated queries, are effective in improving CLIR performance.

3 Algorithms

3.1 Term Disambiguation

In our earlier work [2] we proposed a term-sense disambiguation technique forselecting the best translation sense of a word from all possible senses given by abilingual dictionary. Basically, given a set of original query terms, we select for eachof the terms the best sense such that the resulting set of selected senses contains sensesthat are closely related- or statistically similar- with one another. This is done using anapproximate or pseudo-optimal algorithm. Given a set of n original query terms {t1, t2,…, tn}, a set of translation terms, T, is obtained using the following algorithm:

1. For each ti (i=1 to n), retrieve a set of senses Si from the dictionary.2. For each set Si (i=1 to n), do steps 2.1, 2.2 and 2.3. 2.1 For each sense tj’ (j=1 to |Si|) in Si, do step 2.1.1

2.1.1 For each set Sk (k=1 to n & k <> i), get the maximum similarity, Mj,k,between tj’and the senses in Sk..

2.2 Compute the score of sense tj’as the sum of Mj,k (k=1 to n & k <> i). 2.3 Select the sense in Si with the highest score, and add the selected sense into the

set T.3. End

Query terms that are not found in the dictionary are included in the translation set Tas-is. This is typically the case for proper names, technical terms, and acronyms.

We obtain the degree of similarity or association-relation between terms using aterm association measure, called Dice similarity coefficient [16], which is commonly

Page 5: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1696 || Term Similarity-Based Query Expansion for Cross-Language Information Retrieval

Term Similarity-Based Query Expansion for Cross-Language Information Retrieval 315

used in document or term clustering. The term association measure, simxy, betweenterm x and y is computed as follows:

n

n

n

simxy = 2 ∑ (w’ xi ⋅ w’ yi ) / ( ∑ wxi

2 + ∑wyi

2 ) i=1

i=1

i=1

wherewxi = the weight of term x in the document iwyi = the weight of term y in document iw’xi = wxi if term y also occurs in document i, or 0 otherwisew’yi = wyi if term x also occurs in document i, or 0 otherwisen = the number of documents in the collection.

The weight wxi of term x in document i is computed using the standard tf*idf termweighting formula [18].

3.2 Query Expansion

As described previously in Section 2, query expansion has been known to be effectivein improving the retrieval performance of translation queries. In this work, we pursuethe same direction but using a query expansion technique different from those used byprevious researchers. Our query expansion technique adds to a given query termswhich are highly similar, in terms of statistical distribution, to all of the terms in thequery. We measure such a quality of a term, referred to as the term proximity score, bycomputing the sum of the similarity scores between the term and each term in thequery. First, we compute the term proximity score of every term in the collection withrespect to each term in the query. Given a query q containing m terms, the proximityscore, pxq, of term x with respect to query q is computed as follows:

m m

pxq = idfx ∑ (simxj ⋅ idfj) / ∑ idfj

j=1

j=1

wheresimxj = the similarity measure between term x and query term jidfx = the inverse document frequency (idf) of term xidfj = the idf of query term jm = the number of terms in query q.

idfX is computed as log (n/dfx), where dfx is the number of documents containingterm x in the collection. Next, terms whose proximity scores are above a thresholdvalue are then selected and added into the query as expansion terms. The optimal valueof the threshold is collection dependent, and is established through experiments.

In practice, only terms whose similarity scores, simxj, are non-zero are included inthe above computation. To minimize redundant computations in both the sense

Page 6: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1696 || Term Similarity-Based Query Expansion for Cross-Language Information Retrieval

316 M. Adriani and C.J. van Rijsbergen

disambiguation and query expansion techniques, we use a term-similarity matrix as alookup table. This matrix is built using the target corpus. Making use of the targetcorpus, instead of a separate training collection, for this purpose is very practical sincestatistical parameters, such as term frequency (tf) and df, are easily obtainable from thetext retrieval system used to index the documents.

4 Experiments

To measure the effectiveness of the sense disambiguation and query expansiontechniques we conducted a series of experiments using an English corpus. The corpuscontains 748 MB of the TREC (Text Retrieval Conference [8]) Associated Press (AP)collections. We used the German and Spanish versions of the TREC’s 24 queries forcross-language topics to retrieve relevant documents in the English corpus. In addition,we also used an Indonesian version of the queries which were constructed by manuallytranslating the queries. The translation was done by one of the authors whose nativelanguage is Indonesian. The example below shows the different language versions ofone of the queries.

English: Reasons for controversy surrounding Waldheim's World War II actions.German: Grund der Polemik um die Tatigkeit Kurt Waldheims im Zweiten Weltkrieg.Spanish: Razones de la controversia que rodea las acciones de Waldheim durante la

Segunda Guerra Mundial.Indonesian: Alasan munculnya kontroversi tentang tindakan-tindakan Waldheim pada

Perang Dunia II.

Next, we used a subset of the TREC AP collections, i.e. the AP89 collection, tobuild an English term-similarity matrix. We used only a subset of the collectionbecause computing the term-similarity matrix for the entire corpus is too costly interms of computation time and memory space requirement.

In the experiments, we used three bilingual dictionaries, namely, the EcholsIndonesian-English dictionary, the Collins Spanish-English dictionary, and theLangenscheidt German-English dictionary to translate the Indonesian, the Spanish, andthe German queries, respectively, into English. The query translation process wasperformed manually. First, phrases or sequences of words in the queries that can befound in the dictionary were replaced with their translations according to thedictionary. Next, each of the remaining query terms was then substituted with anytranslation senses found in the dictionary for that term. Query terms that were notfound in the dictionary were left unchanged. Next, we then applied the sensedisambiguation technique to select the best sense word among the possible senses foreach query term. Finally, we applied the query expansion technique to the resultingqueries.

We measured the effectiveness of our techniques in terms of average retrievalprecision which was computed using the standard 11 recall-point measurement for

Page 7: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1696 || Term Similarity-Based Query Expansion for Cross-Language Information Retrieval

Term Similarity-Based Query Expansion for Cross-Language Information Retrieval 317

TREC. Our CLIR method uses an off-the-shelf IR system for indexing and retrievingthe documents. In this work we use the INQUERY system [4]. In the experiments, wecompared the retrieval effectiveness of four CLIR methods, namely, (1) monolingualmethod which uses the original English TREC queries, (2) simple translation methodwhich uses queries produced by including all possible translation senses in thedictionary for each query term, (3) automatic disambiguation method which employsthe sense disambiguation technique, and (4) expanded method which employs thequery expansion techniques.

4.1 Result

As can be expected, the average retrieval precision of the queries produced using thesimple translation method for the three languages were much worse than that of theequivalent monolingual method. As shown in Table 1, the average retrieval precisionof the English translation queries (queries translated into English) of the German, theSpanish, and the Indonesian queries are, respectively, 65.77%, 78.47%, and 57.41%below that of using the monolingual method. The increases in the number of queryterms as the result of the simple translation method contributed to the poorperformance. Each of the original English TREC queries, on the average, contains5.46 terms (excluding the stop-words). The average resulting English query obtainedfrom translating the German, the Spanish, and the Indonesian queries contains 35.21terms, 37.25 terms, and 17.75 terms, respectively. The average number of terms perquery in the German, the Spanish, and the Indonesian queries are 5.67 terms, 6.33terms, and 5.79 terms, respectively. Applying the sense disambiguation technique, theaverage retrieval precision improved by 48.20%, 40.68%, and 27.96% for the German,the Spanish, and the Indonesian queries, respectively.

We then applied the query expansion technique to improve the retrievaleffectiveness of translation queries. For each of the query sets, we first performed aseries of preliminary experiments to establish the best threshold values that maximizethe queries’ retrieval effectiveness.

Table 1. Average retrieval precision (and performance drop compared tomonolingual) of the English queries translated from the German, the Spanish,and the Indonesian queries for the simple translation and the automaticdisambiguation methods.

Language Baseline Simple translation

Automaticdisambiguation

Monolingual (English)

0.2804 - -

Germanto English

- 0.0960 (-65.77%)

0.2312(-17.57%)

Spanish to English

- 0.0604 (-78.47%)

0.1745(-37.79%)

Indonesian to English

- 0.1194(-57.41%)

0.1978 (-29.45%)

Page 8: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1696 || Term Similarity-Based Query Expansion for Cross-Language Information Retrieval

318 M. Adriani and C.J. van Rijsbergen

Applying the query expansion technique to the queries produced using the simpletranslation method resulted in significant performance improvements. As can be seenin Table 2, the improvements were 2.58%, 12.14%, and 15.43% for the German, theSpanish, and the Indonesian query sets, respectively.

Table 2. Average retrieval precision (and performance drop compared tomonolingual) of the English queries translated from the German, the Spanish,and the Indonesian queries for the simple translation and the expanded simpletranslation methods.

Language Baseline Threshold Simpletranslation

Expandedsimpletranslation

Monolingual(English)

0.2804 - - -

Germanto English

- 0.7 0.0960(-65.77%)

0.1032(-63.19%)

Spanish to English

- 0.3 0.0604(-78.47%)

0.0944(-66.33%)

Indonesian to English

- 1.0 0.1194(-57.41%)

0.1627(-41.98%)

Applying the query expansion technique in combination with the sensedisambiguation technique resulted in slight performance improvements. As can be seenin Table 3, the improvements were 0.28%, 4.16%, and 5.81% for the German, theSpanish, and the Indonesian query sets, respectively.

Table 3. Average retrieval precision (and performance drop compared tomonolingual) of the English queries translated from the German, the Spanish,and the Indonesian queries for the automatic disambiguation and the expandedautomatic disambiguation methods.

Language Baseline Threshold Automaticdisambiguation

Expandedautomatic

disambiguationMonolingual

(English)0.2804 - - -

Germanto English

- 1.0 0.2312 (-17.57%)

0.2319(-17.29%)

Spanishto English

- 0.9 0.1745 (-37.79%)

0.1861 (-33.63%)

Indonesian to English

- 2.0 0.1978 (-29.45%)

0.2141 (-23.64%)

Page 9: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1696 || Term Similarity-Based Query Expansion for Cross-Language Information Retrieval

Term Similarity-Based Query Expansion for Cross-Language Information Retrieval 319

4.2 Analyses

The results of our experiments demonstrate that the term-similarity based sensedisambiguation does improve the retrieval performance of dictionary based CLIRperformance. However, the degrees of improvement are not similar for all the querysets. The best result was obtained for the German queries (48.20%), followed by theSpanish queries (40.68%), and, lastly, the Indonesian queries (27.96%). Note that wecompare the degrees of improvement instead of the average retrieval precision values.Since we used a different bilingual dictionary for each language query-set we cannotcompare the average retrieval precision of one query set with another.

Our further investigation revealed that significant retrieval performanceimprovements were achieved by queries containing phrases which were translatablenon-ambiguously into English using the bilingual dictionaries. Table 4 shows thenumber of phrases in the original queries for each query set, and the number of suchphrases that can be translated using the bilingual dictionaries.

Table 4. The number of phrases in the original query-sets and the numberof phrases that can be translated using the dictionaries.

Germanqueries

Spanishqueries

Indonesianqueries

Original phrases 36 36 38Translation phrases 10 10 5

From these observations, it can be said that a larger performance improvement usingthe sense-disambiguation technique can be achieved if there are more cases wherewords in the original queries’ language (German, Spanish, or Indonesian) can betranslated deterministically into the target language (English) using the bilingualdictionaries. It is also worth mentioning that, in our earlier work [2], we discoveredthat, within a language’s query-set, there is a correlation between the number ofphrases in a query and the percentage of retrieval improvement resulted from applyingthe technique.

As for the query expansion technique, we observed that significant performanceimprovements were obtained when the technique discovered terms mutually associatedwith the existing query terms. Unfortunately, in many cases, the technique also addednon relevant terms into the queries which resulted in retrieval performance drops. Suchperformance drops were significant in the case of queries with relatively small numberof terms, in particular, queries which had gone through the sense disambiguationprocess.

We also observed that there is a correlation between the number of phrases in theoriginal queries and the technique’s effectiveness in improving retrieval performance.For every query in each of the query sets, we compared the number of phrases in theoriginal query and the performance improvement resulted from using the query

Page 10: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1696 || Term Similarity-Based Query Expansion for Cross-Language Information Retrieval

320 M. Adriani and C.J. van Rijsbergen

expansion technique. Using the Spearman’s correlation coefficient method [11]- a non-parametric statistical tool- we obtained correlation coefficients of 0.366, 0.473, and0.238 for the German, the Spanish, and the Indonesian queries, respectively. Thecorrelation between the number of phrases and the performance improvement aresignificant for the German and the Spanish query sets as their coefficients are wellabove the critical value for 24 observation data (queries) at α = 0.5, namely, 0.343.Note that for the German queries, we count each compound word as a phrase since ittranslates into a noun phrase in English. Our tentative explanation to this phenomenonis that, as members of a phrase tend to co-occur, so do their translations, and so, thequery expansion technique has a better chance of finding the phrase members’ correcttranslations that were not identified by the translation process.

Overall, our finding reemphasize the point that has been made by other researcherssuch as Ballesteros & Croft [4], i.e. that phrase recognition and translation are criticalto the effectiveness of cross-language query translation.

5 Summary

The availability of machine-readable bilingual dictionaries has made dictionary-basedapproaches to CLIR cost effective. The result of our study suggests that the two majorresearch issues in CLIR, namely, term ambiguity and phrase recognition andtranslation [3, 4, 10], are also the main sources of problem in dictionary-based querytranslation techniques. We have demonstrated that using statistical term similaritymeasures to enhance the dictionary-based query-translation CLIR method, particularlyin term disambiguation and query expansion, can significantly improve retrievaleffectiveness.

Our work with term-similarity based sense disambiguation and query expansion hasprovided us with a vehicle for investigating the effect of various language-specific termcharacteristics, including term distribution parameters, on the retrieval performance ofCLIR. This study is part of an on-going research in an effort to build a model forCLIR. Our short-term objective in this research is to identify the collective statisticalproperty or properties among terms which correlates with the term ambiguity andphrase translation problems.

References

1. Adriani, Mirna and Croft, W. Bruce. The Effectiveness of a Dictionary-BasedTechnique for Indonesian-English Cross-Language Text Retrieval. CIIR TechnicalReport IR-170, University of Massachusetts, Amherst, 1997.

2. Adriani, Mirna. Using Statistical Term Similarity for Sense Disambiguation inCross-Language Information Retrieval. To appear in Information Retrieval.

3. Ballesteros, L., and Croft, W. Bruce. Resolving Ambiguity for Cross-languageRetrieval. In Proceedings of the 21st International ACM SIGIR Conference onResearch and Development in Information Retrieval, pages 64-71, 1998.

Page 11: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1696 || Term Similarity-Based Query Expansion for Cross-Language Information Retrieval

Term Similarity-Based Query Expansion for Cross-Language Information Retrieval 321

4. Ballesteros, L., and Croft, W. Bruce. Phrasal Translation and Query ExpansionTechniques for Cross Language Information Retrieval. In Proceedings of the 20th

International ACM SIGIR Conference on Research and Development inInformation Retrieval, pages 84-91, 1997.

5. Callan, J. P., Croft, W.B., Harding, S.M.. The Inquery Retrieval System. InProceedings of Third International Conference on Database and Expert SystemsApplications, 1992.

6. Carbonell, J., Yang, Y., Frederking, R., Brown, R.D., Geng,Y., and Lee, D.Translingual Information Retrieval: A Comparative Evaluation. In Proceedings ofFifteenth International Joint Conference on Artificial Intelligence (IJCAI), 1997.

7. Davis, M. and Dunning, T. E. A TREC Evaluation of Query-Translation Methodsfor Multi-Lingual Text Retrieval. In NIST Special Publication: The 4th TextRetrieval Conference (TREC-4), D.K. Harman, ed. Gaithersburg, MD: NIST,1995.

8. Davis, Mark W. and Ogden, William C. Free Resources and Advanced Alignmentfor Cross-Language Text Retrieval. In NIST Special Publication: The 6th TextRetrieval Conference (TREC-6), D.K. Harman, ed. Gaithersburg, MD: NIST,1997.

9. Harman, Donna. Overview of the Sixth Text Retrieval Conference. In Proceedingof the 6th Text Retrieval Conference (TREC-6), 1997.

10. Hull, D. A., and Grefenstette, G. Querying Across Languages: A dictionary-basedapproach to Multilingual Information Retrieval. In Proceedings of the 19th AnnualInternational ACM SIGIR Conference on Research and Development inInformation Retrieval, pages 49-57, 1996.

11. Mendenhall, William, Scheaffer, Richard L., and Wackerly, Dennis D.Mathematical Statistics with Applications, Third ed., Boston: Duxbury Press,1986.

12. Oard, Douglas W. and Hackett, Paul. Document Translation for Cross-LanguageText Retrieval at the University of Maryland. In Proceeding of the Sixth TextRetrieval Conference (TREC-6), 1997.

13. Pevzner, B. Comparative Evaluation of the Operation of the Russian and Englishvariants of the Pusto-Nepusto-2 System. Automatic Documentation andMathematical Linguistic, 6:71-74, 1972.

14. Pirkola, A. The Effects of Query Structure and Dictionary setups in Dictionary-Based Cross-language Information Retrieval. In Proceedings of the 21st

International ACM SIGIR Conference on Research and Development inInformation Retrieval, pages 55-63, 1998.

15. Qiu, Y. and Frei, H. P. Concept Based Query Expansion. In Proceedings of ACMSIGIR International Conference on Research and Development in InformationRetrieval, pages 160-169, 1993.

16. van Rijsbergen, C. J. Information Retrieval, Second ed., London: Butterworths,1979.

17. Salton, G. Automatic Processing of Foreign Language Documents. Journal of theAmerican Society for Information Science, 21: 187-194, 1970.

18. Salton, Gerard, and McGill, Michael J. Introduction to Modern InformationRetrieval, New York: McGraw-Hill, 1983.

Page 12: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1696 || Term Similarity-Based Query Expansion for Cross-Language Information Retrieval

322 M. Adriani and C.J. van Rijsbergen

19. Sheridan, P., and Ballerini, J. P. Experiments in Multilingual InformationRetrieval using the SPIDER System. In Proceedings of the 19th AnnualInternational ACM SIGIR Conference on Research and Development inInformation Retrieval, Zürich, Switzerland, August 1996.

20. Sheridan, P., Braschler, M., and Schauble, P. Cross-Language InformationRetrieval in a Multilingual Legal Domain. In Research and Advanced Technologyfor Digital Libraries, First European Conference, ECDL’97, Pisa, Italy, September1997.

21. Spark Jones, K. Automatic Keyword Classifications for Information Retrieval.London: Butterworth, 1971.