15
C. Nikolaou, C. Stephanidis (Eds.): ECDL’98, LNCS 1513, pp. 183-197, 1998. © Springer-Verlag Berlin Heidelberg 1998 Multilingual Information Retrieval Based on Document Alignment Techniques Martin Braschler 1* , Peter Schäuble 2 1 Eurospider Information Technology AG Schaffhauserstr. 18, CH-8006 Zürich, Switzerland [email protected] 2 Swiss Federal Institute of Technology (ETH) CH-8092 Zürich, Switzerland [email protected] Abstract. A multilingual information retrieval method is presented where the user formulates the query in his/her preferred language to retrieve relevant information from a multilingual document collection. This multilingual retrieval method involves mono- and cross-language searches as well as merging their results. We adopt a corpus based approach where documents of different languages are associated if they cover a similar story. The resulting comparable corpus enables two novel techniques we have developed. First, it enables Cross-Language Information Retrieval (CLIR) which does not lack vocabulary coverage as we observed in the case of approaches that are based on automatic Machine Translation (MT). Second, aligned documents of this corpus facilitate to merge the results of mono- and cross-language searches. Using the TREC CLIR data, excellent results are obtained. In addition, our evaluation of the document alignments gives us new insights about the usefulness of comparable corpora. 1 Introduction We present a multilingual information retrieval method that allows the user to formulate the query in his/her preferred language, in order to retrieve relevant documents in any of the languages contained in a multilingual document collection. This is an extension to the classical Cross-Language Information Retrieval (CLIR) problem, where the user can retrieve documents in a language different from the one used for query formulation, but only one language at a time (see e.g. [11]). The multilingual information retrieval problem we tackle is therefore a generalization of CLIR. * Part of this work has been carried out during the author's time at the National Institute of Standards and Technology NIST

[Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

  • Upload
    martin

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Page 1: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

C. Nikolaou, C. Stephanidis (Eds.): ECDL’98, LNCS 1513, pp. 183-197, 1998.© Springer-Verlag Berlin Heidelberg 1998

Multilingual Information RetrievalBased on Document Alignment Techniques

Martin Braschler1*, Peter Schäuble2

1 Eurospider Information Technology AGSchaffhauserstr. 18, CH-8006 Zürich, Switzerland

[email protected] Swiss Federal Institute of Technology (ETH)

CH-8092 Zürich, [email protected]

Abstract. A multilingual information retrieval method is presented where theuser formulates the query in his/her preferred language to retrieve relevantinformation from a multilingual document collection. This multilingualretrieval method involves mono- and cross-language searches as well asmerging their results. We adopt a corpus based approach where documents ofdifferent languages are associated if they cover a similar story. The resultingcomparable corpus enables two novel techniques we have developed. First, itenables Cross-Language Information Retrieval (CLIR) which does not lackvocabulary coverage as we observed in the case of approaches that are based onautomatic Machine Translation (MT). Second, aligned documents of this corpusfacilitate to merge the results of mono- and cross-language searches. Using theTREC CLIR data, excellent results are obtained. In addition, our evaluation ofthe document alignments gives us new insights about the usefulness ofcomparable corpora.

1 Introduction

We present a multilingual information retrieval method that allows the user toformulate the query in his/her preferred language, in order to retrieve relevantdocuments in any of the languages contained in a multilingual document collection.This is an extension to the classical Cross-Language Information Retrieval (CLIR)problem, where the user can retrieve documents in a language different from the oneused for query formulation, but only one language at a time (see e.g. [11]). Themultilingual information retrieval problem we tackle is therefore a generalization ofCLIR.

* Part of this work has been carried out during the author's time at the National Institute of

Standards and Technology NIST

Page 2: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

184 M. Braschler and P. Schäuble

There is a growing need for this general type of multilingual information retrieval,e.g. in multilingual countries, organizations and enterprises. In the future, multilingualinformation retrieval will also play an important role in the World-Wide Web, whoseyearly growth is around 50% for English documents, and more than 90% for non-English documents.

Our approach to multilingual information retrieval is corpus-based, using so-calleddocument alignments. The alignment process associates documents that cover similarstories. This leads to an unidirectional mapping between texts in different collections.The following is an example of such a pair of aligned documents, taken from theFrench-German SDA alignments (see below). Shown are the titles of the two texts. Ascan be seen, they clearly cover the same event.

Table 1. Example of an alignment pair

Condor-Maschine bei Izmir abgestürzt: Mutmasslich 16 Tote.(Condor plane crashed near Izmir: probably 16 dead)Un avion ouest-allemand s'écrase près d'Izmir: 16 morts.(A Western German plane crashes near Izmir: 16 dead)

The individual collections in different languages and the mapping given by thealignments together form a multilingual comparable corpus. A related resource wouldbe a parallel corpus. In such a parallel corpus, the paired documents are not onlysimilar, but high-quality manual translations. One of the benefits of the presentedmethod is the excellent availability of comparable corpora as opposed to rare andexpensive parallel corpora.

Using a derived comparable corpus, the system accomplishes multilingual IR bydoing pseudo (local) feedback (see [14]) on sets of aligned documents. We show howthis process can easily be combined with a dictionary based approach to thetranslation problem. We present some excellent results obtained on the collectionsused for the TREC-6 CLIR track (see Appendix A) for the two language pairsEnglish-German and French-German. The English-German run is within a fewpercentage points of the best runs reported for TREC, including those using fullmachine translation, while the French-German run clearly outperforms the other runsreported for this language combination.

We further use the alignments to extend the classical CLIR problem to include themerging of mono- and cross-language retrieval results, presenting the user with onemultilingual result list. In this case, the alignments help overcome the problem ofdifferent RSV scales. This more general problem will also be investigated in theCLIR track for the upcoming TREC-7 conference.

Related work on alignment has been going on in the field of computationallinguistics for a number of years. However, most of this work relies on parallelcorpora and aligns at sentence or even word level. Usually, the amount of dataprocessed is also smaller by several magnitudes. For an example of sentencealignment, see e.g. [3].

Other corpus based approaches to cross-language IR have been proposed in thepast, including the use of similarity thesauri (see e.g. [10] and [12]) and LSI (see [5]).

Page 3: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

Multilingual Information Retrieval Based on Document Alignment Techniques 185

Our approach is based on document/document similarities, whereas the similaritythesaurus approach and LSI approach are based on term/term similarities, with thesimilarity thesaurus working in a dual space, and LSI using a low-dimensional vectorspace obtained by singular value decomposition. On the other hand, the similarity tothe Ballesteros/Croft approach [1] is only superficial. While they perform feedbackbefore or after translation, using only a dictionary to translate the query, in our casethe feedback takes place on a set of aligned documents, effectively producing thetranslation itself. A comprehensive overview of cross-language IR approaches can befound in [6].

The remainder of the paper is structured as follows: section 2 discusses theapproach for computing alignments. Section 3 discusses methods for evaluating thealignments and section 4 shows the application of alignments in a CLIR system.Section 5 closes the paper by giving a summary and an outlook.

2 Document Alignment

2.1 Using Indicators for Alignment

Alignments are produced by using so-called "indicators" to find similarities betweenpairs of documents from the collections involved. Indications of such a similarityinclude:

� The documents share common proper nouns (the spelling of names in similarlanguages is often quite stable).

� The documents share common numbers (numbers are largely languageindependent).

� If the documents have compatible classifiers assigned, these can be used.� The same story is usually published on similar dates by news agencies. Thus, dates

can be used as indicators.� A lexicon can be used to translate terms between the languages. Words shared by

both documents are then an indication of similarity.

Only the last class of indicators (lexicon-translated terms) needs a linguistic dataresource (the lexicon).

The basic concept underlying the alignment process is to use texts from the firstcollection as queries and run them against the documents from the other collection,thus retrieving their most similar counterparts. These pairs of similar documents givethen a mapping between the two collections.

Following is a list of main differences between straightforward retrieval and thestrategies used in this paper:

Page 4: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

186 M. Braschler and P. Schäuble

1. Elimination of terms based on frequency rather than stopword lists2. Extracting indicators from the query3. Thresholding/Query length normalization4. Date normalization5. Use of sliding date windows

2.2 Producing the Alignments

Alignments were produced for both English AP to German SDA texts (AP-SDAalignment) and French SDA to German SDA texts. While SDA texts in German andFrench are not direct translations, the "coverage overlap", i.e. the portion of topicsshared between the two collections, is much higher, making them easier to align.Additionally, SDA texts in both languages have manually assigned classifiers that arecompatible. This greatly simplifies alignment. Because the two collections are similarenough, it is possible to align them using no linguistic resource.

For aligning English AP and German SDA, we use medium-frequency terms asindicators, eliminating both terms with very low and very high frequency. This issimilar to using a list of "stopwords" for retrieval, with two main differences: first, notjust words from the high-frequency end of the spectrum but also those from theopposite end get eliminated, and second, the list of eliminated words is much larger.The reason for this is that high-frequency terms (occurring in 10% of all documents)don’t help to discriminate between similar and non-similar documents. Low-frequencyterms (occurring in only one or two documents) are too likely to give purely randomassociations. Similar observations have been documented repeatedly in the literature(e.g. [2], [8]).

The processed AP documents are converted into queries and "transferred" into thetarget language (i.e. German). The necessary effort for this "translation" step variesgreatly depending on how similar the collections are. AP and SDA are too far apart toalign them based solely on proper nouns and numbers, so use of a lexicon of somesort is necessary to bridge the language gap. We used a wordlist (very simple form ofdictionary) to avoid the need of acquiring a costly dictionary. Such a wordlist wasassembled from various free sources on the Internet. All those sources provided verysimplistic lists of translations, without extra information as to part of speech,frequency of different translations, etc. The lists also contain a fair number ofmisspellings and questionable translations. The combined list with English - Germantranslations is quite big. It contains 141,240 entries (85,931 unique "head" entries),most of them of the form "word - word", but some of them phrases and even fullsentences.

Using a "noisy" wordlist instead of a high-quality machine-readable dictionary isless of a restriction than might be expected. The collections used for the experimentscontain thousands of documents. Doing sophisticated syntactic and semantic analysison them, such as part-of-speech tagging and word-sense disambiguation, would bevery time consuming. Oard and Hackett [7] report that machine-translating the entireSDA and NZZ collections into English took two months of computing time running

Page 5: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

Multilingual Information Retrieval Based on Document Alignment Techniques 187

six computers (most of them SPARC 20 workstations) in parallel. We believe that thealignment process is much more tolerant to translation errors than direct query ordocument translation for information retrieval, so that this kind of effort can beavoided.

In case of multiple translations, a word was replaced by all possible alternatives.This is because the list contains no extra information and no context analysis takesplace. Words missing from the list were omitted, unless they begin with a capitalletter. Words starting with capital letters were treated as potential names and weretransferred "as-is" into the resulting query.

For running the queries against the other collection, two further majormodifications were needed: thresholding (including query length normalization) andnormalizing by date distance.

News agencies tend to publish stories about the same events on or near the samedate. The likelihood of a good match is higher if the dates of the documents of apotential pair are close to each other. Date normalization boosts the similarity scoreof such documents, through dividing the raw alignment score by the logarithm of thedistance in days between the two texts.

Thresholding is needed for making a decision based on the retrieval score as towhether the two texts should form a pair. Because AP and SDA are quite different,there is no good counterpart for a lot of the documents, and no pair should beproduced. To make scores of individual pairs more comparable, the score isnormalized with the query length to give a final score. With our weighting scheme notguaranteeing an upper bound on the score, this is however still not an ideal solution,and thresholding remains an issue that requires further work. Possible approaches tothresholding include setting a single fixed threshold which is applied to all queries, orusing the observation that the ratio of documents that have good counterparts remainsroughly constant from day to day and filtering out the best portion of pairs that fulfillthis criterion.

2.3 Visualization of Alignments

Alignments, and the impact of date normalization and thresholding on them, can bevisualized as follows (see Figure 1): Pairs of aligned documents are plotted as dots ina plane spanned by two time axes. The x-axis represents all AP-documents from1988, sorted by date. The y-axis stands for the SDA documents of the same year, alsoin date order. The graph in the upper left corner shows alignments without either datenormalization or thresholding. As would be expected, a lot of dots fall into thediagonal, as the dates of the documents making up the corresponding pair are close.However, a fair number of dots are scattered all over the plot. When a threshold isapplied that filters out roughly two thirds of the pairs, the diagonal remains wellrepresented. However, the dots outside the diagonal are still quite evenly distributed,even though the graph is sparser (upper right graph). The lower left graph shows thesituation if date normalization is switched on. This graph is without thresholding. Thedots are now drawn much closer to the diagonal. If date normalization and

Page 6: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

188 M. Braschler and P. Schäuble

thresholding are combined, a thick diagonal emerges, and the dots get sparser thefarther they are from the diagonal (lower right graph). This is the desired effect.

Fig. 1. Effects of date normalization and thresholding

These preceding observations lead to a possible optimization: as most pairsproduced lie in a fairly narrow band around the diagonal, the alignment can becomputed using a sliding "date window", which restricts the document search spacedrastically. Instead of searching for a match in the whole 3 years of SDA for every APdocument, just documents inside a date window a few days wide are considered. Thisspeeds things up dramatically, by up to two orders of magnitude. Searches take placeon a series of overlapping small subcollections as the date window shifts over thecollection. Experiments indicated that a window of 15 days is a good choice. If thecollections are more similar, much narrower date windows are possible. Note that

Page 7: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

Multilingual Information Retrieval Based on Document Alignment Techniques 189

some desirable matches (background stories, periodic events) may be missed usingthis strategy. Graphically, this corresponds to eliminating all dots outside of a smallband around the diagonal from the lower right graph in Figure 1.

The alignment pairs produced are not reversible. Several English AP documentsmay be aligned to the same SDA text and not all SDA documents are member of apair. This is a consequence of different amounts of coverage of the same events by thetwo news agencies, and therefore hard (and probably not desirable) to avoid. Ifalignments in the opposite direction are needed, the roles of the AP and SDAdocuments must be swapped, and the pairs recomputed.

Aligning the whole AP and SDA takes roughly a week using a singleSPARCstation 5. Small changes, like eliminating debugging information, could easilyreduce this time significantly. The alignment process is also parallelizable, withdifferent computers working on different parts of the collections. And, mostimportantly, if date windows are used, old alignments could be kept withoutrecalculation when the collections have new documents added. Aligning a new day’sworth of news stories would only take a few minutes this way.

2.4 Aligning More Similar Collections

Aligning the French and German SDA texts gives insights into possiblesimplifications when the two collections are more similar than English AP andGerman SDA. The SDA texts also have manually assigned classifiers that arelanguage-independent. The classification is rather rough (some 300+ differentclassifiers, mostly just country codes), but nevertheless helpful.

As no wordlists like those used for the AP-SDA alignment could be found forGerman-French, the SDA-SDA alignment works without any lexicon. This would nothave been possible, had the two collections been more different. This demonstrates,however, that it is possible to produce alignments without the use of any linguisticdata resource, provided the collections are similar enough.

Beyond using the classifiers, the SDA-SDA alignment relies only on matchingproper nouns (i.e. names) and numbers as indicators. This way a document matches aquery if it shares at least one classifier and any proper noun/number. Identification ofproper nouns works by looking at capitalization of words.

The higher similarity of the two collections allows choosing a narrower datewindow than for SDA-AP alignment; experiments gave the best results for a windowof only one day (i.e. same-day matches only). Thresholding didn’t give consistentimprovements for SDA-SDA, probably because as many good as bad pairs werefiltered. This demonstrates again the need for a better thresholding strategy.

Page 8: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

190 M. Braschler and P. Schäuble

3 Evaluation of Alignments

3.1 General-Purpose vs. Application-Specific Evaluation

Evaluating document-level alignments can have fundamentally different goals. Onone hand, one can assess how many of the pairs produced consist of documents thatare really related to each other (i.e. "good alignments"). This is a general point ofview - the quality of the alignments is judged without any particular application inmind. An alternative is to evaluate the performance of a specific application that usesthe alignments. In our case, this means using them for cross-language informationretrieval. Concentrating on one application has the drawback of losing generality,however the danger of "overtuning" the alignment process to theoretical goals thathave no merit in practice is avoided. In this paper, both ideas are investigated. Qualityassessment independent of a specific application will be discussed in the following,whereas an evaluation of the alignments for use in CLIR can be found in section 4.

3.2 Judging the Quality of Alignment Pairs

To evaluate the quality of the alignments independent of a specific application, asample of the pairs was judged for "quality". This is similar to doing relevanceassessments to evaluate retrieval runs. There exist however the following majordifferences:

� It is not immediately clear how to judge the "quality" of a pair. Defining thealignment task as "finding the most similar counterpart in the other collection"means that a human judge would have to read the entire collection to make sure nomore similar text exists. This is clearly impractical. When the criterion is relaxed to"find a good (i.e. similar) counterpart in the other collection", looking at the twodocuments that form the pair is sufficient.

� It is also unclear on how to quantify how good a match is. When doing relevanceassessments for documents returned by a retrieval system, there is a comparablyshort query against which a human judge can compare the documents. Forassessing alignments, this "query" is a whole document, and much less focused. Itseems harder to make a "yes/no" decision. As a consequence, a five-class scale wasused (see Table).

� The human judge has to read through two documents per judgment instead of oneas for relevance judgments, with the query changing for every pair. There is alsono ranking available to form the sample, so the sample must be quite large in orderto get reliable statistics. This means more work for the judge.

Page 9: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

Multilingual Information Retrieval Based on Document Alignment Techniques 191

Table 2. Classes used for judgment of alignment pairs

Class 1 Same story Two documents cover exactly the same story/event (e.g.the results of the same candidate in the same primary forthe US presidential election)

Class 2 Related story Two documents cover two related events (e.g. twodifferent primaries for the US presidential election)

Class 3 Shared aspect Two documents address various topics, but at least one ofthem is shared (e.g. update on US politics of the day, oneitem is about the upcoming presidential election)

Class 4 Commonterminology

Two documents are not really related, but a significantamount of the terminology is shared (e.g. one documenton the US, the other on the French presidential election)

Class 5 Unrelated Two documents have no apparent relation (e.g. onedocument about the US presidential election, the otherabout vacation traffic in Germany)

A 1% random sample of the final (i.e. after thresholding) pairs of the AP-SDAalignment was judged with respect to these five classes. The results were as follows:

Table 3. Results from evaluation of a 1% random sample

1% sample: 852 out of 85125 pairsClass 1: 327 38.38%Class 2: 174 20.42%Class 3: 16 1.88%Class 4: 123 14.44%Class 5: 212 24.88%

Which classes are considered to be good alignments will likely vary with theintended application. When used for extraction of linguistic information, like termassociations or translations, a high quality of alignment is probably needed, so thatmaybe only pairs from classes 1 and 2 would be acceptable. The proportion of"success" of the alignment process would then be 58.8 ± 3.3% (95% confidenceinterval). This seems a bit low, so that AP and SDA are probably too dissimilar forsuch use. For application in a CLIR system, pairs from classes 1 through 4 are likelyto help for extracting good terms. The proportion of success is then 75.1 � 2.9%, andsome very good results are obtained when the alignments are used for retrieval(details on this follow in the next section).

Page 10: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

192 M. Braschler and P. Schäuble

4 Applications of Alignments for IR

4.1 Cross-Language Retrieval Using Document Alignments

We now describe how we use the alignments to build a system for multilingualinformation retrieval . Such a system is comparably simple and very inexpensive, asvery few linguistic resources were used for aligning the documents. It is also fairlyindependent of specific language combinations, as long as suitable collections foraligning can be found.

A surprisingly simple strategy was used that can be derived as follows: In case thetwo collections to be searched were parallel, i.e., real translations of each other, itwould be possible just to search the collection that corresponds to the language of thequery, and return a result list produced by replacing every found document by itscounterpart in the other collection. This is of course not a very interesting case, as thecollection to be searched is seldom available in translated form. We therefore replacethe requirement for a parallel corpus with one for a comparable corpus. Thisrequirement can be met in a lot of interesting real-world applications. One suchexample is the alignment of English AP and German SDA, allowing one to search theSDA texts in English.

First, the user’s query is run against the source collection, thus obtaining a rankedlist. Instead of replacing the found documents by their translations, the document-level mapping produced by the alignment process is used to replace them with theirmost similar counterparts from the other collection, if available. This produces a newresult list containing documents in the target language. Because a lot of documentsare not part of an alignment pair, however, they would never be retrieved using thisstrategy.

This problem was approached by using a pseudo relevance feedback (in this caselocal feedback) (see e.g. [4], [14]) loop after the replacement step (but before a searchin the target language takes place) . A certain number of the highest rankeddocuments are assumed relevant and terms are extracted from these documents thatare thought to represent them well. These terms form a query used for a new search.Because the documents are already in the target language, so is the query produced.This is fundamentally different from the approach in Ballesteros/Croft [1], wherefeedback is used before or after the translation. In our feedback takes place on a set ofaligned documents and is used to produce the translation itself. Unlike in usualapplications of relevance feedback, the new terms cannot be combined with theoriginal query because their languages don’t match. Only terms coming from thefeedback process form the new query.

This simple strategy works surprisingly well for certain queries. It fails however ifthe initial query doesn’t retrieve any relevant documents. This can be amended bycombining the strategy with a dictionary translation of the query. The same wordlistused for producing the alignments can be used for a crude word-by-word translationof the query. Such translations normally have problems, even if the dictionary is ofhigh quality. Translations are often very ambiguous, and including many extraneous

Page 11: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

Multilingual Information Retrieval Based on Document Alignment Techniques 193

wrong translations hurts retrieval performance a lot. Another problem is missingentries from the dictionary, because the query term is inflected or because even thebase form itself is absent.

Therefore, simple word-by-word replacement with all possible translations usuallydoesn’t perform well and is less appropriate than for computing alignments.Additional efforts are needed, like using part-of-speech, sense disambiguation,lexically correct word normalization etc. It is however easily possible to combine sucha simple dictionary translation with the relevance feedback process described above.Instead of deleting the original query before feedback, it can be replaced with such a"pseudo-translation". Because the terms get reweighted in the feedback process, eventhe problem of assigning weights to the two sources of information (dictionary vs.alignments) is automatically taken care of. The alignments as additional source ofinformation help lessen the mentioned translation problems.

In case the collection to be searched is not aligned, two independent alignedcollections can be used to produce the query. The aligned collections are then onlyused for the transfer of the query into the target language, whereas the search takesplace on the third collection.

4.2 Results on the TREC-6 CLIR Collection

Results for the strategies just described on the TREC-6 CLIR collection are presentedin the following: Figure 2 shows a comparison of using alignments alone, using adictionary pseudo-translation and then using both methods combined, i.e. doing initialretrieval using a dictionary translation, and then improving this translation using thealignments, as outlined above. The collection being searched is a combination of bothGerman SDA and NZZ, and therefore a superset of the one that was aligned toEnglish AP or French SDA. This makes the results directly comparable to the onesreported by participants of the TREC-6 CLIR task. The full topic statements wereused for all runs, and the evaluation used relevance assessments for 21 queries. Somecaution is appropriate with regard to the scope of the conclusions because this was thefirst year with a CLIR task at the TREC conference, and the size of the query set wasrather small.

The left graph shows a comparison of doing English-German CLIR using thealignments, the wordlist or the combination of both. The combination gives by far thebest result, improving the dictionary-only run by a massive 62% in terms of averageprecision. The combined run achieves not quite 60% of the monolingual baseline.This is a substantial drop, but due to the fact that the baseline is very high (itoutperforms the German monolingual runs reported for TREC-6), the result is stillwithin 5% of the best TREC-6 English-German runs. This is an excellent result, asthose were using full machine translation of the documents or the queries.

The graph on the right compares the monolingual baseline with cross-languageruns coming from French and English. The French-German run produces slightlybetter results than the English-German. This is remarkable because of the muchsimpler alignment process. It shows that when very similar collections are available

Page 12: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

194 M. Braschler and P. Schäuble

for alignments, a system without any lexicon-like data resource can be built. ThisFrench-German run outperforms all of the few TREC-6 runs reported for thislanguage combination by a wide margin.

Fig. 2. Recall/precision graphs showing results of cross-language retrieval using document-level alignments

4.3 Merging the Results of Mono- and Cross-Language Searches

Merging the results of a mono-language search and of one or several cross-languagesearch(es) is a non-trivial problem for two reasons. First, the Retrieval Status valuesRSVi(q,dj) of the different search methods i are on different scales. Second, thenumbers of relevant documents in the different languages are unknown. It is possiblethat the same number of relevant documents exist in each language; but is alsopossible that all relevant documents are in a single language.

To cope with the merging problem we suggest linear transformations of theRetrieval Status Values. For instance, assume that the document dj has been retrievedby method i because it is in language i, then its retrieval status value RSVi(q,dj) ismapped to a common scale RSV(q,dj) in the following way:

RSV(q,dj) := ai + bi * RSVi(q,dj) . (1)

The parameters ai and bi are determined by means of aligned documents and a leastsquare fit which minimizes the sum of the squares of the error of aligned pairs. Forinstance, assume that dj and dk were aligned because dj covers a story in language hand dk covers the same or a similar story in language i. These two documents obtainedthe scores RSVh(q,dj) and RSVi(q,dk). Because they were aligned, they should bemapped to similar scores,

Page 13: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

Multilingual Information Retrieval Based on Document Alignment Techniques 195

ah + bh * RSVh(q,dj)   ai + bi * RSVi(q,dk) , (2)

or in other words: the square of the difference

Djk

2 := (ah + bh * RSVh(q,dj) - ai - bi * RSVi(q,dh))2 (3)

should be minimized, which is achieved by a least square fit. The advantage of thisapproach is that not only relevant but also irrelevant pairs of aligned documents areused for merging. Of course, non-aligned documents can also be mapped to thecommon scale using the mappings that were determined by means of the alignedpairs. Results of future experiments will show which percentage of documents have tobe aligned in order to accomplish merging the search results in this way.

5 Summary and Outlook

We present a method for computing a document-level mapping between texts fromdifferent collections written in different languages and propose techniques forevaluating such a document-level alignment. The alignments are then used forbuilding a cross-language information retrieval system, and the results of this systemusing the TREC-6 CLIR data are given. We also show how to use the alignments toextend the classical CLIR problem to a scenario where mono- and cross-languageresult lists are merged.

The alignment process is very modest in terms of the resources used; it needsneither expensive hardware nor costly high-quality linguistic resources. It should alsobe easy to adapt it for a dynamic environment where documents are constantly addedto the collections. In such a case, thanks to using date windows, the alignments couldbe extended without the need to discard old pairs.

Use of the alignments for CLIR gives excellent results, proving their value for real-world applications. Applications for alignments other than CLIR, such as automaticdictionary extraction, thesaurus generation and others, are possible for the future. Thequestion of how well the findings apply to a range of different collections remainsopen; however, the fact that AP and SDA are quite dissimilar gives hope that a lot ofdata can be aligned. The methods shown should also be fairly easily adaptable toother language pairs, as long as some conditions can be met (e.g. similar collectionsor wordlist available).

There is much room for improvements in the alignment process. Indications arethat it is crucial to extract the right pieces of the documents, such as by filtering outterms with high and low frequency. This idea could be carried further by using thesimilarity of the best matching passage between two documents, or by using asummarization step and then comparing the summaries. First experiments with fixed-length passages showed a prohibitive increase in computational complexity, though.

Page 14: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

196 M. Braschler and P. Schäuble

Perhaps the most interesting issue is the integration of more or better linguisticresources. The alignment process should take advantage of such resources, while stillremain usable if they are unavailable. With hopefully more linguistic tools becomingaffordable, there should be considerable potential in such enhancements.

6 Acknowledgments

Thanks go to the whole NLPIR group at NIST for help, especially Donna Harman andPaul Over for numerous valuable discussions of the experiments presented. Workleading to some aspects of these experiments began earlier at the information retrievalgroup at ETH; thanks go to Páraic Sheridan.

References

1. Ballesteros, L., Croft, B. W.: Phrasal Translation and Query Expansion Techniques forCross-Language Information Retrieval. In: Proceedings of the 20th Annual InternationalACM SIGIR Conference on Research and Development in Information Retrieval, (1997)pages 84-91

2. Crestani, F., van Rijsbergen, C. J.: Information Retrieval by Logical Imaging. In: Journal ofDocumentation (1994)

3. Gale, W. A., Church, K. W.: A Program for Aligning Sentences in Bilingual Corpora. In:Computational Linguistics, 19(1) (1993) 75-102. Special Issue on Using Large Corpora I.

4. Harman, D. K.: Relevance Feedback and Other Query Modification Techniques. In: Frakes,W. B., Baeza-Yates, R.: Information Retrieval, Data Structures & Algorithms, Prentice-Hall(1992) pages 241-261

5. Landauer, T. K., Littman, M. L.: Fully Automatic Cross-Language Document Retrievalusing Latent Semantic Indexing. In: Proceedings of the Sixth Annual Conference of the UWCentre for the New Oxford English Dictionary and Text Research, (1990) pages 31-38.

6. Oard, D. W.: Cross-Language Text Retrieval Research in the USA. Presented at: 3rd ERCIMDELOS Workshop, Zurich, Switzerland (1997)Available from http://www.clis.umd.edu/dlrg/filter/papers/delos.ps.

7. Oard, D. W., Hackett, P.: Document Translation for Cross-Language Text Retrieval at theUniversity of Maryland. To be published in: Proceedings of the Sixth Text RetrievalConference (TREC-6) (to appear)Available from http://trec.nist.gov/pubs/trec6/papers/umd.ps.

8. Qiu, Y.: Automatic Query Expansion Based on A Similarity Thesaurus. PhD Thesis, SwissFederal Institute of Technology (ETH), Zurich, Switzerland (1995)

9. See http://www.sda-ats.ch10. Schäuble, P.: Multimedia Information Retrieval. Kluwer Academic Publishers (1997)11. Schäuble, P., Sheridan, P.: Cross-Language Information Retrieval (CLIR) Track Overview.

To be published in: Proceedings of the Sixth Text Retrieval Conference (TREC-6) (toappear)

12. Sheridan, P., Ballerini, J.-P.: Experiments in Multilingual Information Retrieval using theSPIDER system. In: Proceedings of the 19th Annual International ACM SIGIR Conferenceon Research and Development in Information Retrieval, (1996) pages 58-65

Page 15: [Lecture Notes in Computer Science] Research and Advanced Technology for Digital Libraries Volume 1513 || Multilingual Information Retrieval Based on Document Alignment Techniques

Multilingual Information Retrieval Based on Document Alignment Techniques 197

13. Voorhees, E. M., Harman, D. K.: Overview of the Sixth Text Retrieval Conference (TREC-6). To be published in: Proceedings of the Sixth Text Retrieval Conference (TREC-6) (toappear)

14. Xu, J. and Croft, B. W.: Query Expansion Using Local and Global Document Analysis. In:Proceedings of the 19th Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, (1996) pages 4-11.

Appendix

Used Resources

All alignment experiments were carried out by modifying components of the PRISE,a public domain information retrieval system developed at the National Institute ofStandards and Technology (NIST). The experiments used the text collections fromcross-language task of the TREC-6 ([11], [13]) conference. These collections (seeTable 1) consist of documents in English, German and French. Although both theFrench and some of the German texts are from SDA, they are not direct translationsof each other. They are written by different people independently and produced bydifferent editorial offices [9]. While AP and SDA are both news wires, NZZ is anewspaper, and differs both in style and the dates covered. As it is important that thecollections used for alignment are as similar as possible, the NZZ texts didn’t seem tobe suited for alignment as well as the rest of the collections. They were thereforeexcluded from the alignment process. They were however used for the CLIRexperiments.

Table 4. Details for the document collections used

Collection Language Dates Covered Size (MB) # DocsAP (Associated Press) English mid-Feb 1988 to end of

1990741 MB 242,918

SDA (SchweizerischeDepeschenagentur)

German Jan 1988 to end of 1990 332 MB 185,099

NZZ (Neue ZürcherZeitung)

German Jan 1994 to end of 1994 194 MB 66,741

SDA (SchweizerischeDepeschenagentur)

French Jan 1988 to end of 1990 252 MB 141,656