A light way to collect comparable corpora from the Web

A light way to collect comparable corpora from the Web

Ahmet Aker?, Evangelos Kanoulas†, Robert Gaizauskas?

?Department of Computer Science, University of Sheffield, Sheffield, UK†Google Research, Zurich, Switzerland

Email: [email protected], [email protected], [email protected]

AbstractStatistical Machine Translation (SMT) relies on the availability of rich parallel corpora. However, in the case of under-resourcedlanguages, parallel corpora are not readily available. To overcome this problem previous work has recognized the potential of usingcomparable corpora as training data. The process of obtaining such data usually involves (1) downloading a separate list of documentsfor each language, (2) matching the documents between two languages usually by comparing the document contents, and finally (3)extracting useful data for SMT from the matched document pairs. This process requires a large amount of time and resources since ahuge volume of documents needs to be downloaded to increase the chances of finding good document pairs. In this work we aim toreduce the amount of time and resources spent for tasks 1 and 2. Instead of obtaining full documents we first obtain just titles along withsome meta-data such as time and date of publication. Titles can be obtained through Web Search and RSS News feed collections so thatdownload of the full documents is not needed. We show experimentally that titles can be used to approximate the comparison betweendocuments using full document contents.

Keywords: Comparable Corpora, News Search, Statistical Machine Translation

1. Introduction

Statistical Machine Translation (SMT) relies on the avail-ability of rich parallel resources. However, often parallelresources are not readily available for under-resourced lan-guages or specific narrow domains. This leads to under-performing machine translation systems. To overcome thelow availability of parallel resources the machine transla-tion community has recognized the potential of using com-parable resources as training data (Rapp, 1999; Munteanuand Marcu, 2002; Sharoff et al., 2006; Munteanu andMarcu, 2006; Kumano et al., 2007; Barzilay and McKe-own, 2001; Kauchak and Barzilay, 2006; Callison-Burch etal., 2006; Nakov, 2008; Zhao et al., 2008; Marton et al.,2009).

A critical first problem with such an approach is actuallyidentifying and gathering corpora with the potential for im-proving SMT systems. Attempts at gathering comparablecorpora from the Web have been made (Braschler, 1998;Resnik, 1999; Huang et al., 2010; Talvensaari et al., 2008).The process of obtaining such corpora involves (1) down-loading for each language a separate set of documents, (2)matching documents between the two languages by com-paring document contents, and finally (3) extracting use-ful units for SMT from the matched document pairs byapplying approaches such as that described in Munteanuand Marcu (2006). In this work we focus on steps 1 and2. For step 1 past studies (see Section 2.) commonly usea cross-language information retrieval (CLIR) approach orsome rules of thumb, where source and target language doc-ument collections are gathered by monolingual crawling.Step 2 is performed using weak translation methods andsearch to pair target language documents with those in thesource collection. These steps are resource intensive andtime consuming requiring huge amounts of disk space andfast computing machines.

In this work we aim to reduce the amount of time and re-sources required by steps 1 and 2. We collect compara-ble corpora from the Web by focusing only on news ar-ticles. When collecting comparable corpora we only relyon the title of the documents and do not use the article con-tents. Using only the title for comparison clearly saves timeand computation, since we obtain the titles through GoogleNews Search and RSS News feed collections, so that down-loading of the full document contents prior to determiningcomparability is not needed. In this paper we show experi-mentally that the title can be used to successfully match thedocuments instead of using the full document contents. Thefull content of the matched pairs can then be downloaded.Some news stories run for some time and the initial reportgets developed in follow up articles. In many of the followup news articles a similar title to the earlier one(s) is used,although the content of the follow up news starts divergingfrom the original report. For this reason we also combinethe title with publishing date and time which are also avail-able through the Google News Search and RSS News feedsto investigate their contribution to the quality of the docu-ment pairs.Through our study we build a framework for obtainingcomparable corpora for various language pairs, such asEnglish-German, -Greek, -Croatian, -Estonian, -Latvian,-Romanian, -Lithuanian and -Slovenian. In this paper, wefocus on English-German and English-Greek with the lat-ter being an under-resourced language. We evaluate the col-lected corpora by asking human assessors to assess the levelof their comparability.

2. Related WorkConstructing comparable corpora has been investigated inearlier studies. Braschler (1998) uses existing news doc-ument collections in English-German and English-Frenchand investigates different ways to align the most poten-

15

tially useful pairs across monolingual collections by usingproper nouns, numbers, date similarity and content bear-ing words. Munteanu et al. (2004) and Munteanu andMarcu (2005) use dictionaries trained using initial paralleldata to create comparable corpora. For every document inthe source language in the comparable corpora, the top fivedictionary based translations of every word are used to cre-ate a query to search all documents in the target language.This search is limited to articles published within 5 days ofthe source text and only top 20 ranked articles are returnedand paired with the source text. More recently, Huanget al. (2010) describe methods for obtaining comparablecorpora for English-Chinese documents using Cross Lin-gual Information Retrieval (CLIR) techniques. Monolin-gual documents are crawled from manually selected web-sites and later paired using CLIR. Another CLIR approachis described by Talvensaari et al. (2008). The authorsobtain comparable corpora for English, Spanish and Ger-man. They manually select a set of topic words and thenuse them in a monolingual crawler. After the monolingualcollections are downloaded, CLIR is used to pair the doc-uments across languages based on word co-occurrences inthe source and target documents. Pouliquen et al. (2004)focus on clustering news articles. The authors first clusternews articles published on the same day monolingually us-ing content bearing words and named entities. Then theymap the news articles within the same cluster to a multi-lingual thesaurus from which a list of concept terms is ex-tracted (each concept term has unique translations in 22 dif-ferent languages). The thesaurus concept terms are thenused to link news cluster written in different languages.Similar to Pouliquen et al., the work by Montalvo et al.(2006) also makes use of named entities to pair documents.

In summary, in all cases the full document content is re-quired to make judgements about the comparability levelof two documents. However, to do this a huge number ofdocuments in different languages must first be downloadedfrom the web and preprocessed. Downloading documentsand preprocessing them is time and resource intensive. Inaddition, only a small portion of these downloaded docu-ments will be included in the comparable corpora. The vastmajority of the documents will be discarded.

We aim to overcome this problem and create comparablecorpora by just using document or article titles. Obtain-ing such titles is less resource and time consuming. Wefocus on news articles and thus make use of the fact thattitles in news articles are a good indicator for the con-tent of the document. In a text summarization frameworkEdmundson (1969) scored sentences which contain wordsfrom the title higher than sentences which do not includethese terms. The motivation behind this was that writersre-use the words from the titles in the subsequent sentenceswhen they write their articles. Lopez et al. (2011) haveanalyzed 300 titles of news articles and showed that 66%of the title words occur in the articles. Therefore we usetitles as representatives of document contents and use themto judge documents comparability level.

3. Collecting News TitlesIn order to collect comparable corpora we first collect newsarticle titles through Google News Search and RSS Newsfeeds. We only download current news articles and do notsearch for articles in news archives or on the entire Web.Searching in a bigger space causes more noise in the pair-ing process (Section 3.1.) than when the focus is only oncurrent news.This title collection should be of high recall, i.e. containas many potentially useful titles as possible. Thus, duringthis initial process of collecting our “working material” weonly care about recall and ignore precision (the proportionof the collected titles that are actually comparable with eachother). After collecting these general titles, we apply differ-ent heuristics to pair them. To collect the title corpora weadopt the following process:

1. We first collect initial corpora of titles from news arti-cle monolingually using Google News. For each lan-guage we iteratively download titles from news in dif-ferent topic categories, such as economics, world, pol-itics, etc. We set the iteration time to 15 minutes.Apart from the title for each search result we also haveinformation about the date and time of publication, theurl to the actual article and another url that it is usedby Google News to show all related articles about thesame topic in a cluster – we refer to this url as clusterurl. We refer to the titles obtained by this first step asthe initial corpora (Figure 1).

2. We make use of the Google News clustering of Newsarticles that are found to be similar to each other andfor each title in our initial corpora we collect titlesof articles that are clustered with it. More preciselywe follow the cluster url and download the first 30articles from the cluster. We refer to these corpora asnews corpora 1. Clearly, following these two stepsone can collect as many titles as one wants spanning aperiod of time. In our case this period was a week. Inthis way we always download the current news. Thismeans if our method runs, e.g. for one week, the firstdownloaded news article will be one week old.

3. We then use the titles from the initial corpora andnews corpora 1 as queries and perform a monolin-gual Google News search. We extract the titles fromthe search results and these constitute news corpora2. When performing this search we restrict the date ofthe search to a maximum of one week from the mo-ment the search is performed. Furthermore, we collectnews corpora 2 in parallel with the initial and news 1corpora. As shown in Figure 1 we run these processesfor a week.

4. Next, we further expand the collection of article titlesto include news corpora 3. For this, we take the arti-cle titles from the initial corpora, news corpora 1 andnews corpora 2 for the English collection only. Weparse them for named entities such as person, location

16

Figure 1: Crawl Steps.

and organization names.1 For each named entity typewe do the following: we translate the entities into thelanguage in which the search will be performed (usingGoogle Translate) and perform a Google News Searchusing the translated entity as a query. The search isrestricted to a maximum of one week prior to the pub-lication date of the article.

5. Finally, Google News does not support all languagesequally. Languages such as German or Greek are wellsupported by Google News, i.e. articles of differentnews agencies are preprocessed and listed by GoogleNews. However, this is not the case for languages suchas Latvian, Lithuanian, Estonian, etc. Due to this factthere is a data scarcity problem in those languages. Toovercome this problem we also manually identify agood number of RSS News feeds for each languagefrom which we extract similar information as in theGoogle News Search.

3.1. Document Alignment

In the alignment phase the goal is to match the article titlesfrom the collected corpora and only download the actualarticle contents for those matching pairs to obtain a compa-rable corpus.Matching news by title similarity (TS) is performed bycomputing the cosine similarity across the titles’ term fre-quency vectors. Thus, each title pair is scored between 0and 1. Before computing the cosine measure we also en-sure that both titles (after removing the stop words) have atleast 5 content words on both sides. We have experimen-tally observed that a news title with at least 5 content words

1For named entity parsing we use OpenNLP tools:http://incubator.apache.org/opennlp/

is best to represent the actual document content. We trans-late the foreign title into English using Google Translate.We also combine TS with the following heuristics to inves-tigate their impact on the quality of the produced pairs:

• HS: Each article title pair is scored by 1h+1 , where h

is the time difference in hours, with h ∈ [0, ..., 23].Articles published within the same hour get a score of1. If the time difference is greater than 23h then HS isset to 0.

• DS: We score each article title pair by the publishingdate difference between the two articles ( 1

d+1 , whered is the date difference, with d ∈ [0, ..., 7]). Articlespublished on the same date get a score of 1. We set DSto 0 when the publishing date is greater than 7 days.

• TLD: We score each article title pair by 1w+1 , where w

is the difference in content word count (starting from0). Article titles with the same length get a score of 1.

4. EvaluationWe create different combinations of the heuristics and eval-uate the quality of the results. We use a linear combinationof each heuristic with equal weight. Each combination pro-duces a ranked list of article title pairs. The following listsummarizes the different heuristic combinations:

• TS: Title cosine similarity.

• TS HS: Title cosine similarity and time difference.

• TS DS: Title cosine similarity and date difference.

• TS TLD: Title cosine similarity and title length difference.

• TS TLD HS: Title cosine similarity, title length differenceand time difference.

17

Table 1: Ranking correlation between the different heuristiccombinations for the English-German pairs.

TS TS HS TS DS TS TLD TS TLD HS TS TLD DSTS – 1 0.94 1 0.99 0.73CS 0.23 0.19 0.15 0.09 0.16 0.17

Table 2: Ranking correlation between the different heuristiccombinations for the English-Greek pairs.

TS TS HS TS DS TS TLD TS TLD HS TS TLD DSTS – 1 0.82 1 0.95 0.78CS 0.11 0.21 0.14 0.18 0.25 0.25

• TS TLD DS: Title cosine similarity, title length differenceand date difference.

We perform a ranking comparison between the differentranked lists of title pairs and human assessment on thealigned articles.

4.1. Evaluation: Ranking Order

We compare the quality of the pairs produced by the dif-ferent heuristic combinations with the ones obtained whenthe article content is used. To compute the content similar-ity, first we consider the union of the top 1K pairs of titlesranked by each one of six aforementioned methods. Themaximum number of pairs in the union is 6K. Followingthe corresponding URLs we download content (text) of thearticle pairs and compute the cosine similarity over termfrequency vectors of the entire article. We use an HTMLparser2 to extract text from the HTML documents. Beforecomparing the article contents, each foreign article text istranslated into English using Google Translate. The com-parison of article texts produces another ranked list of arti-cle pairs which we refer as the CS list.We compared the rankings of each similarity heuristic usingKendall’s τ . Kendall’s τ values close to 1 reflect rankingsvery similar to each other, while values very close to 0 re-flect independent rankings. The results are shown in Table1 and 2. As one can observe, in both German and Greek, theresults in the first row show that the rankings produced bydifferent heuristic combinations correlate very highly withthe original title similarity. Thus, date, time and title lengthdo not dramatically change the matching process. On theother hand the correlation between CS and the other heuris-tic combinations is rather low as shown in the second row ofboth Tables 1 and 2. Thus, using the title (along with othermeta-data) does not produce the same matches as when us-ing the entire article. The next step is to investigate howhumans judge the different rankings produced for the twocases (title similarity and meta-data versus content similar-ity).

4.2. Evaluation: Human Judgment

In the human evaluation we asked assessors to judge thecomparability of each aligned document pair. We use fivecomparability classes proposed by Braschler (1998): samestory, related story, shared aspect, common terminologyand unrelated to judge each document pair manually. We

2Boilerpipe – http://code.google.com/p/boilerpipe/ – is used toextract the textual content from the URL

Table 3: English-German document pair evaluation results.Results of both assessors are taken together. The numbersare percentage values.

samestory

relatedstory

shared as-pect

commonterminol-ogy

unrelated

TS 74 24 2 0 0TS HS 88 12 0 0 0TS DS 76 18 6 0 0TS TLD 74 24 2 0 0TS TLD HS 86 12 2 0 0TS TLD DS 72 22 6 0 0CS 75 21 4 0 0

Table 4: English-Greek document pair evaluation results.Results of four assessors are taken together. The numbersare percentage values.

samestory

relatedstory

shared as-pect

commonterminol-ogy

unrelated

TS 50 12 24 7 7TS HS 56 15 20 5 4TS DS 62 8 30 0 0TS TLD 50 8 25 11 6TS TLD HS 70 8 20 2 0TS TLD DS 42 18 32 8 0CS 29 19 32 6 14

hypothesize that if two news articles are about the “samestory” then it is more likely that they contain useful frag-ments for SMT than if they are “unrelated”. The documentcontents were shown to the assessors side-by-side. The de-sign of the assessment implementation is shown in Figure2.We employed a “pooling” approach similar to the one usedin TREC3 and ImageCLEF4, and constructed a depth-30pool by considering the union of the top 30 document pairscoming from each one of the approaches under considera-tion: TS, TS HS, TS DS, TS TLD, TS TLD HS, TS TLD DSand CS.The document pairs in the pools for the two languages wereshown to two native German and eight native Greek speak-ers respectively. All judges were also fluent in English. ForGerman each participant judged all the pairs in the Germanpool. In case of the Greek experiment each quarter of thepool was shown to two different assessors.From the results shown in Tables 3 and 4 we can see thatthe documents aligned with the title and meta-data infor-mation are mainly judged as being “same story” and “re-lated story”. For English-German the best performance isachieved when the title similarity is combined with the pub-lishing time (TS HS). In case of the English-Greek pairswe can see that HS plays also an important role.The reason for the positive impact of HS may be that it re-flects the way news events emerge. Two news articles pub-lished very close to each other in time are likely to reportthe same news event in the same way. However, over time anews event develops and changes so any new report about itwill differ from the first reports. Although the new reportsare also about the same general event, the contents differfrom the first reports and become reports of related storiesor reports which share only some aspects with the first ones.

3http://trec.nist.gov4http://www.imageclef.org

18

Figure 2: Evaluation Tool.

This fact is supported by the results shown in Tables 3 and4 where we see that any combination of heuristics withoutHS has higher “shared aspect” than the combinations withHS. The heuristic DS is also meant to capture news articlesabout the same story. However, since DS uses day level dif-ference in scoring, it can only achieve similar performanceto HS for stories which do not emerge very quickly.For English-Greek we get the best results when TS and HSare combined with TLD (TS TLD HS) – note that addingTLD to TS HS in English-German leads to almost as goodresults as those obtained with TS HS only. In general theheuristic TLD plays also a role in the title method. It en-sures that titles with no length difference are scored higherthan the ones which vary a lot in length. We computed theaverage title length difference for each language.5 The En-glish titles contain on average 6.8 content words, the Ger-man titles 6.5 and the Greek titles 5.8. These figures showthat the English and Greek titles vary from each other morethan the English and German ones. We think that this mayexplain why TLD has more impact on the English-Greekresults than it has on the English-German ones.In the ranking results shown in Tables 1 and 2 we see thatthere is no correlation between the ranked list of articlepairs produced by CS and the article rankings of the otherheuristics. However, from the results shown in Tables 3and 4 we see a different picture. In the case of the German-English pairs the title similarity heuristics perform as wellor better than the CS measure, while for the English-Greekpairs title similarity heuristics perform significantly betterthan the CS method. However, note that this comparisonis not exactly fair, since CS is tested on data pre-selectedusing the other heuristics. A non-biased selection of datacould lead to different CS performance. We plan to addressthis in our future work. Finally, we also think that the poorperformance of the CS method for English-Greek is dueat least in part to the performance of the machine transla-tion system. For German the machine translation system

5Titles which have less than five content words are not takeninto consideration.

is much better than for Greek, which is an under-resourcedlanguage, and this difference may well influence the resultssignificantly.

5. ConclusionIn this work we described a framework for collecting com-parable corpora from the web. To construct comparablecorpora we start with news titles written in different lan-guages, pair the titles and download only the correspondingarticle contents if the titles are comparable. To measure thecomparability of two titles we investigated different heuris-tics. We showed that the best heuristics are TS, HS and TLDwhen used in combination.Our technique is a promising and resource-light way of col-lecting comparable corpora likely to be of use for SMT,though further work needs to be done to confirm this. Thus,for future work we plan to extract fragments such as parallelsentences or phrases from our comparable corpora and in-vestigate their impact on machine translation quality. Cur-rently, we constrain titles to have at least five content words.By doing this we discard 50% of the Greek and 23% of theGerman articles and prevent them from being paired. Toincrease the recall of our method we aim to reduce thesenumbers by investigating further ideas for pairing.Acknowledgements This work has been funded by the AC-CURAT project, European Union Seventh Framework Pro-gramme (FP7/2007-2013), grant agreement no 248347 andby the European Union Seventh Framework Programmewith grant agreement FP7-PEOPLE-2009-IIF-254562.

6. ReferencesR. Barzilay and K. R. McKeown. 2001. Extracting para-

phrases from a parallel corpus. In ACL ’01: Proceedingsof the 39th Annual Meeting on Association for Computa-tional Linguistics, pages 50–57, Morristown, NJ, USA.Association for Computational Linguistics.

P.S. Braschler. 1998. Multilingual Information RetrievalBased on Document Alignment Techniques. In Research

19

and advanced technology for digital libraries: sec-ond European conference, ECDL’98, Heraklion, Crete,Cyprus, September 21-23, 1998: proceedings, page 183.Springer Verlag.

C. Callison-Burch, P. Koehn, and M. Osborne. 2006. Im-proved statistical machine translation using paraphrases.In Proceedings of the main conference on Human Lan-guage Technology Conference of the North AmericanChapter of the Association of Computational Linguis-tics, pages 17–24, Morristown, NJ, USA. Association forComputational Linguistics.

H. Edmundson, P. 1969. New Methods in Automatic Ex-tracting. Journal of the Association for Computing Ma-chinery, 16:264–285.

D. Huang, L. Zhao, L. Li, and H. Yu. 2010. Mining large-scale comparable corpora from chinese-english newscollections. In Proceedings of the 23rd InternationalConference on Computational Linguistics: Posters,pages 472–480. Association for Computational Linguis-tics.

D. Kauchak and R. Barzilay. 2006. Paraphrasing for au-tomatic evaluation. In Proceedings of the main confer-ence on Human Language Technology Conference of theNorth American Chapter of the Association of Compu-tational Linguistics, pages 455–462, Morristown, NJ,USA. Association for Computational Linguistics.

T. Kumano, H. Tanaka, and T. Tokunaga. 2007. Extractingphrasal alignments from comparable corpora by usingjoint probability SMT model. In Proceedings of the 11thInternational Conference on Theoretical and Method-ological Issues in Machine Translation (TMI-07), pages95–103.

C. Lopez, V. Prince, and M Roche. 2011. Automatic ti-tling of Articles Using Position and Statistical Informa-tion. RANLP, 2011.

Y. Marton, C. Callison-Burch, and P. Resnik. 2009.Improved statistical machine translation usingmonolingually-derived paraphrases. In Proceedingsof the 2009 Conference on Empirical Methods in Natu-ral Language Processing, pages 381–390. Associationfor Computational Linguistics.

S. Montalvo, R. Martinez, A. Casillas, and V. Fresno. 2006.Multilingual document clustering: an heuristic approachbased on cognate named entities. In Proceedings of the21st International Conference on Computational Lin-guistics and the 44th annual meeting of the Associationfor Computational Linguistics, pages 1145–1152. Asso-ciation for Computational Linguistics.

D. S. Munteanu and D. Marcu. 2002. Processing com-parable corpora with bilingual suffix trees. In EMNLP’02: Proceedings of the ACL-02 conference on Empiri-cal methods in natural language processing, pages 289–295, Morristown, NJ, USA. Association for Computa-tional Linguistics.

D.S. Munteanu and D. Marcu. 2005. Improving machinetranslation performance by exploiting non-parallel cor-pora. Computational Linguistics, 31(4):477–504.

D. S. Munteanu and D. Marcu. 2006. Extracting parallelsub-sentential fragments from non-parallel corpora. In

ACL-44: Proceedings of the 21st International Confer-ence on Computational Linguistics and the 44th annualmeeting of the Association for Computational Linguis-tics, pages 81–88, Morristown, NJ, USA. Association forComputational Linguistics.

D.S. Munteanu, A. Fraser, and D. Marcu. 2004. Improvedmachine translation performance via parallel sentenceextraction from comparable corpora. In HLT-NAACL,pages 265–272.

P. Nakov. 2008. Paraphrasing verbs for noun compoundinterpretation. In Proc. of the Workshop on MultiwordExpressions, LREC-2008.

B. Pouliquen, R. Steinberger, C. Ignat, E. K”asper, and I. Temnikova. 2004. Multilingual and cross-lingual news topic tracking. In Proceedings of the 20thinternational conference on Computational Linguistics,pages 959–es. Association for Computational Linguis-tics.

R. Rapp. 1999. Automatic identification of word transla-tions from unrelated English and German corpora. InProceedings of the 37th annual meeting of the Associ-ation for Computational Linguistics on ComputationalLinguistics, pages 519–526. Association for Computa-tional Linguistics.

P. Resnik. 1999. Mining the web for bilingual text. InProceedings of the 37th annual meeting of the Associ-ation for Computational Linguistics on ComputationalLinguistics, pages 527–534. Association for Computa-tional Linguistics.

S. Sharoff, B. Babych, and A. Hartley. 2006. Using com-parable corpora to solve problems difficult for humantranslators. In Proceedings of the COLING/ACL on Mainconference poster sessions, pages 739–746, Morristown,NJ, USA. Association for Computational Linguistics.

T. Talvensaari, A. Pirkola, K. Jarvelin, M. Juhola, andJ. Laurikkala. 2008. Focused web crawling in the ac-quisition of comparable corpora. Information Retrieval,11(5):427–445.

S. Zhao, C. Niu, M. Zhou, T. Liu, and S. Li. 2008.Combining multiple resources to improve SMT-basedparaphrasing model. In Proceedings of ACL-08: HLT,pages 1021–1029, Columbus, Ohio, June. Associationfor Computational Linguistics.

20

Documents

A light way to collect comparable corpora from the Web