6
Automatic Construction of English-Vietnamese Parallel Corpus through Web Mining Van B. Dang, Bao-Quoc Ho Faculty of Information Technology, University of Natural Sciences, Ho Chi Minh City, Vietnam dbvangfit.hcmuns.edu.vn, hbquocgfit.hcmuns.edu.vn Abstract-Parallel corpus has become a very essential resource for applied in our case because most of the sites containing English- multilingual natural language processing and there are large scale Vietnamese parallel texts encode few or no information as stated of parallel texts available on the internet these days. In this paper, in [14]. we propose a simple but reliable method to construct an English- BITS [13] provides a different approach. All pages from a Vietnamese parallel corpus through web mining. Our system can BITS [13] are a erentively. Allnpage is automatically download and detect parallel web pages on a given specified domain are crawled exhaustively. Their language is domain to construct a parallel corpus that is well-aligned at determined by a language detector and all possible combinations paragraph level with completely clean texts. The proposed of these pages (a full cross product) have to be examined to find technique can be easily applied to other language pairs. matches. In this proposal, a bilingual dictionary has been used to Experiments have been made and shown promising results. perform matching at word level between parallel documents. This approach is easy to understand yet very time-consuming. Keywords-parallel corpus; web mining; information retrieval STRAND [15, 9] has a similar approach to PTMiner [14] except that it handles the case where URL-matching requires I. INTRODUCTION multiple substitutions. Structural filtering with a tuning Large scale parallel corpus has become a very essential parameter optimized by using Machine Learning gives it the resource for multilingual natural language processing. It plays an ability not to examine all possible combinations like BITS [13]. important role in Cross-Language Information Retrieval [2, 7, 8], In [9], Resnik also proposes a content-based matching method as provides principal training data for statistical translation model in [13] but similarity is measured in a different way. Experiments [3, 4], it also represents resources for automatic lexical have shown very promising results. However, there are so few acquisition and enrichment [5, 6] as well as for grammar good quality English-Vietnamese parallel texts and the most induction [1]. However, it is not available in sufficient quantities potential source is news websites such as VOA News [16]. On and tends to be accessible only in the specialized form such as some news sites, such as BBC [17], parallel texts exist in United Nation proceeding [10] and localized versions of different writing style and they can hardly be regarded as parallel software manuals [11]. Even for the major languages, the corpus. Consequently, structural filtering in [9] cannot be applied available parallel corpora are not only in small size but also since all pages from a news website share the same structure. unbalanced [9]. According to our observation, there have been no free-of-charge English-Vietnamese parallel corpora that are in In this paper, we propose a content-based matching approach considerable quantities so far. Hence, such a corpus can be for English-Vietnamese parallel texts collected using web thought of as a critical resource for processing texts in this mining, mainly from news sites, to construct an English- language pair. Vietnamese parallel corpus. Firstly, we perform a host crawling on the specified domain: all pages of the desired language pair Internet is certainly a good place to search for parallel texts. are downloaded from that domain. Our system will try to extract Nevertheless, collecting them is not a trivial task at all for the language of the page from its URL if possible; otherwise, a huge network makes the process very labor intensive. Therefore, simple language detector is required. Secondly, we define some scientists have designed several systems to automate this rules to quickly reject all false combinations to create the set of construction process. PTMiner [14] first searches for host candidate pairs. Finally, similarity between each candidate pair is containing parallel texts, and then performs a host crawling, calculated using an English-Vietnamese dictionary to determine candidate pairs are found based on several external and internal if it is a match. The output of our system is an English- features such as URL like ".../english/... /file.html" is more Vietnamese parallel corpus that is well-aligned at paragraph likely to be an English page and to be a match with level with completely clean texts. System architecture is ".../chinese/... /file.html". The outstanding feature of PTMiner is illustrated in Figure 1 and a sample English-Vietnamese parallel the ability to effectively reject false pairs prior to downloading documents in the output corpus can be found in Figure 4. Our them. Though Chen and Nie [14] reported a good result of 90°- precise English-Chinese corpus, this approach can hardly be 1-4244-0695-1/07/$25.OO ©2007 IEEE. 261

[IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

Embed Size (px)

Citation preview

Page 1: [IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

Automatic Construction of English-Vietnamese ParallelCorpus through Web Mining

Van B. Dang, Bao-Quoc HoFaculty of Information Technology, University ofNatural Sciences, Ho Chi Minh City, Vietnam

dbvangfit.hcmuns.edu.vn, hbquocgfit.hcmuns.edu.vn

Abstract-Parallel corpus has become a very essential resource for applied in our case because most ofthe sites containing English-multilingual natural language processing and there are large scale Vietnamese parallel texts encode few or no information as statedof parallel texts available on the internet these days. In this paper, in [14].we propose a simple but reliable method to construct an English- BITS [13] provides a different approach. All pages from aVietnamese parallel corpus through web mining. Our system can BITS [13] are a erentively. Allnpage is

automatically download and detect parallel web pages on a given specified domain are crawled exhaustively. Their language isdomain to construct a parallel corpus that is well-aligned at determined by a language detector and all possible combinationsparagraph level with completely clean texts. The proposed of these pages (a full cross product) have to be examined to findtechnique can be easily applied to other language pairs. matches. In this proposal, a bilingual dictionary has been used toExperiments have been made and shown promising results. perform matching at word level between parallel documents.

This approach is easy to understand yet very time-consuming.Keywords-parallel corpus; web mining; information retrieval

STRAND [15, 9] has a similar approach to PTMiner [14]except that it handles the case where URL-matching requires

I. INTRODUCTION multiple substitutions. Structural filtering with a tuningLarge scale parallel corpus has become a very essential parameter optimized by using Machine Learning gives it the

resource for multilingual natural language processing. It plays an ability not to examine all possible combinations like BITS [13].important role in Cross-Language Information Retrieval [2, 7, 8], In [9], Resnik also proposes a content-based matching method asprovides principal training data for statistical translation model in [13] but similarity is measured in a different way. Experiments[3, 4], it also represents resources for automatic lexical have shown very promising results. However, there are so fewacquisition and enrichment [5, 6] as well as for grammar good quality English-Vietnamese parallel texts and the mostinduction [1]. However, it is not available in sufficient quantities potential source is news websites such as VOA News [16]. Onand tends to be accessible only in the specialized form such as some news sites, such as BBC [17], parallel texts exist inUnited Nation proceeding [10] and localized versions of different writing style and they can hardly be regarded as parallelsoftware manuals [11]. Even for the major languages, the corpus. Consequently, structural filtering in [9] cannot be appliedavailable parallel corpora are not only in small size but also since all pages from a news website share the same structure.unbalanced [9]. According to our observation, there have beenno free-of-charge English-Vietnamese parallel corpora that are in In this paper, we propose a content-based matching approachconsiderable quantities so far. Hence, such a corpus can be for English-Vietnamese parallel texts collected using webthought of as a critical resource for processing texts in this mining, mainly from news sites, to construct an English-language pair. Vietnamese parallel corpus. Firstly, we perform a host crawling

on the specified domain: all pages of the desired language pairInternet is certainly a good place to search for parallel texts. are downloaded from that domain. Our system will try to extract

Nevertheless, collecting them is not a trivial task at all for the language of the page from its URL if possible; otherwise, ahuge network makes the process very labor intensive. Therefore, simple language detector is required. Secondly, we define somescientists have designed several systems to automate this rules to quickly reject all false combinations to create the set ofconstruction process. PTMiner [14] first searches for host candidate pairs. Finally, similarity between each candidate pair iscontaining parallel texts, and then performs a host crawling, calculated using an English-Vietnamese dictionary to determinecandidate pairs are found based on several external and internal if it is a match. The output of our system is an English-features such as URL like ".../english/... /file.html" is more Vietnamese parallel corpus that is well-aligned at paragraphlikely to be an English page and to be a match with level with completely clean texts. System architecture is".../chinese/... /file.html". The outstanding feature of PTMiner is illustrated in Figure 1 and a sample English-Vietnamese parallelthe ability to effectively reject false pairs prior to downloading documents in the output corpus can be found in Figure 4. Ourthem. Though Chen and Nie [14] reported a good result of 90°-precise English-Chinese corpus, this approach can hardly be

1-4244-0695-1/07/$25.OO ©2007 IEEE. 261

Page 2: [IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

Domain D VietnamesePage] H.|

Host crawling (Separating by language) Page 2 CRAWLERPage 3Page 4 English

L Create candidate pairs by rules Page 5 | Riect P

Content-based matching Fig. 2. The host crawling process. Downloaded pages are separated by theirlanguages.

Mlin ng Systeml------------------____I , The STRAND system [9, 15] aligns parallel documents to

glih-VietnamesePaalle Corpus their HTML structures. Actually there are several pairs that have

Fig.1. SystemArchitecture quite different HTML structures [14] and there are also a largenumber of pages which do not match with any others at all yet

approach is straight forward, fully automatic and easy to port to they share the HTML structure. The latter circumstance happensother language pairs. particularly to news site - VOA News [16] can be a typical

example.II. HOST CRAWLING Chen and Nie [14] propose a method to recognize candidate

pairs by path and filename similarity comparison. This approachTtari does not perform well on English-Vietnamese parallel textsfrom a given URL on a specified host, our system crawls the host either because most of the sites we aware of have URLs

for all pages that are supposed to be either in English or containing no information except for their language (some URLsVietnamese. For easy maintenance, webmasters usually keep even do not encode this information).parallel pages in different directories with respect to the name ofthe language. For example, "..... /English/... ", ". len/..." and Hereafter are the two criteria our experiments recommend.".../en_file.html" are more likely to be in English, and the samephenomenon is observed on Vietnamese pages. Therefore, we A. Created datecreate a list of patterns so that the system can reject without Naturally, parallel pages will show created date closed todownloading those pages in languages other than English and each other since people tend to create the translated version of aVietnamese. This list can be modified with ease for our systemto work on other language pairs. We did not re-implement the paergtatri scetd osqety hsdsac ilbrather short, especially for news sites as news always needs to behost crawler but did use GNU Wget [18] instead since it supports the latest. Depending upon the main language of the page, wefull-featured recursion and above all, it is well designed to work provide a desire threshold on created day for the other. Forwith configurable parameters including our list of patterns. The willyiexample, VOA News [16] iS mainly in English and news inhost crawling process is visualized in Figure 2. Vietnamese is one of its global features. As a result, translated

According to our observation, language is the useful Vietnamese pages are surely created after their original versionsinformation can be extracted from the URLs. Finding candidate in English and the distance of one day after can be a suitablepairs (as stated in [14]), however, can only be done with content threshold for Vietnamese pages. It is worth noting that news sitesintervention. For instance, http.//www. - are the most potential source to provide English-Vietnamesevoanews. comrenglishlarchive/2006-0912006-09-09-voal4. cfm parallel texts we have been awared of so far, so this will be aand http.//www. voanews. comrvietna-meselarchive/2006-09I principal criterion in our system.2006-09-09-voal2.cfm are actually a pair.

B. Sentence lengthIII. FINDING LIST OF CANDIDATE PAIRS Using text length as a criterion to filter false combinations

As mentioned above, whether two pages are a pair or not can sounds to be an effective way [14]. However, a translation of oneonly be known after content-based similarity has been measured. word with n chunks in language LI (i.e. English) may result inNevertheless, calculating similarity between all possible different words or phrases (different meanings in differentcombinations between two set of downloaded web pages as context) with m chunks in language L2 (i.e. Vietnamese). Aproposed in [13] will be too exhaustive whereas rules may be chunk is a sequence of non-space characters in the text, forapplied to quickly filter out those apparently false combinations instance, "Information Retrieval" has two chunks: "Information"

a and "Retrieval". Our experiments have shown that text length is

262

Page 3: [IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

a less effective feature for English-Vietnamese parallel texts. On TABLE Im ALGORITHM OF FINDING TRANSLATION PARAGRAPH PAIRS

the other hand, the ratio is rather stable. Hence, we employ For each paragraph Pe,k in English document dethe sentence length instead (number of chunks of a sentence) and Tokenize Pe,k with Porter stemming algorithmcQ=1, P=1.8 have been proved to be reliable lower and upper For each of 3 paragraph PVJ, j E {k-1, k, k±13 in

m Vietnamese document dvbound of the ratio - respectively. Details are given i Sj = Sim(pe,k, Pvj)

nExperiment section. Before sentence length filter is applied, we If (Si < Smax)first remove all HTML tags and extract only the content section Smax= Sof the page. End If

End ForAfter filtering, we have a limited list of Vietnamese pages to End For

scan for each English page - only those pages that can satisfy theratio filtering and have created date ofno later than one day afterits English version. good pair of paragraphs varies from -i to 1 (see Figure 3) and

such good pair usually contains more than 3 translational tokenIV. CONTENT-BASED MATCHING pairs (details will be given under Experiment section). Therefore,

Two documents that are actually parallel will contain some we calculate the number of translation pairs N between twotoken pairs that are translations of each other. This kind of token documents based on the total number of translation pairs nkpairs is known as translational token pairs [13]. For instance, in between paragraphs (the similarity between two paragraphs)."China marks 30th anniversary of Mao Zedong's Death" and Each paragraph Pe,k in an English document will be compared to"Trung Quoc am tham laing le ttrng niem Mao Trach D6ng" its 3 neighbor paragraphs pvk-, Pv,k and Pv,k+] and the one with the"China" and "Trung Quoc", "anniversary" and "ttuong niem" maximum value of translation pairs nk together with Pe,k will"Mao" and "Mao" are translational token pairs. form a translation pair of paragraphs. Algorithm can be found in

Table 1. Only those pairs with nk > Op (Op =3) are supposed to beA simple method to calculate the similarity of a pair of correct pairs and nk is added into N.

documents is to scan them for translational token pairs and thenuse the number of translation pairs found as value of the N ="nk, nk > Opsimilarity. However, a pair in which the position of the two Our system uses an English-Vietnamese dictionary abouttranslation tokens is too far from each other is rarely to be a 10.000 entries of stemmed words. Translational token pairs arecorrect translation pair. [13] uses a distance threshold to reject found by first stemming the English words - ones are not in thesuch false pairs. Approach in [13] is an exhaustive search over firs English words ones a

12tan the

the full cross product of two list of tokens extracted from each SMART's English stoplist -with Porter algorithm [12] andbthenpair of documents and the process is rather time consuming. We look them up in the dictionary for all possible Vietnamese wordsprefer an approach which does not need to search over all forabetterrecallpossible translation pairs. Our experiments have proved thatthere is a reliable threshold Od to conclude whether a pair of V. EXPERIMENTdocuments is a translation pair or not. Thus, with each English In most of the experiments we have studied, researchersdocument, we only need to scan over a number of Vietnamese evaluated their systems with thousands of parallel web pages.documents until one pair with similarity exceeds Od is found. However, we are aware of very few English-Vietnamese parallelSimilarity between a pair of document (A, B) is defined as: websites with good translation quality. Among them are VOA

Nx2 News [18] and BBC [17]. On BBC, however, news in differentsim(A, B) = versions is written in different styles. For example,

E number -of - tokens http://news.bbc.co.uk/2/hi/middle_east/6071258.stm andA,B http://www.bbc.co.uk/vietnamese/worldnews/story/2006/-

where N is the number of token pairs found between A and B. 10/061021_bushgenerals.shtml mention the same news but in

The distance threshold in [13] is actually similar to completely different ways and even details in them are not thetechniques used in the alignment problem. At this moment, our same. Actually, it is very difficult to say whether a translation isaim is to retrieve as more as possible parallel documents rather good or not since this is the matter ofhuman judgment. One maythan aligning them since further precise alignment can be done say that pages in the two URLs above are a translationallater on. As a result, we use another solution to filter out distant document pair for it mentions the same news. Therefore, in thistranslation pairs. Careful examination shows that most of the context, we specify that a translational document pair is a pair ofgood-quality parallel texts, especially from VOA news [16], are English-Vietnamese documents describing the same thing withalmost paragraph aligned. The difference in position between a similar facts and in the same (or may be slightly different) order.

As a result, we test our system mainly on VOA News site.

263

Page 4: [IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

It is unnecessary to use a language identifier for VOA News TABLE 3because language can be extracted from URLs. The host crawler SYSTEM PERFORMANCE AT 6D = 0.104. Op =3 IS SUPPOSED TO BE A GOODgives us two sets of documents in English and Vietnamese CHOICE.separately. We have downloaded approximately 45,000 pages in OP Precision RecallEnglish and 25,000 pages in Vietnamese from 2003 to 2006 on 1 68% 95%VOA archive (non-news pages such as index or pages of 2 93% 96%category display are not considered). Hence, it is impossible to 3 98% 96%scan through the full cross product of all possible combinations 4 98% 94%and filtering them by created date can help to limit this list. Afterfiltered by created date, these pairs go through the HTML tagsremoving process and then are filtered by sentence length ratio.We have tested on 829 English pages and 437 Vietnamese pagesposted on March 2006 for sentence length ratio filtering TABLE 4performance with1= 1, n= 1.8 as upper bound and lower bound of SYSTEM PERFORMANCE AT O=3. , =0. 104 YIELDS THE BEST RESULT.

the ratio respectively and the average filter rate is approximately Od Found Correct Precision Recall74%. We have run our system without sentence length ratio 0.095 130 68 52% 68%filtering on the set of rejected pairs and found no parallel 0.1 104 80 77% 80%documents. 100 of these rejected pairs have been examined by 0.104 98 96 98% 96%human and have been proved to be correct filtering. However, 0.11 91 87 96% 87%false filtering cases appeared when value of a and f are changed.There are about 7045 possible pairs to be filtered. With f = 1.5, algorithm without using the threshold O0, which means anyrejection rate raises to 82% yet the system will filter out 4 actual number of translational token pairs nkifound between twotranslation pairs as well and the error rate is even greater. A paragraphs will result in the total number of translation pairs Nsimilar phenomenon to a has also been observed. More details between those two documents. Then Od0. 112 is manuallycan be found in Table 2. selected since it gives the best result on the examination set with

We have tested text length filtering as proposed in [14] on the 70% precision and 80% recall. Further examination has shownsame set. The text length ratio between a pair of English- that the cause to this low precision was that a pair of trulyVietnamese documents gives less effectiveness in filtering since matched paragraphs is more likely to have more than threethe best filtering rate that generates no errors we observed is translational token pairs whereas pairs of paragraphs with two or700%. Therefore, we decided to use a 1,lo 1.8 (actually, a three translation pairs are more likely to be false matches andslightly different value of a,d such as = 1.85, a= 1.05 gives the most of the errors happen to these cases. Adding any number ofsame result) as threshold for sentence length ratio in our system translational token pairs found between paragraphs to N makes it

nearer to Od, which resulted in many false matches and thus, lowSimilar to any retrieval systems, parallel-page mining system precision. When threshold p = 3 is used and Od was changed to

can be evaluated through precision and recall. It is not possible 0. 104, this pair of(9p,Od) gave a perfect performance with 1000for us to measure recall on the entire VOA News site. Thus, we for both precision and recall on the examination set. This settingneed to create a smaller set for measuring recall. The test set is was then applied for our system to run on test set and the result isconstructed with 100 translation pairs of pages for which human still quite promising with 98% precision 96% recall. We haveevaluation were done and 300 unmatched pairs of pages also evaluated our system with different values of (9,Od) of whichfrom this website. From this 100 pairs, 25 were chosen results are provided in Table 3 and Table 4.arbitrarily along with 25 unmatched pairs as the examination set.For content-based matching evaluation, firstly, we ran the As mentioned, parallel documents have been almost aligned

at paragraph level with a maximum distance in paragraphsTABLE2 position of one (as shown in Figure 3). We have also done

SENTENCE LENGTH REJECTION STATISTICS. experiment with a value oftwo for this distance (examine Pk of deoc ,B Rejection Rate False Rejection Error Rate with Pk-2, Pk-l, Pk, Pk,l, Pk+2 of d,) and observed similar result but

1 1.2 92% 9 9.10-4the whole process lasted much longer. We keep examining each

1 1.2 9200 9. 10 paragraph in English documents with its three nearest neighbors

1 1.5 820% 4 5.10-4in Vietnamese documents as a result.1 1.8 74%o 0 01 2.5 65% 0 0 After all parameters have been set up, we run our system on*2.5 65% 0 0 VOA News site. We have downloaded about 45,000 pages in1 3 620o 0 0 English and 25,000 pages in Vietnamese from 2003 to 2006 on0.5 2 41% 0 0 VOA archive. The process of searching for English-Vietnamese1.1 2 77% 4 5.10-4 parallel documents lasted about 3 days with 4 Pentium 4 2.0

GHz computers. A precision of 91% has been observed and a

264

Page 5: [IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

* [P1] US TO EASE BAN ON LIQUIDS, GELS ON * [P1I ME NdI LONG LENH CAM MANG CHAT LONGAIRPLANES VA CAC LOAI KEM LEN MAY .BAY

* [P21 TSA SECURITY OFFICIAL HOLDS BAG OF [P2] HOA KY DEANG Nd1I L6NG LENH CAM MANGLIQUIDS AND GELS NOW PERMISSIBLE IN CABIN CHAT LONG vA cAC LOAI KEM d THIE DAC HON

* [P3] THE UNITED STATES IS EASING ITS BAN LEN PHI CO TAI CAC PHI TRU'dNG MYAGAINST CARRYING LIQUIDS AND GELS ONTO [[3] MOT GIdI CHLtC HANG DAU CUA Cd QUAN ANPLANES AT U.S. AIRPORTS NIbINH GIAO THHONG, ONG KIP HAWLEY, NOI PANG[P31 .. KHO I SU T JNGAY THLT BA, HANH KHA CH E)AP PHI

* [P41 A TOP OFFICIAL OF THE TRANSPORTATION Cd SE DEUOC MANG THEO LEN PHI CO MMOQT LYONGSECURITY ADMINISTRATION, KIP HAWLEY, SAYS fT CHAT LONG, HAY CAC LOAI KEM KHONG QUATHAT STARTING TUESDAY, TRAVELERS WILL BE 35 GRAMALLOWED TO TAKE SMALL AMOUNTS - 35 GPAMS * [P41 ONG HAWLEY NOI NHLVNG MON NAY PHAIOR LESS - OF LIQUIDS, AEROSOLS AND GELS INTO EDU`OC EU`NG TRONG MOT TUI TRONG, KEO KIN,THE PASSENGER CABIN VA CHI CHLrA MOT LIT. ONG NOI THEM RANG

* [FP5 HE SAID THE ITEMS MUST BE PLACED IN A HANH KHACH CUNG SE E)U`CdC PHEP MANG THEOONE-LITER-SIZED, CLEAR, ZIP-TOP BAG. HE SAID NHLVNG THLYC UONG E)A MUA TAI CAC KHU ANPASSENGERS WILL ALSO BE ALLOWED TO CARRY ToAN, NHLT PHONG E)OI CUA HANH KHACH0N BEVERAGES PURCHASED IN SECURE TRTJC KHI LEN PHI COBOARDING AREAS. I|I

Fig. 3. A sample paragraph matching by examining 3 nearest neighbors. Paragraph PI, P2 on the left match with PI, P2 on the right respectively. Paragraph P4,P5 of English document, however, match with P3, P4 of Vietnamese document

collection of 1060 English-Vietnamese parallel documents has pages in English and Vietnamese on VOA News site [16]. Allbeen retrieved. possible combinations of pages in these two sets are then filtered

Ma and Liberman [13] reported 99.100 precision and 97.10 by their created date and sentence length ratio. HTML tags are

recallon a hand-picked set of 600 documents...Due.to.the removed before the sentence length ratio filtering process butrecl on achand-p.sics of 600 docue. Des to th after created date filtering. For each pair of documents, eachdifferallenttcha ourateristics eac a

of languages ,ersiction oy paragraph of the English document is compared to its nearest 3difficult to compare.paralleltchrpa languags,tis. ery neighbor paragraphs in the Vietnamese documents to search fordifiultton coprep parallelcoruse minin sysem. Fori the best matched paragraph. A bilingual English-Vietnameseevaluation, we replaced our content-based mathi algorithm dictionary with approximately 10,000 entries is used to find

hirsh andtese our modifedts. oealsa tetservedth translational token pairs between paragraphs. The number of

Englishva meseparallel du ments. also obsreth.e token pairs found between truly matched paragraphs is added tosimilar ,va .o precisi bth itgtook a 7das to comple te the total number of translational token pairs between the

iossobviosbleca athir athmn nees eain corresponding pair of documents. This number is thenmore possible combinations*than ours which will stoP thresholded to determine whether they are a match or not.immediately when the first Vietnamese document that has the Experiments have shown very promising results. Our approach issimilarity value to the being examined English documentexceeds Od are found. PTMiner [14] takes advantages of eaytionprt to other pilrongruaes by chagn the blingualsiiart valu of th URs TAN 1]ussHM dictionary and the filtering rules for these are languagesimilarity value of the UR-Ls, STRAND [1l5] uses HTML dependent.structure filtering and since these features rarely exist in English-Vietnamese parallel pages we have found, we cannot compare The process of constructing English-Vietnamese parallelour system with these two ones. corpus based on web mining does not consider what domain a

document belongs to whereas parallel corpus used in translationVI. CONCLUSION AND FUTURE WORK system is highly domain specific. In order to solve this problem,

we must integrate domain detecting function into our system soParallel corpus has been considered a crucial resource these that it can extract a portion of the corpus for a given domain.

days for it has so many applications. Language pairs such asEnglish-French, English-Chinese have great sources of parallel In the process of removing HTML tags of a web page, ourwebsites available on the internet and there are many works have system just keeps the main text of the page, which will result inbeen done with prosperous results. However, this is not the case clean texts and well-aligned corpus. Nevertheless, the HTMLto English-Vietnamese parallel corpus. structure may completely different on different websites. Our

current technique has not been robust enough to work on everyWe provide a simple but reliable method to construct an website. We need to employ a more intelligent extraction method

English-Vietnamese parallel corpus based on web mining and for it to be robust.content-based matching. GNU Wget [18] was used to download

265

Page 6: [IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

Cuffent system only searches for parallel pages that are good . [P11 ISRAELI TROOPS KILL 5 MORE PALESTINIANStranslation of each other. However, this source is rather restricted IN GAZAfor English-Vietnamese pairs. We should use other equations to * [P2J AN ISRAELI HELICOPTER STRIKE HAS KILLEDrepresent the similarity of parallel documents with a bit worse TWO PALESTINIAN TEENAGERS IN THEtranslation quality to enrich our English-Vietnamese corpus. NORTWRN. GAZA STRIP, AS THE MILITARY

CONTINUES A MAJOR OFFENSIVE TO TRY TO STOPMILITANTS FROM FIRING ROCKETS INTO NEARB Y

REFERENCES JEWISH SETTLEMIENTS

[1] Kuhn, Jonas. "Experiments in parallel-text based grammar induction," inm [P3J RESIDENTS OF THE JABALYA REFUGEE CAMPProceedings ofthe 42 d Annual Meeting of the Association for SAY ONE OF THE TEENS WAS A MILITANT.Computational Linguistics, 2004, pp. 470-477. S [P4J ISFAEL'S MILITARY SAYS IT FIRED ON A

[2] Martin Volk, Spela Vintar, Paul Buitelaar. "Ontologies in Cross-Language GROUPP OFGUNBEN WHO WERE TRYING TOInformation Retrieval," Wissensmanagement 2003: 43-50. PLANT A BOMB

[3] Melamed, I. Dan. "Models of translational equivalence among words," [P5J] MEANWHILE, A PALESTINIAN BOY DIEDComputation Linguistics, 26(2):221-249, June, 2000. FRIDAY FROM INJURIES SUSTAINED WHEN AN

[4] Rong Jin, Joyce Y. Chai, "Study of cross lingual information retrieval using ISRAELI TANK FIRED ON THE REFUGEE CAhlPon-line translation systems," Proceedings of the 28th annual international LAST WEEK. A 1 0-YEAR-OLD GIRL WAS KILLED B YACM SIGIR conference on Research and development in information ISRAELI GUNFIRE IN THE SAME AREA TODAYretrieval, 2005. S [P6] IN A SEPARATE INCIDENT, OFFICIALS SAY

[5] Gale, William A., Kenneth W. Church. "Identifying word correspondences PALESTINIAN MILITANTS SHOT AND KILLED Ain parallel texts," Fourth DARPA Workshop on Speech and Natural PALESTINIAN WORKING ON A FARM IN A JEWISHLanguage, Asilomar, California, Feb 1991. SETTLEMENT IN SOLTHERNQGAZA

[6] Dominic Widdows, Beate Dorow, Chiu-Ki Chan. "Using Parallel Corpora * [P7] MORE THAN SO PALESTINIANS AND THREEto enrich Multilingual Lexical Resources," Third International Conference ISRAELIS HAVE BEEN KILLED SINCE THE GAZA,on Language Resources and Evaluation (LREC 3), Las Palmas, May 2002, OFFENSIVE BEGAN LAST WEEKpages 240-245. * [PI] MAY BAY TRUC THANG ISRAEL BAN CHET 2

[7] J.-Y. Nie, M. Simard, P. Isabelle, and R. Durand. "Cross-language THItU NIEN PALESTJIWE TAI DAI GAZAinformation retrieval based on parallel texts and automatic mining of * [P2 MQOT MAY BAY TRQ'C THANG COA ISRAEL DAparallel texts from the Web," Proceedings of the 22nd Annual International BAN CHIT 2 THISTU NIEN PALESTINE TAI MIENACM SIGIR Conference on Research and Development in Information BAC DAI GAZA KHI QUAN DOI TItP TUC CUOCRetrieval, pages 74--81, 1999. HANH QUAN LdN Et NGAN CHAN CAC PHAN TI]

[8] Hiroshi Masuichi, Raymond Flournoy, Stefan Kaufmann, and Stanley TRANH DAIA BAN ROCKET VAc CAC KHU DINH CYPeters. "Query Translation Method for Cross Language Information DO THAIRetrieval," in Proceedings of the Workshop on Machine Translation for [P3 ] CU DAN TAI TRAI T I NAN JABALYA NO I RANGCross Language Information Retrieval, MT Summit VII, pp. 30-34, MOQT TRONG 2 THItU NIEN VIJA Kt LA MOQT PHANSingapore, September 1999. TTJ TRANH DAu.

[9] P. Resnik and N. A. Smith. "The Web as a Parallel Corpus," Computational [[P4J QUAN DOi IFYAEL NOI RANG HO BAN vAOLinguistics, 2003, 29(3):349-380.. MOT NIHOM PHAN Tal VO TRANG E)ANG TIM CACH

[10] http://www.dc.upenn.edu GAI BOM[11] Resnik Philip and I. Dan Melamed. "Semi-automatic acquisition of domain- [rP5j TRONG KH-E bDO MOT BE TEAI PALESTINE Ta

specific translation lexicons," Fifth Conference on Applied Natural TRAN NGA Y HOM NAY VI VET THU`NG DO hMOTLanguage Processing, Washington, D.C., 1997. XE TANG ISRAEL BAN VAo TRAI TI NAIN HOI TUAN

[12] C.J. van Rijsbergen, S.E. Robertson and M.F. Porter. "New models in TRU`cCprobabilistic information retrieval, " London: British Library, 1980. (British * [P6j HOhM NAY, MOT BE GAI BI THIET MANG NGA YLibrary Research and Development Report, no. 5587). vi TRUNG DAN CUA W5RAEL TRONG CONG KHU

[13] Ma Xiaoyi, Mark Liberman. "BITS: A method for bilingual text search VT/C NAYover the web," Machine Translation Summit VII, September, 1999. * [P7J TRONG MQT DItN BItN KHAC CAC GIdI

[14] J. Chen, J.Y. Nie. "Automatic construction of parallel English-Chinese CHJC NTO RANG DAIC PHAN TJcTRANH GAIIcorpus for cross-language information retrieval," Proc. ANLP, pp. 21-28, PALESTINE E2A BAN CHE-T MOQT NGtdII PALESTINESeattle, 2000.. LAM VIEC TAI MOT NONG TRAI TRONG MOT KHU

[15] Resnik Philip. "Parallel strands: A preliminary investigation into mining the EI-J CL DO TFAI TAI MIEN NAM DAI GAZAWeb for bilingual text," in Proceedings ofthe Third Conference ofthe * [PSI HON 30 NGTJtII PALESTJNE VA 3 NGUtIAssociation for Machine Translation in the Americas, AMTA-98, in ISPAEL DA THIS F MlANG KE TT KHICVUC HANHLecture Notes in Artificial Intelligence, 1529, Langhorne, PA, October 28- yUAN CIJA I BAT DAl HOI TUAN TRUtC31. QA AISALBA )UH UNTY

[16] wwwvoancws.com Fig. 4. A sample English-Vietnamese parallel documents in the output corpus[17] wwwbbc.co.uk[18] http://www.gnu.orgsoftware/wget/

266