14
Research Article A Relationship: Word Alignment, Phrase Table, and Translation Quality Liang Tian, Derek F. Wong, Lidia S. Chao, and Francisco Oliveira Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory, Department of Computer and Information Science, University of Macau, Macau Correspondence should be addressed to Liang Tian; [email protected] Received 30 August 2013; Accepted 10 March 2014; Published 16 April 2014 Academic Editors: J. Shu and F. Yu Copyright © 2014 Liang Tian et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In the last years, researchers conducted several studies to evaluate the machine translation quality based on the relationship between word alignments and phrase table. However, existing methods usually employ ad-hoc heuristics without theoretical support. So far, there is no discussion from the aspect of providing a formula to describe the relationship among word alignments, phrase table, and machine translation performance. In this paper, on one hand, we focus on formulating such a relationship for estimating the size of extracted phrase pairs given one or more word alignment points. On the other hand, a corpus-motivated pruning technique is proposed to prune the default large phrase table. Experiment proves that the deduced formula is feasible, which not only can be used to predict the size of the phrase table, but also can be a valuable reference for investigating the relationship between the translation performance and phrase tables based on different links of word alignment. e corpus-motivated pruning results show that nearly 98% of phrases can be reduced without any significant loss in translation quality. 1. Introduction One of the best performing translation systems in Statistical Machine Translation (SMT) nowadays is the phrase-based model, which takes continual word sequences as translation units [1, 2]. e fundamental data structure in phrase-based models is a table of phrase pairs with associated scores which may come from probability distributions. Until now, several methods to extract phrase pairs from a parallel corpus have been proposed, such as using a probabilistic model [3], pat- tern mining methods [4], matrix factorization [5], heuristic- based method [69], MBR-based method [10], and model- based method [11]. However, most commonly, this table is acquired from word alignments [1215], which exhaustively enumerates all phrases up to a certain length consistent with the alignment [16]. When talking about the relationship between machine translation and word alignment or phrase table, researchers seek for better translation performance from at least two independent research efforts. On the one hand, different processes of interfering word alignments were studied for better translation results. Some researchers have shown that translation quality depends on word alignment quality [7, 17]. Another group of researchers hold an opposite point of view: significant decreases AER (alignment error rate) will not always result in significant increases the translation quality [8, 18, 19]. Recently, researchers in [20] made a systematic study and pointed out that alignments can improve the translation performance depending on the SMT systems and the type of corpus used. On the other hand, phrase table pruning techniques were applied for better machine performance without losing the overall quality. reshold values are considered to reduce the phrase table size, which are usually related to absolute scores of phrase pairs in the phrase table or relative scores between the phrase tables sharing their source phrases. Authors in [21, 22] proposed a significant testing in order to select only those phrase pairs which are the more-than-random occurring ones in the training corpus. References [23, 24] exploited the same idea for the hierarchical SMT. References [25, 26] considered usage statistics of phrase pairs from translating sample data, which are based on how oſten a phrase pair Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 438106, 13 pages http://dx.doi.org/10.1155/2014/438106

Research Article A Relationship: Word Alignment, Phrase

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article A Relationship: Word Alignment, Phrase

Research ArticleA Relationship Word Alignment Phrase Tableand Translation Quality

Liang Tian Derek F Wong Lidia S Chao and Francisco Oliveira

Natural Language Processing amp Portuguese-Chinese Machine Translation LaboratoryDepartment of Computer and Information Science University of Macau Macau

Correspondence should be addressed to Liang Tian tianliang0123gmailcom

Received 30 August 2013 Accepted 10 March 2014 Published 16 April 2014

Academic Editors J Shu and F Yu

Copyright copy 2014 Liang Tian et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

In the last years researchers conducted several studies to evaluate themachine translation quality based on the relationship betweenword alignments and phrase table However existingmethods usually employ ad-hoc heuristics without theoretical support So farthere is no discussion from the aspect of providing a formula to describe the relationship among word alignments phrase tableand machine translation performance In this paper on one hand we focus on formulating such a relationship for estimating thesize of extracted phrase pairs given one or more word alignment points On the other hand a corpus-motivated pruning techniqueis proposed to prune the default large phrase table Experiment proves that the deduced formula is feasible which not only canbe used to predict the size of the phrase table but also can be a valuable reference for investigating the relationship between thetranslation performance and phrase tables based on different links of word alignment The corpus-motivated pruning results showthat nearly 98 of phrases can be reduced without any significant loss in translation quality

1 Introduction

One of the best performing translation systems in StatisticalMachine Translation (SMT) nowadays is the phrase-basedmodel which takes continual word sequences as translationunits [1 2]

The fundamental data structure in phrase-based modelsis a table of phrase pairs with associated scores whichmay come from probability distributions Until now severalmethods to extract phrase pairs from a parallel corpus havebeen proposed such as using a probabilistic model [3] pat-tern mining methods [4] matrix factorization [5] heuristic-based method [6ndash9] MBR-based method [10] and model-based method [11] However most commonly this table isacquired from word alignments [12ndash15] which exhaustivelyenumerates all phrases up to a certain length consistent withthe alignment [16]

When talking about the relationship between machinetranslation and word alignment or phrase table researchersseek for better translation performance from at least twoindependent research efforts On the one hand differentprocesses of interfering word alignments were studied for

better translation results Some researchers have shown thattranslation quality depends onword alignment quality [7 17]Another group of researchers hold an opposite point of viewsignificant decreases AER (alignment error rate) will notalways result in significant increases the translation quality [818 19] Recently researchers in [20] made a systematic studyand pointed out that alignments can improve the translationperformance depending on the SMT systems and the type ofcorpus used

On the other hand phrase table pruning techniques wereapplied for better machine performance without losing theoverall qualityThreshold values are considered to reduce thephrase table size which are usually related to absolute scoresof phrase pairs in the phrase table or relative scores betweenthe phrase tables sharing their source phrases Authors in [2122] proposed a significant testing in order to select only thosephrase pairs which are the more-than-random occurringones in the training corpus References [23 24] exploitedthe same idea for the hierarchical SMT References [25 26]considered usage statistics of phrase pairs from translatingsample data which are based on how often a phrase pair

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 438106 13 pageshttpdxdoiorg1011552014438106

2 The Scientific World Journal

was considered during decoding and how often it was usedin the best translation In [27 28] the researchers adaptedtriangulation approach to prune the phrase table whichrequires additional bilingual corpora Authors in [10 29 30]modified the phrase extraction methods in order to reducethe phrase table size In [31 32] they introduced an entropy-based pruning criterion to prune the generated phrase tables

Both studies tried to reduce the spurious results (errors orredundancies) to investigate the affection to the final trans-lation performance based on different empirical statisticsIn other words many existing methods usually employ adhoc heuristics without theoretical proof We try to make astudy on the relationship among word alignments phrasetable and translation quality originally from a mathematicalperspective proposed in this paper The equation indicatesthat it is necessary to improve the quality of word alignmenteven if the alignment does not explicitly connect to thetranslation quality In particular by lowering the quality ofalignment the affection to the translation is very minor Inthis paper two extreme cases are that the best and the worstquality of word alignments are used to illustrate the situationExperiment results show that the quality of word alignmentwill affect the translation performance a lot measured byBLEU [33]

The contributions of this paper mainly focus on twoaspects (1) formulating the relationships among word align-ment phrase table and machine translation the equationsindicate that the better quality of the word alignment thefewer the phrases will be and the better translation perfor-mance both on phrase-level and finer granularity levels (2)corpus-motivated pruning techniques to prune the defaultlarge phrase table which can remove nearly 85 of allthe phrase pairs without hurting the performance of thetranslation

The rest of the paper is organized as follows Section 2reviews the related works and background that will be used inthis paper Section 3 describes the relationship details amongthe word alignment phrase table and machine translationfrom a mathematical perspective And a corpus-motivatedpruning method is introduced here Section 4 describes theconducted experiments and presents the obtained resultsThe paper concludes with a summary and discussions of theresults

2 Related Works and Background

Our work is based on the dominant method to obtain aphrase table from word alignment which trained from theEM (Expectation Maximization) algorithm To extract thephrases from the word alignment EM algorithm will beutilized to train the bilingual corpus for several iterationsand then phrase pairs that are consistent with this wordalignmentwill be extractedMost of the current Phrase-BasedSMT systems rely on the GIZA++ implementation of theIBM Models [34] to produce word alignments running thealgorithm in both directions source to target and target tosource Various heuristics can then be applied to obtain asymmetrized alignment from those twoMost of them such as

grow-diag-final-and [8] start from the intersection of the twoword alignments and enrich it with alignment points fromthe union

Suppose a phrase pair is represented by (119891 119890) consistentwith an alignment 119886 if all words 119891

1 119891

119895 119891

119869in 119891 that

have alignment points in 119886 have these with words 1198901

119890119894 119890

119868in 119890 and vice versa More formal definition of the

consistent with the word alignment can be described asfollows [35]

(119891 119890) consistent with 119886 hArr

forall119890119894isin 119890 (119890

119894 119891119894) isin 119886 997904rArr 119891

119894isin 119891

forall119891119894isin 119891 (119890

119894 119891119894) isin 119860 997904rArr 119890

119894isin 119890

exist119890119894isin 119890 119891

119894isin 119891 (119890

119894 119891119894) isin 119886

(1)

This algorithm indicates that all the phrases will beextracted if the biphrases (119891 119890) alone with their alignment 1198861015840satisfy the following two conditions [36]

(i) 119890 and 119891 are consecutive word subsequences in thetarget sentence 119890 and source sentence 119891 respectivelyand neither of them is longer than 119896 words

(ii) 1198861015840 the alignment between the words of 119891 and 119890

induced by 119886 contains at least one link

For the relationship between phrase table and machinetranslation most researchers talk about the effects based ondifferent sizes of phrase table which concerns the phrasetable pruning techniques There are mainly three pruningmethods statistical-based significance testing and entropy-based methods For more details there are absolute andrelative pruning methods in the statistical-basedmethodTheabsolute pruningmethods rely only on the statistics of a singlephrase pair (119891 119890)Theymay prune all translations of a sourcephrase that fall below a threshold according to the count119873(119891 119890) or the probability 119901(119890 | 119891) of the phrase pair (119891 119890) asshown in (2) Hence they are independent of other phrasesin the phrase table

119873(119891 119890) lt 120591119888

119901 (119890 | 119891) lt 120591119901

(2)

However the relative pruning methods consider the fullset of target phrases for a specific source phrase 119891 Given apruning threshold 120591

119905 a phrase pair (119891 119890) is discarded if

119901 (119890 | 119891) lt 120591119905max119890

119901 (119890 | 119891) (3)

Both the significance testing and entropy-based methodare trying to find a threshold 120591 to reduce the size of thegenerated phrase table

For the entropy-based approach in [32] they used con-ditional Kullback-Leibler divergence [37] to measure theprunedmodel1199011015840(119890 | 119891) which is to be as similar as possible tothe original model 119901(119890 | 119891) Then a pruning threshold 120591

119864can

The Scientific World Journal 3

Table 1 Contingency table

119890 1198901015840

119891 119873(119891 119890) 119873 (119891) minus 119873(119891 119890) 119873 (119891)

1198911015840

119873(119890) minus 119873(119891 119890) 119873 minus 119873(119891) minus 119873(119890) minus 119873 (119891 119890) 119873 minus 119873(119891)

119873(119890) 119873 minus 119873(119890) 119873

be chosen and phrase pairs with a contribution to the relativeentropy below that threshold are pruned if

119901 (119890 | 119891) [log119901 (119890 | 119891) minus log1199011015840 (119890 | 119891)] lt 120591119864 (4)

For the significance testing pruning approach it heavilydepends on a119875-valueThe value is calculated from two by twocontingency tables 119862119879(119891 119890) as shown in Table 1 which canbe used to represent the relationship between 119891 and 119890 wherethe119873(119891 119890) is defined as the number of parallel sentences thatcontain one or more occurrences of 119891 on the source side and119890 on the target side119873(119891) is the number of parallel sentencesthat contain one or more occurrences of119891 on the source side119873(119890) is the number of parallel sentences that contain one ormore occurrences of 119890 on the target side and119873 is the numberof parallel sentences

Following Fisherrsquos exact test the probability of the con-tingency table via the hyper geometric distribution can becalculated

119901ℎ(119896) =

(119873(119891)

119873(119891119890))(119873minus119873(119891)

119873(119890)minus119873(119891119890))

( 119873119873(119890)

) (5)

Then the 119875-value can be then calculated by summingprobabilities for tables that have a larger119873(119891 119890)

119875-value (119862119879 (119891 119890)) =

min(119873(119891)119873(119890))

sum

119896=119873(119891119890)

119901ℎ(119896) (6)

A phrase pair will be discarded if the 119875-value is greaterthan 120591

119865as shown in (7)

119875-value (119862119879 (119891 119890)) =

min(119873(119891)119873(119890))

sum

119896=119873(119891119890)

119901ℎ(119896) gt 120591

119865 (7)

Generally speaking the goal of the entropy-basedmethodis to remove the redundant phrases whereas the otherapproaches are to try to remove the low-quality or unreliablephrases

The knowledge introduced above will be used duringour discussions on the relationships among word alignmentphase table and machine translation More details will beshown in Section 3

3 Word Alignment amp Phrase Table ampMachine Translation

In this section we can see from a theoretical aspect thatthe quality of the underlying word alignments has a strong

influence both on performances of phrases table and phrase-based SMT system

31 Word Alignment amp Phrase Table As introduced in theprevious section there are some direct relationships betweenphrase table and word alignments In this part the relation-ship between the word alignments and extracted phrases isformulated to estimate the size of the target phrase tableTwo different word alignment cases are considered for therelations Firstly the phrases are extracted from the full wordalignment links of parallel sentences which are the best wordalignments Secondly phrases based on a single link of wordalignment will bemade an investigation which can be treatedas the worst alignment information

311 Phrases Extraction from Full Word Alignment Supposea source sentence 119891

119869

1= 11989111198912sdot sdot sdot 119891119869will be translated into a

target sentence 1198901198681

= 11989011198902sdot sdot sdot 119890119868 and only one target word

is allowed to align to one or multiple source words Where119868 and 119869 are the length of the target and source sentencesrespectively 119894 and 119895 are the positions of words in the targetand source sentences

If each target word 119890119894has at least one alignment link

against the word in source text named as full word alignmentfewest phrase pairs can be extracted

For the first case considering the alignment pointsin Figure 1 each word 119890

119894in the target sentence has at

least one alignment point and there is no crossing wordalignment which means words are monotonically alignedStarting with the first word 119890

1in the target sentence

the phrase pairs with possible extent can be extractedfor example (119890

1| 1198911) (11989011198902

| 119891111989121198913) (11989011198902sdot sdot sdot 119890119894

|

119891111989121198913sdot sdot sdot 119891119895) (11989011198902sdot sdot sdot 119890119894sdot sdot sdot 119890119868| 119891111989121198913sdot sdot sdot 119891119895sdot sdot sdot 119891119869) and so

forth where the total number of 119868 phrase pairs can beobtained Starting with the second word 119890

2in the target

sentence 119868 minus 1 phrase pairs can be extracted for example(1198902

| 11989121198913) (1198902sdot sdot sdot 119890119894

| 11989121198913sdot sdot sdot 119891119895) (1198902sdot sdot sdot 119890119894sdot sdot sdot 119890119868

|

11989121198913sdot sdot sdot 119891119895sdot sdot sdot 119891119869) and so onThus the total number of phrase

pairs that extracted from the fully aligned sentences can bedefined as

119868 + (119868 minus 1) + sdot sdot sdot + 1 =119868 (119868 + 1)

2 (8)

Actually the previous alignment type cannot deal withthe word alignment that contains the cross-alignment asshown in Figure 2 where the fourth and the fifth Chinesewords are cross-aligned with words in source sentenceAccording to (8) there should be 21 possible phrase pairsHowever only 18 phrases are generated by the conventionalphrase extraction algorithm

The general form of above alignments is illustrated inFigure 3 where the cross-alignments take place between theneighboring words 119890

119894beginand 119890119894end

that is 119894end = 119894begin + 1 Anyphrase pairs that fall between the bounds of cross-alignedwords cannot be extracted in line with (8) That is (119894begin minus

1) + (119868 minus 119894end) phrases should be excluded

119873ca = (119894begin minus 1) + (119868 minus 119894end) (9)

4 The Scientific World Journal

e1 e2 ei eI

f1 f2 f3 fj fJ

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1 Full word alignments

1 2 3 4 5

I1 am2 able3 do5 it6to4 well7

∘6

8

Figure 2 An example of nonmonotonic alignments where align-ment links are crossing between parallel sentences

So if we define the reduced phrases as 119873ca (number ofcross-alignments) a more practical equation can be revisedas

119873min =119868 (119868 + 1)

2minus 119873ca (10)

If there is more than one cross-alignment as illustratedin Figure 3 the total number of extracted phrases can begeneralized as

119873min =119868 (119868 + 1)

2minus

119898

sum119896=1

[(119894119896

begin minus 1) + (119868 minus 119894119896

end)] (11)

Reduced phrase pairs are determined by the cross-alig-nment point 119894begin and 119894end 119894

119896

begin represents the start positionand 119894119896end is the last position of the 119896th pair words (boundedby the crossing links) and 119898 is the total number of cross-alignments

Figure 3 illustrates the situation that there is not any wordbetween the crossing dependencies However in reality therewill be any words in between the cross words as illustrated inFigure 4

For the third case words between 119890119894begin

and 119890119894end

may havesimilar alignment situation as that illustrated in (9)Thus thefollowing equation can be derived to calculate 119873

119903 the exact

number of phrase pairs that cannot be generated Consider

119873119903=

119898

sum119896=1

119894119896

endminus119894119896

endminus1

sum120572=0120573=0

[(119894119896

begin + 120572 minus 1) + (119868 minus 119894119896

end + 120573)] (12)

where 120572 indicates the shift word position after the first crossword 119894119896begin and 120573 indicates the word position before the endcross word 119894

119896

end And 120572 120573will not be greater than the numberof words between the first and the end cross word

Finally the number of phrase pairs that can be extractedfrom the full word alignmentsmeasured in the target side canbe described by

119873min-119905 =119868 (119868 + 1)

2minus 119873119903 (13)

e1 e2 eI

f1 f2 f3 fJ

middot middot middot middot middot middot

middot middot middotmiddot middot middot

eibegin eiend

fjbegin fjend

Figure 3 Alignments with a cross link of neighboring words

e1 e2 eI

f1 f2 f3 fJ

middot middot middot middot middot middot

middot middot middotmiddot middot middot middot middot middot

middot middot middot eibegin eiend

fjbegin fjend

Figure 4 Alignments with a cross link involving distant words

The equation shows that the size of the phrase tableis decided by the target length of the given sentence ifthe word alignment is the full word alignment that is thebest and ideal quality word alignment It is proven thatthe bidirectional training approach for word alignment canimprove the quality of word alignment [8] If the alignmentis deduced from another direction (source side) similarly theextracted phrases from source side can be represented by

119873min-119904 =119869 (119869 + 1)

2minus 119873119903 (14)

Equations (13) and (14) show that the number of phrasesheavily affected by the length of source and target sentencesGiven the best word alignment if the length in the sourcesentence equals the target side (119868 = 119869) the phrases willbe the minimal numbers (119873min-119904 or 119873min-119905) If not equalthe extracted phrases will be larger than this minimal value(119873min-119904 and 119873min-119905) after intersection and union operationsduring alignment process

312 Phrases Extraction from Worst Alignment Supposeonly one word 119891

119895in the source sentence is aligned to a target

word 119890119894as shown in Figure 5 Note that unaligned words

near the word 119891119895may lead to multiple possibilities each

proceeding and following words next to it can form a part ofthe phrase and so does the neighboring words of 119890

119894in target

sentence In other words if we know the different extractedsituations in each source and target sentence and then totalcounts of phrase pairs can be derived by the product of thetwo numbers Algorithm 1 describes the different phrasesextraction scenarios according to different positions in thesource sentence

Obviously there are total 119894 words that can start with(1198901 1198902 1198903 119890

119894) Now in target sentence consisting of

oddeven words the phrase pairs 119891(119894) in different positionare

119891 (119894) = 119894 (119868 minus 119894 + 1) = 119894 (119868 minus 119894) + 119894 (15)

The phrases 119892(119895) in the source side can be calculated in asimilar way

119892 (119895) = 119895 (119869 minus 119895 + 1) = 119895 (119869 minus 119895) + 119895 (16)

The Scientific World Journal 5

e1 e2 ei eI

f1 f2 f3 fj fJ

middot middot middot

middot middot middot middot middot middot

middot middot middot

Figure 5 Sentences with single alignment link

Based on (15) and (16) the total number of phrase pairs119873max is given by

119873max = [119894 (119868 minus 119894) + 119894] [119895 (119869 minus 119895) + 119895] (17)

where 0 le 119894 le 119868 minus 1 and 0 le 119895 le 119869 minus 1Obviously (15) and (16) accord with symmetric distribu-

tion and the maximal value will be achieved in the middleposition Given a sentence consisting of 119871 words a formaldescription to the middle position in the sentence can bepresented by

wordmid

1 + 119871odd2

119871even2

or (119871even2

+ 1) (18)

We can see that it gives (17) the maximal value in themiddle position in a given sentence that is the nearer to themiddle word the more phrase pairs can be produced Takethe Figure 2 as an example there are 18 phrases based onthe full word alignments while the phrases will be 240 whenonly the fifth word (do) in the source sentence aligns to thethird word (做zuo) in the target side From the view of sizeof phrase table the calculation value from (17) is much largerthan that from (13) That proves that the better quality of theword alignment the fewer phrases will be

Note that there will be some share phrases extracted fromthe full word alignment and only single word alignment whichleads to the phenomenon that there are still some translationabilities even with only one alignment point under the exist-ing phrase extraction algorithm For instance phrase pairs(11989011198902sdot sdot sdot 119890119894| 119891111989121198913sdot sdot sdot 119891119895) (1198902sdot sdot sdot 119890119894| 11989121198913sdot sdot sdot 119891119895) (119890

119894| 119891119895)

are the share ones when given a phrase such as 119891111989121198913sdot sdot sdot 119891119895

both the phrase tables generated from both alignment typescan possibly translate it right to 119890

11198902sdot sdot sdot 119890119894

32 Phrase Table amp Machine Translation Nowadays the keystep of the process of phrase-based machine translation(SMT) involves inferring a large table of phrase pairs thatare translations of each other from a large corpus of alignedsentences The set of all phrase pairs together with estimatesof conditional probabilities and other useful features is calledPhrase Table (PT) which is applied during the decodingprocess [29]

Nevertheless phrases in the PT are very dependent fromthe system that uses them Some phrases might be presentin the PT but never be used in translations because they areeither ranked too low or erroneous translations for exampleIn other words there are too many redundant phrases in thePT Such situation leads many researches to propose differenttechniques to reduce the size of the phrase tableThis includes

Table 2 Pruned phrase table and translation quality

Pruning methods Languages Pruning rates BLEU (Δ)

Countsignificantfr-en 646sim859

minus01es-en 679sim881de-en 627sim841

Entropyfr-en 85sim95

minus10es-en 85sim95de-en 85sim95

Bilingual-segmentfr-en mdash

minus12es-en 98de-en 94

the methods as introduced in Section 2 These works can betreated as the research on the relationship between phrasetable and machine translation

All the works presented in Section 2 show that thetranslation results can be achieved from smaller phrase pairswith several times increase in translation speed with little orno loss in translation accuracy As reported by [32] count-based and significance-based pruning techniques can resultin larger savings between 70 and 90 while the entropy-based pruning approach can reduce consistently the entriesbetween 85 to 95 of the phrase table In [29] they evenpointed out that the pruning rate can reach to 98 using abilingual segmentation method with finite state transducersMore details about the pruning rates and related translationqualities (measured by BLEU [33]) are shown in Table 2where the BLEU column is to record the maximal decrease(minus) rate from the 100 size of phrase table to the maximalpruning rate of the phrase table

The pruning rates presented above are counted from anempirical or experiment result We can also estimate thepruning rate from a theoretical aspect In practice we alsonotice that the phrases extracted from the middle wordalignment point includemost of the phrases that are extractedfrom the full word alignment All the phrases excluding theones extracted from the full word alignment can be removedIn other words a rough pruning rate can be calculated fromthe ration of phrases extracted from best alignment (119873min)and worst alignment (119873max)

119877pruned = 1 minus119873min-119905119873max

(19)

If we want to quickly estimate how many percentageof phrases can be pruned in a given corpus the averagestatistics of the given corpus can be used as in (20) Thisequation shows that the pruning rate is not only decided bythe sentence length in a parallel corpus but also affected bythe alignment position Consider

119877pruned = 1 minus119868average (119868average + 1) 2

[119894 (119868average minus 119894) + 119894] [119895 (119869average minus 119895) + 119895] (20)

In reality the numerator trends to bigger and the denom-inator will be smaller so the pruning rate will be smaller

6 The Scientific World Journal

Start from 1198901

119873phrases = 119868 minus 119894 + 1Start from 119890

2

119873phrases = 119868 minus 119894 + 1Similarly forall119894 isin (119894 = 119894 119868) in the target side

119873phrases = 119868 minus 119894 + 1

Algorithm 1 Phrase extraction based on the only one alignment point 119890119894harr 119891119895

Table 3 Training data statistics WMT 2006

Language pairs Sentences Ave source len Ave target lenfr-en 688031 2267 2007es-en 730740 2152 2083de-en 751088 2031 2137

than the calculation value When the alignment position isin the middle position (middle alignment) there will getthe maximal pruning rates as shown in (21) Based on theequation we can estimate the pruning rate range based ondifferent word alignment position in European corpus [46]Table 3 shows the statistics of the WMT 2006 and Table 4compares different pruning rates reported in researchersrsquopaper [22 32] and the calculation ranges based on (20) Fromwhichwe can see that the calculation result is very close to theexperiment results presented in [32] The authors concludedtheir findings that ldquothe novel entropy-based pruning oftenachieves the same Bleu score with only half the number ofphrasesrdquo A very bold assumption is thatmost phrases (at leasthalf) can be removed from the original extracted phrasesConsider

119877max = 1 minus119868average (119868average + 1)

2

times ([119894mid (119868average minus 119894mid) + 119894mid]

times [119895mid (119869average minus 119895mid) + 119895mid])minus1

(21)

We have seen that the phrase table pruning rate is stronglyaffected by the length and word positions in the source andtarget sentences As reflected in the pruning equation (20)the phrase table might be more accurately pruned based onsome features (eg phrase length phrase cooccurrence) inthe bilingual and monolingual corpus The researchers in[27] reduced the phrase table based on the significance testingof phrase pair cooccurrence in bilingual corpus while Heet al [38 39] pruned the phrase pairs in terms of phrasefrequency in the source side using key phrases extractedfrom a monolingual corpus We want to make full use of theadvantages of the two approaches the key phrases can beused to check whether the extracted phrases are often usedin practice and the significant testing can be used to measurethe quality of the extracted phrase pairs Our methods wantto consider the features both for the source and target phrase

Table 4 Comparison between theoretical estimation results andother pruning techniques results

Language pairs 119895mid 119894mid 119877pruned

fr-en 1234 1104 535sim986es-en 1126 1142 493sim984de-en 1116 1119 449sim983

The basic metrics for phrases that are often used in prac-tice is the frequency that a phrase appears in a monolingualcorpus The more frequent a phrase appears in a corpus thegreater possibility the phrase may be used However longerphrase pairs will be removed in this way because of the factthat those phrases seem to appear rarely in a monolingualcorpus

The 119862-value is a measurement of automatic term recog-nition which can be used to solve this problem as describedin [40] The method not only fully considers the frequencyinformation such as frequencies of a phrase and a subphraseappearing in longer phrases but also is an efficient algorithm

In this algorithm (as shown in Algorithm 2) 4 factors (119871119865 119878119873) can be used to determine if a phrase 119901 is a key phrase[38]

(i) 119871(119901) the length of 119901(ii) 119865(119901) the frequency that 119901 appears in a corpus(iii) 119878(119901) the frequency that 119901 appears as a substring in

other longer phrases(iv) 119873(119901) the number of phrases that contain 119901 as a

substring

Given amonolingual corpus key phrases can be extractedefficiently according to Algorithm 2 Firstly (line 1) allpossible phrases are extracted as candidates of key phrases(eg extracted phrase table fromMoses [41]) Secondly (line3 to 7) for each candidate phrase 119862-value is computedaccording to the phrase appearing by itself (line 4) or as asubstring of other long phrases (line 6) The 119862-value is indirect proportion to the phrase length (119871) and occurrences(119865 119878) but in inverse proportion to the number of phrasesthat contain the phrase as a substring (119873)This overcomes thelimitations of frequency measurement A phrase is regardedas a key phrase if its 119862-value is greater than a threshold 120576Finally (line 11 to 14) 119878(119902) and 119873(119902) are updated for eachsubstring 119902

The algorithm is originally used for pruning the ruletable extracted from the hierarchical phrase-based SMT [42]

The Scientific World Journal 7

Input Monolingual Corpus(1) Extracted candidate phrases(2) for all phrases 119901 in length descending order do(3) if 119873(119901) = 0 then(4) 119862-value = (119871(119901) minus 1) times 119865(119901)

(5) else(6) 119862-value = (119871(119901) minus 1) times (119865(119901) minus 119878(119901)119873(119901))

(7) end if(8) if 119862-value ge 120576 then(9) add 119901 to KP(10) end if(11) for all sub-strings 119902 of 119901 do(12) 119878 (119902) = 119878(119902) + 119865(119901) minus 119878(119901)

(13) 119873(119902) = 119873(119902) + 1

(14) end for(15) end for

Output Key Phrase Table KP

Algorithm 2 Key phrase extraction from monolingual corpus

Input Bilingual Corpus and KP Table from Algorithm 1(1) for all phrases 1199011015840 in KP do(2) if minus log(119875-value(1199011015840)) le 120591

119865then

(3) add 1199011015840 to 119875 PT

(4) end if(5) end forOutput Pruned Phrase Table 119875 PT

Algorithm 3 Algorithm for key phrase pruning

However the idea can be also used for phrase-based SMT[1 41] perfectly After this filter step a filtered phrase tablecalled KP will be generated

The phrases in the phrase table KP are all possibly usedphrases in practice However there are still some noisesin the KP table for example the source phrase occurringin source sentences and the target phrase occurring in thetarget sentences are not well matched that is the phrasetranslation is not good enough To overcome this case it isconvenient to construct a two by two contingency table thattabulates the sentence pairs where the two types of matchesoccur and do not occur [21 22 43] The calculation resultis also called 119875-value as introduced in Section 2 A bilingualcorpus constraint can be added to filter the noise phrases aspresented in Algorithm 3 After this step a large pruning ratephrase table (called 119875 119875119879) will be generated

33 Word Alignment amp Machine Translation Based on thediscussion in Sections 31 and 32 we can find that the sizeof phrase table is heavily affected by the quality of wordalignment the better the word alignment the smaller size ofthe phrase table Besides this themachine translation perfor-mance is not directly decided by the size of the phrase tableat least 50 phrases can be removed without hurting the

performance of the machine translation Experiments showthat the quality of phrase translation plays more importantrole to the final translation quality in practice Therefore itis necessary to improve the quality of word alignment underexisting phrase extraction algorithm from word alignment

Now it is a little easier to talk about the relationshipbetween word alignment andmachine translationThe trans-lation of sentence is directly derived from the phrase tablewhile the phrase table is extracted from word alignment Inother words the quality of word alignment will affect theperformance of machine translation through phrase tableThat is to say the word alignment quality will affect thetranslation performance a lot However the works listed in[8 18 19] show that there is not always significant affection tothe translation result with improved word alignment quality

We do not agree with the conclusion Firstly either thetraining or testing corpus they used in their experimentsis limited in a small size or the improvement of the wordalignment quality is not significant (AER decrease between01 and 02) Secondly we think it is the existing phraseextraction algorithm that leads to the not obvious result Wehave pointed out in the last part of Section 31 that even theworst alignments are used the extracted phrase table stillhave some ability to translate some input phrases This isbecause of the share phrases extracted from the only one

8 The Scientific World Journal

(aligner)

Phraseextraction

Moses decoder

SRILMIRSTLM(LM toolkit)

Phrase table Language model

Monolingualcorpus

Bilingualcorpus

Sourcetext

Targettext

GIZA++

Figure 6 Mosesrsquo main modules and models

alignment link The only one word alignment can translatesome input phrases or sentences not mentioning the moreword alignment links

We will discuss the situation with different word align-ment in next section where the machine translation affectedby the best word alignment and worst word alignment willbe fully surveyed After that we can see that there are at leastthree advantages to improve the quality of word alignment asfollows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

4 Experiments and Discussions

In order to evaluate the conclusions drawn in the previoussection some experiments are carried out usingMoses toolkit[41] which provides a complete package of the requiredmodules for training and generating the translation of texts

41 Experiment Settings Like other SMT approaches Mosestoolkit requires two models (as shown in Figure 6) theLanguage Model (LM) and the Translation Model (TM) InMoses LM is usually created by the external toolkits SRILM[44] or IRSTLM [45] In our experiment the IRSTLM toolkitis used for the large training corpus Before training allthe tokenization for languages in European corpus [46] usethe integrated scripts in Moses and the segmentation forChinese adapts the Stanford Chinese segmenter [47] usingCTB (Chinese Tree Bank) standard [48 49] The entirelanguage model is trained by five-gram models

The phrases are extracted from the results generatedfrom GIZA++ [8] The word alignment was trained with teniterations of IBMmodel 1 and model 4 [34] and six iterationsof the HMM alignment model [50]

To extract all the possible phrases to verify (13) and (17)the maximal phrases are limited to 30 This can extract most

Table 5 Statistics of the first 10000 WMT 2006 corpus

Languages Ave source len Ave en lenfr-en 2289 2015es-en 2171 2094de-en 2055 2142

Table 6 Statistics of test data (3064 sentences)

Languages Ave source len Ave en lenfr-en 3295 2781es-en 2994 2781de-en 2692 2781

Table 7 Statistics of CMWT 2013 + UM-Corpus

Languages Tokens Average length VocabulariesEnglish 152161233 1937 1655080CharacterCE 229110265 2916 397442CTBCE 123917395 1578 1331505

phrases in terms ofword alignments andmake us focus on therelations among word alignment phrase table and machinetranslation

In order to better reveal relationships between alignmentand phrase table to the translation quality leftmost middleand rightmost word alignment are extracted from the originalfull alignment in the worst alignment situation (ie onlyone alignment link) Here the leftmostmiddle and rightmostword alignment indicate theword alignment link that alignedto the firstmiddle and end word in the source language side

42 Corpus Information Three language pairs in WMT2006 shared translation task are chosen as the trainingdata French-English (fr-en) German-English (de-en) andSpanish-English (es-en) These three language pairs are usedto carry out the experiments for proving the equationdeduced in Section 3 and relationship between word align-ment and machine translation Table 3 shows the statisticsof original corpus [46] However the number of phrasesextracted from the only one link is too many to be allowedin our server so the first 100000 European corpus sentences(statistics shown in Table 5) are chosen to prove the availabil-ity of the equations betweenword alignment and phrase tableTable 6 presents the statistics of 3064 testing data fromWMT2006 for the same English sentences

To test our algorithm for pruning the phrase table alarge Chinese-English parallel corpus is brought inThe largecorpus consists of two parts 3289497 sentences are fromthe CWMT 2013 [51] and 4157556 sentences are from UM-Corpus [52 53] After removing repeat and mismatchedsentences in the combined two parts there are left 7445190sentences and the statistics of the combined parallel corpusare presented in Table 7 Note that the statistics of Chinesesentences are counted both in character level (each Chinesecharacter is treated as one token CharacterCE in Table 7) and

The Scientific World Journal 9

Table 8 Statistics of English amp Chinese corpora

Languages Tokens Average length VocabulariesEnglish 804584844 2464 2357374Chinese 3007196322 3343 1533343

Table 9 Statistics for EC-Corpus testing data

Languages Tokens Average lengthEnglish 118802 2376CharacterCE 153705 3074CTBCE 118499 2370

word level (segmented by StanfordCTB segmenter presentedby CTBCE)

The UM-Corpus also contains English Chinese andPortuguese monolingual corpora In this experiment theEnglish and Chinese corpora are used There are 35938697English sentences from news domain and more than 70million Chinese sentences from China portal webpage Thestatistics about the two corpora are shown in Table 8

The 5000 sentences of test data (named UM-Testing) aredesigned according to domains in the UM-Corpus Thereare 1500 sentences for news 500 sentences for laws 500sentences for novels 600 sentences for thesis 700 sentencesfor education 600 sentences for science and 600 for subtitleThe statistics of the testing data are shown in Table 9

43 Equation of Word Alignments versus Phrases The calcu-lation for number of phrases requires the full word alignmentlinks Full word alignment in other words is the best qualityof word alignment It is obviously very time-consumingif we manually edit the large word aligned sentences It isalready observed in the past that generative models usedfor statistical word alignment create alignments of increasingquality as they are exposed to more data [19] The intuitionbehind this is simple as more cooccurrences of source andtargets words are observed the word alignments are betterFollowing this intuition if we wish to increase the quality ofa word alignment we allow the alignment process access toextra data which is used only during the alignment processand then removed If we wish to decrease the quality of aword alignment we divide the parallel corpus into piecesand align the pieces independently of one another and finallyconcatenate the results together In our experiments we usethe entire WMT 2006 corpus as training data firstly andthen the first 100000 alignments are chosen as the full wordalignments

Also the leftmost middle and rightmost word alignmentpoint based on the full word alignment are obtained based onthe results generated by GIZA++

The minimal and maximal sizes of the phrase tableextracted from the full and middle word alignment and the

differential rates are shown in Table 10The differential rate iscalculated by

119877diff-min =

1003816100381610038161003816119873min minus 119873moses-min1003816100381610038161003816

119873min (22)

119877diff-max =

1003816100381610038161003816119873max minus 119873moses-max1003816100381610038161003816

119873max (23)

where the119873moses-min and119873moses-max indicate the minimal andmaximal phrases extracted fromMoses

From Table 10 we can see that the actual number ofphrases is close to the calculation result With the small dif-ferential rate the (13) and (17) can be available to the practicalexperiments The difference may come from two reasons(1) the assumption of full word alignment produced fromthe GIZA++ may contain too many noises (2) equations(13) and (17) do not consider repetitions of the extractedphrases

44 Phrase Table versusMachine Translation Wehave shownthat the there is no direct relationship between the sizeof phrase table and machine translation performance Asreflected in (17) and Table 4 at least half of the phrases canbe removed without hurting too much the performance ofmachine translation In this part we want to use the largeChinese-English corpus to test our pruning algorithm basedon monolingual and bilingual corpus

According to our experience different thresholds willgenerate different size of phrase table and result in differentperformance of translation Therefore one of the key stepsis to choose the pruning threshold in Algorithms 2 and 3Firstly a baseline threshold is calculated from the logarithmof sentences of the given corpus (119873) that is log(119873) Forthe 119862-value threshold the baseline threshold is about 25(log(36 lowast 107)) For the 119875-value threshold the negativelogarithm will be chosen Because the probabilities involvedare incredibly tiny we will work instead with the negative ofthe natural logs of the probabilities For example instead ofselecting phrase pairs with a 119875-value less than exp(minus15) wewill select phrase pairs with a negative-log-119875-value greaterthan 15 The baseline threshold for 120591

119865is 45 (log(74 lowast 106))

Secondly the different thresholds will increase or decreasethan the baseline threshold The affection to the phrase tableand BLEU scores based on different thresholds is shown inTable 11

From Table 11 we can see that nearly 98 phrases canbe discarded with monolingual and bilingual corpora whilethe translation quality does not become worse too much(only decrease 235) Experiments show that about 85 of thephrase table is reduced and can generate the best translationquality which not only again proves the similar results as inother researchers [3 21 32 38] but also indicates that mostphrases are not needed at all The best translation can beachieved on the size of about 15 to 30of the original phrasetable That is to say only a small fraction of phrases are theessential elements while major of the extracted phrases arespurious and in some sense they are redundancy and shouldbe discarded

10 The Scientific World Journal

Table 10 Phrases comparison between Moses extraction and calculation result

Phrase length 119870 = 30

Languages 119873moses-min 119873min 119877diff-min 119873moses-max 119873max 119877diff-max

fr-en 9136283 9080307 0006 1541902790 1544021509 0001es-en 10958551 10895266 0006 1401977185 1489422170 0058de-en 9219921 8929095 0033 1606176665 1686043482 0047

Table 11 119862-value and 119875-value threshold effect on the phrase tablesize and BLEU scores

Thresholds UM-Corpus + CWMT 2013119862-value 120576 minus log(119875-value) 120591

119865Phrase table () BLEU ()

0 0 100 281310 20 836 280925 45 525 2798100 60 297 2811200 100 143 2827400 200 202 2578

Table 12 Translation performance on different word alignmenttypes

Languages BLEU (3064 sentences)119860 full 119860 lmost 119860 sure 119860mid 119860 rmost

fr-en 2480 448 2226 996 248es-en 3076 312 2824 1128 232de-en 2166 230 1928 976 294

45 Word Alignment versus Translation Quality In order toinvestigate the affection to the translation performance basedon different word alignment types and phrase table size weselected five different types of word alignments to generatethe phrase table and then measure the translation qualityby BLEU metric the full word alignment points (119860 full) theleftmost word alignment (119860

119897most) the sure alignment point(119860 sure) the middle word alignment point (119860mid) and therightmost word alignment point (119860

119903most)Clearly the 119860 full is the best quality of word alignment

and followed by 119860 sure which is the only 1-to-1 alignmentThe 119860

119897most 119860mid and 119860119903most are much worse ones which

is only one alignment link As mentioned previously allthese alignment points are obtained after getting the resultsgenerated by GIZA++ The experiment is carried out inWMT 2006 corpus Based on the different alignment typesthe number of phrases extracted and the correspondingtranslation BLEU are presented in Figure 7 and Table 12separately

The histograms show that there is not any direct propor-tion relationship between the translation performance andthe number of extracted phrases The numbers in Figure 7show that German-English language pairs can generate themost phrases while the translation quality is not the bestAlthough the fewest phrases produced from the Spanish-English language pairs the best translation performance canbe obtained Explained by (17) more phrases usually lead to

worse word alignment quality or it can be said that thereare too many unaligned words in the produced alignmentresults The results in Figure 7 and Table 12 show that (1) thefull word alignment will contribute most to the translationperformance and there will not be too much decrease whenonly the sure alignment is provided (2) however the first andlast word alignment will not produce too high translationquality (3) although the middle position word alignmentcan extract the most phrases the quality of the translationis not the best and it seems that middle word alignment cancontribute more to the finally translation

In fact the translation from the only one link wordalignment should be higher After all the default full wordalignments have too many noises which trained from a largetraining corpus without manual edit

5 Conclusions

In this paper some investigations are made about the rela-tionship among word alignment phrase table and machinetranslation After deducing a formula for estimating the sizeof the phrase table a maximal phrase pruning ratio can becalculated based on the average length of the given corpus andword alignment positions Finally translation performancebased on different word alignment types is compared

Equations (13) and (17) provide a simple way to predictthe size of the phrase table in advance based on the generatedword alignment points This is beneficial for evaluating howmuch hardware resources are required to train the modelwhen the size of the parallel corpus is huge

Experiment results show the affection to the translationperformance is significant if the AER decreases or increasestoomuchThere is no direct proportion relationship betweenthe translation performance and the number of extractedphrases In other words most of the phrases are not needed atall only phrases extracted from the full word alignment areenough It can be seen from our corpus-motivated pruningmethod that only 15ndash30 phrases are needed for the basictranslations and most phrases in the phrase table are noisesto the final translation

Based on the existing phrase extraction algorithm fromword alignment better word alignment will be very helpful tothe final translation There are at least three advantages witha better quality of word alignment even without the existingphrase extraction algorithm improved as follows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

The Scientific World Journal 11

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Fr-e

n nu

mbe

r of p

hras

es

Fr-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

668712E7

785057E7

913628E654091E7

15419E9

(a)

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Es-e

n nu

mbe

r of p

hras

es

Es-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

753427E7

621835E7

140198E9

109586E7534143E7

(b)

De-en number of phrases

180E + 009

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

De-

en n

umbe

r of p

hras

es

Wod alignment types

Almost Asure Amid Afull Armost

716272E7712921E7

921992E6

527541E7

160618E9

(c)

Fr-e

n BL

EU

Fr-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

448

2226

996

248

248

(d)

Es-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

Es-e

n BL

EU

35

30

25

20

15

10

5

0

312

2824

1128

232

3076

(e)

De-

en B

LEU

De-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

23

1928

976

2166

294

(f)

Figure 7 Number of phrases extracted from different word alignment points and their BLEU

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

The next work we want to focus on is to add the corpus-motivated pruning technique into the translation modelsthat is the phrases will be filtered during the training timenot after generating the phrase table

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank all reviewers for the verycareful reading and helpful suggestions The authors aregrateful to the Science and Technology Development Fundof Macau and the Research Committee of the University ofMacau for the funding support for their research under theReference nos MYRG076 (Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS

References

[1] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology vol 1 pp 48ndash54 Association forComputational Linguistics Stroudsburg Pa USA 2003

[2] R Zens F J Och and H Ney Phrase-Based Statistical MachineTranslation Springer Berlin Germany 2002

[3] J H Shin Y S Han and K S Choi ldquoBilingual knowledgeacquisition from Korean-English parallel corpus using align-ment method Korean-English alignment at word and phraselevelrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 230ndash235 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[4] K Yamamoto T Kudo Y Tsuboi and YMatsumoto ldquoLearningsequence-to-sequence correspondences from parallel corporavia sequential pattern miningrdquo in Proceedings of the HLT-NAACL Workshop on Building and Using Parallel Texts DataDriven Machine Translation and Beyond vol 3 pp 73ndash80Association for Computational Linguistics Stroudsburg PaUSA 2003

[5] C Goutte K Yamada and E Gaussier ldquoAligning words usingmatrix factorisationrdquo in Proceedings of the 42nd AnnualMeetingon Association for Computational Linguistics Association forComputational Linguistics Stroudsburg Pa USA 2004

[6] F J Och C Tillmann H Ney and L F Informatik ImprovedAlignmentModels for StatisticalMachine Translation Universityof Maryland College Park Md USA 1999

[7] F J Och and H Ney ldquoA Comparison of Alignment Modelsfor Statistical Machine Translationrdquo in Proceedings of the 18thConference on Computational Linguistics vol 2 pp 1086ndash1090

12 The Scientific World Journal

Association for Computational Linguistics Stroudsburg PaUSA 2000

[8] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[9] F J Och and H Ney ldquoThe alignment template approach tostatistical machine translationrdquo Computational Linguistics vol30 no 4 pp 417ndash449 2004

[10] N Duan M Li and M Zhou ldquoImproving phrase extractionvia MBR phrase scoring and pruningrdquo in Proceedings of the MTSummit XIII pp 189ndash197 2011

[11] G Neubig T Watanabe E Sumita S Mori and T Kawa-hara ldquoAn unsupervised model for joint phrase alignment andextractionrdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human LanguageTechnologies vol 1 pp 632ndash641 Stroudsburg Pa USA June2011

[12] H L M Z Hendra Setiawan ldquoLearning phrase translationusing level of detail approachrdquo inTheTenthMachine TranslationSummit (MT-Summit X) 2005

[13] C Tillmann ldquoA projection extension algorithm for statisticalmachine translationrdquo in Proceedings of the Conference on Empir-ical Methods in Natural Language Processing Association forComputational Linguistics Stroudsburg PA USA 2003

[14] B Zhao and S Vogel ldquoA generalized alignment-free phraseextractionrdquo in Proceedings of the ACLWorkshop on Building andUsing Parallel Texts pp 141ndash144 Association for ComputationalLinguistics Stroudsburg Pa USA 2005

[15] Y Zhang and S Vogel ldquoCompetitive grouping in integratedphrase segmentation and alignment modelrdquo in Proceedings ofthe ACLWorkshop on Building and Using Parallel Texts pp 159ndash162 Association for Computational Linguistics StroudsburgPa USA 2005

[16] A Lopez ldquoStatistical machine translationrdquo ACM ComputingSurveys vol 40 no 3 article 8 2008

[17] K Ganchev J V Grac and B Taskar ldquoBetter alignments = Bet-ter translationsrdquo in Proceedings of the 46th Annual Meeting ofthe Association for Computational Linguistics Human LanguageTechnologies pp 986ndash993 June 2008

[18] D Vilar M Popovic and H Ney ldquoAER do we need toldquoimproverdquo our alignmentsrdquo in Proceedings of the InternationalWorkshop on Spoken Language Translation pp 205ndash212 2006

[19] A Fraser andDMarcu ldquoMeasuring word alignment quality forstatistical machine translationrdquo Computational Linguistics vol33 no 3 pp 293ndash303 2007

[20] P Lambert S Petitrenaud Y Ma and A Way ldquoWhat typesof word alignment improve statistical machine translationrdquoMachine Translation vol 26 no 4 pp 289ndash323 2012

[21] J H Johnson J Martin G Foster and R Kuhn ldquoImprovingtranslation quality by discarding most of the phrasetablerdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 967ndash975 June2007

[22] J H Johnson Conditional Significance Pruning DiscardingMore of Huge Phrase Tables 2012

[23] M Yang and J Zheng ldquoToward smaller faster and better hierar-chical phrase-based SMTrdquo in Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for ComputationalLinguistics and 4th International Joint Conference on Natural

Language Processing of the AFNLP (ACL-IJCNLP rsquo09) pp 237ndash240 Association for Computational Linguistics StroudsburgPa USA August 2009

[24] N Tomeh N cancedda and M Dymetman ldquoComplexity-based phrase-table filtering for statistical machine translationrdquoin Proceedings of the MT Summit XII 2009

[25] M Eck S Vogel and A Waibel ldquoTranslation model prun-ing via usage statistics for statistical machine translationrdquo inProceedings of the Human Language Technologies The AnnualConference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT rsquo07) pp 21ndash24Association for Computational Linguistics troudsburg PaUSA 2007

[26] M Eck S Vogel and A Waibel ldquoEstimating phrase pairrelevance formachine translation pruningrdquo inProceedings of theMT Summit XI pp 159ndash165 2007

[27] Y Chen A Eisele and M Kay ldquoImproving statistical machinetranslation efficiency by triangulationrdquo in Proceedings of theSixth International Conference on Language Resources andEvaluation 2008

[28] Y Chen M Kay and A Eisele ldquoIntersecting multilingual datafor faster and better statistical translationsrdquo in Proceedings of theHuman Language Technologies The Annual Conference of theNorth American Chapter of the Association for ComputationalLinguistics (NAACL HLT rsquo09) pp 128ndash136 Association forComputational Linguistics Stroudsburg Pa USA June 2009

[29] G Sanchis-Trilles D Ortiz-Martınez J Gonzalez-Rubio JGonzalez and F Casacuberta ldquoBilingual segmentation forphrasetable pruning in statistical machine translationrdquo in Pro-ceedings of the 15th International Conference of the EuropeanAssociation for Machine Translation (EAMT rsquo11) pp 257ndash264May 2011

[30] N Tomeh M Turchi G Wisinewski A Allauzen and YFrancois ldquoHow good are your phrases Assessing phrase qualitywith single class classificationrdquo in Proceedings of the Interna-tional Workshop on Spoken Language Translation pp 261ndash2682011

[31] W Ling J Graca I Trancoso and A Black ldquoEntropy-basedpruning for phrase-based machine translationrdquo in Proceedingsof the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Language Learn-ing pp 962ndash971 Association for Computational LinguisticsStroudsburg Pa USA 2012

[32] R Zens D Stanton and P Xu ldquoA systematic comparison ofphrase table pruning techniquesrdquo in Proceedings of the JointConference on Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning pp 972ndash983 Association for Computational Linguistics StroudsburgPa USA 2012

[33] K Papineni S Roukos T Ward andW Zhu ldquoBLEU a methodfor automatic evaluation of machine translationrdquo in Proceedingsof the 40th Annual Meeting of the Association for ComputationalLinguistics (ACL rsquo01) pp 311ndash318 2001

[34] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe Mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[35] P Koehn StatisticalMachine Translation CambridgeUniversityPress New York NY USA 1st edition 2010

[36] E Galbrun ldquoPhrase table pruning for statistical machine trans-lationrdquo Department of Computer Science Series of PublicationsC C-2009-22 University of Helsinki 2007

The Scientific World Journal 13

[37] T M Cover and J A Thomas Elements of Information TheoryWiley-Interscience New York NY USA 2006

[38] Z He Y Meng Y Lu H Yu and Q Liu ldquoReducing SMTrule table with monolingual key phraserdquo in Proceedings ofthe Joint Conference of the 47th Annual Meeting of the Asso-ciation for Computational Linguistics and 4th InternationalJoint Conference on Natural Language Processing of the AFNLP(ACL-IJCNLP rsquo09) pp 121ndash124 Association for ComputationalLinguistics Stroudsburg Pa USA August 2009

[39] Z He Y Meng and H Yu ldquoDiscarding monotone com-posed rule for hierarchical phrase-based statistical machinetranslationrdquo in Proceedings of the 3rd International UniversalCommunication Symposium (IUCS rsquo09) pp 25ndash29 New YorkNY USA December 2009

[40] K T Frantzi and S Ananiadou ldquoExtracting nested colloca-tionsrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 41ndash46 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[41] P Koehn H Hoang A Birch et al ldquoMoses open source toolkitfor statistical machine translationrdquo in Proceedings of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions pp 177ndash180 Association for ComputationalLinguistics Stroudsburg Pa USA 2007

[42] D Chiang ldquoA hierarchical phrase-based model for statisticalmachine translationrdquo in Proceedings of the 43rd Annual Meetingof the Association for Computational Linguistics pp 263ndash270Association for Computational Linguistics Stroudsburg PaUSA June 2005

[43] W A Gale and K W Church ldquoIdentifying word correspon-dences in parallel textsrdquo in Proceedings of the Workshop onSpeech and Natural Language pp 152ndash157 Association forComputational Linguistics Stroudsburg Pa USA 1991

[44] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[45] M Federico N Bertoldi and M Cettolo ldquoIRSTLM An opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

[46] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the MT Summit X 2005

[47] S Green and J DeNero ldquoA Class-based agreement model forgenerating accurately inflected translationsrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Long Papers vol 1 pp 146ndash155 Association forComputational Linguistics Stroudsburg Pa USA 2012

[48] N Xue F D Chiou and M lmer ldquoBuilding a large-scaleannotated chinese corpusrdquo in Proceedings of the 19th Interna-tional Conference on Computational Linguistics vol 1 pp 1ndash8PaAssociation for Computational Linguistics Stroudsburg PaUSA 2002

[49] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[50] S Vogel H Ney and C Tillmann ldquoHmm-based word align-ment in statistical translationrdquo in Proceedings of the 16thInternational Conference on Computational Linguistics pp 836ndash841 1996

[51] CWMT 2013 httpwwwliipcncwmt2013

[52] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) pp 1273ndash1278 Qingdao China July2010

[53] L Tian D Wong L Chao et al ldquoUM-Corpus a large English-Chinese parallel corpus for statistical machine translationrdquo inProceedings of the 9th International Conference on LanguageResources and Evaluation 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article A Relationship: Word Alignment, Phrase

2 The Scientific World Journal

was considered during decoding and how often it was usedin the best translation In [27 28] the researchers adaptedtriangulation approach to prune the phrase table whichrequires additional bilingual corpora Authors in [10 29 30]modified the phrase extraction methods in order to reducethe phrase table size In [31 32] they introduced an entropy-based pruning criterion to prune the generated phrase tables

Both studies tried to reduce the spurious results (errors orredundancies) to investigate the affection to the final trans-lation performance based on different empirical statisticsIn other words many existing methods usually employ adhoc heuristics without theoretical proof We try to make astudy on the relationship among word alignments phrasetable and translation quality originally from a mathematicalperspective proposed in this paper The equation indicatesthat it is necessary to improve the quality of word alignmenteven if the alignment does not explicitly connect to thetranslation quality In particular by lowering the quality ofalignment the affection to the translation is very minor Inthis paper two extreme cases are that the best and the worstquality of word alignments are used to illustrate the situationExperiment results show that the quality of word alignmentwill affect the translation performance a lot measured byBLEU [33]

The contributions of this paper mainly focus on twoaspects (1) formulating the relationships among word align-ment phrase table and machine translation the equationsindicate that the better quality of the word alignment thefewer the phrases will be and the better translation perfor-mance both on phrase-level and finer granularity levels (2)corpus-motivated pruning techniques to prune the defaultlarge phrase table which can remove nearly 85 of allthe phrase pairs without hurting the performance of thetranslation

The rest of the paper is organized as follows Section 2reviews the related works and background that will be used inthis paper Section 3 describes the relationship details amongthe word alignment phrase table and machine translationfrom a mathematical perspective And a corpus-motivatedpruning method is introduced here Section 4 describes theconducted experiments and presents the obtained resultsThe paper concludes with a summary and discussions of theresults

2 Related Works and Background

Our work is based on the dominant method to obtain aphrase table from word alignment which trained from theEM (Expectation Maximization) algorithm To extract thephrases from the word alignment EM algorithm will beutilized to train the bilingual corpus for several iterationsand then phrase pairs that are consistent with this wordalignmentwill be extractedMost of the current Phrase-BasedSMT systems rely on the GIZA++ implementation of theIBM Models [34] to produce word alignments running thealgorithm in both directions source to target and target tosource Various heuristics can then be applied to obtain asymmetrized alignment from those twoMost of them such as

grow-diag-final-and [8] start from the intersection of the twoword alignments and enrich it with alignment points fromthe union

Suppose a phrase pair is represented by (119891 119890) consistentwith an alignment 119886 if all words 119891

1 119891

119895 119891

119869in 119891 that

have alignment points in 119886 have these with words 1198901

119890119894 119890

119868in 119890 and vice versa More formal definition of the

consistent with the word alignment can be described asfollows [35]

(119891 119890) consistent with 119886 hArr

forall119890119894isin 119890 (119890

119894 119891119894) isin 119886 997904rArr 119891

119894isin 119891

forall119891119894isin 119891 (119890

119894 119891119894) isin 119860 997904rArr 119890

119894isin 119890

exist119890119894isin 119890 119891

119894isin 119891 (119890

119894 119891119894) isin 119886

(1)

This algorithm indicates that all the phrases will beextracted if the biphrases (119891 119890) alone with their alignment 1198861015840satisfy the following two conditions [36]

(i) 119890 and 119891 are consecutive word subsequences in thetarget sentence 119890 and source sentence 119891 respectivelyand neither of them is longer than 119896 words

(ii) 1198861015840 the alignment between the words of 119891 and 119890

induced by 119886 contains at least one link

For the relationship between phrase table and machinetranslation most researchers talk about the effects based ondifferent sizes of phrase table which concerns the phrasetable pruning techniques There are mainly three pruningmethods statistical-based significance testing and entropy-based methods For more details there are absolute andrelative pruning methods in the statistical-basedmethodTheabsolute pruningmethods rely only on the statistics of a singlephrase pair (119891 119890)Theymay prune all translations of a sourcephrase that fall below a threshold according to the count119873(119891 119890) or the probability 119901(119890 | 119891) of the phrase pair (119891 119890) asshown in (2) Hence they are independent of other phrasesin the phrase table

119873(119891 119890) lt 120591119888

119901 (119890 | 119891) lt 120591119901

(2)

However the relative pruning methods consider the fullset of target phrases for a specific source phrase 119891 Given apruning threshold 120591

119905 a phrase pair (119891 119890) is discarded if

119901 (119890 | 119891) lt 120591119905max119890

119901 (119890 | 119891) (3)

Both the significance testing and entropy-based methodare trying to find a threshold 120591 to reduce the size of thegenerated phrase table

For the entropy-based approach in [32] they used con-ditional Kullback-Leibler divergence [37] to measure theprunedmodel1199011015840(119890 | 119891) which is to be as similar as possible tothe original model 119901(119890 | 119891) Then a pruning threshold 120591

119864can

The Scientific World Journal 3

Table 1 Contingency table

119890 1198901015840

119891 119873(119891 119890) 119873 (119891) minus 119873(119891 119890) 119873 (119891)

1198911015840

119873(119890) minus 119873(119891 119890) 119873 minus 119873(119891) minus 119873(119890) minus 119873 (119891 119890) 119873 minus 119873(119891)

119873(119890) 119873 minus 119873(119890) 119873

be chosen and phrase pairs with a contribution to the relativeentropy below that threshold are pruned if

119901 (119890 | 119891) [log119901 (119890 | 119891) minus log1199011015840 (119890 | 119891)] lt 120591119864 (4)

For the significance testing pruning approach it heavilydepends on a119875-valueThe value is calculated from two by twocontingency tables 119862119879(119891 119890) as shown in Table 1 which canbe used to represent the relationship between 119891 and 119890 wherethe119873(119891 119890) is defined as the number of parallel sentences thatcontain one or more occurrences of 119891 on the source side and119890 on the target side119873(119891) is the number of parallel sentencesthat contain one or more occurrences of119891 on the source side119873(119890) is the number of parallel sentences that contain one ormore occurrences of 119890 on the target side and119873 is the numberof parallel sentences

Following Fisherrsquos exact test the probability of the con-tingency table via the hyper geometric distribution can becalculated

119901ℎ(119896) =

(119873(119891)

119873(119891119890))(119873minus119873(119891)

119873(119890)minus119873(119891119890))

( 119873119873(119890)

) (5)

Then the 119875-value can be then calculated by summingprobabilities for tables that have a larger119873(119891 119890)

119875-value (119862119879 (119891 119890)) =

min(119873(119891)119873(119890))

sum

119896=119873(119891119890)

119901ℎ(119896) (6)

A phrase pair will be discarded if the 119875-value is greaterthan 120591

119865as shown in (7)

119875-value (119862119879 (119891 119890)) =

min(119873(119891)119873(119890))

sum

119896=119873(119891119890)

119901ℎ(119896) gt 120591

119865 (7)

Generally speaking the goal of the entropy-basedmethodis to remove the redundant phrases whereas the otherapproaches are to try to remove the low-quality or unreliablephrases

The knowledge introduced above will be used duringour discussions on the relationships among word alignmentphase table and machine translation More details will beshown in Section 3

3 Word Alignment amp Phrase Table ampMachine Translation

In this section we can see from a theoretical aspect thatthe quality of the underlying word alignments has a strong

influence both on performances of phrases table and phrase-based SMT system

31 Word Alignment amp Phrase Table As introduced in theprevious section there are some direct relationships betweenphrase table and word alignments In this part the relation-ship between the word alignments and extracted phrases isformulated to estimate the size of the target phrase tableTwo different word alignment cases are considered for therelations Firstly the phrases are extracted from the full wordalignment links of parallel sentences which are the best wordalignments Secondly phrases based on a single link of wordalignment will bemade an investigation which can be treatedas the worst alignment information

311 Phrases Extraction from Full Word Alignment Supposea source sentence 119891

119869

1= 11989111198912sdot sdot sdot 119891119869will be translated into a

target sentence 1198901198681

= 11989011198902sdot sdot sdot 119890119868 and only one target word

is allowed to align to one or multiple source words Where119868 and 119869 are the length of the target and source sentencesrespectively 119894 and 119895 are the positions of words in the targetand source sentences

If each target word 119890119894has at least one alignment link

against the word in source text named as full word alignmentfewest phrase pairs can be extracted

For the first case considering the alignment pointsin Figure 1 each word 119890

119894in the target sentence has at

least one alignment point and there is no crossing wordalignment which means words are monotonically alignedStarting with the first word 119890

1in the target sentence

the phrase pairs with possible extent can be extractedfor example (119890

1| 1198911) (11989011198902

| 119891111989121198913) (11989011198902sdot sdot sdot 119890119894

|

119891111989121198913sdot sdot sdot 119891119895) (11989011198902sdot sdot sdot 119890119894sdot sdot sdot 119890119868| 119891111989121198913sdot sdot sdot 119891119895sdot sdot sdot 119891119869) and so

forth where the total number of 119868 phrase pairs can beobtained Starting with the second word 119890

2in the target

sentence 119868 minus 1 phrase pairs can be extracted for example(1198902

| 11989121198913) (1198902sdot sdot sdot 119890119894

| 11989121198913sdot sdot sdot 119891119895) (1198902sdot sdot sdot 119890119894sdot sdot sdot 119890119868

|

11989121198913sdot sdot sdot 119891119895sdot sdot sdot 119891119869) and so onThus the total number of phrase

pairs that extracted from the fully aligned sentences can bedefined as

119868 + (119868 minus 1) + sdot sdot sdot + 1 =119868 (119868 + 1)

2 (8)

Actually the previous alignment type cannot deal withthe word alignment that contains the cross-alignment asshown in Figure 2 where the fourth and the fifth Chinesewords are cross-aligned with words in source sentenceAccording to (8) there should be 21 possible phrase pairsHowever only 18 phrases are generated by the conventionalphrase extraction algorithm

The general form of above alignments is illustrated inFigure 3 where the cross-alignments take place between theneighboring words 119890

119894beginand 119890119894end

that is 119894end = 119894begin + 1 Anyphrase pairs that fall between the bounds of cross-alignedwords cannot be extracted in line with (8) That is (119894begin minus

1) + (119868 minus 119894end) phrases should be excluded

119873ca = (119894begin minus 1) + (119868 minus 119894end) (9)

4 The Scientific World Journal

e1 e2 ei eI

f1 f2 f3 fj fJ

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1 Full word alignments

1 2 3 4 5

I1 am2 able3 do5 it6to4 well7

∘6

8

Figure 2 An example of nonmonotonic alignments where align-ment links are crossing between parallel sentences

So if we define the reduced phrases as 119873ca (number ofcross-alignments) a more practical equation can be revisedas

119873min =119868 (119868 + 1)

2minus 119873ca (10)

If there is more than one cross-alignment as illustratedin Figure 3 the total number of extracted phrases can begeneralized as

119873min =119868 (119868 + 1)

2minus

119898

sum119896=1

[(119894119896

begin minus 1) + (119868 minus 119894119896

end)] (11)

Reduced phrase pairs are determined by the cross-alig-nment point 119894begin and 119894end 119894

119896

begin represents the start positionand 119894119896end is the last position of the 119896th pair words (boundedby the crossing links) and 119898 is the total number of cross-alignments

Figure 3 illustrates the situation that there is not any wordbetween the crossing dependencies However in reality therewill be any words in between the cross words as illustrated inFigure 4

For the third case words between 119890119894begin

and 119890119894end

may havesimilar alignment situation as that illustrated in (9)Thus thefollowing equation can be derived to calculate 119873

119903 the exact

number of phrase pairs that cannot be generated Consider

119873119903=

119898

sum119896=1

119894119896

endminus119894119896

endminus1

sum120572=0120573=0

[(119894119896

begin + 120572 minus 1) + (119868 minus 119894119896

end + 120573)] (12)

where 120572 indicates the shift word position after the first crossword 119894119896begin and 120573 indicates the word position before the endcross word 119894

119896

end And 120572 120573will not be greater than the numberof words between the first and the end cross word

Finally the number of phrase pairs that can be extractedfrom the full word alignmentsmeasured in the target side canbe described by

119873min-119905 =119868 (119868 + 1)

2minus 119873119903 (13)

e1 e2 eI

f1 f2 f3 fJ

middot middot middot middot middot middot

middot middot middotmiddot middot middot

eibegin eiend

fjbegin fjend

Figure 3 Alignments with a cross link of neighboring words

e1 e2 eI

f1 f2 f3 fJ

middot middot middot middot middot middot

middot middot middotmiddot middot middot middot middot middot

middot middot middot eibegin eiend

fjbegin fjend

Figure 4 Alignments with a cross link involving distant words

The equation shows that the size of the phrase tableis decided by the target length of the given sentence ifthe word alignment is the full word alignment that is thebest and ideal quality word alignment It is proven thatthe bidirectional training approach for word alignment canimprove the quality of word alignment [8] If the alignmentis deduced from another direction (source side) similarly theextracted phrases from source side can be represented by

119873min-119904 =119869 (119869 + 1)

2minus 119873119903 (14)

Equations (13) and (14) show that the number of phrasesheavily affected by the length of source and target sentencesGiven the best word alignment if the length in the sourcesentence equals the target side (119868 = 119869) the phrases willbe the minimal numbers (119873min-119904 or 119873min-119905) If not equalthe extracted phrases will be larger than this minimal value(119873min-119904 and 119873min-119905) after intersection and union operationsduring alignment process

312 Phrases Extraction from Worst Alignment Supposeonly one word 119891

119895in the source sentence is aligned to a target

word 119890119894as shown in Figure 5 Note that unaligned words

near the word 119891119895may lead to multiple possibilities each

proceeding and following words next to it can form a part ofthe phrase and so does the neighboring words of 119890

119894in target

sentence In other words if we know the different extractedsituations in each source and target sentence and then totalcounts of phrase pairs can be derived by the product of thetwo numbers Algorithm 1 describes the different phrasesextraction scenarios according to different positions in thesource sentence

Obviously there are total 119894 words that can start with(1198901 1198902 1198903 119890

119894) Now in target sentence consisting of

oddeven words the phrase pairs 119891(119894) in different positionare

119891 (119894) = 119894 (119868 minus 119894 + 1) = 119894 (119868 minus 119894) + 119894 (15)

The phrases 119892(119895) in the source side can be calculated in asimilar way

119892 (119895) = 119895 (119869 minus 119895 + 1) = 119895 (119869 minus 119895) + 119895 (16)

The Scientific World Journal 5

e1 e2 ei eI

f1 f2 f3 fj fJ

middot middot middot

middot middot middot middot middot middot

middot middot middot

Figure 5 Sentences with single alignment link

Based on (15) and (16) the total number of phrase pairs119873max is given by

119873max = [119894 (119868 minus 119894) + 119894] [119895 (119869 minus 119895) + 119895] (17)

where 0 le 119894 le 119868 minus 1 and 0 le 119895 le 119869 minus 1Obviously (15) and (16) accord with symmetric distribu-

tion and the maximal value will be achieved in the middleposition Given a sentence consisting of 119871 words a formaldescription to the middle position in the sentence can bepresented by

wordmid

1 + 119871odd2

119871even2

or (119871even2

+ 1) (18)

We can see that it gives (17) the maximal value in themiddle position in a given sentence that is the nearer to themiddle word the more phrase pairs can be produced Takethe Figure 2 as an example there are 18 phrases based onthe full word alignments while the phrases will be 240 whenonly the fifth word (do) in the source sentence aligns to thethird word (做zuo) in the target side From the view of sizeof phrase table the calculation value from (17) is much largerthan that from (13) That proves that the better quality of theword alignment the fewer phrases will be

Note that there will be some share phrases extracted fromthe full word alignment and only single word alignment whichleads to the phenomenon that there are still some translationabilities even with only one alignment point under the exist-ing phrase extraction algorithm For instance phrase pairs(11989011198902sdot sdot sdot 119890119894| 119891111989121198913sdot sdot sdot 119891119895) (1198902sdot sdot sdot 119890119894| 11989121198913sdot sdot sdot 119891119895) (119890

119894| 119891119895)

are the share ones when given a phrase such as 119891111989121198913sdot sdot sdot 119891119895

both the phrase tables generated from both alignment typescan possibly translate it right to 119890

11198902sdot sdot sdot 119890119894

32 Phrase Table amp Machine Translation Nowadays the keystep of the process of phrase-based machine translation(SMT) involves inferring a large table of phrase pairs thatare translations of each other from a large corpus of alignedsentences The set of all phrase pairs together with estimatesof conditional probabilities and other useful features is calledPhrase Table (PT) which is applied during the decodingprocess [29]

Nevertheless phrases in the PT are very dependent fromthe system that uses them Some phrases might be presentin the PT but never be used in translations because they areeither ranked too low or erroneous translations for exampleIn other words there are too many redundant phrases in thePT Such situation leads many researches to propose differenttechniques to reduce the size of the phrase tableThis includes

Table 2 Pruned phrase table and translation quality

Pruning methods Languages Pruning rates BLEU (Δ)

Countsignificantfr-en 646sim859

minus01es-en 679sim881de-en 627sim841

Entropyfr-en 85sim95

minus10es-en 85sim95de-en 85sim95

Bilingual-segmentfr-en mdash

minus12es-en 98de-en 94

the methods as introduced in Section 2 These works can betreated as the research on the relationship between phrasetable and machine translation

All the works presented in Section 2 show that thetranslation results can be achieved from smaller phrase pairswith several times increase in translation speed with little orno loss in translation accuracy As reported by [32] count-based and significance-based pruning techniques can resultin larger savings between 70 and 90 while the entropy-based pruning approach can reduce consistently the entriesbetween 85 to 95 of the phrase table In [29] they evenpointed out that the pruning rate can reach to 98 using abilingual segmentation method with finite state transducersMore details about the pruning rates and related translationqualities (measured by BLEU [33]) are shown in Table 2where the BLEU column is to record the maximal decrease(minus) rate from the 100 size of phrase table to the maximalpruning rate of the phrase table

The pruning rates presented above are counted from anempirical or experiment result We can also estimate thepruning rate from a theoretical aspect In practice we alsonotice that the phrases extracted from the middle wordalignment point includemost of the phrases that are extractedfrom the full word alignment All the phrases excluding theones extracted from the full word alignment can be removedIn other words a rough pruning rate can be calculated fromthe ration of phrases extracted from best alignment (119873min)and worst alignment (119873max)

119877pruned = 1 minus119873min-119905119873max

(19)

If we want to quickly estimate how many percentageof phrases can be pruned in a given corpus the averagestatistics of the given corpus can be used as in (20) Thisequation shows that the pruning rate is not only decided bythe sentence length in a parallel corpus but also affected bythe alignment position Consider

119877pruned = 1 minus119868average (119868average + 1) 2

[119894 (119868average minus 119894) + 119894] [119895 (119869average minus 119895) + 119895] (20)

In reality the numerator trends to bigger and the denom-inator will be smaller so the pruning rate will be smaller

6 The Scientific World Journal

Start from 1198901

119873phrases = 119868 minus 119894 + 1Start from 119890

2

119873phrases = 119868 minus 119894 + 1Similarly forall119894 isin (119894 = 119894 119868) in the target side

119873phrases = 119868 minus 119894 + 1

Algorithm 1 Phrase extraction based on the only one alignment point 119890119894harr 119891119895

Table 3 Training data statistics WMT 2006

Language pairs Sentences Ave source len Ave target lenfr-en 688031 2267 2007es-en 730740 2152 2083de-en 751088 2031 2137

than the calculation value When the alignment position isin the middle position (middle alignment) there will getthe maximal pruning rates as shown in (21) Based on theequation we can estimate the pruning rate range based ondifferent word alignment position in European corpus [46]Table 3 shows the statistics of the WMT 2006 and Table 4compares different pruning rates reported in researchersrsquopaper [22 32] and the calculation ranges based on (20) Fromwhichwe can see that the calculation result is very close to theexperiment results presented in [32] The authors concludedtheir findings that ldquothe novel entropy-based pruning oftenachieves the same Bleu score with only half the number ofphrasesrdquo A very bold assumption is thatmost phrases (at leasthalf) can be removed from the original extracted phrasesConsider

119877max = 1 minus119868average (119868average + 1)

2

times ([119894mid (119868average minus 119894mid) + 119894mid]

times [119895mid (119869average minus 119895mid) + 119895mid])minus1

(21)

We have seen that the phrase table pruning rate is stronglyaffected by the length and word positions in the source andtarget sentences As reflected in the pruning equation (20)the phrase table might be more accurately pruned based onsome features (eg phrase length phrase cooccurrence) inthe bilingual and monolingual corpus The researchers in[27] reduced the phrase table based on the significance testingof phrase pair cooccurrence in bilingual corpus while Heet al [38 39] pruned the phrase pairs in terms of phrasefrequency in the source side using key phrases extractedfrom a monolingual corpus We want to make full use of theadvantages of the two approaches the key phrases can beused to check whether the extracted phrases are often usedin practice and the significant testing can be used to measurethe quality of the extracted phrase pairs Our methods wantto consider the features both for the source and target phrase

Table 4 Comparison between theoretical estimation results andother pruning techniques results

Language pairs 119895mid 119894mid 119877pruned

fr-en 1234 1104 535sim986es-en 1126 1142 493sim984de-en 1116 1119 449sim983

The basic metrics for phrases that are often used in prac-tice is the frequency that a phrase appears in a monolingualcorpus The more frequent a phrase appears in a corpus thegreater possibility the phrase may be used However longerphrase pairs will be removed in this way because of the factthat those phrases seem to appear rarely in a monolingualcorpus

The 119862-value is a measurement of automatic term recog-nition which can be used to solve this problem as describedin [40] The method not only fully considers the frequencyinformation such as frequencies of a phrase and a subphraseappearing in longer phrases but also is an efficient algorithm

In this algorithm (as shown in Algorithm 2) 4 factors (119871119865 119878119873) can be used to determine if a phrase 119901 is a key phrase[38]

(i) 119871(119901) the length of 119901(ii) 119865(119901) the frequency that 119901 appears in a corpus(iii) 119878(119901) the frequency that 119901 appears as a substring in

other longer phrases(iv) 119873(119901) the number of phrases that contain 119901 as a

substring

Given amonolingual corpus key phrases can be extractedefficiently according to Algorithm 2 Firstly (line 1) allpossible phrases are extracted as candidates of key phrases(eg extracted phrase table fromMoses [41]) Secondly (line3 to 7) for each candidate phrase 119862-value is computedaccording to the phrase appearing by itself (line 4) or as asubstring of other long phrases (line 6) The 119862-value is indirect proportion to the phrase length (119871) and occurrences(119865 119878) but in inverse proportion to the number of phrasesthat contain the phrase as a substring (119873)This overcomes thelimitations of frequency measurement A phrase is regardedas a key phrase if its 119862-value is greater than a threshold 120576Finally (line 11 to 14) 119878(119902) and 119873(119902) are updated for eachsubstring 119902

The algorithm is originally used for pruning the ruletable extracted from the hierarchical phrase-based SMT [42]

The Scientific World Journal 7

Input Monolingual Corpus(1) Extracted candidate phrases(2) for all phrases 119901 in length descending order do(3) if 119873(119901) = 0 then(4) 119862-value = (119871(119901) minus 1) times 119865(119901)

(5) else(6) 119862-value = (119871(119901) minus 1) times (119865(119901) minus 119878(119901)119873(119901))

(7) end if(8) if 119862-value ge 120576 then(9) add 119901 to KP(10) end if(11) for all sub-strings 119902 of 119901 do(12) 119878 (119902) = 119878(119902) + 119865(119901) minus 119878(119901)

(13) 119873(119902) = 119873(119902) + 1

(14) end for(15) end for

Output Key Phrase Table KP

Algorithm 2 Key phrase extraction from monolingual corpus

Input Bilingual Corpus and KP Table from Algorithm 1(1) for all phrases 1199011015840 in KP do(2) if minus log(119875-value(1199011015840)) le 120591

119865then

(3) add 1199011015840 to 119875 PT

(4) end if(5) end forOutput Pruned Phrase Table 119875 PT

Algorithm 3 Algorithm for key phrase pruning

However the idea can be also used for phrase-based SMT[1 41] perfectly After this filter step a filtered phrase tablecalled KP will be generated

The phrases in the phrase table KP are all possibly usedphrases in practice However there are still some noisesin the KP table for example the source phrase occurringin source sentences and the target phrase occurring in thetarget sentences are not well matched that is the phrasetranslation is not good enough To overcome this case it isconvenient to construct a two by two contingency table thattabulates the sentence pairs where the two types of matchesoccur and do not occur [21 22 43] The calculation resultis also called 119875-value as introduced in Section 2 A bilingualcorpus constraint can be added to filter the noise phrases aspresented in Algorithm 3 After this step a large pruning ratephrase table (called 119875 119875119879) will be generated

33 Word Alignment amp Machine Translation Based on thediscussion in Sections 31 and 32 we can find that the sizeof phrase table is heavily affected by the quality of wordalignment the better the word alignment the smaller size ofthe phrase table Besides this themachine translation perfor-mance is not directly decided by the size of the phrase tableat least 50 phrases can be removed without hurting the

performance of the machine translation Experiments showthat the quality of phrase translation plays more importantrole to the final translation quality in practice Therefore itis necessary to improve the quality of word alignment underexisting phrase extraction algorithm from word alignment

Now it is a little easier to talk about the relationshipbetween word alignment andmachine translationThe trans-lation of sentence is directly derived from the phrase tablewhile the phrase table is extracted from word alignment Inother words the quality of word alignment will affect theperformance of machine translation through phrase tableThat is to say the word alignment quality will affect thetranslation performance a lot However the works listed in[8 18 19] show that there is not always significant affection tothe translation result with improved word alignment quality

We do not agree with the conclusion Firstly either thetraining or testing corpus they used in their experimentsis limited in a small size or the improvement of the wordalignment quality is not significant (AER decrease between01 and 02) Secondly we think it is the existing phraseextraction algorithm that leads to the not obvious result Wehave pointed out in the last part of Section 31 that even theworst alignments are used the extracted phrase table stillhave some ability to translate some input phrases This isbecause of the share phrases extracted from the only one

8 The Scientific World Journal

(aligner)

Phraseextraction

Moses decoder

SRILMIRSTLM(LM toolkit)

Phrase table Language model

Monolingualcorpus

Bilingualcorpus

Sourcetext

Targettext

GIZA++

Figure 6 Mosesrsquo main modules and models

alignment link The only one word alignment can translatesome input phrases or sentences not mentioning the moreword alignment links

We will discuss the situation with different word align-ment in next section where the machine translation affectedby the best word alignment and worst word alignment willbe fully surveyed After that we can see that there are at leastthree advantages to improve the quality of word alignment asfollows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

4 Experiments and Discussions

In order to evaluate the conclusions drawn in the previoussection some experiments are carried out usingMoses toolkit[41] which provides a complete package of the requiredmodules for training and generating the translation of texts

41 Experiment Settings Like other SMT approaches Mosestoolkit requires two models (as shown in Figure 6) theLanguage Model (LM) and the Translation Model (TM) InMoses LM is usually created by the external toolkits SRILM[44] or IRSTLM [45] In our experiment the IRSTLM toolkitis used for the large training corpus Before training allthe tokenization for languages in European corpus [46] usethe integrated scripts in Moses and the segmentation forChinese adapts the Stanford Chinese segmenter [47] usingCTB (Chinese Tree Bank) standard [48 49] The entirelanguage model is trained by five-gram models

The phrases are extracted from the results generatedfrom GIZA++ [8] The word alignment was trained with teniterations of IBMmodel 1 and model 4 [34] and six iterationsof the HMM alignment model [50]

To extract all the possible phrases to verify (13) and (17)the maximal phrases are limited to 30 This can extract most

Table 5 Statistics of the first 10000 WMT 2006 corpus

Languages Ave source len Ave en lenfr-en 2289 2015es-en 2171 2094de-en 2055 2142

Table 6 Statistics of test data (3064 sentences)

Languages Ave source len Ave en lenfr-en 3295 2781es-en 2994 2781de-en 2692 2781

Table 7 Statistics of CMWT 2013 + UM-Corpus

Languages Tokens Average length VocabulariesEnglish 152161233 1937 1655080CharacterCE 229110265 2916 397442CTBCE 123917395 1578 1331505

phrases in terms ofword alignments andmake us focus on therelations among word alignment phrase table and machinetranslation

In order to better reveal relationships between alignmentand phrase table to the translation quality leftmost middleand rightmost word alignment are extracted from the originalfull alignment in the worst alignment situation (ie onlyone alignment link) Here the leftmostmiddle and rightmostword alignment indicate theword alignment link that alignedto the firstmiddle and end word in the source language side

42 Corpus Information Three language pairs in WMT2006 shared translation task are chosen as the trainingdata French-English (fr-en) German-English (de-en) andSpanish-English (es-en) These three language pairs are usedto carry out the experiments for proving the equationdeduced in Section 3 and relationship between word align-ment and machine translation Table 3 shows the statisticsof original corpus [46] However the number of phrasesextracted from the only one link is too many to be allowedin our server so the first 100000 European corpus sentences(statistics shown in Table 5) are chosen to prove the availabil-ity of the equations betweenword alignment and phrase tableTable 6 presents the statistics of 3064 testing data fromWMT2006 for the same English sentences

To test our algorithm for pruning the phrase table alarge Chinese-English parallel corpus is brought inThe largecorpus consists of two parts 3289497 sentences are fromthe CWMT 2013 [51] and 4157556 sentences are from UM-Corpus [52 53] After removing repeat and mismatchedsentences in the combined two parts there are left 7445190sentences and the statistics of the combined parallel corpusare presented in Table 7 Note that the statistics of Chinesesentences are counted both in character level (each Chinesecharacter is treated as one token CharacterCE in Table 7) and

The Scientific World Journal 9

Table 8 Statistics of English amp Chinese corpora

Languages Tokens Average length VocabulariesEnglish 804584844 2464 2357374Chinese 3007196322 3343 1533343

Table 9 Statistics for EC-Corpus testing data

Languages Tokens Average lengthEnglish 118802 2376CharacterCE 153705 3074CTBCE 118499 2370

word level (segmented by StanfordCTB segmenter presentedby CTBCE)

The UM-Corpus also contains English Chinese andPortuguese monolingual corpora In this experiment theEnglish and Chinese corpora are used There are 35938697English sentences from news domain and more than 70million Chinese sentences from China portal webpage Thestatistics about the two corpora are shown in Table 8

The 5000 sentences of test data (named UM-Testing) aredesigned according to domains in the UM-Corpus Thereare 1500 sentences for news 500 sentences for laws 500sentences for novels 600 sentences for thesis 700 sentencesfor education 600 sentences for science and 600 for subtitleThe statistics of the testing data are shown in Table 9

43 Equation of Word Alignments versus Phrases The calcu-lation for number of phrases requires the full word alignmentlinks Full word alignment in other words is the best qualityof word alignment It is obviously very time-consumingif we manually edit the large word aligned sentences It isalready observed in the past that generative models usedfor statistical word alignment create alignments of increasingquality as they are exposed to more data [19] The intuitionbehind this is simple as more cooccurrences of source andtargets words are observed the word alignments are betterFollowing this intuition if we wish to increase the quality ofa word alignment we allow the alignment process access toextra data which is used only during the alignment processand then removed If we wish to decrease the quality of aword alignment we divide the parallel corpus into piecesand align the pieces independently of one another and finallyconcatenate the results together In our experiments we usethe entire WMT 2006 corpus as training data firstly andthen the first 100000 alignments are chosen as the full wordalignments

Also the leftmost middle and rightmost word alignmentpoint based on the full word alignment are obtained based onthe results generated by GIZA++

The minimal and maximal sizes of the phrase tableextracted from the full and middle word alignment and the

differential rates are shown in Table 10The differential rate iscalculated by

119877diff-min =

1003816100381610038161003816119873min minus 119873moses-min1003816100381610038161003816

119873min (22)

119877diff-max =

1003816100381610038161003816119873max minus 119873moses-max1003816100381610038161003816

119873max (23)

where the119873moses-min and119873moses-max indicate the minimal andmaximal phrases extracted fromMoses

From Table 10 we can see that the actual number ofphrases is close to the calculation result With the small dif-ferential rate the (13) and (17) can be available to the practicalexperiments The difference may come from two reasons(1) the assumption of full word alignment produced fromthe GIZA++ may contain too many noises (2) equations(13) and (17) do not consider repetitions of the extractedphrases

44 Phrase Table versusMachine Translation Wehave shownthat the there is no direct relationship between the sizeof phrase table and machine translation performance Asreflected in (17) and Table 4 at least half of the phrases canbe removed without hurting too much the performance ofmachine translation In this part we want to use the largeChinese-English corpus to test our pruning algorithm basedon monolingual and bilingual corpus

According to our experience different thresholds willgenerate different size of phrase table and result in differentperformance of translation Therefore one of the key stepsis to choose the pruning threshold in Algorithms 2 and 3Firstly a baseline threshold is calculated from the logarithmof sentences of the given corpus (119873) that is log(119873) Forthe 119862-value threshold the baseline threshold is about 25(log(36 lowast 107)) For the 119875-value threshold the negativelogarithm will be chosen Because the probabilities involvedare incredibly tiny we will work instead with the negative ofthe natural logs of the probabilities For example instead ofselecting phrase pairs with a 119875-value less than exp(minus15) wewill select phrase pairs with a negative-log-119875-value greaterthan 15 The baseline threshold for 120591

119865is 45 (log(74 lowast 106))

Secondly the different thresholds will increase or decreasethan the baseline threshold The affection to the phrase tableand BLEU scores based on different thresholds is shown inTable 11

From Table 11 we can see that nearly 98 phrases canbe discarded with monolingual and bilingual corpora whilethe translation quality does not become worse too much(only decrease 235) Experiments show that about 85 of thephrase table is reduced and can generate the best translationquality which not only again proves the similar results as inother researchers [3 21 32 38] but also indicates that mostphrases are not needed at all The best translation can beachieved on the size of about 15 to 30of the original phrasetable That is to say only a small fraction of phrases are theessential elements while major of the extracted phrases arespurious and in some sense they are redundancy and shouldbe discarded

10 The Scientific World Journal

Table 10 Phrases comparison between Moses extraction and calculation result

Phrase length 119870 = 30

Languages 119873moses-min 119873min 119877diff-min 119873moses-max 119873max 119877diff-max

fr-en 9136283 9080307 0006 1541902790 1544021509 0001es-en 10958551 10895266 0006 1401977185 1489422170 0058de-en 9219921 8929095 0033 1606176665 1686043482 0047

Table 11 119862-value and 119875-value threshold effect on the phrase tablesize and BLEU scores

Thresholds UM-Corpus + CWMT 2013119862-value 120576 minus log(119875-value) 120591

119865Phrase table () BLEU ()

0 0 100 281310 20 836 280925 45 525 2798100 60 297 2811200 100 143 2827400 200 202 2578

Table 12 Translation performance on different word alignmenttypes

Languages BLEU (3064 sentences)119860 full 119860 lmost 119860 sure 119860mid 119860 rmost

fr-en 2480 448 2226 996 248es-en 3076 312 2824 1128 232de-en 2166 230 1928 976 294

45 Word Alignment versus Translation Quality In order toinvestigate the affection to the translation performance basedon different word alignment types and phrase table size weselected five different types of word alignments to generatethe phrase table and then measure the translation qualityby BLEU metric the full word alignment points (119860 full) theleftmost word alignment (119860

119897most) the sure alignment point(119860 sure) the middle word alignment point (119860mid) and therightmost word alignment point (119860

119903most)Clearly the 119860 full is the best quality of word alignment

and followed by 119860 sure which is the only 1-to-1 alignmentThe 119860

119897most 119860mid and 119860119903most are much worse ones which

is only one alignment link As mentioned previously allthese alignment points are obtained after getting the resultsgenerated by GIZA++ The experiment is carried out inWMT 2006 corpus Based on the different alignment typesthe number of phrases extracted and the correspondingtranslation BLEU are presented in Figure 7 and Table 12separately

The histograms show that there is not any direct propor-tion relationship between the translation performance andthe number of extracted phrases The numbers in Figure 7show that German-English language pairs can generate themost phrases while the translation quality is not the bestAlthough the fewest phrases produced from the Spanish-English language pairs the best translation performance canbe obtained Explained by (17) more phrases usually lead to

worse word alignment quality or it can be said that thereare too many unaligned words in the produced alignmentresults The results in Figure 7 and Table 12 show that (1) thefull word alignment will contribute most to the translationperformance and there will not be too much decrease whenonly the sure alignment is provided (2) however the first andlast word alignment will not produce too high translationquality (3) although the middle position word alignmentcan extract the most phrases the quality of the translationis not the best and it seems that middle word alignment cancontribute more to the finally translation

In fact the translation from the only one link wordalignment should be higher After all the default full wordalignments have too many noises which trained from a largetraining corpus without manual edit

5 Conclusions

In this paper some investigations are made about the rela-tionship among word alignment phrase table and machinetranslation After deducing a formula for estimating the sizeof the phrase table a maximal phrase pruning ratio can becalculated based on the average length of the given corpus andword alignment positions Finally translation performancebased on different word alignment types is compared

Equations (13) and (17) provide a simple way to predictthe size of the phrase table in advance based on the generatedword alignment points This is beneficial for evaluating howmuch hardware resources are required to train the modelwhen the size of the parallel corpus is huge

Experiment results show the affection to the translationperformance is significant if the AER decreases or increasestoomuchThere is no direct proportion relationship betweenthe translation performance and the number of extractedphrases In other words most of the phrases are not needed atall only phrases extracted from the full word alignment areenough It can be seen from our corpus-motivated pruningmethod that only 15ndash30 phrases are needed for the basictranslations and most phrases in the phrase table are noisesto the final translation

Based on the existing phrase extraction algorithm fromword alignment better word alignment will be very helpful tothe final translation There are at least three advantages witha better quality of word alignment even without the existingphrase extraction algorithm improved as follows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

The Scientific World Journal 11

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Fr-e

n nu

mbe

r of p

hras

es

Fr-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

668712E7

785057E7

913628E654091E7

15419E9

(a)

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Es-e

n nu

mbe

r of p

hras

es

Es-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

753427E7

621835E7

140198E9

109586E7534143E7

(b)

De-en number of phrases

180E + 009

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

De-

en n

umbe

r of p

hras

es

Wod alignment types

Almost Asure Amid Afull Armost

716272E7712921E7

921992E6

527541E7

160618E9

(c)

Fr-e

n BL

EU

Fr-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

448

2226

996

248

248

(d)

Es-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

Es-e

n BL

EU

35

30

25

20

15

10

5

0

312

2824

1128

232

3076

(e)

De-

en B

LEU

De-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

23

1928

976

2166

294

(f)

Figure 7 Number of phrases extracted from different word alignment points and their BLEU

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

The next work we want to focus on is to add the corpus-motivated pruning technique into the translation modelsthat is the phrases will be filtered during the training timenot after generating the phrase table

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank all reviewers for the verycareful reading and helpful suggestions The authors aregrateful to the Science and Technology Development Fundof Macau and the Research Committee of the University ofMacau for the funding support for their research under theReference nos MYRG076 (Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS

References

[1] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology vol 1 pp 48ndash54 Association forComputational Linguistics Stroudsburg Pa USA 2003

[2] R Zens F J Och and H Ney Phrase-Based Statistical MachineTranslation Springer Berlin Germany 2002

[3] J H Shin Y S Han and K S Choi ldquoBilingual knowledgeacquisition from Korean-English parallel corpus using align-ment method Korean-English alignment at word and phraselevelrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 230ndash235 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[4] K Yamamoto T Kudo Y Tsuboi and YMatsumoto ldquoLearningsequence-to-sequence correspondences from parallel corporavia sequential pattern miningrdquo in Proceedings of the HLT-NAACL Workshop on Building and Using Parallel Texts DataDriven Machine Translation and Beyond vol 3 pp 73ndash80Association for Computational Linguistics Stroudsburg PaUSA 2003

[5] C Goutte K Yamada and E Gaussier ldquoAligning words usingmatrix factorisationrdquo in Proceedings of the 42nd AnnualMeetingon Association for Computational Linguistics Association forComputational Linguistics Stroudsburg Pa USA 2004

[6] F J Och C Tillmann H Ney and L F Informatik ImprovedAlignmentModels for StatisticalMachine Translation Universityof Maryland College Park Md USA 1999

[7] F J Och and H Ney ldquoA Comparison of Alignment Modelsfor Statistical Machine Translationrdquo in Proceedings of the 18thConference on Computational Linguistics vol 2 pp 1086ndash1090

12 The Scientific World Journal

Association for Computational Linguistics Stroudsburg PaUSA 2000

[8] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[9] F J Och and H Ney ldquoThe alignment template approach tostatistical machine translationrdquo Computational Linguistics vol30 no 4 pp 417ndash449 2004

[10] N Duan M Li and M Zhou ldquoImproving phrase extractionvia MBR phrase scoring and pruningrdquo in Proceedings of the MTSummit XIII pp 189ndash197 2011

[11] G Neubig T Watanabe E Sumita S Mori and T Kawa-hara ldquoAn unsupervised model for joint phrase alignment andextractionrdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human LanguageTechnologies vol 1 pp 632ndash641 Stroudsburg Pa USA June2011

[12] H L M Z Hendra Setiawan ldquoLearning phrase translationusing level of detail approachrdquo inTheTenthMachine TranslationSummit (MT-Summit X) 2005

[13] C Tillmann ldquoA projection extension algorithm for statisticalmachine translationrdquo in Proceedings of the Conference on Empir-ical Methods in Natural Language Processing Association forComputational Linguistics Stroudsburg PA USA 2003

[14] B Zhao and S Vogel ldquoA generalized alignment-free phraseextractionrdquo in Proceedings of the ACLWorkshop on Building andUsing Parallel Texts pp 141ndash144 Association for ComputationalLinguistics Stroudsburg Pa USA 2005

[15] Y Zhang and S Vogel ldquoCompetitive grouping in integratedphrase segmentation and alignment modelrdquo in Proceedings ofthe ACLWorkshop on Building and Using Parallel Texts pp 159ndash162 Association for Computational Linguistics StroudsburgPa USA 2005

[16] A Lopez ldquoStatistical machine translationrdquo ACM ComputingSurveys vol 40 no 3 article 8 2008

[17] K Ganchev J V Grac and B Taskar ldquoBetter alignments = Bet-ter translationsrdquo in Proceedings of the 46th Annual Meeting ofthe Association for Computational Linguistics Human LanguageTechnologies pp 986ndash993 June 2008

[18] D Vilar M Popovic and H Ney ldquoAER do we need toldquoimproverdquo our alignmentsrdquo in Proceedings of the InternationalWorkshop on Spoken Language Translation pp 205ndash212 2006

[19] A Fraser andDMarcu ldquoMeasuring word alignment quality forstatistical machine translationrdquo Computational Linguistics vol33 no 3 pp 293ndash303 2007

[20] P Lambert S Petitrenaud Y Ma and A Way ldquoWhat typesof word alignment improve statistical machine translationrdquoMachine Translation vol 26 no 4 pp 289ndash323 2012

[21] J H Johnson J Martin G Foster and R Kuhn ldquoImprovingtranslation quality by discarding most of the phrasetablerdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 967ndash975 June2007

[22] J H Johnson Conditional Significance Pruning DiscardingMore of Huge Phrase Tables 2012

[23] M Yang and J Zheng ldquoToward smaller faster and better hierar-chical phrase-based SMTrdquo in Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for ComputationalLinguistics and 4th International Joint Conference on Natural

Language Processing of the AFNLP (ACL-IJCNLP rsquo09) pp 237ndash240 Association for Computational Linguistics StroudsburgPa USA August 2009

[24] N Tomeh N cancedda and M Dymetman ldquoComplexity-based phrase-table filtering for statistical machine translationrdquoin Proceedings of the MT Summit XII 2009

[25] M Eck S Vogel and A Waibel ldquoTranslation model prun-ing via usage statistics for statistical machine translationrdquo inProceedings of the Human Language Technologies The AnnualConference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT rsquo07) pp 21ndash24Association for Computational Linguistics troudsburg PaUSA 2007

[26] M Eck S Vogel and A Waibel ldquoEstimating phrase pairrelevance formachine translation pruningrdquo inProceedings of theMT Summit XI pp 159ndash165 2007

[27] Y Chen A Eisele and M Kay ldquoImproving statistical machinetranslation efficiency by triangulationrdquo in Proceedings of theSixth International Conference on Language Resources andEvaluation 2008

[28] Y Chen M Kay and A Eisele ldquoIntersecting multilingual datafor faster and better statistical translationsrdquo in Proceedings of theHuman Language Technologies The Annual Conference of theNorth American Chapter of the Association for ComputationalLinguistics (NAACL HLT rsquo09) pp 128ndash136 Association forComputational Linguistics Stroudsburg Pa USA June 2009

[29] G Sanchis-Trilles D Ortiz-Martınez J Gonzalez-Rubio JGonzalez and F Casacuberta ldquoBilingual segmentation forphrasetable pruning in statistical machine translationrdquo in Pro-ceedings of the 15th International Conference of the EuropeanAssociation for Machine Translation (EAMT rsquo11) pp 257ndash264May 2011

[30] N Tomeh M Turchi G Wisinewski A Allauzen and YFrancois ldquoHow good are your phrases Assessing phrase qualitywith single class classificationrdquo in Proceedings of the Interna-tional Workshop on Spoken Language Translation pp 261ndash2682011

[31] W Ling J Graca I Trancoso and A Black ldquoEntropy-basedpruning for phrase-based machine translationrdquo in Proceedingsof the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Language Learn-ing pp 962ndash971 Association for Computational LinguisticsStroudsburg Pa USA 2012

[32] R Zens D Stanton and P Xu ldquoA systematic comparison ofphrase table pruning techniquesrdquo in Proceedings of the JointConference on Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning pp 972ndash983 Association for Computational Linguistics StroudsburgPa USA 2012

[33] K Papineni S Roukos T Ward andW Zhu ldquoBLEU a methodfor automatic evaluation of machine translationrdquo in Proceedingsof the 40th Annual Meeting of the Association for ComputationalLinguistics (ACL rsquo01) pp 311ndash318 2001

[34] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe Mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[35] P Koehn StatisticalMachine Translation CambridgeUniversityPress New York NY USA 1st edition 2010

[36] E Galbrun ldquoPhrase table pruning for statistical machine trans-lationrdquo Department of Computer Science Series of PublicationsC C-2009-22 University of Helsinki 2007

The Scientific World Journal 13

[37] T M Cover and J A Thomas Elements of Information TheoryWiley-Interscience New York NY USA 2006

[38] Z He Y Meng Y Lu H Yu and Q Liu ldquoReducing SMTrule table with monolingual key phraserdquo in Proceedings ofthe Joint Conference of the 47th Annual Meeting of the Asso-ciation for Computational Linguistics and 4th InternationalJoint Conference on Natural Language Processing of the AFNLP(ACL-IJCNLP rsquo09) pp 121ndash124 Association for ComputationalLinguistics Stroudsburg Pa USA August 2009

[39] Z He Y Meng and H Yu ldquoDiscarding monotone com-posed rule for hierarchical phrase-based statistical machinetranslationrdquo in Proceedings of the 3rd International UniversalCommunication Symposium (IUCS rsquo09) pp 25ndash29 New YorkNY USA December 2009

[40] K T Frantzi and S Ananiadou ldquoExtracting nested colloca-tionsrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 41ndash46 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[41] P Koehn H Hoang A Birch et al ldquoMoses open source toolkitfor statistical machine translationrdquo in Proceedings of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions pp 177ndash180 Association for ComputationalLinguistics Stroudsburg Pa USA 2007

[42] D Chiang ldquoA hierarchical phrase-based model for statisticalmachine translationrdquo in Proceedings of the 43rd Annual Meetingof the Association for Computational Linguistics pp 263ndash270Association for Computational Linguistics Stroudsburg PaUSA June 2005

[43] W A Gale and K W Church ldquoIdentifying word correspon-dences in parallel textsrdquo in Proceedings of the Workshop onSpeech and Natural Language pp 152ndash157 Association forComputational Linguistics Stroudsburg Pa USA 1991

[44] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[45] M Federico N Bertoldi and M Cettolo ldquoIRSTLM An opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

[46] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the MT Summit X 2005

[47] S Green and J DeNero ldquoA Class-based agreement model forgenerating accurately inflected translationsrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Long Papers vol 1 pp 146ndash155 Association forComputational Linguistics Stroudsburg Pa USA 2012

[48] N Xue F D Chiou and M lmer ldquoBuilding a large-scaleannotated chinese corpusrdquo in Proceedings of the 19th Interna-tional Conference on Computational Linguistics vol 1 pp 1ndash8PaAssociation for Computational Linguistics Stroudsburg PaUSA 2002

[49] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[50] S Vogel H Ney and C Tillmann ldquoHmm-based word align-ment in statistical translationrdquo in Proceedings of the 16thInternational Conference on Computational Linguistics pp 836ndash841 1996

[51] CWMT 2013 httpwwwliipcncwmt2013

[52] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) pp 1273ndash1278 Qingdao China July2010

[53] L Tian D Wong L Chao et al ldquoUM-Corpus a large English-Chinese parallel corpus for statistical machine translationrdquo inProceedings of the 9th International Conference on LanguageResources and Evaluation 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article A Relationship: Word Alignment, Phrase

The Scientific World Journal 3

Table 1 Contingency table

119890 1198901015840

119891 119873(119891 119890) 119873 (119891) minus 119873(119891 119890) 119873 (119891)

1198911015840

119873(119890) minus 119873(119891 119890) 119873 minus 119873(119891) minus 119873(119890) minus 119873 (119891 119890) 119873 minus 119873(119891)

119873(119890) 119873 minus 119873(119890) 119873

be chosen and phrase pairs with a contribution to the relativeentropy below that threshold are pruned if

119901 (119890 | 119891) [log119901 (119890 | 119891) minus log1199011015840 (119890 | 119891)] lt 120591119864 (4)

For the significance testing pruning approach it heavilydepends on a119875-valueThe value is calculated from two by twocontingency tables 119862119879(119891 119890) as shown in Table 1 which canbe used to represent the relationship between 119891 and 119890 wherethe119873(119891 119890) is defined as the number of parallel sentences thatcontain one or more occurrences of 119891 on the source side and119890 on the target side119873(119891) is the number of parallel sentencesthat contain one or more occurrences of119891 on the source side119873(119890) is the number of parallel sentences that contain one ormore occurrences of 119890 on the target side and119873 is the numberof parallel sentences

Following Fisherrsquos exact test the probability of the con-tingency table via the hyper geometric distribution can becalculated

119901ℎ(119896) =

(119873(119891)

119873(119891119890))(119873minus119873(119891)

119873(119890)minus119873(119891119890))

( 119873119873(119890)

) (5)

Then the 119875-value can be then calculated by summingprobabilities for tables that have a larger119873(119891 119890)

119875-value (119862119879 (119891 119890)) =

min(119873(119891)119873(119890))

sum

119896=119873(119891119890)

119901ℎ(119896) (6)

A phrase pair will be discarded if the 119875-value is greaterthan 120591

119865as shown in (7)

119875-value (119862119879 (119891 119890)) =

min(119873(119891)119873(119890))

sum

119896=119873(119891119890)

119901ℎ(119896) gt 120591

119865 (7)

Generally speaking the goal of the entropy-basedmethodis to remove the redundant phrases whereas the otherapproaches are to try to remove the low-quality or unreliablephrases

The knowledge introduced above will be used duringour discussions on the relationships among word alignmentphase table and machine translation More details will beshown in Section 3

3 Word Alignment amp Phrase Table ampMachine Translation

In this section we can see from a theoretical aspect thatthe quality of the underlying word alignments has a strong

influence both on performances of phrases table and phrase-based SMT system

31 Word Alignment amp Phrase Table As introduced in theprevious section there are some direct relationships betweenphrase table and word alignments In this part the relation-ship between the word alignments and extracted phrases isformulated to estimate the size of the target phrase tableTwo different word alignment cases are considered for therelations Firstly the phrases are extracted from the full wordalignment links of parallel sentences which are the best wordalignments Secondly phrases based on a single link of wordalignment will bemade an investigation which can be treatedas the worst alignment information

311 Phrases Extraction from Full Word Alignment Supposea source sentence 119891

119869

1= 11989111198912sdot sdot sdot 119891119869will be translated into a

target sentence 1198901198681

= 11989011198902sdot sdot sdot 119890119868 and only one target word

is allowed to align to one or multiple source words Where119868 and 119869 are the length of the target and source sentencesrespectively 119894 and 119895 are the positions of words in the targetand source sentences

If each target word 119890119894has at least one alignment link

against the word in source text named as full word alignmentfewest phrase pairs can be extracted

For the first case considering the alignment pointsin Figure 1 each word 119890

119894in the target sentence has at

least one alignment point and there is no crossing wordalignment which means words are monotonically alignedStarting with the first word 119890

1in the target sentence

the phrase pairs with possible extent can be extractedfor example (119890

1| 1198911) (11989011198902

| 119891111989121198913) (11989011198902sdot sdot sdot 119890119894

|

119891111989121198913sdot sdot sdot 119891119895) (11989011198902sdot sdot sdot 119890119894sdot sdot sdot 119890119868| 119891111989121198913sdot sdot sdot 119891119895sdot sdot sdot 119891119869) and so

forth where the total number of 119868 phrase pairs can beobtained Starting with the second word 119890

2in the target

sentence 119868 minus 1 phrase pairs can be extracted for example(1198902

| 11989121198913) (1198902sdot sdot sdot 119890119894

| 11989121198913sdot sdot sdot 119891119895) (1198902sdot sdot sdot 119890119894sdot sdot sdot 119890119868

|

11989121198913sdot sdot sdot 119891119895sdot sdot sdot 119891119869) and so onThus the total number of phrase

pairs that extracted from the fully aligned sentences can bedefined as

119868 + (119868 minus 1) + sdot sdot sdot + 1 =119868 (119868 + 1)

2 (8)

Actually the previous alignment type cannot deal withthe word alignment that contains the cross-alignment asshown in Figure 2 where the fourth and the fifth Chinesewords are cross-aligned with words in source sentenceAccording to (8) there should be 21 possible phrase pairsHowever only 18 phrases are generated by the conventionalphrase extraction algorithm

The general form of above alignments is illustrated inFigure 3 where the cross-alignments take place between theneighboring words 119890

119894beginand 119890119894end

that is 119894end = 119894begin + 1 Anyphrase pairs that fall between the bounds of cross-alignedwords cannot be extracted in line with (8) That is (119894begin minus

1) + (119868 minus 119894end) phrases should be excluded

119873ca = (119894begin minus 1) + (119868 minus 119894end) (9)

4 The Scientific World Journal

e1 e2 ei eI

f1 f2 f3 fj fJ

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1 Full word alignments

1 2 3 4 5

I1 am2 able3 do5 it6to4 well7

∘6

8

Figure 2 An example of nonmonotonic alignments where align-ment links are crossing between parallel sentences

So if we define the reduced phrases as 119873ca (number ofcross-alignments) a more practical equation can be revisedas

119873min =119868 (119868 + 1)

2minus 119873ca (10)

If there is more than one cross-alignment as illustratedin Figure 3 the total number of extracted phrases can begeneralized as

119873min =119868 (119868 + 1)

2minus

119898

sum119896=1

[(119894119896

begin minus 1) + (119868 minus 119894119896

end)] (11)

Reduced phrase pairs are determined by the cross-alig-nment point 119894begin and 119894end 119894

119896

begin represents the start positionand 119894119896end is the last position of the 119896th pair words (boundedby the crossing links) and 119898 is the total number of cross-alignments

Figure 3 illustrates the situation that there is not any wordbetween the crossing dependencies However in reality therewill be any words in between the cross words as illustrated inFigure 4

For the third case words between 119890119894begin

and 119890119894end

may havesimilar alignment situation as that illustrated in (9)Thus thefollowing equation can be derived to calculate 119873

119903 the exact

number of phrase pairs that cannot be generated Consider

119873119903=

119898

sum119896=1

119894119896

endminus119894119896

endminus1

sum120572=0120573=0

[(119894119896

begin + 120572 minus 1) + (119868 minus 119894119896

end + 120573)] (12)

where 120572 indicates the shift word position after the first crossword 119894119896begin and 120573 indicates the word position before the endcross word 119894

119896

end And 120572 120573will not be greater than the numberof words between the first and the end cross word

Finally the number of phrase pairs that can be extractedfrom the full word alignmentsmeasured in the target side canbe described by

119873min-119905 =119868 (119868 + 1)

2minus 119873119903 (13)

e1 e2 eI

f1 f2 f3 fJ

middot middot middot middot middot middot

middot middot middotmiddot middot middot

eibegin eiend

fjbegin fjend

Figure 3 Alignments with a cross link of neighboring words

e1 e2 eI

f1 f2 f3 fJ

middot middot middot middot middot middot

middot middot middotmiddot middot middot middot middot middot

middot middot middot eibegin eiend

fjbegin fjend

Figure 4 Alignments with a cross link involving distant words

The equation shows that the size of the phrase tableis decided by the target length of the given sentence ifthe word alignment is the full word alignment that is thebest and ideal quality word alignment It is proven thatthe bidirectional training approach for word alignment canimprove the quality of word alignment [8] If the alignmentis deduced from another direction (source side) similarly theextracted phrases from source side can be represented by

119873min-119904 =119869 (119869 + 1)

2minus 119873119903 (14)

Equations (13) and (14) show that the number of phrasesheavily affected by the length of source and target sentencesGiven the best word alignment if the length in the sourcesentence equals the target side (119868 = 119869) the phrases willbe the minimal numbers (119873min-119904 or 119873min-119905) If not equalthe extracted phrases will be larger than this minimal value(119873min-119904 and 119873min-119905) after intersection and union operationsduring alignment process

312 Phrases Extraction from Worst Alignment Supposeonly one word 119891

119895in the source sentence is aligned to a target

word 119890119894as shown in Figure 5 Note that unaligned words

near the word 119891119895may lead to multiple possibilities each

proceeding and following words next to it can form a part ofthe phrase and so does the neighboring words of 119890

119894in target

sentence In other words if we know the different extractedsituations in each source and target sentence and then totalcounts of phrase pairs can be derived by the product of thetwo numbers Algorithm 1 describes the different phrasesextraction scenarios according to different positions in thesource sentence

Obviously there are total 119894 words that can start with(1198901 1198902 1198903 119890

119894) Now in target sentence consisting of

oddeven words the phrase pairs 119891(119894) in different positionare

119891 (119894) = 119894 (119868 minus 119894 + 1) = 119894 (119868 minus 119894) + 119894 (15)

The phrases 119892(119895) in the source side can be calculated in asimilar way

119892 (119895) = 119895 (119869 minus 119895 + 1) = 119895 (119869 minus 119895) + 119895 (16)

The Scientific World Journal 5

e1 e2 ei eI

f1 f2 f3 fj fJ

middot middot middot

middot middot middot middot middot middot

middot middot middot

Figure 5 Sentences with single alignment link

Based on (15) and (16) the total number of phrase pairs119873max is given by

119873max = [119894 (119868 minus 119894) + 119894] [119895 (119869 minus 119895) + 119895] (17)

where 0 le 119894 le 119868 minus 1 and 0 le 119895 le 119869 minus 1Obviously (15) and (16) accord with symmetric distribu-

tion and the maximal value will be achieved in the middleposition Given a sentence consisting of 119871 words a formaldescription to the middle position in the sentence can bepresented by

wordmid

1 + 119871odd2

119871even2

or (119871even2

+ 1) (18)

We can see that it gives (17) the maximal value in themiddle position in a given sentence that is the nearer to themiddle word the more phrase pairs can be produced Takethe Figure 2 as an example there are 18 phrases based onthe full word alignments while the phrases will be 240 whenonly the fifth word (do) in the source sentence aligns to thethird word (做zuo) in the target side From the view of sizeof phrase table the calculation value from (17) is much largerthan that from (13) That proves that the better quality of theword alignment the fewer phrases will be

Note that there will be some share phrases extracted fromthe full word alignment and only single word alignment whichleads to the phenomenon that there are still some translationabilities even with only one alignment point under the exist-ing phrase extraction algorithm For instance phrase pairs(11989011198902sdot sdot sdot 119890119894| 119891111989121198913sdot sdot sdot 119891119895) (1198902sdot sdot sdot 119890119894| 11989121198913sdot sdot sdot 119891119895) (119890

119894| 119891119895)

are the share ones when given a phrase such as 119891111989121198913sdot sdot sdot 119891119895

both the phrase tables generated from both alignment typescan possibly translate it right to 119890

11198902sdot sdot sdot 119890119894

32 Phrase Table amp Machine Translation Nowadays the keystep of the process of phrase-based machine translation(SMT) involves inferring a large table of phrase pairs thatare translations of each other from a large corpus of alignedsentences The set of all phrase pairs together with estimatesof conditional probabilities and other useful features is calledPhrase Table (PT) which is applied during the decodingprocess [29]

Nevertheless phrases in the PT are very dependent fromthe system that uses them Some phrases might be presentin the PT but never be used in translations because they areeither ranked too low or erroneous translations for exampleIn other words there are too many redundant phrases in thePT Such situation leads many researches to propose differenttechniques to reduce the size of the phrase tableThis includes

Table 2 Pruned phrase table and translation quality

Pruning methods Languages Pruning rates BLEU (Δ)

Countsignificantfr-en 646sim859

minus01es-en 679sim881de-en 627sim841

Entropyfr-en 85sim95

minus10es-en 85sim95de-en 85sim95

Bilingual-segmentfr-en mdash

minus12es-en 98de-en 94

the methods as introduced in Section 2 These works can betreated as the research on the relationship between phrasetable and machine translation

All the works presented in Section 2 show that thetranslation results can be achieved from smaller phrase pairswith several times increase in translation speed with little orno loss in translation accuracy As reported by [32] count-based and significance-based pruning techniques can resultin larger savings between 70 and 90 while the entropy-based pruning approach can reduce consistently the entriesbetween 85 to 95 of the phrase table In [29] they evenpointed out that the pruning rate can reach to 98 using abilingual segmentation method with finite state transducersMore details about the pruning rates and related translationqualities (measured by BLEU [33]) are shown in Table 2where the BLEU column is to record the maximal decrease(minus) rate from the 100 size of phrase table to the maximalpruning rate of the phrase table

The pruning rates presented above are counted from anempirical or experiment result We can also estimate thepruning rate from a theoretical aspect In practice we alsonotice that the phrases extracted from the middle wordalignment point includemost of the phrases that are extractedfrom the full word alignment All the phrases excluding theones extracted from the full word alignment can be removedIn other words a rough pruning rate can be calculated fromthe ration of phrases extracted from best alignment (119873min)and worst alignment (119873max)

119877pruned = 1 minus119873min-119905119873max

(19)

If we want to quickly estimate how many percentageof phrases can be pruned in a given corpus the averagestatistics of the given corpus can be used as in (20) Thisequation shows that the pruning rate is not only decided bythe sentence length in a parallel corpus but also affected bythe alignment position Consider

119877pruned = 1 minus119868average (119868average + 1) 2

[119894 (119868average minus 119894) + 119894] [119895 (119869average minus 119895) + 119895] (20)

In reality the numerator trends to bigger and the denom-inator will be smaller so the pruning rate will be smaller

6 The Scientific World Journal

Start from 1198901

119873phrases = 119868 minus 119894 + 1Start from 119890

2

119873phrases = 119868 minus 119894 + 1Similarly forall119894 isin (119894 = 119894 119868) in the target side

119873phrases = 119868 minus 119894 + 1

Algorithm 1 Phrase extraction based on the only one alignment point 119890119894harr 119891119895

Table 3 Training data statistics WMT 2006

Language pairs Sentences Ave source len Ave target lenfr-en 688031 2267 2007es-en 730740 2152 2083de-en 751088 2031 2137

than the calculation value When the alignment position isin the middle position (middle alignment) there will getthe maximal pruning rates as shown in (21) Based on theequation we can estimate the pruning rate range based ondifferent word alignment position in European corpus [46]Table 3 shows the statistics of the WMT 2006 and Table 4compares different pruning rates reported in researchersrsquopaper [22 32] and the calculation ranges based on (20) Fromwhichwe can see that the calculation result is very close to theexperiment results presented in [32] The authors concludedtheir findings that ldquothe novel entropy-based pruning oftenachieves the same Bleu score with only half the number ofphrasesrdquo A very bold assumption is thatmost phrases (at leasthalf) can be removed from the original extracted phrasesConsider

119877max = 1 minus119868average (119868average + 1)

2

times ([119894mid (119868average minus 119894mid) + 119894mid]

times [119895mid (119869average minus 119895mid) + 119895mid])minus1

(21)

We have seen that the phrase table pruning rate is stronglyaffected by the length and word positions in the source andtarget sentences As reflected in the pruning equation (20)the phrase table might be more accurately pruned based onsome features (eg phrase length phrase cooccurrence) inthe bilingual and monolingual corpus The researchers in[27] reduced the phrase table based on the significance testingof phrase pair cooccurrence in bilingual corpus while Heet al [38 39] pruned the phrase pairs in terms of phrasefrequency in the source side using key phrases extractedfrom a monolingual corpus We want to make full use of theadvantages of the two approaches the key phrases can beused to check whether the extracted phrases are often usedin practice and the significant testing can be used to measurethe quality of the extracted phrase pairs Our methods wantto consider the features both for the source and target phrase

Table 4 Comparison between theoretical estimation results andother pruning techniques results

Language pairs 119895mid 119894mid 119877pruned

fr-en 1234 1104 535sim986es-en 1126 1142 493sim984de-en 1116 1119 449sim983

The basic metrics for phrases that are often used in prac-tice is the frequency that a phrase appears in a monolingualcorpus The more frequent a phrase appears in a corpus thegreater possibility the phrase may be used However longerphrase pairs will be removed in this way because of the factthat those phrases seem to appear rarely in a monolingualcorpus

The 119862-value is a measurement of automatic term recog-nition which can be used to solve this problem as describedin [40] The method not only fully considers the frequencyinformation such as frequencies of a phrase and a subphraseappearing in longer phrases but also is an efficient algorithm

In this algorithm (as shown in Algorithm 2) 4 factors (119871119865 119878119873) can be used to determine if a phrase 119901 is a key phrase[38]

(i) 119871(119901) the length of 119901(ii) 119865(119901) the frequency that 119901 appears in a corpus(iii) 119878(119901) the frequency that 119901 appears as a substring in

other longer phrases(iv) 119873(119901) the number of phrases that contain 119901 as a

substring

Given amonolingual corpus key phrases can be extractedefficiently according to Algorithm 2 Firstly (line 1) allpossible phrases are extracted as candidates of key phrases(eg extracted phrase table fromMoses [41]) Secondly (line3 to 7) for each candidate phrase 119862-value is computedaccording to the phrase appearing by itself (line 4) or as asubstring of other long phrases (line 6) The 119862-value is indirect proportion to the phrase length (119871) and occurrences(119865 119878) but in inverse proportion to the number of phrasesthat contain the phrase as a substring (119873)This overcomes thelimitations of frequency measurement A phrase is regardedas a key phrase if its 119862-value is greater than a threshold 120576Finally (line 11 to 14) 119878(119902) and 119873(119902) are updated for eachsubstring 119902

The algorithm is originally used for pruning the ruletable extracted from the hierarchical phrase-based SMT [42]

The Scientific World Journal 7

Input Monolingual Corpus(1) Extracted candidate phrases(2) for all phrases 119901 in length descending order do(3) if 119873(119901) = 0 then(4) 119862-value = (119871(119901) minus 1) times 119865(119901)

(5) else(6) 119862-value = (119871(119901) minus 1) times (119865(119901) minus 119878(119901)119873(119901))

(7) end if(8) if 119862-value ge 120576 then(9) add 119901 to KP(10) end if(11) for all sub-strings 119902 of 119901 do(12) 119878 (119902) = 119878(119902) + 119865(119901) minus 119878(119901)

(13) 119873(119902) = 119873(119902) + 1

(14) end for(15) end for

Output Key Phrase Table KP

Algorithm 2 Key phrase extraction from monolingual corpus

Input Bilingual Corpus and KP Table from Algorithm 1(1) for all phrases 1199011015840 in KP do(2) if minus log(119875-value(1199011015840)) le 120591

119865then

(3) add 1199011015840 to 119875 PT

(4) end if(5) end forOutput Pruned Phrase Table 119875 PT

Algorithm 3 Algorithm for key phrase pruning

However the idea can be also used for phrase-based SMT[1 41] perfectly After this filter step a filtered phrase tablecalled KP will be generated

The phrases in the phrase table KP are all possibly usedphrases in practice However there are still some noisesin the KP table for example the source phrase occurringin source sentences and the target phrase occurring in thetarget sentences are not well matched that is the phrasetranslation is not good enough To overcome this case it isconvenient to construct a two by two contingency table thattabulates the sentence pairs where the two types of matchesoccur and do not occur [21 22 43] The calculation resultis also called 119875-value as introduced in Section 2 A bilingualcorpus constraint can be added to filter the noise phrases aspresented in Algorithm 3 After this step a large pruning ratephrase table (called 119875 119875119879) will be generated

33 Word Alignment amp Machine Translation Based on thediscussion in Sections 31 and 32 we can find that the sizeof phrase table is heavily affected by the quality of wordalignment the better the word alignment the smaller size ofthe phrase table Besides this themachine translation perfor-mance is not directly decided by the size of the phrase tableat least 50 phrases can be removed without hurting the

performance of the machine translation Experiments showthat the quality of phrase translation plays more importantrole to the final translation quality in practice Therefore itis necessary to improve the quality of word alignment underexisting phrase extraction algorithm from word alignment

Now it is a little easier to talk about the relationshipbetween word alignment andmachine translationThe trans-lation of sentence is directly derived from the phrase tablewhile the phrase table is extracted from word alignment Inother words the quality of word alignment will affect theperformance of machine translation through phrase tableThat is to say the word alignment quality will affect thetranslation performance a lot However the works listed in[8 18 19] show that there is not always significant affection tothe translation result with improved word alignment quality

We do not agree with the conclusion Firstly either thetraining or testing corpus they used in their experimentsis limited in a small size or the improvement of the wordalignment quality is not significant (AER decrease between01 and 02) Secondly we think it is the existing phraseextraction algorithm that leads to the not obvious result Wehave pointed out in the last part of Section 31 that even theworst alignments are used the extracted phrase table stillhave some ability to translate some input phrases This isbecause of the share phrases extracted from the only one

8 The Scientific World Journal

(aligner)

Phraseextraction

Moses decoder

SRILMIRSTLM(LM toolkit)

Phrase table Language model

Monolingualcorpus

Bilingualcorpus

Sourcetext

Targettext

GIZA++

Figure 6 Mosesrsquo main modules and models

alignment link The only one word alignment can translatesome input phrases or sentences not mentioning the moreword alignment links

We will discuss the situation with different word align-ment in next section where the machine translation affectedby the best word alignment and worst word alignment willbe fully surveyed After that we can see that there are at leastthree advantages to improve the quality of word alignment asfollows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

4 Experiments and Discussions

In order to evaluate the conclusions drawn in the previoussection some experiments are carried out usingMoses toolkit[41] which provides a complete package of the requiredmodules for training and generating the translation of texts

41 Experiment Settings Like other SMT approaches Mosestoolkit requires two models (as shown in Figure 6) theLanguage Model (LM) and the Translation Model (TM) InMoses LM is usually created by the external toolkits SRILM[44] or IRSTLM [45] In our experiment the IRSTLM toolkitis used for the large training corpus Before training allthe tokenization for languages in European corpus [46] usethe integrated scripts in Moses and the segmentation forChinese adapts the Stanford Chinese segmenter [47] usingCTB (Chinese Tree Bank) standard [48 49] The entirelanguage model is trained by five-gram models

The phrases are extracted from the results generatedfrom GIZA++ [8] The word alignment was trained with teniterations of IBMmodel 1 and model 4 [34] and six iterationsof the HMM alignment model [50]

To extract all the possible phrases to verify (13) and (17)the maximal phrases are limited to 30 This can extract most

Table 5 Statistics of the first 10000 WMT 2006 corpus

Languages Ave source len Ave en lenfr-en 2289 2015es-en 2171 2094de-en 2055 2142

Table 6 Statistics of test data (3064 sentences)

Languages Ave source len Ave en lenfr-en 3295 2781es-en 2994 2781de-en 2692 2781

Table 7 Statistics of CMWT 2013 + UM-Corpus

Languages Tokens Average length VocabulariesEnglish 152161233 1937 1655080CharacterCE 229110265 2916 397442CTBCE 123917395 1578 1331505

phrases in terms ofword alignments andmake us focus on therelations among word alignment phrase table and machinetranslation

In order to better reveal relationships between alignmentand phrase table to the translation quality leftmost middleand rightmost word alignment are extracted from the originalfull alignment in the worst alignment situation (ie onlyone alignment link) Here the leftmostmiddle and rightmostword alignment indicate theword alignment link that alignedto the firstmiddle and end word in the source language side

42 Corpus Information Three language pairs in WMT2006 shared translation task are chosen as the trainingdata French-English (fr-en) German-English (de-en) andSpanish-English (es-en) These three language pairs are usedto carry out the experiments for proving the equationdeduced in Section 3 and relationship between word align-ment and machine translation Table 3 shows the statisticsof original corpus [46] However the number of phrasesextracted from the only one link is too many to be allowedin our server so the first 100000 European corpus sentences(statistics shown in Table 5) are chosen to prove the availabil-ity of the equations betweenword alignment and phrase tableTable 6 presents the statistics of 3064 testing data fromWMT2006 for the same English sentences

To test our algorithm for pruning the phrase table alarge Chinese-English parallel corpus is brought inThe largecorpus consists of two parts 3289497 sentences are fromthe CWMT 2013 [51] and 4157556 sentences are from UM-Corpus [52 53] After removing repeat and mismatchedsentences in the combined two parts there are left 7445190sentences and the statistics of the combined parallel corpusare presented in Table 7 Note that the statistics of Chinesesentences are counted both in character level (each Chinesecharacter is treated as one token CharacterCE in Table 7) and

The Scientific World Journal 9

Table 8 Statistics of English amp Chinese corpora

Languages Tokens Average length VocabulariesEnglish 804584844 2464 2357374Chinese 3007196322 3343 1533343

Table 9 Statistics for EC-Corpus testing data

Languages Tokens Average lengthEnglish 118802 2376CharacterCE 153705 3074CTBCE 118499 2370

word level (segmented by StanfordCTB segmenter presentedby CTBCE)

The UM-Corpus also contains English Chinese andPortuguese monolingual corpora In this experiment theEnglish and Chinese corpora are used There are 35938697English sentences from news domain and more than 70million Chinese sentences from China portal webpage Thestatistics about the two corpora are shown in Table 8

The 5000 sentences of test data (named UM-Testing) aredesigned according to domains in the UM-Corpus Thereare 1500 sentences for news 500 sentences for laws 500sentences for novels 600 sentences for thesis 700 sentencesfor education 600 sentences for science and 600 for subtitleThe statistics of the testing data are shown in Table 9

43 Equation of Word Alignments versus Phrases The calcu-lation for number of phrases requires the full word alignmentlinks Full word alignment in other words is the best qualityof word alignment It is obviously very time-consumingif we manually edit the large word aligned sentences It isalready observed in the past that generative models usedfor statistical word alignment create alignments of increasingquality as they are exposed to more data [19] The intuitionbehind this is simple as more cooccurrences of source andtargets words are observed the word alignments are betterFollowing this intuition if we wish to increase the quality ofa word alignment we allow the alignment process access toextra data which is used only during the alignment processand then removed If we wish to decrease the quality of aword alignment we divide the parallel corpus into piecesand align the pieces independently of one another and finallyconcatenate the results together In our experiments we usethe entire WMT 2006 corpus as training data firstly andthen the first 100000 alignments are chosen as the full wordalignments

Also the leftmost middle and rightmost word alignmentpoint based on the full word alignment are obtained based onthe results generated by GIZA++

The minimal and maximal sizes of the phrase tableextracted from the full and middle word alignment and the

differential rates are shown in Table 10The differential rate iscalculated by

119877diff-min =

1003816100381610038161003816119873min minus 119873moses-min1003816100381610038161003816

119873min (22)

119877diff-max =

1003816100381610038161003816119873max minus 119873moses-max1003816100381610038161003816

119873max (23)

where the119873moses-min and119873moses-max indicate the minimal andmaximal phrases extracted fromMoses

From Table 10 we can see that the actual number ofphrases is close to the calculation result With the small dif-ferential rate the (13) and (17) can be available to the practicalexperiments The difference may come from two reasons(1) the assumption of full word alignment produced fromthe GIZA++ may contain too many noises (2) equations(13) and (17) do not consider repetitions of the extractedphrases

44 Phrase Table versusMachine Translation Wehave shownthat the there is no direct relationship between the sizeof phrase table and machine translation performance Asreflected in (17) and Table 4 at least half of the phrases canbe removed without hurting too much the performance ofmachine translation In this part we want to use the largeChinese-English corpus to test our pruning algorithm basedon monolingual and bilingual corpus

According to our experience different thresholds willgenerate different size of phrase table and result in differentperformance of translation Therefore one of the key stepsis to choose the pruning threshold in Algorithms 2 and 3Firstly a baseline threshold is calculated from the logarithmof sentences of the given corpus (119873) that is log(119873) Forthe 119862-value threshold the baseline threshold is about 25(log(36 lowast 107)) For the 119875-value threshold the negativelogarithm will be chosen Because the probabilities involvedare incredibly tiny we will work instead with the negative ofthe natural logs of the probabilities For example instead ofselecting phrase pairs with a 119875-value less than exp(minus15) wewill select phrase pairs with a negative-log-119875-value greaterthan 15 The baseline threshold for 120591

119865is 45 (log(74 lowast 106))

Secondly the different thresholds will increase or decreasethan the baseline threshold The affection to the phrase tableand BLEU scores based on different thresholds is shown inTable 11

From Table 11 we can see that nearly 98 phrases canbe discarded with monolingual and bilingual corpora whilethe translation quality does not become worse too much(only decrease 235) Experiments show that about 85 of thephrase table is reduced and can generate the best translationquality which not only again proves the similar results as inother researchers [3 21 32 38] but also indicates that mostphrases are not needed at all The best translation can beachieved on the size of about 15 to 30of the original phrasetable That is to say only a small fraction of phrases are theessential elements while major of the extracted phrases arespurious and in some sense they are redundancy and shouldbe discarded

10 The Scientific World Journal

Table 10 Phrases comparison between Moses extraction and calculation result

Phrase length 119870 = 30

Languages 119873moses-min 119873min 119877diff-min 119873moses-max 119873max 119877diff-max

fr-en 9136283 9080307 0006 1541902790 1544021509 0001es-en 10958551 10895266 0006 1401977185 1489422170 0058de-en 9219921 8929095 0033 1606176665 1686043482 0047

Table 11 119862-value and 119875-value threshold effect on the phrase tablesize and BLEU scores

Thresholds UM-Corpus + CWMT 2013119862-value 120576 minus log(119875-value) 120591

119865Phrase table () BLEU ()

0 0 100 281310 20 836 280925 45 525 2798100 60 297 2811200 100 143 2827400 200 202 2578

Table 12 Translation performance on different word alignmenttypes

Languages BLEU (3064 sentences)119860 full 119860 lmost 119860 sure 119860mid 119860 rmost

fr-en 2480 448 2226 996 248es-en 3076 312 2824 1128 232de-en 2166 230 1928 976 294

45 Word Alignment versus Translation Quality In order toinvestigate the affection to the translation performance basedon different word alignment types and phrase table size weselected five different types of word alignments to generatethe phrase table and then measure the translation qualityby BLEU metric the full word alignment points (119860 full) theleftmost word alignment (119860

119897most) the sure alignment point(119860 sure) the middle word alignment point (119860mid) and therightmost word alignment point (119860

119903most)Clearly the 119860 full is the best quality of word alignment

and followed by 119860 sure which is the only 1-to-1 alignmentThe 119860

119897most 119860mid and 119860119903most are much worse ones which

is only one alignment link As mentioned previously allthese alignment points are obtained after getting the resultsgenerated by GIZA++ The experiment is carried out inWMT 2006 corpus Based on the different alignment typesthe number of phrases extracted and the correspondingtranslation BLEU are presented in Figure 7 and Table 12separately

The histograms show that there is not any direct propor-tion relationship between the translation performance andthe number of extracted phrases The numbers in Figure 7show that German-English language pairs can generate themost phrases while the translation quality is not the bestAlthough the fewest phrases produced from the Spanish-English language pairs the best translation performance canbe obtained Explained by (17) more phrases usually lead to

worse word alignment quality or it can be said that thereare too many unaligned words in the produced alignmentresults The results in Figure 7 and Table 12 show that (1) thefull word alignment will contribute most to the translationperformance and there will not be too much decrease whenonly the sure alignment is provided (2) however the first andlast word alignment will not produce too high translationquality (3) although the middle position word alignmentcan extract the most phrases the quality of the translationis not the best and it seems that middle word alignment cancontribute more to the finally translation

In fact the translation from the only one link wordalignment should be higher After all the default full wordalignments have too many noises which trained from a largetraining corpus without manual edit

5 Conclusions

In this paper some investigations are made about the rela-tionship among word alignment phrase table and machinetranslation After deducing a formula for estimating the sizeof the phrase table a maximal phrase pruning ratio can becalculated based on the average length of the given corpus andword alignment positions Finally translation performancebased on different word alignment types is compared

Equations (13) and (17) provide a simple way to predictthe size of the phrase table in advance based on the generatedword alignment points This is beneficial for evaluating howmuch hardware resources are required to train the modelwhen the size of the parallel corpus is huge

Experiment results show the affection to the translationperformance is significant if the AER decreases or increasestoomuchThere is no direct proportion relationship betweenthe translation performance and the number of extractedphrases In other words most of the phrases are not needed atall only phrases extracted from the full word alignment areenough It can be seen from our corpus-motivated pruningmethod that only 15ndash30 phrases are needed for the basictranslations and most phrases in the phrase table are noisesto the final translation

Based on the existing phrase extraction algorithm fromword alignment better word alignment will be very helpful tothe final translation There are at least three advantages witha better quality of word alignment even without the existingphrase extraction algorithm improved as follows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

The Scientific World Journal 11

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Fr-e

n nu

mbe

r of p

hras

es

Fr-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

668712E7

785057E7

913628E654091E7

15419E9

(a)

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Es-e

n nu

mbe

r of p

hras

es

Es-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

753427E7

621835E7

140198E9

109586E7534143E7

(b)

De-en number of phrases

180E + 009

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

De-

en n

umbe

r of p

hras

es

Wod alignment types

Almost Asure Amid Afull Armost

716272E7712921E7

921992E6

527541E7

160618E9

(c)

Fr-e

n BL

EU

Fr-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

448

2226

996

248

248

(d)

Es-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

Es-e

n BL

EU

35

30

25

20

15

10

5

0

312

2824

1128

232

3076

(e)

De-

en B

LEU

De-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

23

1928

976

2166

294

(f)

Figure 7 Number of phrases extracted from different word alignment points and their BLEU

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

The next work we want to focus on is to add the corpus-motivated pruning technique into the translation modelsthat is the phrases will be filtered during the training timenot after generating the phrase table

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank all reviewers for the verycareful reading and helpful suggestions The authors aregrateful to the Science and Technology Development Fundof Macau and the Research Committee of the University ofMacau for the funding support for their research under theReference nos MYRG076 (Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS

References

[1] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology vol 1 pp 48ndash54 Association forComputational Linguistics Stroudsburg Pa USA 2003

[2] R Zens F J Och and H Ney Phrase-Based Statistical MachineTranslation Springer Berlin Germany 2002

[3] J H Shin Y S Han and K S Choi ldquoBilingual knowledgeacquisition from Korean-English parallel corpus using align-ment method Korean-English alignment at word and phraselevelrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 230ndash235 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[4] K Yamamoto T Kudo Y Tsuboi and YMatsumoto ldquoLearningsequence-to-sequence correspondences from parallel corporavia sequential pattern miningrdquo in Proceedings of the HLT-NAACL Workshop on Building and Using Parallel Texts DataDriven Machine Translation and Beyond vol 3 pp 73ndash80Association for Computational Linguistics Stroudsburg PaUSA 2003

[5] C Goutte K Yamada and E Gaussier ldquoAligning words usingmatrix factorisationrdquo in Proceedings of the 42nd AnnualMeetingon Association for Computational Linguistics Association forComputational Linguistics Stroudsburg Pa USA 2004

[6] F J Och C Tillmann H Ney and L F Informatik ImprovedAlignmentModels for StatisticalMachine Translation Universityof Maryland College Park Md USA 1999

[7] F J Och and H Ney ldquoA Comparison of Alignment Modelsfor Statistical Machine Translationrdquo in Proceedings of the 18thConference on Computational Linguistics vol 2 pp 1086ndash1090

12 The Scientific World Journal

Association for Computational Linguistics Stroudsburg PaUSA 2000

[8] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[9] F J Och and H Ney ldquoThe alignment template approach tostatistical machine translationrdquo Computational Linguistics vol30 no 4 pp 417ndash449 2004

[10] N Duan M Li and M Zhou ldquoImproving phrase extractionvia MBR phrase scoring and pruningrdquo in Proceedings of the MTSummit XIII pp 189ndash197 2011

[11] G Neubig T Watanabe E Sumita S Mori and T Kawa-hara ldquoAn unsupervised model for joint phrase alignment andextractionrdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human LanguageTechnologies vol 1 pp 632ndash641 Stroudsburg Pa USA June2011

[12] H L M Z Hendra Setiawan ldquoLearning phrase translationusing level of detail approachrdquo inTheTenthMachine TranslationSummit (MT-Summit X) 2005

[13] C Tillmann ldquoA projection extension algorithm for statisticalmachine translationrdquo in Proceedings of the Conference on Empir-ical Methods in Natural Language Processing Association forComputational Linguistics Stroudsburg PA USA 2003

[14] B Zhao and S Vogel ldquoA generalized alignment-free phraseextractionrdquo in Proceedings of the ACLWorkshop on Building andUsing Parallel Texts pp 141ndash144 Association for ComputationalLinguistics Stroudsburg Pa USA 2005

[15] Y Zhang and S Vogel ldquoCompetitive grouping in integratedphrase segmentation and alignment modelrdquo in Proceedings ofthe ACLWorkshop on Building and Using Parallel Texts pp 159ndash162 Association for Computational Linguistics StroudsburgPa USA 2005

[16] A Lopez ldquoStatistical machine translationrdquo ACM ComputingSurveys vol 40 no 3 article 8 2008

[17] K Ganchev J V Grac and B Taskar ldquoBetter alignments = Bet-ter translationsrdquo in Proceedings of the 46th Annual Meeting ofthe Association for Computational Linguistics Human LanguageTechnologies pp 986ndash993 June 2008

[18] D Vilar M Popovic and H Ney ldquoAER do we need toldquoimproverdquo our alignmentsrdquo in Proceedings of the InternationalWorkshop on Spoken Language Translation pp 205ndash212 2006

[19] A Fraser andDMarcu ldquoMeasuring word alignment quality forstatistical machine translationrdquo Computational Linguistics vol33 no 3 pp 293ndash303 2007

[20] P Lambert S Petitrenaud Y Ma and A Way ldquoWhat typesof word alignment improve statistical machine translationrdquoMachine Translation vol 26 no 4 pp 289ndash323 2012

[21] J H Johnson J Martin G Foster and R Kuhn ldquoImprovingtranslation quality by discarding most of the phrasetablerdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 967ndash975 June2007

[22] J H Johnson Conditional Significance Pruning DiscardingMore of Huge Phrase Tables 2012

[23] M Yang and J Zheng ldquoToward smaller faster and better hierar-chical phrase-based SMTrdquo in Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for ComputationalLinguistics and 4th International Joint Conference on Natural

Language Processing of the AFNLP (ACL-IJCNLP rsquo09) pp 237ndash240 Association for Computational Linguistics StroudsburgPa USA August 2009

[24] N Tomeh N cancedda and M Dymetman ldquoComplexity-based phrase-table filtering for statistical machine translationrdquoin Proceedings of the MT Summit XII 2009

[25] M Eck S Vogel and A Waibel ldquoTranslation model prun-ing via usage statistics for statistical machine translationrdquo inProceedings of the Human Language Technologies The AnnualConference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT rsquo07) pp 21ndash24Association for Computational Linguistics troudsburg PaUSA 2007

[26] M Eck S Vogel and A Waibel ldquoEstimating phrase pairrelevance formachine translation pruningrdquo inProceedings of theMT Summit XI pp 159ndash165 2007

[27] Y Chen A Eisele and M Kay ldquoImproving statistical machinetranslation efficiency by triangulationrdquo in Proceedings of theSixth International Conference on Language Resources andEvaluation 2008

[28] Y Chen M Kay and A Eisele ldquoIntersecting multilingual datafor faster and better statistical translationsrdquo in Proceedings of theHuman Language Technologies The Annual Conference of theNorth American Chapter of the Association for ComputationalLinguistics (NAACL HLT rsquo09) pp 128ndash136 Association forComputational Linguistics Stroudsburg Pa USA June 2009

[29] G Sanchis-Trilles D Ortiz-Martınez J Gonzalez-Rubio JGonzalez and F Casacuberta ldquoBilingual segmentation forphrasetable pruning in statistical machine translationrdquo in Pro-ceedings of the 15th International Conference of the EuropeanAssociation for Machine Translation (EAMT rsquo11) pp 257ndash264May 2011

[30] N Tomeh M Turchi G Wisinewski A Allauzen and YFrancois ldquoHow good are your phrases Assessing phrase qualitywith single class classificationrdquo in Proceedings of the Interna-tional Workshop on Spoken Language Translation pp 261ndash2682011

[31] W Ling J Graca I Trancoso and A Black ldquoEntropy-basedpruning for phrase-based machine translationrdquo in Proceedingsof the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Language Learn-ing pp 962ndash971 Association for Computational LinguisticsStroudsburg Pa USA 2012

[32] R Zens D Stanton and P Xu ldquoA systematic comparison ofphrase table pruning techniquesrdquo in Proceedings of the JointConference on Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning pp 972ndash983 Association for Computational Linguistics StroudsburgPa USA 2012

[33] K Papineni S Roukos T Ward andW Zhu ldquoBLEU a methodfor automatic evaluation of machine translationrdquo in Proceedingsof the 40th Annual Meeting of the Association for ComputationalLinguistics (ACL rsquo01) pp 311ndash318 2001

[34] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe Mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[35] P Koehn StatisticalMachine Translation CambridgeUniversityPress New York NY USA 1st edition 2010

[36] E Galbrun ldquoPhrase table pruning for statistical machine trans-lationrdquo Department of Computer Science Series of PublicationsC C-2009-22 University of Helsinki 2007

The Scientific World Journal 13

[37] T M Cover and J A Thomas Elements of Information TheoryWiley-Interscience New York NY USA 2006

[38] Z He Y Meng Y Lu H Yu and Q Liu ldquoReducing SMTrule table with monolingual key phraserdquo in Proceedings ofthe Joint Conference of the 47th Annual Meeting of the Asso-ciation for Computational Linguistics and 4th InternationalJoint Conference on Natural Language Processing of the AFNLP(ACL-IJCNLP rsquo09) pp 121ndash124 Association for ComputationalLinguistics Stroudsburg Pa USA August 2009

[39] Z He Y Meng and H Yu ldquoDiscarding monotone com-posed rule for hierarchical phrase-based statistical machinetranslationrdquo in Proceedings of the 3rd International UniversalCommunication Symposium (IUCS rsquo09) pp 25ndash29 New YorkNY USA December 2009

[40] K T Frantzi and S Ananiadou ldquoExtracting nested colloca-tionsrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 41ndash46 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[41] P Koehn H Hoang A Birch et al ldquoMoses open source toolkitfor statistical machine translationrdquo in Proceedings of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions pp 177ndash180 Association for ComputationalLinguistics Stroudsburg Pa USA 2007

[42] D Chiang ldquoA hierarchical phrase-based model for statisticalmachine translationrdquo in Proceedings of the 43rd Annual Meetingof the Association for Computational Linguistics pp 263ndash270Association for Computational Linguistics Stroudsburg PaUSA June 2005

[43] W A Gale and K W Church ldquoIdentifying word correspon-dences in parallel textsrdquo in Proceedings of the Workshop onSpeech and Natural Language pp 152ndash157 Association forComputational Linguistics Stroudsburg Pa USA 1991

[44] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[45] M Federico N Bertoldi and M Cettolo ldquoIRSTLM An opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

[46] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the MT Summit X 2005

[47] S Green and J DeNero ldquoA Class-based agreement model forgenerating accurately inflected translationsrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Long Papers vol 1 pp 146ndash155 Association forComputational Linguistics Stroudsburg Pa USA 2012

[48] N Xue F D Chiou and M lmer ldquoBuilding a large-scaleannotated chinese corpusrdquo in Proceedings of the 19th Interna-tional Conference on Computational Linguistics vol 1 pp 1ndash8PaAssociation for Computational Linguistics Stroudsburg PaUSA 2002

[49] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[50] S Vogel H Ney and C Tillmann ldquoHmm-based word align-ment in statistical translationrdquo in Proceedings of the 16thInternational Conference on Computational Linguistics pp 836ndash841 1996

[51] CWMT 2013 httpwwwliipcncwmt2013

[52] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) pp 1273ndash1278 Qingdao China July2010

[53] L Tian D Wong L Chao et al ldquoUM-Corpus a large English-Chinese parallel corpus for statistical machine translationrdquo inProceedings of the 9th International Conference on LanguageResources and Evaluation 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article A Relationship: Word Alignment, Phrase

4 The Scientific World Journal

e1 e2 ei eI

f1 f2 f3 fj fJ

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1 Full word alignments

1 2 3 4 5

I1 am2 able3 do5 it6to4 well7

∘6

8

Figure 2 An example of nonmonotonic alignments where align-ment links are crossing between parallel sentences

So if we define the reduced phrases as 119873ca (number ofcross-alignments) a more practical equation can be revisedas

119873min =119868 (119868 + 1)

2minus 119873ca (10)

If there is more than one cross-alignment as illustratedin Figure 3 the total number of extracted phrases can begeneralized as

119873min =119868 (119868 + 1)

2minus

119898

sum119896=1

[(119894119896

begin minus 1) + (119868 minus 119894119896

end)] (11)

Reduced phrase pairs are determined by the cross-alig-nment point 119894begin and 119894end 119894

119896

begin represents the start positionand 119894119896end is the last position of the 119896th pair words (boundedby the crossing links) and 119898 is the total number of cross-alignments

Figure 3 illustrates the situation that there is not any wordbetween the crossing dependencies However in reality therewill be any words in between the cross words as illustrated inFigure 4

For the third case words between 119890119894begin

and 119890119894end

may havesimilar alignment situation as that illustrated in (9)Thus thefollowing equation can be derived to calculate 119873

119903 the exact

number of phrase pairs that cannot be generated Consider

119873119903=

119898

sum119896=1

119894119896

endminus119894119896

endminus1

sum120572=0120573=0

[(119894119896

begin + 120572 minus 1) + (119868 minus 119894119896

end + 120573)] (12)

where 120572 indicates the shift word position after the first crossword 119894119896begin and 120573 indicates the word position before the endcross word 119894

119896

end And 120572 120573will not be greater than the numberof words between the first and the end cross word

Finally the number of phrase pairs that can be extractedfrom the full word alignmentsmeasured in the target side canbe described by

119873min-119905 =119868 (119868 + 1)

2minus 119873119903 (13)

e1 e2 eI

f1 f2 f3 fJ

middot middot middot middot middot middot

middot middot middotmiddot middot middot

eibegin eiend

fjbegin fjend

Figure 3 Alignments with a cross link of neighboring words

e1 e2 eI

f1 f2 f3 fJ

middot middot middot middot middot middot

middot middot middotmiddot middot middot middot middot middot

middot middot middot eibegin eiend

fjbegin fjend

Figure 4 Alignments with a cross link involving distant words

The equation shows that the size of the phrase tableis decided by the target length of the given sentence ifthe word alignment is the full word alignment that is thebest and ideal quality word alignment It is proven thatthe bidirectional training approach for word alignment canimprove the quality of word alignment [8] If the alignmentis deduced from another direction (source side) similarly theextracted phrases from source side can be represented by

119873min-119904 =119869 (119869 + 1)

2minus 119873119903 (14)

Equations (13) and (14) show that the number of phrasesheavily affected by the length of source and target sentencesGiven the best word alignment if the length in the sourcesentence equals the target side (119868 = 119869) the phrases willbe the minimal numbers (119873min-119904 or 119873min-119905) If not equalthe extracted phrases will be larger than this minimal value(119873min-119904 and 119873min-119905) after intersection and union operationsduring alignment process

312 Phrases Extraction from Worst Alignment Supposeonly one word 119891

119895in the source sentence is aligned to a target

word 119890119894as shown in Figure 5 Note that unaligned words

near the word 119891119895may lead to multiple possibilities each

proceeding and following words next to it can form a part ofthe phrase and so does the neighboring words of 119890

119894in target

sentence In other words if we know the different extractedsituations in each source and target sentence and then totalcounts of phrase pairs can be derived by the product of thetwo numbers Algorithm 1 describes the different phrasesextraction scenarios according to different positions in thesource sentence

Obviously there are total 119894 words that can start with(1198901 1198902 1198903 119890

119894) Now in target sentence consisting of

oddeven words the phrase pairs 119891(119894) in different positionare

119891 (119894) = 119894 (119868 minus 119894 + 1) = 119894 (119868 minus 119894) + 119894 (15)

The phrases 119892(119895) in the source side can be calculated in asimilar way

119892 (119895) = 119895 (119869 minus 119895 + 1) = 119895 (119869 minus 119895) + 119895 (16)

The Scientific World Journal 5

e1 e2 ei eI

f1 f2 f3 fj fJ

middot middot middot

middot middot middot middot middot middot

middot middot middot

Figure 5 Sentences with single alignment link

Based on (15) and (16) the total number of phrase pairs119873max is given by

119873max = [119894 (119868 minus 119894) + 119894] [119895 (119869 minus 119895) + 119895] (17)

where 0 le 119894 le 119868 minus 1 and 0 le 119895 le 119869 minus 1Obviously (15) and (16) accord with symmetric distribu-

tion and the maximal value will be achieved in the middleposition Given a sentence consisting of 119871 words a formaldescription to the middle position in the sentence can bepresented by

wordmid

1 + 119871odd2

119871even2

or (119871even2

+ 1) (18)

We can see that it gives (17) the maximal value in themiddle position in a given sentence that is the nearer to themiddle word the more phrase pairs can be produced Takethe Figure 2 as an example there are 18 phrases based onthe full word alignments while the phrases will be 240 whenonly the fifth word (do) in the source sentence aligns to thethird word (做zuo) in the target side From the view of sizeof phrase table the calculation value from (17) is much largerthan that from (13) That proves that the better quality of theword alignment the fewer phrases will be

Note that there will be some share phrases extracted fromthe full word alignment and only single word alignment whichleads to the phenomenon that there are still some translationabilities even with only one alignment point under the exist-ing phrase extraction algorithm For instance phrase pairs(11989011198902sdot sdot sdot 119890119894| 119891111989121198913sdot sdot sdot 119891119895) (1198902sdot sdot sdot 119890119894| 11989121198913sdot sdot sdot 119891119895) (119890

119894| 119891119895)

are the share ones when given a phrase such as 119891111989121198913sdot sdot sdot 119891119895

both the phrase tables generated from both alignment typescan possibly translate it right to 119890

11198902sdot sdot sdot 119890119894

32 Phrase Table amp Machine Translation Nowadays the keystep of the process of phrase-based machine translation(SMT) involves inferring a large table of phrase pairs thatare translations of each other from a large corpus of alignedsentences The set of all phrase pairs together with estimatesof conditional probabilities and other useful features is calledPhrase Table (PT) which is applied during the decodingprocess [29]

Nevertheless phrases in the PT are very dependent fromthe system that uses them Some phrases might be presentin the PT but never be used in translations because they areeither ranked too low or erroneous translations for exampleIn other words there are too many redundant phrases in thePT Such situation leads many researches to propose differenttechniques to reduce the size of the phrase tableThis includes

Table 2 Pruned phrase table and translation quality

Pruning methods Languages Pruning rates BLEU (Δ)

Countsignificantfr-en 646sim859

minus01es-en 679sim881de-en 627sim841

Entropyfr-en 85sim95

minus10es-en 85sim95de-en 85sim95

Bilingual-segmentfr-en mdash

minus12es-en 98de-en 94

the methods as introduced in Section 2 These works can betreated as the research on the relationship between phrasetable and machine translation

All the works presented in Section 2 show that thetranslation results can be achieved from smaller phrase pairswith several times increase in translation speed with little orno loss in translation accuracy As reported by [32] count-based and significance-based pruning techniques can resultin larger savings between 70 and 90 while the entropy-based pruning approach can reduce consistently the entriesbetween 85 to 95 of the phrase table In [29] they evenpointed out that the pruning rate can reach to 98 using abilingual segmentation method with finite state transducersMore details about the pruning rates and related translationqualities (measured by BLEU [33]) are shown in Table 2where the BLEU column is to record the maximal decrease(minus) rate from the 100 size of phrase table to the maximalpruning rate of the phrase table

The pruning rates presented above are counted from anempirical or experiment result We can also estimate thepruning rate from a theoretical aspect In practice we alsonotice that the phrases extracted from the middle wordalignment point includemost of the phrases that are extractedfrom the full word alignment All the phrases excluding theones extracted from the full word alignment can be removedIn other words a rough pruning rate can be calculated fromthe ration of phrases extracted from best alignment (119873min)and worst alignment (119873max)

119877pruned = 1 minus119873min-119905119873max

(19)

If we want to quickly estimate how many percentageof phrases can be pruned in a given corpus the averagestatistics of the given corpus can be used as in (20) Thisequation shows that the pruning rate is not only decided bythe sentence length in a parallel corpus but also affected bythe alignment position Consider

119877pruned = 1 minus119868average (119868average + 1) 2

[119894 (119868average minus 119894) + 119894] [119895 (119869average minus 119895) + 119895] (20)

In reality the numerator trends to bigger and the denom-inator will be smaller so the pruning rate will be smaller

6 The Scientific World Journal

Start from 1198901

119873phrases = 119868 minus 119894 + 1Start from 119890

2

119873phrases = 119868 minus 119894 + 1Similarly forall119894 isin (119894 = 119894 119868) in the target side

119873phrases = 119868 minus 119894 + 1

Algorithm 1 Phrase extraction based on the only one alignment point 119890119894harr 119891119895

Table 3 Training data statistics WMT 2006

Language pairs Sentences Ave source len Ave target lenfr-en 688031 2267 2007es-en 730740 2152 2083de-en 751088 2031 2137

than the calculation value When the alignment position isin the middle position (middle alignment) there will getthe maximal pruning rates as shown in (21) Based on theequation we can estimate the pruning rate range based ondifferent word alignment position in European corpus [46]Table 3 shows the statistics of the WMT 2006 and Table 4compares different pruning rates reported in researchersrsquopaper [22 32] and the calculation ranges based on (20) Fromwhichwe can see that the calculation result is very close to theexperiment results presented in [32] The authors concludedtheir findings that ldquothe novel entropy-based pruning oftenachieves the same Bleu score with only half the number ofphrasesrdquo A very bold assumption is thatmost phrases (at leasthalf) can be removed from the original extracted phrasesConsider

119877max = 1 minus119868average (119868average + 1)

2

times ([119894mid (119868average minus 119894mid) + 119894mid]

times [119895mid (119869average minus 119895mid) + 119895mid])minus1

(21)

We have seen that the phrase table pruning rate is stronglyaffected by the length and word positions in the source andtarget sentences As reflected in the pruning equation (20)the phrase table might be more accurately pruned based onsome features (eg phrase length phrase cooccurrence) inthe bilingual and monolingual corpus The researchers in[27] reduced the phrase table based on the significance testingof phrase pair cooccurrence in bilingual corpus while Heet al [38 39] pruned the phrase pairs in terms of phrasefrequency in the source side using key phrases extractedfrom a monolingual corpus We want to make full use of theadvantages of the two approaches the key phrases can beused to check whether the extracted phrases are often usedin practice and the significant testing can be used to measurethe quality of the extracted phrase pairs Our methods wantto consider the features both for the source and target phrase

Table 4 Comparison between theoretical estimation results andother pruning techniques results

Language pairs 119895mid 119894mid 119877pruned

fr-en 1234 1104 535sim986es-en 1126 1142 493sim984de-en 1116 1119 449sim983

The basic metrics for phrases that are often used in prac-tice is the frequency that a phrase appears in a monolingualcorpus The more frequent a phrase appears in a corpus thegreater possibility the phrase may be used However longerphrase pairs will be removed in this way because of the factthat those phrases seem to appear rarely in a monolingualcorpus

The 119862-value is a measurement of automatic term recog-nition which can be used to solve this problem as describedin [40] The method not only fully considers the frequencyinformation such as frequencies of a phrase and a subphraseappearing in longer phrases but also is an efficient algorithm

In this algorithm (as shown in Algorithm 2) 4 factors (119871119865 119878119873) can be used to determine if a phrase 119901 is a key phrase[38]

(i) 119871(119901) the length of 119901(ii) 119865(119901) the frequency that 119901 appears in a corpus(iii) 119878(119901) the frequency that 119901 appears as a substring in

other longer phrases(iv) 119873(119901) the number of phrases that contain 119901 as a

substring

Given amonolingual corpus key phrases can be extractedefficiently according to Algorithm 2 Firstly (line 1) allpossible phrases are extracted as candidates of key phrases(eg extracted phrase table fromMoses [41]) Secondly (line3 to 7) for each candidate phrase 119862-value is computedaccording to the phrase appearing by itself (line 4) or as asubstring of other long phrases (line 6) The 119862-value is indirect proportion to the phrase length (119871) and occurrences(119865 119878) but in inverse proportion to the number of phrasesthat contain the phrase as a substring (119873)This overcomes thelimitations of frequency measurement A phrase is regardedas a key phrase if its 119862-value is greater than a threshold 120576Finally (line 11 to 14) 119878(119902) and 119873(119902) are updated for eachsubstring 119902

The algorithm is originally used for pruning the ruletable extracted from the hierarchical phrase-based SMT [42]

The Scientific World Journal 7

Input Monolingual Corpus(1) Extracted candidate phrases(2) for all phrases 119901 in length descending order do(3) if 119873(119901) = 0 then(4) 119862-value = (119871(119901) minus 1) times 119865(119901)

(5) else(6) 119862-value = (119871(119901) minus 1) times (119865(119901) minus 119878(119901)119873(119901))

(7) end if(8) if 119862-value ge 120576 then(9) add 119901 to KP(10) end if(11) for all sub-strings 119902 of 119901 do(12) 119878 (119902) = 119878(119902) + 119865(119901) minus 119878(119901)

(13) 119873(119902) = 119873(119902) + 1

(14) end for(15) end for

Output Key Phrase Table KP

Algorithm 2 Key phrase extraction from monolingual corpus

Input Bilingual Corpus and KP Table from Algorithm 1(1) for all phrases 1199011015840 in KP do(2) if minus log(119875-value(1199011015840)) le 120591

119865then

(3) add 1199011015840 to 119875 PT

(4) end if(5) end forOutput Pruned Phrase Table 119875 PT

Algorithm 3 Algorithm for key phrase pruning

However the idea can be also used for phrase-based SMT[1 41] perfectly After this filter step a filtered phrase tablecalled KP will be generated

The phrases in the phrase table KP are all possibly usedphrases in practice However there are still some noisesin the KP table for example the source phrase occurringin source sentences and the target phrase occurring in thetarget sentences are not well matched that is the phrasetranslation is not good enough To overcome this case it isconvenient to construct a two by two contingency table thattabulates the sentence pairs where the two types of matchesoccur and do not occur [21 22 43] The calculation resultis also called 119875-value as introduced in Section 2 A bilingualcorpus constraint can be added to filter the noise phrases aspresented in Algorithm 3 After this step a large pruning ratephrase table (called 119875 119875119879) will be generated

33 Word Alignment amp Machine Translation Based on thediscussion in Sections 31 and 32 we can find that the sizeof phrase table is heavily affected by the quality of wordalignment the better the word alignment the smaller size ofthe phrase table Besides this themachine translation perfor-mance is not directly decided by the size of the phrase tableat least 50 phrases can be removed without hurting the

performance of the machine translation Experiments showthat the quality of phrase translation plays more importantrole to the final translation quality in practice Therefore itis necessary to improve the quality of word alignment underexisting phrase extraction algorithm from word alignment

Now it is a little easier to talk about the relationshipbetween word alignment andmachine translationThe trans-lation of sentence is directly derived from the phrase tablewhile the phrase table is extracted from word alignment Inother words the quality of word alignment will affect theperformance of machine translation through phrase tableThat is to say the word alignment quality will affect thetranslation performance a lot However the works listed in[8 18 19] show that there is not always significant affection tothe translation result with improved word alignment quality

We do not agree with the conclusion Firstly either thetraining or testing corpus they used in their experimentsis limited in a small size or the improvement of the wordalignment quality is not significant (AER decrease between01 and 02) Secondly we think it is the existing phraseextraction algorithm that leads to the not obvious result Wehave pointed out in the last part of Section 31 that even theworst alignments are used the extracted phrase table stillhave some ability to translate some input phrases This isbecause of the share phrases extracted from the only one

8 The Scientific World Journal

(aligner)

Phraseextraction

Moses decoder

SRILMIRSTLM(LM toolkit)

Phrase table Language model

Monolingualcorpus

Bilingualcorpus

Sourcetext

Targettext

GIZA++

Figure 6 Mosesrsquo main modules and models

alignment link The only one word alignment can translatesome input phrases or sentences not mentioning the moreword alignment links

We will discuss the situation with different word align-ment in next section where the machine translation affectedby the best word alignment and worst word alignment willbe fully surveyed After that we can see that there are at leastthree advantages to improve the quality of word alignment asfollows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

4 Experiments and Discussions

In order to evaluate the conclusions drawn in the previoussection some experiments are carried out usingMoses toolkit[41] which provides a complete package of the requiredmodules for training and generating the translation of texts

41 Experiment Settings Like other SMT approaches Mosestoolkit requires two models (as shown in Figure 6) theLanguage Model (LM) and the Translation Model (TM) InMoses LM is usually created by the external toolkits SRILM[44] or IRSTLM [45] In our experiment the IRSTLM toolkitis used for the large training corpus Before training allthe tokenization for languages in European corpus [46] usethe integrated scripts in Moses and the segmentation forChinese adapts the Stanford Chinese segmenter [47] usingCTB (Chinese Tree Bank) standard [48 49] The entirelanguage model is trained by five-gram models

The phrases are extracted from the results generatedfrom GIZA++ [8] The word alignment was trained with teniterations of IBMmodel 1 and model 4 [34] and six iterationsof the HMM alignment model [50]

To extract all the possible phrases to verify (13) and (17)the maximal phrases are limited to 30 This can extract most

Table 5 Statistics of the first 10000 WMT 2006 corpus

Languages Ave source len Ave en lenfr-en 2289 2015es-en 2171 2094de-en 2055 2142

Table 6 Statistics of test data (3064 sentences)

Languages Ave source len Ave en lenfr-en 3295 2781es-en 2994 2781de-en 2692 2781

Table 7 Statistics of CMWT 2013 + UM-Corpus

Languages Tokens Average length VocabulariesEnglish 152161233 1937 1655080CharacterCE 229110265 2916 397442CTBCE 123917395 1578 1331505

phrases in terms ofword alignments andmake us focus on therelations among word alignment phrase table and machinetranslation

In order to better reveal relationships between alignmentand phrase table to the translation quality leftmost middleand rightmost word alignment are extracted from the originalfull alignment in the worst alignment situation (ie onlyone alignment link) Here the leftmostmiddle and rightmostword alignment indicate theword alignment link that alignedto the firstmiddle and end word in the source language side

42 Corpus Information Three language pairs in WMT2006 shared translation task are chosen as the trainingdata French-English (fr-en) German-English (de-en) andSpanish-English (es-en) These three language pairs are usedto carry out the experiments for proving the equationdeduced in Section 3 and relationship between word align-ment and machine translation Table 3 shows the statisticsof original corpus [46] However the number of phrasesextracted from the only one link is too many to be allowedin our server so the first 100000 European corpus sentences(statistics shown in Table 5) are chosen to prove the availabil-ity of the equations betweenword alignment and phrase tableTable 6 presents the statistics of 3064 testing data fromWMT2006 for the same English sentences

To test our algorithm for pruning the phrase table alarge Chinese-English parallel corpus is brought inThe largecorpus consists of two parts 3289497 sentences are fromthe CWMT 2013 [51] and 4157556 sentences are from UM-Corpus [52 53] After removing repeat and mismatchedsentences in the combined two parts there are left 7445190sentences and the statistics of the combined parallel corpusare presented in Table 7 Note that the statistics of Chinesesentences are counted both in character level (each Chinesecharacter is treated as one token CharacterCE in Table 7) and

The Scientific World Journal 9

Table 8 Statistics of English amp Chinese corpora

Languages Tokens Average length VocabulariesEnglish 804584844 2464 2357374Chinese 3007196322 3343 1533343

Table 9 Statistics for EC-Corpus testing data

Languages Tokens Average lengthEnglish 118802 2376CharacterCE 153705 3074CTBCE 118499 2370

word level (segmented by StanfordCTB segmenter presentedby CTBCE)

The UM-Corpus also contains English Chinese andPortuguese monolingual corpora In this experiment theEnglish and Chinese corpora are used There are 35938697English sentences from news domain and more than 70million Chinese sentences from China portal webpage Thestatistics about the two corpora are shown in Table 8

The 5000 sentences of test data (named UM-Testing) aredesigned according to domains in the UM-Corpus Thereare 1500 sentences for news 500 sentences for laws 500sentences for novels 600 sentences for thesis 700 sentencesfor education 600 sentences for science and 600 for subtitleThe statistics of the testing data are shown in Table 9

43 Equation of Word Alignments versus Phrases The calcu-lation for number of phrases requires the full word alignmentlinks Full word alignment in other words is the best qualityof word alignment It is obviously very time-consumingif we manually edit the large word aligned sentences It isalready observed in the past that generative models usedfor statistical word alignment create alignments of increasingquality as they are exposed to more data [19] The intuitionbehind this is simple as more cooccurrences of source andtargets words are observed the word alignments are betterFollowing this intuition if we wish to increase the quality ofa word alignment we allow the alignment process access toextra data which is used only during the alignment processand then removed If we wish to decrease the quality of aword alignment we divide the parallel corpus into piecesand align the pieces independently of one another and finallyconcatenate the results together In our experiments we usethe entire WMT 2006 corpus as training data firstly andthen the first 100000 alignments are chosen as the full wordalignments

Also the leftmost middle and rightmost word alignmentpoint based on the full word alignment are obtained based onthe results generated by GIZA++

The minimal and maximal sizes of the phrase tableextracted from the full and middle word alignment and the

differential rates are shown in Table 10The differential rate iscalculated by

119877diff-min =

1003816100381610038161003816119873min minus 119873moses-min1003816100381610038161003816

119873min (22)

119877diff-max =

1003816100381610038161003816119873max minus 119873moses-max1003816100381610038161003816

119873max (23)

where the119873moses-min and119873moses-max indicate the minimal andmaximal phrases extracted fromMoses

From Table 10 we can see that the actual number ofphrases is close to the calculation result With the small dif-ferential rate the (13) and (17) can be available to the practicalexperiments The difference may come from two reasons(1) the assumption of full word alignment produced fromthe GIZA++ may contain too many noises (2) equations(13) and (17) do not consider repetitions of the extractedphrases

44 Phrase Table versusMachine Translation Wehave shownthat the there is no direct relationship between the sizeof phrase table and machine translation performance Asreflected in (17) and Table 4 at least half of the phrases canbe removed without hurting too much the performance ofmachine translation In this part we want to use the largeChinese-English corpus to test our pruning algorithm basedon monolingual and bilingual corpus

According to our experience different thresholds willgenerate different size of phrase table and result in differentperformance of translation Therefore one of the key stepsis to choose the pruning threshold in Algorithms 2 and 3Firstly a baseline threshold is calculated from the logarithmof sentences of the given corpus (119873) that is log(119873) Forthe 119862-value threshold the baseline threshold is about 25(log(36 lowast 107)) For the 119875-value threshold the negativelogarithm will be chosen Because the probabilities involvedare incredibly tiny we will work instead with the negative ofthe natural logs of the probabilities For example instead ofselecting phrase pairs with a 119875-value less than exp(minus15) wewill select phrase pairs with a negative-log-119875-value greaterthan 15 The baseline threshold for 120591

119865is 45 (log(74 lowast 106))

Secondly the different thresholds will increase or decreasethan the baseline threshold The affection to the phrase tableand BLEU scores based on different thresholds is shown inTable 11

From Table 11 we can see that nearly 98 phrases canbe discarded with monolingual and bilingual corpora whilethe translation quality does not become worse too much(only decrease 235) Experiments show that about 85 of thephrase table is reduced and can generate the best translationquality which not only again proves the similar results as inother researchers [3 21 32 38] but also indicates that mostphrases are not needed at all The best translation can beachieved on the size of about 15 to 30of the original phrasetable That is to say only a small fraction of phrases are theessential elements while major of the extracted phrases arespurious and in some sense they are redundancy and shouldbe discarded

10 The Scientific World Journal

Table 10 Phrases comparison between Moses extraction and calculation result

Phrase length 119870 = 30

Languages 119873moses-min 119873min 119877diff-min 119873moses-max 119873max 119877diff-max

fr-en 9136283 9080307 0006 1541902790 1544021509 0001es-en 10958551 10895266 0006 1401977185 1489422170 0058de-en 9219921 8929095 0033 1606176665 1686043482 0047

Table 11 119862-value and 119875-value threshold effect on the phrase tablesize and BLEU scores

Thresholds UM-Corpus + CWMT 2013119862-value 120576 minus log(119875-value) 120591

119865Phrase table () BLEU ()

0 0 100 281310 20 836 280925 45 525 2798100 60 297 2811200 100 143 2827400 200 202 2578

Table 12 Translation performance on different word alignmenttypes

Languages BLEU (3064 sentences)119860 full 119860 lmost 119860 sure 119860mid 119860 rmost

fr-en 2480 448 2226 996 248es-en 3076 312 2824 1128 232de-en 2166 230 1928 976 294

45 Word Alignment versus Translation Quality In order toinvestigate the affection to the translation performance basedon different word alignment types and phrase table size weselected five different types of word alignments to generatethe phrase table and then measure the translation qualityby BLEU metric the full word alignment points (119860 full) theleftmost word alignment (119860

119897most) the sure alignment point(119860 sure) the middle word alignment point (119860mid) and therightmost word alignment point (119860

119903most)Clearly the 119860 full is the best quality of word alignment

and followed by 119860 sure which is the only 1-to-1 alignmentThe 119860

119897most 119860mid and 119860119903most are much worse ones which

is only one alignment link As mentioned previously allthese alignment points are obtained after getting the resultsgenerated by GIZA++ The experiment is carried out inWMT 2006 corpus Based on the different alignment typesthe number of phrases extracted and the correspondingtranslation BLEU are presented in Figure 7 and Table 12separately

The histograms show that there is not any direct propor-tion relationship between the translation performance andthe number of extracted phrases The numbers in Figure 7show that German-English language pairs can generate themost phrases while the translation quality is not the bestAlthough the fewest phrases produced from the Spanish-English language pairs the best translation performance canbe obtained Explained by (17) more phrases usually lead to

worse word alignment quality or it can be said that thereare too many unaligned words in the produced alignmentresults The results in Figure 7 and Table 12 show that (1) thefull word alignment will contribute most to the translationperformance and there will not be too much decrease whenonly the sure alignment is provided (2) however the first andlast word alignment will not produce too high translationquality (3) although the middle position word alignmentcan extract the most phrases the quality of the translationis not the best and it seems that middle word alignment cancontribute more to the finally translation

In fact the translation from the only one link wordalignment should be higher After all the default full wordalignments have too many noises which trained from a largetraining corpus without manual edit

5 Conclusions

In this paper some investigations are made about the rela-tionship among word alignment phrase table and machinetranslation After deducing a formula for estimating the sizeof the phrase table a maximal phrase pruning ratio can becalculated based on the average length of the given corpus andword alignment positions Finally translation performancebased on different word alignment types is compared

Equations (13) and (17) provide a simple way to predictthe size of the phrase table in advance based on the generatedword alignment points This is beneficial for evaluating howmuch hardware resources are required to train the modelwhen the size of the parallel corpus is huge

Experiment results show the affection to the translationperformance is significant if the AER decreases or increasestoomuchThere is no direct proportion relationship betweenthe translation performance and the number of extractedphrases In other words most of the phrases are not needed atall only phrases extracted from the full word alignment areenough It can be seen from our corpus-motivated pruningmethod that only 15ndash30 phrases are needed for the basictranslations and most phrases in the phrase table are noisesto the final translation

Based on the existing phrase extraction algorithm fromword alignment better word alignment will be very helpful tothe final translation There are at least three advantages witha better quality of word alignment even without the existingphrase extraction algorithm improved as follows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

The Scientific World Journal 11

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Fr-e

n nu

mbe

r of p

hras

es

Fr-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

668712E7

785057E7

913628E654091E7

15419E9

(a)

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Es-e

n nu

mbe

r of p

hras

es

Es-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

753427E7

621835E7

140198E9

109586E7534143E7

(b)

De-en number of phrases

180E + 009

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

De-

en n

umbe

r of p

hras

es

Wod alignment types

Almost Asure Amid Afull Armost

716272E7712921E7

921992E6

527541E7

160618E9

(c)

Fr-e

n BL

EU

Fr-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

448

2226

996

248

248

(d)

Es-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

Es-e

n BL

EU

35

30

25

20

15

10

5

0

312

2824

1128

232

3076

(e)

De-

en B

LEU

De-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

23

1928

976

2166

294

(f)

Figure 7 Number of phrases extracted from different word alignment points and their BLEU

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

The next work we want to focus on is to add the corpus-motivated pruning technique into the translation modelsthat is the phrases will be filtered during the training timenot after generating the phrase table

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank all reviewers for the verycareful reading and helpful suggestions The authors aregrateful to the Science and Technology Development Fundof Macau and the Research Committee of the University ofMacau for the funding support for their research under theReference nos MYRG076 (Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS

References

[1] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology vol 1 pp 48ndash54 Association forComputational Linguistics Stroudsburg Pa USA 2003

[2] R Zens F J Och and H Ney Phrase-Based Statistical MachineTranslation Springer Berlin Germany 2002

[3] J H Shin Y S Han and K S Choi ldquoBilingual knowledgeacquisition from Korean-English parallel corpus using align-ment method Korean-English alignment at word and phraselevelrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 230ndash235 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[4] K Yamamoto T Kudo Y Tsuboi and YMatsumoto ldquoLearningsequence-to-sequence correspondences from parallel corporavia sequential pattern miningrdquo in Proceedings of the HLT-NAACL Workshop on Building and Using Parallel Texts DataDriven Machine Translation and Beyond vol 3 pp 73ndash80Association for Computational Linguistics Stroudsburg PaUSA 2003

[5] C Goutte K Yamada and E Gaussier ldquoAligning words usingmatrix factorisationrdquo in Proceedings of the 42nd AnnualMeetingon Association for Computational Linguistics Association forComputational Linguistics Stroudsburg Pa USA 2004

[6] F J Och C Tillmann H Ney and L F Informatik ImprovedAlignmentModels for StatisticalMachine Translation Universityof Maryland College Park Md USA 1999

[7] F J Och and H Ney ldquoA Comparison of Alignment Modelsfor Statistical Machine Translationrdquo in Proceedings of the 18thConference on Computational Linguistics vol 2 pp 1086ndash1090

12 The Scientific World Journal

Association for Computational Linguistics Stroudsburg PaUSA 2000

[8] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[9] F J Och and H Ney ldquoThe alignment template approach tostatistical machine translationrdquo Computational Linguistics vol30 no 4 pp 417ndash449 2004

[10] N Duan M Li and M Zhou ldquoImproving phrase extractionvia MBR phrase scoring and pruningrdquo in Proceedings of the MTSummit XIII pp 189ndash197 2011

[11] G Neubig T Watanabe E Sumita S Mori and T Kawa-hara ldquoAn unsupervised model for joint phrase alignment andextractionrdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human LanguageTechnologies vol 1 pp 632ndash641 Stroudsburg Pa USA June2011

[12] H L M Z Hendra Setiawan ldquoLearning phrase translationusing level of detail approachrdquo inTheTenthMachine TranslationSummit (MT-Summit X) 2005

[13] C Tillmann ldquoA projection extension algorithm for statisticalmachine translationrdquo in Proceedings of the Conference on Empir-ical Methods in Natural Language Processing Association forComputational Linguistics Stroudsburg PA USA 2003

[14] B Zhao and S Vogel ldquoA generalized alignment-free phraseextractionrdquo in Proceedings of the ACLWorkshop on Building andUsing Parallel Texts pp 141ndash144 Association for ComputationalLinguistics Stroudsburg Pa USA 2005

[15] Y Zhang and S Vogel ldquoCompetitive grouping in integratedphrase segmentation and alignment modelrdquo in Proceedings ofthe ACLWorkshop on Building and Using Parallel Texts pp 159ndash162 Association for Computational Linguistics StroudsburgPa USA 2005

[16] A Lopez ldquoStatistical machine translationrdquo ACM ComputingSurveys vol 40 no 3 article 8 2008

[17] K Ganchev J V Grac and B Taskar ldquoBetter alignments = Bet-ter translationsrdquo in Proceedings of the 46th Annual Meeting ofthe Association for Computational Linguistics Human LanguageTechnologies pp 986ndash993 June 2008

[18] D Vilar M Popovic and H Ney ldquoAER do we need toldquoimproverdquo our alignmentsrdquo in Proceedings of the InternationalWorkshop on Spoken Language Translation pp 205ndash212 2006

[19] A Fraser andDMarcu ldquoMeasuring word alignment quality forstatistical machine translationrdquo Computational Linguistics vol33 no 3 pp 293ndash303 2007

[20] P Lambert S Petitrenaud Y Ma and A Way ldquoWhat typesof word alignment improve statistical machine translationrdquoMachine Translation vol 26 no 4 pp 289ndash323 2012

[21] J H Johnson J Martin G Foster and R Kuhn ldquoImprovingtranslation quality by discarding most of the phrasetablerdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 967ndash975 June2007

[22] J H Johnson Conditional Significance Pruning DiscardingMore of Huge Phrase Tables 2012

[23] M Yang and J Zheng ldquoToward smaller faster and better hierar-chical phrase-based SMTrdquo in Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for ComputationalLinguistics and 4th International Joint Conference on Natural

Language Processing of the AFNLP (ACL-IJCNLP rsquo09) pp 237ndash240 Association for Computational Linguistics StroudsburgPa USA August 2009

[24] N Tomeh N cancedda and M Dymetman ldquoComplexity-based phrase-table filtering for statistical machine translationrdquoin Proceedings of the MT Summit XII 2009

[25] M Eck S Vogel and A Waibel ldquoTranslation model prun-ing via usage statistics for statistical machine translationrdquo inProceedings of the Human Language Technologies The AnnualConference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT rsquo07) pp 21ndash24Association for Computational Linguistics troudsburg PaUSA 2007

[26] M Eck S Vogel and A Waibel ldquoEstimating phrase pairrelevance formachine translation pruningrdquo inProceedings of theMT Summit XI pp 159ndash165 2007

[27] Y Chen A Eisele and M Kay ldquoImproving statistical machinetranslation efficiency by triangulationrdquo in Proceedings of theSixth International Conference on Language Resources andEvaluation 2008

[28] Y Chen M Kay and A Eisele ldquoIntersecting multilingual datafor faster and better statistical translationsrdquo in Proceedings of theHuman Language Technologies The Annual Conference of theNorth American Chapter of the Association for ComputationalLinguistics (NAACL HLT rsquo09) pp 128ndash136 Association forComputational Linguistics Stroudsburg Pa USA June 2009

[29] G Sanchis-Trilles D Ortiz-Martınez J Gonzalez-Rubio JGonzalez and F Casacuberta ldquoBilingual segmentation forphrasetable pruning in statistical machine translationrdquo in Pro-ceedings of the 15th International Conference of the EuropeanAssociation for Machine Translation (EAMT rsquo11) pp 257ndash264May 2011

[30] N Tomeh M Turchi G Wisinewski A Allauzen and YFrancois ldquoHow good are your phrases Assessing phrase qualitywith single class classificationrdquo in Proceedings of the Interna-tional Workshop on Spoken Language Translation pp 261ndash2682011

[31] W Ling J Graca I Trancoso and A Black ldquoEntropy-basedpruning for phrase-based machine translationrdquo in Proceedingsof the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Language Learn-ing pp 962ndash971 Association for Computational LinguisticsStroudsburg Pa USA 2012

[32] R Zens D Stanton and P Xu ldquoA systematic comparison ofphrase table pruning techniquesrdquo in Proceedings of the JointConference on Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning pp 972ndash983 Association for Computational Linguistics StroudsburgPa USA 2012

[33] K Papineni S Roukos T Ward andW Zhu ldquoBLEU a methodfor automatic evaluation of machine translationrdquo in Proceedingsof the 40th Annual Meeting of the Association for ComputationalLinguistics (ACL rsquo01) pp 311ndash318 2001

[34] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe Mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[35] P Koehn StatisticalMachine Translation CambridgeUniversityPress New York NY USA 1st edition 2010

[36] E Galbrun ldquoPhrase table pruning for statistical machine trans-lationrdquo Department of Computer Science Series of PublicationsC C-2009-22 University of Helsinki 2007

The Scientific World Journal 13

[37] T M Cover and J A Thomas Elements of Information TheoryWiley-Interscience New York NY USA 2006

[38] Z He Y Meng Y Lu H Yu and Q Liu ldquoReducing SMTrule table with monolingual key phraserdquo in Proceedings ofthe Joint Conference of the 47th Annual Meeting of the Asso-ciation for Computational Linguistics and 4th InternationalJoint Conference on Natural Language Processing of the AFNLP(ACL-IJCNLP rsquo09) pp 121ndash124 Association for ComputationalLinguistics Stroudsburg Pa USA August 2009

[39] Z He Y Meng and H Yu ldquoDiscarding monotone com-posed rule for hierarchical phrase-based statistical machinetranslationrdquo in Proceedings of the 3rd International UniversalCommunication Symposium (IUCS rsquo09) pp 25ndash29 New YorkNY USA December 2009

[40] K T Frantzi and S Ananiadou ldquoExtracting nested colloca-tionsrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 41ndash46 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[41] P Koehn H Hoang A Birch et al ldquoMoses open source toolkitfor statistical machine translationrdquo in Proceedings of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions pp 177ndash180 Association for ComputationalLinguistics Stroudsburg Pa USA 2007

[42] D Chiang ldquoA hierarchical phrase-based model for statisticalmachine translationrdquo in Proceedings of the 43rd Annual Meetingof the Association for Computational Linguistics pp 263ndash270Association for Computational Linguistics Stroudsburg PaUSA June 2005

[43] W A Gale and K W Church ldquoIdentifying word correspon-dences in parallel textsrdquo in Proceedings of the Workshop onSpeech and Natural Language pp 152ndash157 Association forComputational Linguistics Stroudsburg Pa USA 1991

[44] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[45] M Federico N Bertoldi and M Cettolo ldquoIRSTLM An opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

[46] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the MT Summit X 2005

[47] S Green and J DeNero ldquoA Class-based agreement model forgenerating accurately inflected translationsrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Long Papers vol 1 pp 146ndash155 Association forComputational Linguistics Stroudsburg Pa USA 2012

[48] N Xue F D Chiou and M lmer ldquoBuilding a large-scaleannotated chinese corpusrdquo in Proceedings of the 19th Interna-tional Conference on Computational Linguistics vol 1 pp 1ndash8PaAssociation for Computational Linguistics Stroudsburg PaUSA 2002

[49] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[50] S Vogel H Ney and C Tillmann ldquoHmm-based word align-ment in statistical translationrdquo in Proceedings of the 16thInternational Conference on Computational Linguistics pp 836ndash841 1996

[51] CWMT 2013 httpwwwliipcncwmt2013

[52] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) pp 1273ndash1278 Qingdao China July2010

[53] L Tian D Wong L Chao et al ldquoUM-Corpus a large English-Chinese parallel corpus for statistical machine translationrdquo inProceedings of the 9th International Conference on LanguageResources and Evaluation 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article A Relationship: Word Alignment, Phrase

The Scientific World Journal 5

e1 e2 ei eI

f1 f2 f3 fj fJ

middot middot middot

middot middot middot middot middot middot

middot middot middot

Figure 5 Sentences with single alignment link

Based on (15) and (16) the total number of phrase pairs119873max is given by

119873max = [119894 (119868 minus 119894) + 119894] [119895 (119869 minus 119895) + 119895] (17)

where 0 le 119894 le 119868 minus 1 and 0 le 119895 le 119869 minus 1Obviously (15) and (16) accord with symmetric distribu-

tion and the maximal value will be achieved in the middleposition Given a sentence consisting of 119871 words a formaldescription to the middle position in the sentence can bepresented by

wordmid

1 + 119871odd2

119871even2

or (119871even2

+ 1) (18)

We can see that it gives (17) the maximal value in themiddle position in a given sentence that is the nearer to themiddle word the more phrase pairs can be produced Takethe Figure 2 as an example there are 18 phrases based onthe full word alignments while the phrases will be 240 whenonly the fifth word (do) in the source sentence aligns to thethird word (做zuo) in the target side From the view of sizeof phrase table the calculation value from (17) is much largerthan that from (13) That proves that the better quality of theword alignment the fewer phrases will be

Note that there will be some share phrases extracted fromthe full word alignment and only single word alignment whichleads to the phenomenon that there are still some translationabilities even with only one alignment point under the exist-ing phrase extraction algorithm For instance phrase pairs(11989011198902sdot sdot sdot 119890119894| 119891111989121198913sdot sdot sdot 119891119895) (1198902sdot sdot sdot 119890119894| 11989121198913sdot sdot sdot 119891119895) (119890

119894| 119891119895)

are the share ones when given a phrase such as 119891111989121198913sdot sdot sdot 119891119895

both the phrase tables generated from both alignment typescan possibly translate it right to 119890

11198902sdot sdot sdot 119890119894

32 Phrase Table amp Machine Translation Nowadays the keystep of the process of phrase-based machine translation(SMT) involves inferring a large table of phrase pairs thatare translations of each other from a large corpus of alignedsentences The set of all phrase pairs together with estimatesof conditional probabilities and other useful features is calledPhrase Table (PT) which is applied during the decodingprocess [29]

Nevertheless phrases in the PT are very dependent fromthe system that uses them Some phrases might be presentin the PT but never be used in translations because they areeither ranked too low or erroneous translations for exampleIn other words there are too many redundant phrases in thePT Such situation leads many researches to propose differenttechniques to reduce the size of the phrase tableThis includes

Table 2 Pruned phrase table and translation quality

Pruning methods Languages Pruning rates BLEU (Δ)

Countsignificantfr-en 646sim859

minus01es-en 679sim881de-en 627sim841

Entropyfr-en 85sim95

minus10es-en 85sim95de-en 85sim95

Bilingual-segmentfr-en mdash

minus12es-en 98de-en 94

the methods as introduced in Section 2 These works can betreated as the research on the relationship between phrasetable and machine translation

All the works presented in Section 2 show that thetranslation results can be achieved from smaller phrase pairswith several times increase in translation speed with little orno loss in translation accuracy As reported by [32] count-based and significance-based pruning techniques can resultin larger savings between 70 and 90 while the entropy-based pruning approach can reduce consistently the entriesbetween 85 to 95 of the phrase table In [29] they evenpointed out that the pruning rate can reach to 98 using abilingual segmentation method with finite state transducersMore details about the pruning rates and related translationqualities (measured by BLEU [33]) are shown in Table 2where the BLEU column is to record the maximal decrease(minus) rate from the 100 size of phrase table to the maximalpruning rate of the phrase table

The pruning rates presented above are counted from anempirical or experiment result We can also estimate thepruning rate from a theoretical aspect In practice we alsonotice that the phrases extracted from the middle wordalignment point includemost of the phrases that are extractedfrom the full word alignment All the phrases excluding theones extracted from the full word alignment can be removedIn other words a rough pruning rate can be calculated fromthe ration of phrases extracted from best alignment (119873min)and worst alignment (119873max)

119877pruned = 1 minus119873min-119905119873max

(19)

If we want to quickly estimate how many percentageof phrases can be pruned in a given corpus the averagestatistics of the given corpus can be used as in (20) Thisequation shows that the pruning rate is not only decided bythe sentence length in a parallel corpus but also affected bythe alignment position Consider

119877pruned = 1 minus119868average (119868average + 1) 2

[119894 (119868average minus 119894) + 119894] [119895 (119869average minus 119895) + 119895] (20)

In reality the numerator trends to bigger and the denom-inator will be smaller so the pruning rate will be smaller

6 The Scientific World Journal

Start from 1198901

119873phrases = 119868 minus 119894 + 1Start from 119890

2

119873phrases = 119868 minus 119894 + 1Similarly forall119894 isin (119894 = 119894 119868) in the target side

119873phrases = 119868 minus 119894 + 1

Algorithm 1 Phrase extraction based on the only one alignment point 119890119894harr 119891119895

Table 3 Training data statistics WMT 2006

Language pairs Sentences Ave source len Ave target lenfr-en 688031 2267 2007es-en 730740 2152 2083de-en 751088 2031 2137

than the calculation value When the alignment position isin the middle position (middle alignment) there will getthe maximal pruning rates as shown in (21) Based on theequation we can estimate the pruning rate range based ondifferent word alignment position in European corpus [46]Table 3 shows the statistics of the WMT 2006 and Table 4compares different pruning rates reported in researchersrsquopaper [22 32] and the calculation ranges based on (20) Fromwhichwe can see that the calculation result is very close to theexperiment results presented in [32] The authors concludedtheir findings that ldquothe novel entropy-based pruning oftenachieves the same Bleu score with only half the number ofphrasesrdquo A very bold assumption is thatmost phrases (at leasthalf) can be removed from the original extracted phrasesConsider

119877max = 1 minus119868average (119868average + 1)

2

times ([119894mid (119868average minus 119894mid) + 119894mid]

times [119895mid (119869average minus 119895mid) + 119895mid])minus1

(21)

We have seen that the phrase table pruning rate is stronglyaffected by the length and word positions in the source andtarget sentences As reflected in the pruning equation (20)the phrase table might be more accurately pruned based onsome features (eg phrase length phrase cooccurrence) inthe bilingual and monolingual corpus The researchers in[27] reduced the phrase table based on the significance testingof phrase pair cooccurrence in bilingual corpus while Heet al [38 39] pruned the phrase pairs in terms of phrasefrequency in the source side using key phrases extractedfrom a monolingual corpus We want to make full use of theadvantages of the two approaches the key phrases can beused to check whether the extracted phrases are often usedin practice and the significant testing can be used to measurethe quality of the extracted phrase pairs Our methods wantto consider the features both for the source and target phrase

Table 4 Comparison between theoretical estimation results andother pruning techniques results

Language pairs 119895mid 119894mid 119877pruned

fr-en 1234 1104 535sim986es-en 1126 1142 493sim984de-en 1116 1119 449sim983

The basic metrics for phrases that are often used in prac-tice is the frequency that a phrase appears in a monolingualcorpus The more frequent a phrase appears in a corpus thegreater possibility the phrase may be used However longerphrase pairs will be removed in this way because of the factthat those phrases seem to appear rarely in a monolingualcorpus

The 119862-value is a measurement of automatic term recog-nition which can be used to solve this problem as describedin [40] The method not only fully considers the frequencyinformation such as frequencies of a phrase and a subphraseappearing in longer phrases but also is an efficient algorithm

In this algorithm (as shown in Algorithm 2) 4 factors (119871119865 119878119873) can be used to determine if a phrase 119901 is a key phrase[38]

(i) 119871(119901) the length of 119901(ii) 119865(119901) the frequency that 119901 appears in a corpus(iii) 119878(119901) the frequency that 119901 appears as a substring in

other longer phrases(iv) 119873(119901) the number of phrases that contain 119901 as a

substring

Given amonolingual corpus key phrases can be extractedefficiently according to Algorithm 2 Firstly (line 1) allpossible phrases are extracted as candidates of key phrases(eg extracted phrase table fromMoses [41]) Secondly (line3 to 7) for each candidate phrase 119862-value is computedaccording to the phrase appearing by itself (line 4) or as asubstring of other long phrases (line 6) The 119862-value is indirect proportion to the phrase length (119871) and occurrences(119865 119878) but in inverse proportion to the number of phrasesthat contain the phrase as a substring (119873)This overcomes thelimitations of frequency measurement A phrase is regardedas a key phrase if its 119862-value is greater than a threshold 120576Finally (line 11 to 14) 119878(119902) and 119873(119902) are updated for eachsubstring 119902

The algorithm is originally used for pruning the ruletable extracted from the hierarchical phrase-based SMT [42]

The Scientific World Journal 7

Input Monolingual Corpus(1) Extracted candidate phrases(2) for all phrases 119901 in length descending order do(3) if 119873(119901) = 0 then(4) 119862-value = (119871(119901) minus 1) times 119865(119901)

(5) else(6) 119862-value = (119871(119901) minus 1) times (119865(119901) minus 119878(119901)119873(119901))

(7) end if(8) if 119862-value ge 120576 then(9) add 119901 to KP(10) end if(11) for all sub-strings 119902 of 119901 do(12) 119878 (119902) = 119878(119902) + 119865(119901) minus 119878(119901)

(13) 119873(119902) = 119873(119902) + 1

(14) end for(15) end for

Output Key Phrase Table KP

Algorithm 2 Key phrase extraction from monolingual corpus

Input Bilingual Corpus and KP Table from Algorithm 1(1) for all phrases 1199011015840 in KP do(2) if minus log(119875-value(1199011015840)) le 120591

119865then

(3) add 1199011015840 to 119875 PT

(4) end if(5) end forOutput Pruned Phrase Table 119875 PT

Algorithm 3 Algorithm for key phrase pruning

However the idea can be also used for phrase-based SMT[1 41] perfectly After this filter step a filtered phrase tablecalled KP will be generated

The phrases in the phrase table KP are all possibly usedphrases in practice However there are still some noisesin the KP table for example the source phrase occurringin source sentences and the target phrase occurring in thetarget sentences are not well matched that is the phrasetranslation is not good enough To overcome this case it isconvenient to construct a two by two contingency table thattabulates the sentence pairs where the two types of matchesoccur and do not occur [21 22 43] The calculation resultis also called 119875-value as introduced in Section 2 A bilingualcorpus constraint can be added to filter the noise phrases aspresented in Algorithm 3 After this step a large pruning ratephrase table (called 119875 119875119879) will be generated

33 Word Alignment amp Machine Translation Based on thediscussion in Sections 31 and 32 we can find that the sizeof phrase table is heavily affected by the quality of wordalignment the better the word alignment the smaller size ofthe phrase table Besides this themachine translation perfor-mance is not directly decided by the size of the phrase tableat least 50 phrases can be removed without hurting the

performance of the machine translation Experiments showthat the quality of phrase translation plays more importantrole to the final translation quality in practice Therefore itis necessary to improve the quality of word alignment underexisting phrase extraction algorithm from word alignment

Now it is a little easier to talk about the relationshipbetween word alignment andmachine translationThe trans-lation of sentence is directly derived from the phrase tablewhile the phrase table is extracted from word alignment Inother words the quality of word alignment will affect theperformance of machine translation through phrase tableThat is to say the word alignment quality will affect thetranslation performance a lot However the works listed in[8 18 19] show that there is not always significant affection tothe translation result with improved word alignment quality

We do not agree with the conclusion Firstly either thetraining or testing corpus they used in their experimentsis limited in a small size or the improvement of the wordalignment quality is not significant (AER decrease between01 and 02) Secondly we think it is the existing phraseextraction algorithm that leads to the not obvious result Wehave pointed out in the last part of Section 31 that even theworst alignments are used the extracted phrase table stillhave some ability to translate some input phrases This isbecause of the share phrases extracted from the only one

8 The Scientific World Journal

(aligner)

Phraseextraction

Moses decoder

SRILMIRSTLM(LM toolkit)

Phrase table Language model

Monolingualcorpus

Bilingualcorpus

Sourcetext

Targettext

GIZA++

Figure 6 Mosesrsquo main modules and models

alignment link The only one word alignment can translatesome input phrases or sentences not mentioning the moreword alignment links

We will discuss the situation with different word align-ment in next section where the machine translation affectedby the best word alignment and worst word alignment willbe fully surveyed After that we can see that there are at leastthree advantages to improve the quality of word alignment asfollows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

4 Experiments and Discussions

In order to evaluate the conclusions drawn in the previoussection some experiments are carried out usingMoses toolkit[41] which provides a complete package of the requiredmodules for training and generating the translation of texts

41 Experiment Settings Like other SMT approaches Mosestoolkit requires two models (as shown in Figure 6) theLanguage Model (LM) and the Translation Model (TM) InMoses LM is usually created by the external toolkits SRILM[44] or IRSTLM [45] In our experiment the IRSTLM toolkitis used for the large training corpus Before training allthe tokenization for languages in European corpus [46] usethe integrated scripts in Moses and the segmentation forChinese adapts the Stanford Chinese segmenter [47] usingCTB (Chinese Tree Bank) standard [48 49] The entirelanguage model is trained by five-gram models

The phrases are extracted from the results generatedfrom GIZA++ [8] The word alignment was trained with teniterations of IBMmodel 1 and model 4 [34] and six iterationsof the HMM alignment model [50]

To extract all the possible phrases to verify (13) and (17)the maximal phrases are limited to 30 This can extract most

Table 5 Statistics of the first 10000 WMT 2006 corpus

Languages Ave source len Ave en lenfr-en 2289 2015es-en 2171 2094de-en 2055 2142

Table 6 Statistics of test data (3064 sentences)

Languages Ave source len Ave en lenfr-en 3295 2781es-en 2994 2781de-en 2692 2781

Table 7 Statistics of CMWT 2013 + UM-Corpus

Languages Tokens Average length VocabulariesEnglish 152161233 1937 1655080CharacterCE 229110265 2916 397442CTBCE 123917395 1578 1331505

phrases in terms ofword alignments andmake us focus on therelations among word alignment phrase table and machinetranslation

In order to better reveal relationships between alignmentand phrase table to the translation quality leftmost middleand rightmost word alignment are extracted from the originalfull alignment in the worst alignment situation (ie onlyone alignment link) Here the leftmostmiddle and rightmostword alignment indicate theword alignment link that alignedto the firstmiddle and end word in the source language side

42 Corpus Information Three language pairs in WMT2006 shared translation task are chosen as the trainingdata French-English (fr-en) German-English (de-en) andSpanish-English (es-en) These three language pairs are usedto carry out the experiments for proving the equationdeduced in Section 3 and relationship between word align-ment and machine translation Table 3 shows the statisticsof original corpus [46] However the number of phrasesextracted from the only one link is too many to be allowedin our server so the first 100000 European corpus sentences(statistics shown in Table 5) are chosen to prove the availabil-ity of the equations betweenword alignment and phrase tableTable 6 presents the statistics of 3064 testing data fromWMT2006 for the same English sentences

To test our algorithm for pruning the phrase table alarge Chinese-English parallel corpus is brought inThe largecorpus consists of two parts 3289497 sentences are fromthe CWMT 2013 [51] and 4157556 sentences are from UM-Corpus [52 53] After removing repeat and mismatchedsentences in the combined two parts there are left 7445190sentences and the statistics of the combined parallel corpusare presented in Table 7 Note that the statistics of Chinesesentences are counted both in character level (each Chinesecharacter is treated as one token CharacterCE in Table 7) and

The Scientific World Journal 9

Table 8 Statistics of English amp Chinese corpora

Languages Tokens Average length VocabulariesEnglish 804584844 2464 2357374Chinese 3007196322 3343 1533343

Table 9 Statistics for EC-Corpus testing data

Languages Tokens Average lengthEnglish 118802 2376CharacterCE 153705 3074CTBCE 118499 2370

word level (segmented by StanfordCTB segmenter presentedby CTBCE)

The UM-Corpus also contains English Chinese andPortuguese monolingual corpora In this experiment theEnglish and Chinese corpora are used There are 35938697English sentences from news domain and more than 70million Chinese sentences from China portal webpage Thestatistics about the two corpora are shown in Table 8

The 5000 sentences of test data (named UM-Testing) aredesigned according to domains in the UM-Corpus Thereare 1500 sentences for news 500 sentences for laws 500sentences for novels 600 sentences for thesis 700 sentencesfor education 600 sentences for science and 600 for subtitleThe statistics of the testing data are shown in Table 9

43 Equation of Word Alignments versus Phrases The calcu-lation for number of phrases requires the full word alignmentlinks Full word alignment in other words is the best qualityof word alignment It is obviously very time-consumingif we manually edit the large word aligned sentences It isalready observed in the past that generative models usedfor statistical word alignment create alignments of increasingquality as they are exposed to more data [19] The intuitionbehind this is simple as more cooccurrences of source andtargets words are observed the word alignments are betterFollowing this intuition if we wish to increase the quality ofa word alignment we allow the alignment process access toextra data which is used only during the alignment processand then removed If we wish to decrease the quality of aword alignment we divide the parallel corpus into piecesand align the pieces independently of one another and finallyconcatenate the results together In our experiments we usethe entire WMT 2006 corpus as training data firstly andthen the first 100000 alignments are chosen as the full wordalignments

Also the leftmost middle and rightmost word alignmentpoint based on the full word alignment are obtained based onthe results generated by GIZA++

The minimal and maximal sizes of the phrase tableextracted from the full and middle word alignment and the

differential rates are shown in Table 10The differential rate iscalculated by

119877diff-min =

1003816100381610038161003816119873min minus 119873moses-min1003816100381610038161003816

119873min (22)

119877diff-max =

1003816100381610038161003816119873max minus 119873moses-max1003816100381610038161003816

119873max (23)

where the119873moses-min and119873moses-max indicate the minimal andmaximal phrases extracted fromMoses

From Table 10 we can see that the actual number ofphrases is close to the calculation result With the small dif-ferential rate the (13) and (17) can be available to the practicalexperiments The difference may come from two reasons(1) the assumption of full word alignment produced fromthe GIZA++ may contain too many noises (2) equations(13) and (17) do not consider repetitions of the extractedphrases

44 Phrase Table versusMachine Translation Wehave shownthat the there is no direct relationship between the sizeof phrase table and machine translation performance Asreflected in (17) and Table 4 at least half of the phrases canbe removed without hurting too much the performance ofmachine translation In this part we want to use the largeChinese-English corpus to test our pruning algorithm basedon monolingual and bilingual corpus

According to our experience different thresholds willgenerate different size of phrase table and result in differentperformance of translation Therefore one of the key stepsis to choose the pruning threshold in Algorithms 2 and 3Firstly a baseline threshold is calculated from the logarithmof sentences of the given corpus (119873) that is log(119873) Forthe 119862-value threshold the baseline threshold is about 25(log(36 lowast 107)) For the 119875-value threshold the negativelogarithm will be chosen Because the probabilities involvedare incredibly tiny we will work instead with the negative ofthe natural logs of the probabilities For example instead ofselecting phrase pairs with a 119875-value less than exp(minus15) wewill select phrase pairs with a negative-log-119875-value greaterthan 15 The baseline threshold for 120591

119865is 45 (log(74 lowast 106))

Secondly the different thresholds will increase or decreasethan the baseline threshold The affection to the phrase tableand BLEU scores based on different thresholds is shown inTable 11

From Table 11 we can see that nearly 98 phrases canbe discarded with monolingual and bilingual corpora whilethe translation quality does not become worse too much(only decrease 235) Experiments show that about 85 of thephrase table is reduced and can generate the best translationquality which not only again proves the similar results as inother researchers [3 21 32 38] but also indicates that mostphrases are not needed at all The best translation can beachieved on the size of about 15 to 30of the original phrasetable That is to say only a small fraction of phrases are theessential elements while major of the extracted phrases arespurious and in some sense they are redundancy and shouldbe discarded

10 The Scientific World Journal

Table 10 Phrases comparison between Moses extraction and calculation result

Phrase length 119870 = 30

Languages 119873moses-min 119873min 119877diff-min 119873moses-max 119873max 119877diff-max

fr-en 9136283 9080307 0006 1541902790 1544021509 0001es-en 10958551 10895266 0006 1401977185 1489422170 0058de-en 9219921 8929095 0033 1606176665 1686043482 0047

Table 11 119862-value and 119875-value threshold effect on the phrase tablesize and BLEU scores

Thresholds UM-Corpus + CWMT 2013119862-value 120576 minus log(119875-value) 120591

119865Phrase table () BLEU ()

0 0 100 281310 20 836 280925 45 525 2798100 60 297 2811200 100 143 2827400 200 202 2578

Table 12 Translation performance on different word alignmenttypes

Languages BLEU (3064 sentences)119860 full 119860 lmost 119860 sure 119860mid 119860 rmost

fr-en 2480 448 2226 996 248es-en 3076 312 2824 1128 232de-en 2166 230 1928 976 294

45 Word Alignment versus Translation Quality In order toinvestigate the affection to the translation performance basedon different word alignment types and phrase table size weselected five different types of word alignments to generatethe phrase table and then measure the translation qualityby BLEU metric the full word alignment points (119860 full) theleftmost word alignment (119860

119897most) the sure alignment point(119860 sure) the middle word alignment point (119860mid) and therightmost word alignment point (119860

119903most)Clearly the 119860 full is the best quality of word alignment

and followed by 119860 sure which is the only 1-to-1 alignmentThe 119860

119897most 119860mid and 119860119903most are much worse ones which

is only one alignment link As mentioned previously allthese alignment points are obtained after getting the resultsgenerated by GIZA++ The experiment is carried out inWMT 2006 corpus Based on the different alignment typesthe number of phrases extracted and the correspondingtranslation BLEU are presented in Figure 7 and Table 12separately

The histograms show that there is not any direct propor-tion relationship between the translation performance andthe number of extracted phrases The numbers in Figure 7show that German-English language pairs can generate themost phrases while the translation quality is not the bestAlthough the fewest phrases produced from the Spanish-English language pairs the best translation performance canbe obtained Explained by (17) more phrases usually lead to

worse word alignment quality or it can be said that thereare too many unaligned words in the produced alignmentresults The results in Figure 7 and Table 12 show that (1) thefull word alignment will contribute most to the translationperformance and there will not be too much decrease whenonly the sure alignment is provided (2) however the first andlast word alignment will not produce too high translationquality (3) although the middle position word alignmentcan extract the most phrases the quality of the translationis not the best and it seems that middle word alignment cancontribute more to the finally translation

In fact the translation from the only one link wordalignment should be higher After all the default full wordalignments have too many noises which trained from a largetraining corpus without manual edit

5 Conclusions

In this paper some investigations are made about the rela-tionship among word alignment phrase table and machinetranslation After deducing a formula for estimating the sizeof the phrase table a maximal phrase pruning ratio can becalculated based on the average length of the given corpus andword alignment positions Finally translation performancebased on different word alignment types is compared

Equations (13) and (17) provide a simple way to predictthe size of the phrase table in advance based on the generatedword alignment points This is beneficial for evaluating howmuch hardware resources are required to train the modelwhen the size of the parallel corpus is huge

Experiment results show the affection to the translationperformance is significant if the AER decreases or increasestoomuchThere is no direct proportion relationship betweenthe translation performance and the number of extractedphrases In other words most of the phrases are not needed atall only phrases extracted from the full word alignment areenough It can be seen from our corpus-motivated pruningmethod that only 15ndash30 phrases are needed for the basictranslations and most phrases in the phrase table are noisesto the final translation

Based on the existing phrase extraction algorithm fromword alignment better word alignment will be very helpful tothe final translation There are at least three advantages witha better quality of word alignment even without the existingphrase extraction algorithm improved as follows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

The Scientific World Journal 11

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Fr-e

n nu

mbe

r of p

hras

es

Fr-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

668712E7

785057E7

913628E654091E7

15419E9

(a)

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Es-e

n nu

mbe

r of p

hras

es

Es-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

753427E7

621835E7

140198E9

109586E7534143E7

(b)

De-en number of phrases

180E + 009

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

De-

en n

umbe

r of p

hras

es

Wod alignment types

Almost Asure Amid Afull Armost

716272E7712921E7

921992E6

527541E7

160618E9

(c)

Fr-e

n BL

EU

Fr-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

448

2226

996

248

248

(d)

Es-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

Es-e

n BL

EU

35

30

25

20

15

10

5

0

312

2824

1128

232

3076

(e)

De-

en B

LEU

De-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

23

1928

976

2166

294

(f)

Figure 7 Number of phrases extracted from different word alignment points and their BLEU

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

The next work we want to focus on is to add the corpus-motivated pruning technique into the translation modelsthat is the phrases will be filtered during the training timenot after generating the phrase table

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank all reviewers for the verycareful reading and helpful suggestions The authors aregrateful to the Science and Technology Development Fundof Macau and the Research Committee of the University ofMacau for the funding support for their research under theReference nos MYRG076 (Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS

References

[1] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology vol 1 pp 48ndash54 Association forComputational Linguistics Stroudsburg Pa USA 2003

[2] R Zens F J Och and H Ney Phrase-Based Statistical MachineTranslation Springer Berlin Germany 2002

[3] J H Shin Y S Han and K S Choi ldquoBilingual knowledgeacquisition from Korean-English parallel corpus using align-ment method Korean-English alignment at word and phraselevelrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 230ndash235 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[4] K Yamamoto T Kudo Y Tsuboi and YMatsumoto ldquoLearningsequence-to-sequence correspondences from parallel corporavia sequential pattern miningrdquo in Proceedings of the HLT-NAACL Workshop on Building and Using Parallel Texts DataDriven Machine Translation and Beyond vol 3 pp 73ndash80Association for Computational Linguistics Stroudsburg PaUSA 2003

[5] C Goutte K Yamada and E Gaussier ldquoAligning words usingmatrix factorisationrdquo in Proceedings of the 42nd AnnualMeetingon Association for Computational Linguistics Association forComputational Linguistics Stroudsburg Pa USA 2004

[6] F J Och C Tillmann H Ney and L F Informatik ImprovedAlignmentModels for StatisticalMachine Translation Universityof Maryland College Park Md USA 1999

[7] F J Och and H Ney ldquoA Comparison of Alignment Modelsfor Statistical Machine Translationrdquo in Proceedings of the 18thConference on Computational Linguistics vol 2 pp 1086ndash1090

12 The Scientific World Journal

Association for Computational Linguistics Stroudsburg PaUSA 2000

[8] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[9] F J Och and H Ney ldquoThe alignment template approach tostatistical machine translationrdquo Computational Linguistics vol30 no 4 pp 417ndash449 2004

[10] N Duan M Li and M Zhou ldquoImproving phrase extractionvia MBR phrase scoring and pruningrdquo in Proceedings of the MTSummit XIII pp 189ndash197 2011

[11] G Neubig T Watanabe E Sumita S Mori and T Kawa-hara ldquoAn unsupervised model for joint phrase alignment andextractionrdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human LanguageTechnologies vol 1 pp 632ndash641 Stroudsburg Pa USA June2011

[12] H L M Z Hendra Setiawan ldquoLearning phrase translationusing level of detail approachrdquo inTheTenthMachine TranslationSummit (MT-Summit X) 2005

[13] C Tillmann ldquoA projection extension algorithm for statisticalmachine translationrdquo in Proceedings of the Conference on Empir-ical Methods in Natural Language Processing Association forComputational Linguistics Stroudsburg PA USA 2003

[14] B Zhao and S Vogel ldquoA generalized alignment-free phraseextractionrdquo in Proceedings of the ACLWorkshop on Building andUsing Parallel Texts pp 141ndash144 Association for ComputationalLinguistics Stroudsburg Pa USA 2005

[15] Y Zhang and S Vogel ldquoCompetitive grouping in integratedphrase segmentation and alignment modelrdquo in Proceedings ofthe ACLWorkshop on Building and Using Parallel Texts pp 159ndash162 Association for Computational Linguistics StroudsburgPa USA 2005

[16] A Lopez ldquoStatistical machine translationrdquo ACM ComputingSurveys vol 40 no 3 article 8 2008

[17] K Ganchev J V Grac and B Taskar ldquoBetter alignments = Bet-ter translationsrdquo in Proceedings of the 46th Annual Meeting ofthe Association for Computational Linguistics Human LanguageTechnologies pp 986ndash993 June 2008

[18] D Vilar M Popovic and H Ney ldquoAER do we need toldquoimproverdquo our alignmentsrdquo in Proceedings of the InternationalWorkshop on Spoken Language Translation pp 205ndash212 2006

[19] A Fraser andDMarcu ldquoMeasuring word alignment quality forstatistical machine translationrdquo Computational Linguistics vol33 no 3 pp 293ndash303 2007

[20] P Lambert S Petitrenaud Y Ma and A Way ldquoWhat typesof word alignment improve statistical machine translationrdquoMachine Translation vol 26 no 4 pp 289ndash323 2012

[21] J H Johnson J Martin G Foster and R Kuhn ldquoImprovingtranslation quality by discarding most of the phrasetablerdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 967ndash975 June2007

[22] J H Johnson Conditional Significance Pruning DiscardingMore of Huge Phrase Tables 2012

[23] M Yang and J Zheng ldquoToward smaller faster and better hierar-chical phrase-based SMTrdquo in Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for ComputationalLinguistics and 4th International Joint Conference on Natural

Language Processing of the AFNLP (ACL-IJCNLP rsquo09) pp 237ndash240 Association for Computational Linguistics StroudsburgPa USA August 2009

[24] N Tomeh N cancedda and M Dymetman ldquoComplexity-based phrase-table filtering for statistical machine translationrdquoin Proceedings of the MT Summit XII 2009

[25] M Eck S Vogel and A Waibel ldquoTranslation model prun-ing via usage statistics for statistical machine translationrdquo inProceedings of the Human Language Technologies The AnnualConference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT rsquo07) pp 21ndash24Association for Computational Linguistics troudsburg PaUSA 2007

[26] M Eck S Vogel and A Waibel ldquoEstimating phrase pairrelevance formachine translation pruningrdquo inProceedings of theMT Summit XI pp 159ndash165 2007

[27] Y Chen A Eisele and M Kay ldquoImproving statistical machinetranslation efficiency by triangulationrdquo in Proceedings of theSixth International Conference on Language Resources andEvaluation 2008

[28] Y Chen M Kay and A Eisele ldquoIntersecting multilingual datafor faster and better statistical translationsrdquo in Proceedings of theHuman Language Technologies The Annual Conference of theNorth American Chapter of the Association for ComputationalLinguistics (NAACL HLT rsquo09) pp 128ndash136 Association forComputational Linguistics Stroudsburg Pa USA June 2009

[29] G Sanchis-Trilles D Ortiz-Martınez J Gonzalez-Rubio JGonzalez and F Casacuberta ldquoBilingual segmentation forphrasetable pruning in statistical machine translationrdquo in Pro-ceedings of the 15th International Conference of the EuropeanAssociation for Machine Translation (EAMT rsquo11) pp 257ndash264May 2011

[30] N Tomeh M Turchi G Wisinewski A Allauzen and YFrancois ldquoHow good are your phrases Assessing phrase qualitywith single class classificationrdquo in Proceedings of the Interna-tional Workshop on Spoken Language Translation pp 261ndash2682011

[31] W Ling J Graca I Trancoso and A Black ldquoEntropy-basedpruning for phrase-based machine translationrdquo in Proceedingsof the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Language Learn-ing pp 962ndash971 Association for Computational LinguisticsStroudsburg Pa USA 2012

[32] R Zens D Stanton and P Xu ldquoA systematic comparison ofphrase table pruning techniquesrdquo in Proceedings of the JointConference on Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning pp 972ndash983 Association for Computational Linguistics StroudsburgPa USA 2012

[33] K Papineni S Roukos T Ward andW Zhu ldquoBLEU a methodfor automatic evaluation of machine translationrdquo in Proceedingsof the 40th Annual Meeting of the Association for ComputationalLinguistics (ACL rsquo01) pp 311ndash318 2001

[34] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe Mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[35] P Koehn StatisticalMachine Translation CambridgeUniversityPress New York NY USA 1st edition 2010

[36] E Galbrun ldquoPhrase table pruning for statistical machine trans-lationrdquo Department of Computer Science Series of PublicationsC C-2009-22 University of Helsinki 2007

The Scientific World Journal 13

[37] T M Cover and J A Thomas Elements of Information TheoryWiley-Interscience New York NY USA 2006

[38] Z He Y Meng Y Lu H Yu and Q Liu ldquoReducing SMTrule table with monolingual key phraserdquo in Proceedings ofthe Joint Conference of the 47th Annual Meeting of the Asso-ciation for Computational Linguistics and 4th InternationalJoint Conference on Natural Language Processing of the AFNLP(ACL-IJCNLP rsquo09) pp 121ndash124 Association for ComputationalLinguistics Stroudsburg Pa USA August 2009

[39] Z He Y Meng and H Yu ldquoDiscarding monotone com-posed rule for hierarchical phrase-based statistical machinetranslationrdquo in Proceedings of the 3rd International UniversalCommunication Symposium (IUCS rsquo09) pp 25ndash29 New YorkNY USA December 2009

[40] K T Frantzi and S Ananiadou ldquoExtracting nested colloca-tionsrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 41ndash46 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[41] P Koehn H Hoang A Birch et al ldquoMoses open source toolkitfor statistical machine translationrdquo in Proceedings of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions pp 177ndash180 Association for ComputationalLinguistics Stroudsburg Pa USA 2007

[42] D Chiang ldquoA hierarchical phrase-based model for statisticalmachine translationrdquo in Proceedings of the 43rd Annual Meetingof the Association for Computational Linguistics pp 263ndash270Association for Computational Linguistics Stroudsburg PaUSA June 2005

[43] W A Gale and K W Church ldquoIdentifying word correspon-dences in parallel textsrdquo in Proceedings of the Workshop onSpeech and Natural Language pp 152ndash157 Association forComputational Linguistics Stroudsburg Pa USA 1991

[44] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[45] M Federico N Bertoldi and M Cettolo ldquoIRSTLM An opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

[46] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the MT Summit X 2005

[47] S Green and J DeNero ldquoA Class-based agreement model forgenerating accurately inflected translationsrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Long Papers vol 1 pp 146ndash155 Association forComputational Linguistics Stroudsburg Pa USA 2012

[48] N Xue F D Chiou and M lmer ldquoBuilding a large-scaleannotated chinese corpusrdquo in Proceedings of the 19th Interna-tional Conference on Computational Linguistics vol 1 pp 1ndash8PaAssociation for Computational Linguistics Stroudsburg PaUSA 2002

[49] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[50] S Vogel H Ney and C Tillmann ldquoHmm-based word align-ment in statistical translationrdquo in Proceedings of the 16thInternational Conference on Computational Linguistics pp 836ndash841 1996

[51] CWMT 2013 httpwwwliipcncwmt2013

[52] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) pp 1273ndash1278 Qingdao China July2010

[53] L Tian D Wong L Chao et al ldquoUM-Corpus a large English-Chinese parallel corpus for statistical machine translationrdquo inProceedings of the 9th International Conference on LanguageResources and Evaluation 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article A Relationship: Word Alignment, Phrase

6 The Scientific World Journal

Start from 1198901

119873phrases = 119868 minus 119894 + 1Start from 119890

2

119873phrases = 119868 minus 119894 + 1Similarly forall119894 isin (119894 = 119894 119868) in the target side

119873phrases = 119868 minus 119894 + 1

Algorithm 1 Phrase extraction based on the only one alignment point 119890119894harr 119891119895

Table 3 Training data statistics WMT 2006

Language pairs Sentences Ave source len Ave target lenfr-en 688031 2267 2007es-en 730740 2152 2083de-en 751088 2031 2137

than the calculation value When the alignment position isin the middle position (middle alignment) there will getthe maximal pruning rates as shown in (21) Based on theequation we can estimate the pruning rate range based ondifferent word alignment position in European corpus [46]Table 3 shows the statistics of the WMT 2006 and Table 4compares different pruning rates reported in researchersrsquopaper [22 32] and the calculation ranges based on (20) Fromwhichwe can see that the calculation result is very close to theexperiment results presented in [32] The authors concludedtheir findings that ldquothe novel entropy-based pruning oftenachieves the same Bleu score with only half the number ofphrasesrdquo A very bold assumption is thatmost phrases (at leasthalf) can be removed from the original extracted phrasesConsider

119877max = 1 minus119868average (119868average + 1)

2

times ([119894mid (119868average minus 119894mid) + 119894mid]

times [119895mid (119869average minus 119895mid) + 119895mid])minus1

(21)

We have seen that the phrase table pruning rate is stronglyaffected by the length and word positions in the source andtarget sentences As reflected in the pruning equation (20)the phrase table might be more accurately pruned based onsome features (eg phrase length phrase cooccurrence) inthe bilingual and monolingual corpus The researchers in[27] reduced the phrase table based on the significance testingof phrase pair cooccurrence in bilingual corpus while Heet al [38 39] pruned the phrase pairs in terms of phrasefrequency in the source side using key phrases extractedfrom a monolingual corpus We want to make full use of theadvantages of the two approaches the key phrases can beused to check whether the extracted phrases are often usedin practice and the significant testing can be used to measurethe quality of the extracted phrase pairs Our methods wantto consider the features both for the source and target phrase

Table 4 Comparison between theoretical estimation results andother pruning techniques results

Language pairs 119895mid 119894mid 119877pruned

fr-en 1234 1104 535sim986es-en 1126 1142 493sim984de-en 1116 1119 449sim983

The basic metrics for phrases that are often used in prac-tice is the frequency that a phrase appears in a monolingualcorpus The more frequent a phrase appears in a corpus thegreater possibility the phrase may be used However longerphrase pairs will be removed in this way because of the factthat those phrases seem to appear rarely in a monolingualcorpus

The 119862-value is a measurement of automatic term recog-nition which can be used to solve this problem as describedin [40] The method not only fully considers the frequencyinformation such as frequencies of a phrase and a subphraseappearing in longer phrases but also is an efficient algorithm

In this algorithm (as shown in Algorithm 2) 4 factors (119871119865 119878119873) can be used to determine if a phrase 119901 is a key phrase[38]

(i) 119871(119901) the length of 119901(ii) 119865(119901) the frequency that 119901 appears in a corpus(iii) 119878(119901) the frequency that 119901 appears as a substring in

other longer phrases(iv) 119873(119901) the number of phrases that contain 119901 as a

substring

Given amonolingual corpus key phrases can be extractedefficiently according to Algorithm 2 Firstly (line 1) allpossible phrases are extracted as candidates of key phrases(eg extracted phrase table fromMoses [41]) Secondly (line3 to 7) for each candidate phrase 119862-value is computedaccording to the phrase appearing by itself (line 4) or as asubstring of other long phrases (line 6) The 119862-value is indirect proportion to the phrase length (119871) and occurrences(119865 119878) but in inverse proportion to the number of phrasesthat contain the phrase as a substring (119873)This overcomes thelimitations of frequency measurement A phrase is regardedas a key phrase if its 119862-value is greater than a threshold 120576Finally (line 11 to 14) 119878(119902) and 119873(119902) are updated for eachsubstring 119902

The algorithm is originally used for pruning the ruletable extracted from the hierarchical phrase-based SMT [42]

The Scientific World Journal 7

Input Monolingual Corpus(1) Extracted candidate phrases(2) for all phrases 119901 in length descending order do(3) if 119873(119901) = 0 then(4) 119862-value = (119871(119901) minus 1) times 119865(119901)

(5) else(6) 119862-value = (119871(119901) minus 1) times (119865(119901) minus 119878(119901)119873(119901))

(7) end if(8) if 119862-value ge 120576 then(9) add 119901 to KP(10) end if(11) for all sub-strings 119902 of 119901 do(12) 119878 (119902) = 119878(119902) + 119865(119901) minus 119878(119901)

(13) 119873(119902) = 119873(119902) + 1

(14) end for(15) end for

Output Key Phrase Table KP

Algorithm 2 Key phrase extraction from monolingual corpus

Input Bilingual Corpus and KP Table from Algorithm 1(1) for all phrases 1199011015840 in KP do(2) if minus log(119875-value(1199011015840)) le 120591

119865then

(3) add 1199011015840 to 119875 PT

(4) end if(5) end forOutput Pruned Phrase Table 119875 PT

Algorithm 3 Algorithm for key phrase pruning

However the idea can be also used for phrase-based SMT[1 41] perfectly After this filter step a filtered phrase tablecalled KP will be generated

The phrases in the phrase table KP are all possibly usedphrases in practice However there are still some noisesin the KP table for example the source phrase occurringin source sentences and the target phrase occurring in thetarget sentences are not well matched that is the phrasetranslation is not good enough To overcome this case it isconvenient to construct a two by two contingency table thattabulates the sentence pairs where the two types of matchesoccur and do not occur [21 22 43] The calculation resultis also called 119875-value as introduced in Section 2 A bilingualcorpus constraint can be added to filter the noise phrases aspresented in Algorithm 3 After this step a large pruning ratephrase table (called 119875 119875119879) will be generated

33 Word Alignment amp Machine Translation Based on thediscussion in Sections 31 and 32 we can find that the sizeof phrase table is heavily affected by the quality of wordalignment the better the word alignment the smaller size ofthe phrase table Besides this themachine translation perfor-mance is not directly decided by the size of the phrase tableat least 50 phrases can be removed without hurting the

performance of the machine translation Experiments showthat the quality of phrase translation plays more importantrole to the final translation quality in practice Therefore itis necessary to improve the quality of word alignment underexisting phrase extraction algorithm from word alignment

Now it is a little easier to talk about the relationshipbetween word alignment andmachine translationThe trans-lation of sentence is directly derived from the phrase tablewhile the phrase table is extracted from word alignment Inother words the quality of word alignment will affect theperformance of machine translation through phrase tableThat is to say the word alignment quality will affect thetranslation performance a lot However the works listed in[8 18 19] show that there is not always significant affection tothe translation result with improved word alignment quality

We do not agree with the conclusion Firstly either thetraining or testing corpus they used in their experimentsis limited in a small size or the improvement of the wordalignment quality is not significant (AER decrease between01 and 02) Secondly we think it is the existing phraseextraction algorithm that leads to the not obvious result Wehave pointed out in the last part of Section 31 that even theworst alignments are used the extracted phrase table stillhave some ability to translate some input phrases This isbecause of the share phrases extracted from the only one

8 The Scientific World Journal

(aligner)

Phraseextraction

Moses decoder

SRILMIRSTLM(LM toolkit)

Phrase table Language model

Monolingualcorpus

Bilingualcorpus

Sourcetext

Targettext

GIZA++

Figure 6 Mosesrsquo main modules and models

alignment link The only one word alignment can translatesome input phrases or sentences not mentioning the moreword alignment links

We will discuss the situation with different word align-ment in next section where the machine translation affectedby the best word alignment and worst word alignment willbe fully surveyed After that we can see that there are at leastthree advantages to improve the quality of word alignment asfollows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

4 Experiments and Discussions

In order to evaluate the conclusions drawn in the previoussection some experiments are carried out usingMoses toolkit[41] which provides a complete package of the requiredmodules for training and generating the translation of texts

41 Experiment Settings Like other SMT approaches Mosestoolkit requires two models (as shown in Figure 6) theLanguage Model (LM) and the Translation Model (TM) InMoses LM is usually created by the external toolkits SRILM[44] or IRSTLM [45] In our experiment the IRSTLM toolkitis used for the large training corpus Before training allthe tokenization for languages in European corpus [46] usethe integrated scripts in Moses and the segmentation forChinese adapts the Stanford Chinese segmenter [47] usingCTB (Chinese Tree Bank) standard [48 49] The entirelanguage model is trained by five-gram models

The phrases are extracted from the results generatedfrom GIZA++ [8] The word alignment was trained with teniterations of IBMmodel 1 and model 4 [34] and six iterationsof the HMM alignment model [50]

To extract all the possible phrases to verify (13) and (17)the maximal phrases are limited to 30 This can extract most

Table 5 Statistics of the first 10000 WMT 2006 corpus

Languages Ave source len Ave en lenfr-en 2289 2015es-en 2171 2094de-en 2055 2142

Table 6 Statistics of test data (3064 sentences)

Languages Ave source len Ave en lenfr-en 3295 2781es-en 2994 2781de-en 2692 2781

Table 7 Statistics of CMWT 2013 + UM-Corpus

Languages Tokens Average length VocabulariesEnglish 152161233 1937 1655080CharacterCE 229110265 2916 397442CTBCE 123917395 1578 1331505

phrases in terms ofword alignments andmake us focus on therelations among word alignment phrase table and machinetranslation

In order to better reveal relationships between alignmentand phrase table to the translation quality leftmost middleand rightmost word alignment are extracted from the originalfull alignment in the worst alignment situation (ie onlyone alignment link) Here the leftmostmiddle and rightmostword alignment indicate theword alignment link that alignedto the firstmiddle and end word in the source language side

42 Corpus Information Three language pairs in WMT2006 shared translation task are chosen as the trainingdata French-English (fr-en) German-English (de-en) andSpanish-English (es-en) These three language pairs are usedto carry out the experiments for proving the equationdeduced in Section 3 and relationship between word align-ment and machine translation Table 3 shows the statisticsof original corpus [46] However the number of phrasesextracted from the only one link is too many to be allowedin our server so the first 100000 European corpus sentences(statistics shown in Table 5) are chosen to prove the availabil-ity of the equations betweenword alignment and phrase tableTable 6 presents the statistics of 3064 testing data fromWMT2006 for the same English sentences

To test our algorithm for pruning the phrase table alarge Chinese-English parallel corpus is brought inThe largecorpus consists of two parts 3289497 sentences are fromthe CWMT 2013 [51] and 4157556 sentences are from UM-Corpus [52 53] After removing repeat and mismatchedsentences in the combined two parts there are left 7445190sentences and the statistics of the combined parallel corpusare presented in Table 7 Note that the statistics of Chinesesentences are counted both in character level (each Chinesecharacter is treated as one token CharacterCE in Table 7) and

The Scientific World Journal 9

Table 8 Statistics of English amp Chinese corpora

Languages Tokens Average length VocabulariesEnglish 804584844 2464 2357374Chinese 3007196322 3343 1533343

Table 9 Statistics for EC-Corpus testing data

Languages Tokens Average lengthEnglish 118802 2376CharacterCE 153705 3074CTBCE 118499 2370

word level (segmented by StanfordCTB segmenter presentedby CTBCE)

The UM-Corpus also contains English Chinese andPortuguese monolingual corpora In this experiment theEnglish and Chinese corpora are used There are 35938697English sentences from news domain and more than 70million Chinese sentences from China portal webpage Thestatistics about the two corpora are shown in Table 8

The 5000 sentences of test data (named UM-Testing) aredesigned according to domains in the UM-Corpus Thereare 1500 sentences for news 500 sentences for laws 500sentences for novels 600 sentences for thesis 700 sentencesfor education 600 sentences for science and 600 for subtitleThe statistics of the testing data are shown in Table 9

43 Equation of Word Alignments versus Phrases The calcu-lation for number of phrases requires the full word alignmentlinks Full word alignment in other words is the best qualityof word alignment It is obviously very time-consumingif we manually edit the large word aligned sentences It isalready observed in the past that generative models usedfor statistical word alignment create alignments of increasingquality as they are exposed to more data [19] The intuitionbehind this is simple as more cooccurrences of source andtargets words are observed the word alignments are betterFollowing this intuition if we wish to increase the quality ofa word alignment we allow the alignment process access toextra data which is used only during the alignment processand then removed If we wish to decrease the quality of aword alignment we divide the parallel corpus into piecesand align the pieces independently of one another and finallyconcatenate the results together In our experiments we usethe entire WMT 2006 corpus as training data firstly andthen the first 100000 alignments are chosen as the full wordalignments

Also the leftmost middle and rightmost word alignmentpoint based on the full word alignment are obtained based onthe results generated by GIZA++

The minimal and maximal sizes of the phrase tableextracted from the full and middle word alignment and the

differential rates are shown in Table 10The differential rate iscalculated by

119877diff-min =

1003816100381610038161003816119873min minus 119873moses-min1003816100381610038161003816

119873min (22)

119877diff-max =

1003816100381610038161003816119873max minus 119873moses-max1003816100381610038161003816

119873max (23)

where the119873moses-min and119873moses-max indicate the minimal andmaximal phrases extracted fromMoses

From Table 10 we can see that the actual number ofphrases is close to the calculation result With the small dif-ferential rate the (13) and (17) can be available to the practicalexperiments The difference may come from two reasons(1) the assumption of full word alignment produced fromthe GIZA++ may contain too many noises (2) equations(13) and (17) do not consider repetitions of the extractedphrases

44 Phrase Table versusMachine Translation Wehave shownthat the there is no direct relationship between the sizeof phrase table and machine translation performance Asreflected in (17) and Table 4 at least half of the phrases canbe removed without hurting too much the performance ofmachine translation In this part we want to use the largeChinese-English corpus to test our pruning algorithm basedon monolingual and bilingual corpus

According to our experience different thresholds willgenerate different size of phrase table and result in differentperformance of translation Therefore one of the key stepsis to choose the pruning threshold in Algorithms 2 and 3Firstly a baseline threshold is calculated from the logarithmof sentences of the given corpus (119873) that is log(119873) Forthe 119862-value threshold the baseline threshold is about 25(log(36 lowast 107)) For the 119875-value threshold the negativelogarithm will be chosen Because the probabilities involvedare incredibly tiny we will work instead with the negative ofthe natural logs of the probabilities For example instead ofselecting phrase pairs with a 119875-value less than exp(minus15) wewill select phrase pairs with a negative-log-119875-value greaterthan 15 The baseline threshold for 120591

119865is 45 (log(74 lowast 106))

Secondly the different thresholds will increase or decreasethan the baseline threshold The affection to the phrase tableand BLEU scores based on different thresholds is shown inTable 11

From Table 11 we can see that nearly 98 phrases canbe discarded with monolingual and bilingual corpora whilethe translation quality does not become worse too much(only decrease 235) Experiments show that about 85 of thephrase table is reduced and can generate the best translationquality which not only again proves the similar results as inother researchers [3 21 32 38] but also indicates that mostphrases are not needed at all The best translation can beachieved on the size of about 15 to 30of the original phrasetable That is to say only a small fraction of phrases are theessential elements while major of the extracted phrases arespurious and in some sense they are redundancy and shouldbe discarded

10 The Scientific World Journal

Table 10 Phrases comparison between Moses extraction and calculation result

Phrase length 119870 = 30

Languages 119873moses-min 119873min 119877diff-min 119873moses-max 119873max 119877diff-max

fr-en 9136283 9080307 0006 1541902790 1544021509 0001es-en 10958551 10895266 0006 1401977185 1489422170 0058de-en 9219921 8929095 0033 1606176665 1686043482 0047

Table 11 119862-value and 119875-value threshold effect on the phrase tablesize and BLEU scores

Thresholds UM-Corpus + CWMT 2013119862-value 120576 minus log(119875-value) 120591

119865Phrase table () BLEU ()

0 0 100 281310 20 836 280925 45 525 2798100 60 297 2811200 100 143 2827400 200 202 2578

Table 12 Translation performance on different word alignmenttypes

Languages BLEU (3064 sentences)119860 full 119860 lmost 119860 sure 119860mid 119860 rmost

fr-en 2480 448 2226 996 248es-en 3076 312 2824 1128 232de-en 2166 230 1928 976 294

45 Word Alignment versus Translation Quality In order toinvestigate the affection to the translation performance basedon different word alignment types and phrase table size weselected five different types of word alignments to generatethe phrase table and then measure the translation qualityby BLEU metric the full word alignment points (119860 full) theleftmost word alignment (119860

119897most) the sure alignment point(119860 sure) the middle word alignment point (119860mid) and therightmost word alignment point (119860

119903most)Clearly the 119860 full is the best quality of word alignment

and followed by 119860 sure which is the only 1-to-1 alignmentThe 119860

119897most 119860mid and 119860119903most are much worse ones which

is only one alignment link As mentioned previously allthese alignment points are obtained after getting the resultsgenerated by GIZA++ The experiment is carried out inWMT 2006 corpus Based on the different alignment typesthe number of phrases extracted and the correspondingtranslation BLEU are presented in Figure 7 and Table 12separately

The histograms show that there is not any direct propor-tion relationship between the translation performance andthe number of extracted phrases The numbers in Figure 7show that German-English language pairs can generate themost phrases while the translation quality is not the bestAlthough the fewest phrases produced from the Spanish-English language pairs the best translation performance canbe obtained Explained by (17) more phrases usually lead to

worse word alignment quality or it can be said that thereare too many unaligned words in the produced alignmentresults The results in Figure 7 and Table 12 show that (1) thefull word alignment will contribute most to the translationperformance and there will not be too much decrease whenonly the sure alignment is provided (2) however the first andlast word alignment will not produce too high translationquality (3) although the middle position word alignmentcan extract the most phrases the quality of the translationis not the best and it seems that middle word alignment cancontribute more to the finally translation

In fact the translation from the only one link wordalignment should be higher After all the default full wordalignments have too many noises which trained from a largetraining corpus without manual edit

5 Conclusions

In this paper some investigations are made about the rela-tionship among word alignment phrase table and machinetranslation After deducing a formula for estimating the sizeof the phrase table a maximal phrase pruning ratio can becalculated based on the average length of the given corpus andword alignment positions Finally translation performancebased on different word alignment types is compared

Equations (13) and (17) provide a simple way to predictthe size of the phrase table in advance based on the generatedword alignment points This is beneficial for evaluating howmuch hardware resources are required to train the modelwhen the size of the parallel corpus is huge

Experiment results show the affection to the translationperformance is significant if the AER decreases or increasestoomuchThere is no direct proportion relationship betweenthe translation performance and the number of extractedphrases In other words most of the phrases are not needed atall only phrases extracted from the full word alignment areenough It can be seen from our corpus-motivated pruningmethod that only 15ndash30 phrases are needed for the basictranslations and most phrases in the phrase table are noisesto the final translation

Based on the existing phrase extraction algorithm fromword alignment better word alignment will be very helpful tothe final translation There are at least three advantages witha better quality of word alignment even without the existingphrase extraction algorithm improved as follows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

The Scientific World Journal 11

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Fr-e

n nu

mbe

r of p

hras

es

Fr-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

668712E7

785057E7

913628E654091E7

15419E9

(a)

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Es-e

n nu

mbe

r of p

hras

es

Es-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

753427E7

621835E7

140198E9

109586E7534143E7

(b)

De-en number of phrases

180E + 009

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

De-

en n

umbe

r of p

hras

es

Wod alignment types

Almost Asure Amid Afull Armost

716272E7712921E7

921992E6

527541E7

160618E9

(c)

Fr-e

n BL

EU

Fr-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

448

2226

996

248

248

(d)

Es-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

Es-e

n BL

EU

35

30

25

20

15

10

5

0

312

2824

1128

232

3076

(e)

De-

en B

LEU

De-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

23

1928

976

2166

294

(f)

Figure 7 Number of phrases extracted from different word alignment points and their BLEU

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

The next work we want to focus on is to add the corpus-motivated pruning technique into the translation modelsthat is the phrases will be filtered during the training timenot after generating the phrase table

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank all reviewers for the verycareful reading and helpful suggestions The authors aregrateful to the Science and Technology Development Fundof Macau and the Research Committee of the University ofMacau for the funding support for their research under theReference nos MYRG076 (Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS

References

[1] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology vol 1 pp 48ndash54 Association forComputational Linguistics Stroudsburg Pa USA 2003

[2] R Zens F J Och and H Ney Phrase-Based Statistical MachineTranslation Springer Berlin Germany 2002

[3] J H Shin Y S Han and K S Choi ldquoBilingual knowledgeacquisition from Korean-English parallel corpus using align-ment method Korean-English alignment at word and phraselevelrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 230ndash235 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[4] K Yamamoto T Kudo Y Tsuboi and YMatsumoto ldquoLearningsequence-to-sequence correspondences from parallel corporavia sequential pattern miningrdquo in Proceedings of the HLT-NAACL Workshop on Building and Using Parallel Texts DataDriven Machine Translation and Beyond vol 3 pp 73ndash80Association for Computational Linguistics Stroudsburg PaUSA 2003

[5] C Goutte K Yamada and E Gaussier ldquoAligning words usingmatrix factorisationrdquo in Proceedings of the 42nd AnnualMeetingon Association for Computational Linguistics Association forComputational Linguistics Stroudsburg Pa USA 2004

[6] F J Och C Tillmann H Ney and L F Informatik ImprovedAlignmentModels for StatisticalMachine Translation Universityof Maryland College Park Md USA 1999

[7] F J Och and H Ney ldquoA Comparison of Alignment Modelsfor Statistical Machine Translationrdquo in Proceedings of the 18thConference on Computational Linguistics vol 2 pp 1086ndash1090

12 The Scientific World Journal

Association for Computational Linguistics Stroudsburg PaUSA 2000

[8] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[9] F J Och and H Ney ldquoThe alignment template approach tostatistical machine translationrdquo Computational Linguistics vol30 no 4 pp 417ndash449 2004

[10] N Duan M Li and M Zhou ldquoImproving phrase extractionvia MBR phrase scoring and pruningrdquo in Proceedings of the MTSummit XIII pp 189ndash197 2011

[11] G Neubig T Watanabe E Sumita S Mori and T Kawa-hara ldquoAn unsupervised model for joint phrase alignment andextractionrdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human LanguageTechnologies vol 1 pp 632ndash641 Stroudsburg Pa USA June2011

[12] H L M Z Hendra Setiawan ldquoLearning phrase translationusing level of detail approachrdquo inTheTenthMachine TranslationSummit (MT-Summit X) 2005

[13] C Tillmann ldquoA projection extension algorithm for statisticalmachine translationrdquo in Proceedings of the Conference on Empir-ical Methods in Natural Language Processing Association forComputational Linguistics Stroudsburg PA USA 2003

[14] B Zhao and S Vogel ldquoA generalized alignment-free phraseextractionrdquo in Proceedings of the ACLWorkshop on Building andUsing Parallel Texts pp 141ndash144 Association for ComputationalLinguistics Stroudsburg Pa USA 2005

[15] Y Zhang and S Vogel ldquoCompetitive grouping in integratedphrase segmentation and alignment modelrdquo in Proceedings ofthe ACLWorkshop on Building and Using Parallel Texts pp 159ndash162 Association for Computational Linguistics StroudsburgPa USA 2005

[16] A Lopez ldquoStatistical machine translationrdquo ACM ComputingSurveys vol 40 no 3 article 8 2008

[17] K Ganchev J V Grac and B Taskar ldquoBetter alignments = Bet-ter translationsrdquo in Proceedings of the 46th Annual Meeting ofthe Association for Computational Linguistics Human LanguageTechnologies pp 986ndash993 June 2008

[18] D Vilar M Popovic and H Ney ldquoAER do we need toldquoimproverdquo our alignmentsrdquo in Proceedings of the InternationalWorkshop on Spoken Language Translation pp 205ndash212 2006

[19] A Fraser andDMarcu ldquoMeasuring word alignment quality forstatistical machine translationrdquo Computational Linguistics vol33 no 3 pp 293ndash303 2007

[20] P Lambert S Petitrenaud Y Ma and A Way ldquoWhat typesof word alignment improve statistical machine translationrdquoMachine Translation vol 26 no 4 pp 289ndash323 2012

[21] J H Johnson J Martin G Foster and R Kuhn ldquoImprovingtranslation quality by discarding most of the phrasetablerdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 967ndash975 June2007

[22] J H Johnson Conditional Significance Pruning DiscardingMore of Huge Phrase Tables 2012

[23] M Yang and J Zheng ldquoToward smaller faster and better hierar-chical phrase-based SMTrdquo in Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for ComputationalLinguistics and 4th International Joint Conference on Natural

Language Processing of the AFNLP (ACL-IJCNLP rsquo09) pp 237ndash240 Association for Computational Linguistics StroudsburgPa USA August 2009

[24] N Tomeh N cancedda and M Dymetman ldquoComplexity-based phrase-table filtering for statistical machine translationrdquoin Proceedings of the MT Summit XII 2009

[25] M Eck S Vogel and A Waibel ldquoTranslation model prun-ing via usage statistics for statistical machine translationrdquo inProceedings of the Human Language Technologies The AnnualConference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT rsquo07) pp 21ndash24Association for Computational Linguistics troudsburg PaUSA 2007

[26] M Eck S Vogel and A Waibel ldquoEstimating phrase pairrelevance formachine translation pruningrdquo inProceedings of theMT Summit XI pp 159ndash165 2007

[27] Y Chen A Eisele and M Kay ldquoImproving statistical machinetranslation efficiency by triangulationrdquo in Proceedings of theSixth International Conference on Language Resources andEvaluation 2008

[28] Y Chen M Kay and A Eisele ldquoIntersecting multilingual datafor faster and better statistical translationsrdquo in Proceedings of theHuman Language Technologies The Annual Conference of theNorth American Chapter of the Association for ComputationalLinguistics (NAACL HLT rsquo09) pp 128ndash136 Association forComputational Linguistics Stroudsburg Pa USA June 2009

[29] G Sanchis-Trilles D Ortiz-Martınez J Gonzalez-Rubio JGonzalez and F Casacuberta ldquoBilingual segmentation forphrasetable pruning in statistical machine translationrdquo in Pro-ceedings of the 15th International Conference of the EuropeanAssociation for Machine Translation (EAMT rsquo11) pp 257ndash264May 2011

[30] N Tomeh M Turchi G Wisinewski A Allauzen and YFrancois ldquoHow good are your phrases Assessing phrase qualitywith single class classificationrdquo in Proceedings of the Interna-tional Workshop on Spoken Language Translation pp 261ndash2682011

[31] W Ling J Graca I Trancoso and A Black ldquoEntropy-basedpruning for phrase-based machine translationrdquo in Proceedingsof the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Language Learn-ing pp 962ndash971 Association for Computational LinguisticsStroudsburg Pa USA 2012

[32] R Zens D Stanton and P Xu ldquoA systematic comparison ofphrase table pruning techniquesrdquo in Proceedings of the JointConference on Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning pp 972ndash983 Association for Computational Linguistics StroudsburgPa USA 2012

[33] K Papineni S Roukos T Ward andW Zhu ldquoBLEU a methodfor automatic evaluation of machine translationrdquo in Proceedingsof the 40th Annual Meeting of the Association for ComputationalLinguistics (ACL rsquo01) pp 311ndash318 2001

[34] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe Mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[35] P Koehn StatisticalMachine Translation CambridgeUniversityPress New York NY USA 1st edition 2010

[36] E Galbrun ldquoPhrase table pruning for statistical machine trans-lationrdquo Department of Computer Science Series of PublicationsC C-2009-22 University of Helsinki 2007

The Scientific World Journal 13

[37] T M Cover and J A Thomas Elements of Information TheoryWiley-Interscience New York NY USA 2006

[38] Z He Y Meng Y Lu H Yu and Q Liu ldquoReducing SMTrule table with monolingual key phraserdquo in Proceedings ofthe Joint Conference of the 47th Annual Meeting of the Asso-ciation for Computational Linguistics and 4th InternationalJoint Conference on Natural Language Processing of the AFNLP(ACL-IJCNLP rsquo09) pp 121ndash124 Association for ComputationalLinguistics Stroudsburg Pa USA August 2009

[39] Z He Y Meng and H Yu ldquoDiscarding monotone com-posed rule for hierarchical phrase-based statistical machinetranslationrdquo in Proceedings of the 3rd International UniversalCommunication Symposium (IUCS rsquo09) pp 25ndash29 New YorkNY USA December 2009

[40] K T Frantzi and S Ananiadou ldquoExtracting nested colloca-tionsrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 41ndash46 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[41] P Koehn H Hoang A Birch et al ldquoMoses open source toolkitfor statistical machine translationrdquo in Proceedings of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions pp 177ndash180 Association for ComputationalLinguistics Stroudsburg Pa USA 2007

[42] D Chiang ldquoA hierarchical phrase-based model for statisticalmachine translationrdquo in Proceedings of the 43rd Annual Meetingof the Association for Computational Linguistics pp 263ndash270Association for Computational Linguistics Stroudsburg PaUSA June 2005

[43] W A Gale and K W Church ldquoIdentifying word correspon-dences in parallel textsrdquo in Proceedings of the Workshop onSpeech and Natural Language pp 152ndash157 Association forComputational Linguistics Stroudsburg Pa USA 1991

[44] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[45] M Federico N Bertoldi and M Cettolo ldquoIRSTLM An opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

[46] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the MT Summit X 2005

[47] S Green and J DeNero ldquoA Class-based agreement model forgenerating accurately inflected translationsrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Long Papers vol 1 pp 146ndash155 Association forComputational Linguistics Stroudsburg Pa USA 2012

[48] N Xue F D Chiou and M lmer ldquoBuilding a large-scaleannotated chinese corpusrdquo in Proceedings of the 19th Interna-tional Conference on Computational Linguistics vol 1 pp 1ndash8PaAssociation for Computational Linguistics Stroudsburg PaUSA 2002

[49] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[50] S Vogel H Ney and C Tillmann ldquoHmm-based word align-ment in statistical translationrdquo in Proceedings of the 16thInternational Conference on Computational Linguistics pp 836ndash841 1996

[51] CWMT 2013 httpwwwliipcncwmt2013

[52] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) pp 1273ndash1278 Qingdao China July2010

[53] L Tian D Wong L Chao et al ldquoUM-Corpus a large English-Chinese parallel corpus for statistical machine translationrdquo inProceedings of the 9th International Conference on LanguageResources and Evaluation 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article A Relationship: Word Alignment, Phrase

The Scientific World Journal 7

Input Monolingual Corpus(1) Extracted candidate phrases(2) for all phrases 119901 in length descending order do(3) if 119873(119901) = 0 then(4) 119862-value = (119871(119901) minus 1) times 119865(119901)

(5) else(6) 119862-value = (119871(119901) minus 1) times (119865(119901) minus 119878(119901)119873(119901))

(7) end if(8) if 119862-value ge 120576 then(9) add 119901 to KP(10) end if(11) for all sub-strings 119902 of 119901 do(12) 119878 (119902) = 119878(119902) + 119865(119901) minus 119878(119901)

(13) 119873(119902) = 119873(119902) + 1

(14) end for(15) end for

Output Key Phrase Table KP

Algorithm 2 Key phrase extraction from monolingual corpus

Input Bilingual Corpus and KP Table from Algorithm 1(1) for all phrases 1199011015840 in KP do(2) if minus log(119875-value(1199011015840)) le 120591

119865then

(3) add 1199011015840 to 119875 PT

(4) end if(5) end forOutput Pruned Phrase Table 119875 PT

Algorithm 3 Algorithm for key phrase pruning

However the idea can be also used for phrase-based SMT[1 41] perfectly After this filter step a filtered phrase tablecalled KP will be generated

The phrases in the phrase table KP are all possibly usedphrases in practice However there are still some noisesin the KP table for example the source phrase occurringin source sentences and the target phrase occurring in thetarget sentences are not well matched that is the phrasetranslation is not good enough To overcome this case it isconvenient to construct a two by two contingency table thattabulates the sentence pairs where the two types of matchesoccur and do not occur [21 22 43] The calculation resultis also called 119875-value as introduced in Section 2 A bilingualcorpus constraint can be added to filter the noise phrases aspresented in Algorithm 3 After this step a large pruning ratephrase table (called 119875 119875119879) will be generated

33 Word Alignment amp Machine Translation Based on thediscussion in Sections 31 and 32 we can find that the sizeof phrase table is heavily affected by the quality of wordalignment the better the word alignment the smaller size ofthe phrase table Besides this themachine translation perfor-mance is not directly decided by the size of the phrase tableat least 50 phrases can be removed without hurting the

performance of the machine translation Experiments showthat the quality of phrase translation plays more importantrole to the final translation quality in practice Therefore itis necessary to improve the quality of word alignment underexisting phrase extraction algorithm from word alignment

Now it is a little easier to talk about the relationshipbetween word alignment andmachine translationThe trans-lation of sentence is directly derived from the phrase tablewhile the phrase table is extracted from word alignment Inother words the quality of word alignment will affect theperformance of machine translation through phrase tableThat is to say the word alignment quality will affect thetranslation performance a lot However the works listed in[8 18 19] show that there is not always significant affection tothe translation result with improved word alignment quality

We do not agree with the conclusion Firstly either thetraining or testing corpus they used in their experimentsis limited in a small size or the improvement of the wordalignment quality is not significant (AER decrease between01 and 02) Secondly we think it is the existing phraseextraction algorithm that leads to the not obvious result Wehave pointed out in the last part of Section 31 that even theworst alignments are used the extracted phrase table stillhave some ability to translate some input phrases This isbecause of the share phrases extracted from the only one

8 The Scientific World Journal

(aligner)

Phraseextraction

Moses decoder

SRILMIRSTLM(LM toolkit)

Phrase table Language model

Monolingualcorpus

Bilingualcorpus

Sourcetext

Targettext

GIZA++

Figure 6 Mosesrsquo main modules and models

alignment link The only one word alignment can translatesome input phrases or sentences not mentioning the moreword alignment links

We will discuss the situation with different word align-ment in next section where the machine translation affectedby the best word alignment and worst word alignment willbe fully surveyed After that we can see that there are at leastthree advantages to improve the quality of word alignment asfollows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

4 Experiments and Discussions

In order to evaluate the conclusions drawn in the previoussection some experiments are carried out usingMoses toolkit[41] which provides a complete package of the requiredmodules for training and generating the translation of texts

41 Experiment Settings Like other SMT approaches Mosestoolkit requires two models (as shown in Figure 6) theLanguage Model (LM) and the Translation Model (TM) InMoses LM is usually created by the external toolkits SRILM[44] or IRSTLM [45] In our experiment the IRSTLM toolkitis used for the large training corpus Before training allthe tokenization for languages in European corpus [46] usethe integrated scripts in Moses and the segmentation forChinese adapts the Stanford Chinese segmenter [47] usingCTB (Chinese Tree Bank) standard [48 49] The entirelanguage model is trained by five-gram models

The phrases are extracted from the results generatedfrom GIZA++ [8] The word alignment was trained with teniterations of IBMmodel 1 and model 4 [34] and six iterationsof the HMM alignment model [50]

To extract all the possible phrases to verify (13) and (17)the maximal phrases are limited to 30 This can extract most

Table 5 Statistics of the first 10000 WMT 2006 corpus

Languages Ave source len Ave en lenfr-en 2289 2015es-en 2171 2094de-en 2055 2142

Table 6 Statistics of test data (3064 sentences)

Languages Ave source len Ave en lenfr-en 3295 2781es-en 2994 2781de-en 2692 2781

Table 7 Statistics of CMWT 2013 + UM-Corpus

Languages Tokens Average length VocabulariesEnglish 152161233 1937 1655080CharacterCE 229110265 2916 397442CTBCE 123917395 1578 1331505

phrases in terms ofword alignments andmake us focus on therelations among word alignment phrase table and machinetranslation

In order to better reveal relationships between alignmentand phrase table to the translation quality leftmost middleand rightmost word alignment are extracted from the originalfull alignment in the worst alignment situation (ie onlyone alignment link) Here the leftmostmiddle and rightmostword alignment indicate theword alignment link that alignedto the firstmiddle and end word in the source language side

42 Corpus Information Three language pairs in WMT2006 shared translation task are chosen as the trainingdata French-English (fr-en) German-English (de-en) andSpanish-English (es-en) These three language pairs are usedto carry out the experiments for proving the equationdeduced in Section 3 and relationship between word align-ment and machine translation Table 3 shows the statisticsof original corpus [46] However the number of phrasesextracted from the only one link is too many to be allowedin our server so the first 100000 European corpus sentences(statistics shown in Table 5) are chosen to prove the availabil-ity of the equations betweenword alignment and phrase tableTable 6 presents the statistics of 3064 testing data fromWMT2006 for the same English sentences

To test our algorithm for pruning the phrase table alarge Chinese-English parallel corpus is brought inThe largecorpus consists of two parts 3289497 sentences are fromthe CWMT 2013 [51] and 4157556 sentences are from UM-Corpus [52 53] After removing repeat and mismatchedsentences in the combined two parts there are left 7445190sentences and the statistics of the combined parallel corpusare presented in Table 7 Note that the statistics of Chinesesentences are counted both in character level (each Chinesecharacter is treated as one token CharacterCE in Table 7) and

The Scientific World Journal 9

Table 8 Statistics of English amp Chinese corpora

Languages Tokens Average length VocabulariesEnglish 804584844 2464 2357374Chinese 3007196322 3343 1533343

Table 9 Statistics for EC-Corpus testing data

Languages Tokens Average lengthEnglish 118802 2376CharacterCE 153705 3074CTBCE 118499 2370

word level (segmented by StanfordCTB segmenter presentedby CTBCE)

The UM-Corpus also contains English Chinese andPortuguese monolingual corpora In this experiment theEnglish and Chinese corpora are used There are 35938697English sentences from news domain and more than 70million Chinese sentences from China portal webpage Thestatistics about the two corpora are shown in Table 8

The 5000 sentences of test data (named UM-Testing) aredesigned according to domains in the UM-Corpus Thereare 1500 sentences for news 500 sentences for laws 500sentences for novels 600 sentences for thesis 700 sentencesfor education 600 sentences for science and 600 for subtitleThe statistics of the testing data are shown in Table 9

43 Equation of Word Alignments versus Phrases The calcu-lation for number of phrases requires the full word alignmentlinks Full word alignment in other words is the best qualityof word alignment It is obviously very time-consumingif we manually edit the large word aligned sentences It isalready observed in the past that generative models usedfor statistical word alignment create alignments of increasingquality as they are exposed to more data [19] The intuitionbehind this is simple as more cooccurrences of source andtargets words are observed the word alignments are betterFollowing this intuition if we wish to increase the quality ofa word alignment we allow the alignment process access toextra data which is used only during the alignment processand then removed If we wish to decrease the quality of aword alignment we divide the parallel corpus into piecesand align the pieces independently of one another and finallyconcatenate the results together In our experiments we usethe entire WMT 2006 corpus as training data firstly andthen the first 100000 alignments are chosen as the full wordalignments

Also the leftmost middle and rightmost word alignmentpoint based on the full word alignment are obtained based onthe results generated by GIZA++

The minimal and maximal sizes of the phrase tableextracted from the full and middle word alignment and the

differential rates are shown in Table 10The differential rate iscalculated by

119877diff-min =

1003816100381610038161003816119873min minus 119873moses-min1003816100381610038161003816

119873min (22)

119877diff-max =

1003816100381610038161003816119873max minus 119873moses-max1003816100381610038161003816

119873max (23)

where the119873moses-min and119873moses-max indicate the minimal andmaximal phrases extracted fromMoses

From Table 10 we can see that the actual number ofphrases is close to the calculation result With the small dif-ferential rate the (13) and (17) can be available to the practicalexperiments The difference may come from two reasons(1) the assumption of full word alignment produced fromthe GIZA++ may contain too many noises (2) equations(13) and (17) do not consider repetitions of the extractedphrases

44 Phrase Table versusMachine Translation Wehave shownthat the there is no direct relationship between the sizeof phrase table and machine translation performance Asreflected in (17) and Table 4 at least half of the phrases canbe removed without hurting too much the performance ofmachine translation In this part we want to use the largeChinese-English corpus to test our pruning algorithm basedon monolingual and bilingual corpus

According to our experience different thresholds willgenerate different size of phrase table and result in differentperformance of translation Therefore one of the key stepsis to choose the pruning threshold in Algorithms 2 and 3Firstly a baseline threshold is calculated from the logarithmof sentences of the given corpus (119873) that is log(119873) Forthe 119862-value threshold the baseline threshold is about 25(log(36 lowast 107)) For the 119875-value threshold the negativelogarithm will be chosen Because the probabilities involvedare incredibly tiny we will work instead with the negative ofthe natural logs of the probabilities For example instead ofselecting phrase pairs with a 119875-value less than exp(minus15) wewill select phrase pairs with a negative-log-119875-value greaterthan 15 The baseline threshold for 120591

119865is 45 (log(74 lowast 106))

Secondly the different thresholds will increase or decreasethan the baseline threshold The affection to the phrase tableand BLEU scores based on different thresholds is shown inTable 11

From Table 11 we can see that nearly 98 phrases canbe discarded with monolingual and bilingual corpora whilethe translation quality does not become worse too much(only decrease 235) Experiments show that about 85 of thephrase table is reduced and can generate the best translationquality which not only again proves the similar results as inother researchers [3 21 32 38] but also indicates that mostphrases are not needed at all The best translation can beachieved on the size of about 15 to 30of the original phrasetable That is to say only a small fraction of phrases are theessential elements while major of the extracted phrases arespurious and in some sense they are redundancy and shouldbe discarded

10 The Scientific World Journal

Table 10 Phrases comparison between Moses extraction and calculation result

Phrase length 119870 = 30

Languages 119873moses-min 119873min 119877diff-min 119873moses-max 119873max 119877diff-max

fr-en 9136283 9080307 0006 1541902790 1544021509 0001es-en 10958551 10895266 0006 1401977185 1489422170 0058de-en 9219921 8929095 0033 1606176665 1686043482 0047

Table 11 119862-value and 119875-value threshold effect on the phrase tablesize and BLEU scores

Thresholds UM-Corpus + CWMT 2013119862-value 120576 minus log(119875-value) 120591

119865Phrase table () BLEU ()

0 0 100 281310 20 836 280925 45 525 2798100 60 297 2811200 100 143 2827400 200 202 2578

Table 12 Translation performance on different word alignmenttypes

Languages BLEU (3064 sentences)119860 full 119860 lmost 119860 sure 119860mid 119860 rmost

fr-en 2480 448 2226 996 248es-en 3076 312 2824 1128 232de-en 2166 230 1928 976 294

45 Word Alignment versus Translation Quality In order toinvestigate the affection to the translation performance basedon different word alignment types and phrase table size weselected five different types of word alignments to generatethe phrase table and then measure the translation qualityby BLEU metric the full word alignment points (119860 full) theleftmost word alignment (119860

119897most) the sure alignment point(119860 sure) the middle word alignment point (119860mid) and therightmost word alignment point (119860

119903most)Clearly the 119860 full is the best quality of word alignment

and followed by 119860 sure which is the only 1-to-1 alignmentThe 119860

119897most 119860mid and 119860119903most are much worse ones which

is only one alignment link As mentioned previously allthese alignment points are obtained after getting the resultsgenerated by GIZA++ The experiment is carried out inWMT 2006 corpus Based on the different alignment typesthe number of phrases extracted and the correspondingtranslation BLEU are presented in Figure 7 and Table 12separately

The histograms show that there is not any direct propor-tion relationship between the translation performance andthe number of extracted phrases The numbers in Figure 7show that German-English language pairs can generate themost phrases while the translation quality is not the bestAlthough the fewest phrases produced from the Spanish-English language pairs the best translation performance canbe obtained Explained by (17) more phrases usually lead to

worse word alignment quality or it can be said that thereare too many unaligned words in the produced alignmentresults The results in Figure 7 and Table 12 show that (1) thefull word alignment will contribute most to the translationperformance and there will not be too much decrease whenonly the sure alignment is provided (2) however the first andlast word alignment will not produce too high translationquality (3) although the middle position word alignmentcan extract the most phrases the quality of the translationis not the best and it seems that middle word alignment cancontribute more to the finally translation

In fact the translation from the only one link wordalignment should be higher After all the default full wordalignments have too many noises which trained from a largetraining corpus without manual edit

5 Conclusions

In this paper some investigations are made about the rela-tionship among word alignment phrase table and machinetranslation After deducing a formula for estimating the sizeof the phrase table a maximal phrase pruning ratio can becalculated based on the average length of the given corpus andword alignment positions Finally translation performancebased on different word alignment types is compared

Equations (13) and (17) provide a simple way to predictthe size of the phrase table in advance based on the generatedword alignment points This is beneficial for evaluating howmuch hardware resources are required to train the modelwhen the size of the parallel corpus is huge

Experiment results show the affection to the translationperformance is significant if the AER decreases or increasestoomuchThere is no direct proportion relationship betweenthe translation performance and the number of extractedphrases In other words most of the phrases are not needed atall only phrases extracted from the full word alignment areenough It can be seen from our corpus-motivated pruningmethod that only 15ndash30 phrases are needed for the basictranslations and most phrases in the phrase table are noisesto the final translation

Based on the existing phrase extraction algorithm fromword alignment better word alignment will be very helpful tothe final translation There are at least three advantages witha better quality of word alignment even without the existingphrase extraction algorithm improved as follows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

The Scientific World Journal 11

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Fr-e

n nu

mbe

r of p

hras

es

Fr-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

668712E7

785057E7

913628E654091E7

15419E9

(a)

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Es-e

n nu

mbe

r of p

hras

es

Es-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

753427E7

621835E7

140198E9

109586E7534143E7

(b)

De-en number of phrases

180E + 009

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

De-

en n

umbe

r of p

hras

es

Wod alignment types

Almost Asure Amid Afull Armost

716272E7712921E7

921992E6

527541E7

160618E9

(c)

Fr-e

n BL

EU

Fr-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

448

2226

996

248

248

(d)

Es-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

Es-e

n BL

EU

35

30

25

20

15

10

5

0

312

2824

1128

232

3076

(e)

De-

en B

LEU

De-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

23

1928

976

2166

294

(f)

Figure 7 Number of phrases extracted from different word alignment points and their BLEU

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

The next work we want to focus on is to add the corpus-motivated pruning technique into the translation modelsthat is the phrases will be filtered during the training timenot after generating the phrase table

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank all reviewers for the verycareful reading and helpful suggestions The authors aregrateful to the Science and Technology Development Fundof Macau and the Research Committee of the University ofMacau for the funding support for their research under theReference nos MYRG076 (Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS

References

[1] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology vol 1 pp 48ndash54 Association forComputational Linguistics Stroudsburg Pa USA 2003

[2] R Zens F J Och and H Ney Phrase-Based Statistical MachineTranslation Springer Berlin Germany 2002

[3] J H Shin Y S Han and K S Choi ldquoBilingual knowledgeacquisition from Korean-English parallel corpus using align-ment method Korean-English alignment at word and phraselevelrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 230ndash235 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[4] K Yamamoto T Kudo Y Tsuboi and YMatsumoto ldquoLearningsequence-to-sequence correspondences from parallel corporavia sequential pattern miningrdquo in Proceedings of the HLT-NAACL Workshop on Building and Using Parallel Texts DataDriven Machine Translation and Beyond vol 3 pp 73ndash80Association for Computational Linguistics Stroudsburg PaUSA 2003

[5] C Goutte K Yamada and E Gaussier ldquoAligning words usingmatrix factorisationrdquo in Proceedings of the 42nd AnnualMeetingon Association for Computational Linguistics Association forComputational Linguistics Stroudsburg Pa USA 2004

[6] F J Och C Tillmann H Ney and L F Informatik ImprovedAlignmentModels for StatisticalMachine Translation Universityof Maryland College Park Md USA 1999

[7] F J Och and H Ney ldquoA Comparison of Alignment Modelsfor Statistical Machine Translationrdquo in Proceedings of the 18thConference on Computational Linguistics vol 2 pp 1086ndash1090

12 The Scientific World Journal

Association for Computational Linguistics Stroudsburg PaUSA 2000

[8] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[9] F J Och and H Ney ldquoThe alignment template approach tostatistical machine translationrdquo Computational Linguistics vol30 no 4 pp 417ndash449 2004

[10] N Duan M Li and M Zhou ldquoImproving phrase extractionvia MBR phrase scoring and pruningrdquo in Proceedings of the MTSummit XIII pp 189ndash197 2011

[11] G Neubig T Watanabe E Sumita S Mori and T Kawa-hara ldquoAn unsupervised model for joint phrase alignment andextractionrdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human LanguageTechnologies vol 1 pp 632ndash641 Stroudsburg Pa USA June2011

[12] H L M Z Hendra Setiawan ldquoLearning phrase translationusing level of detail approachrdquo inTheTenthMachine TranslationSummit (MT-Summit X) 2005

[13] C Tillmann ldquoA projection extension algorithm for statisticalmachine translationrdquo in Proceedings of the Conference on Empir-ical Methods in Natural Language Processing Association forComputational Linguistics Stroudsburg PA USA 2003

[14] B Zhao and S Vogel ldquoA generalized alignment-free phraseextractionrdquo in Proceedings of the ACLWorkshop on Building andUsing Parallel Texts pp 141ndash144 Association for ComputationalLinguistics Stroudsburg Pa USA 2005

[15] Y Zhang and S Vogel ldquoCompetitive grouping in integratedphrase segmentation and alignment modelrdquo in Proceedings ofthe ACLWorkshop on Building and Using Parallel Texts pp 159ndash162 Association for Computational Linguistics StroudsburgPa USA 2005

[16] A Lopez ldquoStatistical machine translationrdquo ACM ComputingSurveys vol 40 no 3 article 8 2008

[17] K Ganchev J V Grac and B Taskar ldquoBetter alignments = Bet-ter translationsrdquo in Proceedings of the 46th Annual Meeting ofthe Association for Computational Linguistics Human LanguageTechnologies pp 986ndash993 June 2008

[18] D Vilar M Popovic and H Ney ldquoAER do we need toldquoimproverdquo our alignmentsrdquo in Proceedings of the InternationalWorkshop on Spoken Language Translation pp 205ndash212 2006

[19] A Fraser andDMarcu ldquoMeasuring word alignment quality forstatistical machine translationrdquo Computational Linguistics vol33 no 3 pp 293ndash303 2007

[20] P Lambert S Petitrenaud Y Ma and A Way ldquoWhat typesof word alignment improve statistical machine translationrdquoMachine Translation vol 26 no 4 pp 289ndash323 2012

[21] J H Johnson J Martin G Foster and R Kuhn ldquoImprovingtranslation quality by discarding most of the phrasetablerdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 967ndash975 June2007

[22] J H Johnson Conditional Significance Pruning DiscardingMore of Huge Phrase Tables 2012

[23] M Yang and J Zheng ldquoToward smaller faster and better hierar-chical phrase-based SMTrdquo in Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for ComputationalLinguistics and 4th International Joint Conference on Natural

Language Processing of the AFNLP (ACL-IJCNLP rsquo09) pp 237ndash240 Association for Computational Linguistics StroudsburgPa USA August 2009

[24] N Tomeh N cancedda and M Dymetman ldquoComplexity-based phrase-table filtering for statistical machine translationrdquoin Proceedings of the MT Summit XII 2009

[25] M Eck S Vogel and A Waibel ldquoTranslation model prun-ing via usage statistics for statistical machine translationrdquo inProceedings of the Human Language Technologies The AnnualConference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT rsquo07) pp 21ndash24Association for Computational Linguistics troudsburg PaUSA 2007

[26] M Eck S Vogel and A Waibel ldquoEstimating phrase pairrelevance formachine translation pruningrdquo inProceedings of theMT Summit XI pp 159ndash165 2007

[27] Y Chen A Eisele and M Kay ldquoImproving statistical machinetranslation efficiency by triangulationrdquo in Proceedings of theSixth International Conference on Language Resources andEvaluation 2008

[28] Y Chen M Kay and A Eisele ldquoIntersecting multilingual datafor faster and better statistical translationsrdquo in Proceedings of theHuman Language Technologies The Annual Conference of theNorth American Chapter of the Association for ComputationalLinguistics (NAACL HLT rsquo09) pp 128ndash136 Association forComputational Linguistics Stroudsburg Pa USA June 2009

[29] G Sanchis-Trilles D Ortiz-Martınez J Gonzalez-Rubio JGonzalez and F Casacuberta ldquoBilingual segmentation forphrasetable pruning in statistical machine translationrdquo in Pro-ceedings of the 15th International Conference of the EuropeanAssociation for Machine Translation (EAMT rsquo11) pp 257ndash264May 2011

[30] N Tomeh M Turchi G Wisinewski A Allauzen and YFrancois ldquoHow good are your phrases Assessing phrase qualitywith single class classificationrdquo in Proceedings of the Interna-tional Workshop on Spoken Language Translation pp 261ndash2682011

[31] W Ling J Graca I Trancoso and A Black ldquoEntropy-basedpruning for phrase-based machine translationrdquo in Proceedingsof the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Language Learn-ing pp 962ndash971 Association for Computational LinguisticsStroudsburg Pa USA 2012

[32] R Zens D Stanton and P Xu ldquoA systematic comparison ofphrase table pruning techniquesrdquo in Proceedings of the JointConference on Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning pp 972ndash983 Association for Computational Linguistics StroudsburgPa USA 2012

[33] K Papineni S Roukos T Ward andW Zhu ldquoBLEU a methodfor automatic evaluation of machine translationrdquo in Proceedingsof the 40th Annual Meeting of the Association for ComputationalLinguistics (ACL rsquo01) pp 311ndash318 2001

[34] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe Mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[35] P Koehn StatisticalMachine Translation CambridgeUniversityPress New York NY USA 1st edition 2010

[36] E Galbrun ldquoPhrase table pruning for statistical machine trans-lationrdquo Department of Computer Science Series of PublicationsC C-2009-22 University of Helsinki 2007

The Scientific World Journal 13

[37] T M Cover and J A Thomas Elements of Information TheoryWiley-Interscience New York NY USA 2006

[38] Z He Y Meng Y Lu H Yu and Q Liu ldquoReducing SMTrule table with monolingual key phraserdquo in Proceedings ofthe Joint Conference of the 47th Annual Meeting of the Asso-ciation for Computational Linguistics and 4th InternationalJoint Conference on Natural Language Processing of the AFNLP(ACL-IJCNLP rsquo09) pp 121ndash124 Association for ComputationalLinguistics Stroudsburg Pa USA August 2009

[39] Z He Y Meng and H Yu ldquoDiscarding monotone com-posed rule for hierarchical phrase-based statistical machinetranslationrdquo in Proceedings of the 3rd International UniversalCommunication Symposium (IUCS rsquo09) pp 25ndash29 New YorkNY USA December 2009

[40] K T Frantzi and S Ananiadou ldquoExtracting nested colloca-tionsrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 41ndash46 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[41] P Koehn H Hoang A Birch et al ldquoMoses open source toolkitfor statistical machine translationrdquo in Proceedings of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions pp 177ndash180 Association for ComputationalLinguistics Stroudsburg Pa USA 2007

[42] D Chiang ldquoA hierarchical phrase-based model for statisticalmachine translationrdquo in Proceedings of the 43rd Annual Meetingof the Association for Computational Linguistics pp 263ndash270Association for Computational Linguistics Stroudsburg PaUSA June 2005

[43] W A Gale and K W Church ldquoIdentifying word correspon-dences in parallel textsrdquo in Proceedings of the Workshop onSpeech and Natural Language pp 152ndash157 Association forComputational Linguistics Stroudsburg Pa USA 1991

[44] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[45] M Federico N Bertoldi and M Cettolo ldquoIRSTLM An opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

[46] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the MT Summit X 2005

[47] S Green and J DeNero ldquoA Class-based agreement model forgenerating accurately inflected translationsrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Long Papers vol 1 pp 146ndash155 Association forComputational Linguistics Stroudsburg Pa USA 2012

[48] N Xue F D Chiou and M lmer ldquoBuilding a large-scaleannotated chinese corpusrdquo in Proceedings of the 19th Interna-tional Conference on Computational Linguistics vol 1 pp 1ndash8PaAssociation for Computational Linguistics Stroudsburg PaUSA 2002

[49] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[50] S Vogel H Ney and C Tillmann ldquoHmm-based word align-ment in statistical translationrdquo in Proceedings of the 16thInternational Conference on Computational Linguistics pp 836ndash841 1996

[51] CWMT 2013 httpwwwliipcncwmt2013

[52] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) pp 1273ndash1278 Qingdao China July2010

[53] L Tian D Wong L Chao et al ldquoUM-Corpus a large English-Chinese parallel corpus for statistical machine translationrdquo inProceedings of the 9th International Conference on LanguageResources and Evaluation 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article A Relationship: Word Alignment, Phrase

8 The Scientific World Journal

(aligner)

Phraseextraction

Moses decoder

SRILMIRSTLM(LM toolkit)

Phrase table Language model

Monolingualcorpus

Bilingualcorpus

Sourcetext

Targettext

GIZA++

Figure 6 Mosesrsquo main modules and models

alignment link The only one word alignment can translatesome input phrases or sentences not mentioning the moreword alignment links

We will discuss the situation with different word align-ment in next section where the machine translation affectedby the best word alignment and worst word alignment willbe fully surveyed After that we can see that there are at leastthree advantages to improve the quality of word alignment asfollows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

4 Experiments and Discussions

In order to evaluate the conclusions drawn in the previoussection some experiments are carried out usingMoses toolkit[41] which provides a complete package of the requiredmodules for training and generating the translation of texts

41 Experiment Settings Like other SMT approaches Mosestoolkit requires two models (as shown in Figure 6) theLanguage Model (LM) and the Translation Model (TM) InMoses LM is usually created by the external toolkits SRILM[44] or IRSTLM [45] In our experiment the IRSTLM toolkitis used for the large training corpus Before training allthe tokenization for languages in European corpus [46] usethe integrated scripts in Moses and the segmentation forChinese adapts the Stanford Chinese segmenter [47] usingCTB (Chinese Tree Bank) standard [48 49] The entirelanguage model is trained by five-gram models

The phrases are extracted from the results generatedfrom GIZA++ [8] The word alignment was trained with teniterations of IBMmodel 1 and model 4 [34] and six iterationsof the HMM alignment model [50]

To extract all the possible phrases to verify (13) and (17)the maximal phrases are limited to 30 This can extract most

Table 5 Statistics of the first 10000 WMT 2006 corpus

Languages Ave source len Ave en lenfr-en 2289 2015es-en 2171 2094de-en 2055 2142

Table 6 Statistics of test data (3064 sentences)

Languages Ave source len Ave en lenfr-en 3295 2781es-en 2994 2781de-en 2692 2781

Table 7 Statistics of CMWT 2013 + UM-Corpus

Languages Tokens Average length VocabulariesEnglish 152161233 1937 1655080CharacterCE 229110265 2916 397442CTBCE 123917395 1578 1331505

phrases in terms ofword alignments andmake us focus on therelations among word alignment phrase table and machinetranslation

In order to better reveal relationships between alignmentand phrase table to the translation quality leftmost middleand rightmost word alignment are extracted from the originalfull alignment in the worst alignment situation (ie onlyone alignment link) Here the leftmostmiddle and rightmostword alignment indicate theword alignment link that alignedto the firstmiddle and end word in the source language side

42 Corpus Information Three language pairs in WMT2006 shared translation task are chosen as the trainingdata French-English (fr-en) German-English (de-en) andSpanish-English (es-en) These three language pairs are usedto carry out the experiments for proving the equationdeduced in Section 3 and relationship between word align-ment and machine translation Table 3 shows the statisticsof original corpus [46] However the number of phrasesextracted from the only one link is too many to be allowedin our server so the first 100000 European corpus sentences(statistics shown in Table 5) are chosen to prove the availabil-ity of the equations betweenword alignment and phrase tableTable 6 presents the statistics of 3064 testing data fromWMT2006 for the same English sentences

To test our algorithm for pruning the phrase table alarge Chinese-English parallel corpus is brought inThe largecorpus consists of two parts 3289497 sentences are fromthe CWMT 2013 [51] and 4157556 sentences are from UM-Corpus [52 53] After removing repeat and mismatchedsentences in the combined two parts there are left 7445190sentences and the statistics of the combined parallel corpusare presented in Table 7 Note that the statistics of Chinesesentences are counted both in character level (each Chinesecharacter is treated as one token CharacterCE in Table 7) and

The Scientific World Journal 9

Table 8 Statistics of English amp Chinese corpora

Languages Tokens Average length VocabulariesEnglish 804584844 2464 2357374Chinese 3007196322 3343 1533343

Table 9 Statistics for EC-Corpus testing data

Languages Tokens Average lengthEnglish 118802 2376CharacterCE 153705 3074CTBCE 118499 2370

word level (segmented by StanfordCTB segmenter presentedby CTBCE)

The UM-Corpus also contains English Chinese andPortuguese monolingual corpora In this experiment theEnglish and Chinese corpora are used There are 35938697English sentences from news domain and more than 70million Chinese sentences from China portal webpage Thestatistics about the two corpora are shown in Table 8

The 5000 sentences of test data (named UM-Testing) aredesigned according to domains in the UM-Corpus Thereare 1500 sentences for news 500 sentences for laws 500sentences for novels 600 sentences for thesis 700 sentencesfor education 600 sentences for science and 600 for subtitleThe statistics of the testing data are shown in Table 9

43 Equation of Word Alignments versus Phrases The calcu-lation for number of phrases requires the full word alignmentlinks Full word alignment in other words is the best qualityof word alignment It is obviously very time-consumingif we manually edit the large word aligned sentences It isalready observed in the past that generative models usedfor statistical word alignment create alignments of increasingquality as they are exposed to more data [19] The intuitionbehind this is simple as more cooccurrences of source andtargets words are observed the word alignments are betterFollowing this intuition if we wish to increase the quality ofa word alignment we allow the alignment process access toextra data which is used only during the alignment processand then removed If we wish to decrease the quality of aword alignment we divide the parallel corpus into piecesand align the pieces independently of one another and finallyconcatenate the results together In our experiments we usethe entire WMT 2006 corpus as training data firstly andthen the first 100000 alignments are chosen as the full wordalignments

Also the leftmost middle and rightmost word alignmentpoint based on the full word alignment are obtained based onthe results generated by GIZA++

The minimal and maximal sizes of the phrase tableextracted from the full and middle word alignment and the

differential rates are shown in Table 10The differential rate iscalculated by

119877diff-min =

1003816100381610038161003816119873min minus 119873moses-min1003816100381610038161003816

119873min (22)

119877diff-max =

1003816100381610038161003816119873max minus 119873moses-max1003816100381610038161003816

119873max (23)

where the119873moses-min and119873moses-max indicate the minimal andmaximal phrases extracted fromMoses

From Table 10 we can see that the actual number ofphrases is close to the calculation result With the small dif-ferential rate the (13) and (17) can be available to the practicalexperiments The difference may come from two reasons(1) the assumption of full word alignment produced fromthe GIZA++ may contain too many noises (2) equations(13) and (17) do not consider repetitions of the extractedphrases

44 Phrase Table versusMachine Translation Wehave shownthat the there is no direct relationship between the sizeof phrase table and machine translation performance Asreflected in (17) and Table 4 at least half of the phrases canbe removed without hurting too much the performance ofmachine translation In this part we want to use the largeChinese-English corpus to test our pruning algorithm basedon monolingual and bilingual corpus

According to our experience different thresholds willgenerate different size of phrase table and result in differentperformance of translation Therefore one of the key stepsis to choose the pruning threshold in Algorithms 2 and 3Firstly a baseline threshold is calculated from the logarithmof sentences of the given corpus (119873) that is log(119873) Forthe 119862-value threshold the baseline threshold is about 25(log(36 lowast 107)) For the 119875-value threshold the negativelogarithm will be chosen Because the probabilities involvedare incredibly tiny we will work instead with the negative ofthe natural logs of the probabilities For example instead ofselecting phrase pairs with a 119875-value less than exp(minus15) wewill select phrase pairs with a negative-log-119875-value greaterthan 15 The baseline threshold for 120591

119865is 45 (log(74 lowast 106))

Secondly the different thresholds will increase or decreasethan the baseline threshold The affection to the phrase tableand BLEU scores based on different thresholds is shown inTable 11

From Table 11 we can see that nearly 98 phrases canbe discarded with monolingual and bilingual corpora whilethe translation quality does not become worse too much(only decrease 235) Experiments show that about 85 of thephrase table is reduced and can generate the best translationquality which not only again proves the similar results as inother researchers [3 21 32 38] but also indicates that mostphrases are not needed at all The best translation can beachieved on the size of about 15 to 30of the original phrasetable That is to say only a small fraction of phrases are theessential elements while major of the extracted phrases arespurious and in some sense they are redundancy and shouldbe discarded

10 The Scientific World Journal

Table 10 Phrases comparison between Moses extraction and calculation result

Phrase length 119870 = 30

Languages 119873moses-min 119873min 119877diff-min 119873moses-max 119873max 119877diff-max

fr-en 9136283 9080307 0006 1541902790 1544021509 0001es-en 10958551 10895266 0006 1401977185 1489422170 0058de-en 9219921 8929095 0033 1606176665 1686043482 0047

Table 11 119862-value and 119875-value threshold effect on the phrase tablesize and BLEU scores

Thresholds UM-Corpus + CWMT 2013119862-value 120576 minus log(119875-value) 120591

119865Phrase table () BLEU ()

0 0 100 281310 20 836 280925 45 525 2798100 60 297 2811200 100 143 2827400 200 202 2578

Table 12 Translation performance on different word alignmenttypes

Languages BLEU (3064 sentences)119860 full 119860 lmost 119860 sure 119860mid 119860 rmost

fr-en 2480 448 2226 996 248es-en 3076 312 2824 1128 232de-en 2166 230 1928 976 294

45 Word Alignment versus Translation Quality In order toinvestigate the affection to the translation performance basedon different word alignment types and phrase table size weselected five different types of word alignments to generatethe phrase table and then measure the translation qualityby BLEU metric the full word alignment points (119860 full) theleftmost word alignment (119860

119897most) the sure alignment point(119860 sure) the middle word alignment point (119860mid) and therightmost word alignment point (119860

119903most)Clearly the 119860 full is the best quality of word alignment

and followed by 119860 sure which is the only 1-to-1 alignmentThe 119860

119897most 119860mid and 119860119903most are much worse ones which

is only one alignment link As mentioned previously allthese alignment points are obtained after getting the resultsgenerated by GIZA++ The experiment is carried out inWMT 2006 corpus Based on the different alignment typesthe number of phrases extracted and the correspondingtranslation BLEU are presented in Figure 7 and Table 12separately

The histograms show that there is not any direct propor-tion relationship between the translation performance andthe number of extracted phrases The numbers in Figure 7show that German-English language pairs can generate themost phrases while the translation quality is not the bestAlthough the fewest phrases produced from the Spanish-English language pairs the best translation performance canbe obtained Explained by (17) more phrases usually lead to

worse word alignment quality or it can be said that thereare too many unaligned words in the produced alignmentresults The results in Figure 7 and Table 12 show that (1) thefull word alignment will contribute most to the translationperformance and there will not be too much decrease whenonly the sure alignment is provided (2) however the first andlast word alignment will not produce too high translationquality (3) although the middle position word alignmentcan extract the most phrases the quality of the translationis not the best and it seems that middle word alignment cancontribute more to the finally translation

In fact the translation from the only one link wordalignment should be higher After all the default full wordalignments have too many noises which trained from a largetraining corpus without manual edit

5 Conclusions

In this paper some investigations are made about the rela-tionship among word alignment phrase table and machinetranslation After deducing a formula for estimating the sizeof the phrase table a maximal phrase pruning ratio can becalculated based on the average length of the given corpus andword alignment positions Finally translation performancebased on different word alignment types is compared

Equations (13) and (17) provide a simple way to predictthe size of the phrase table in advance based on the generatedword alignment points This is beneficial for evaluating howmuch hardware resources are required to train the modelwhen the size of the parallel corpus is huge

Experiment results show the affection to the translationperformance is significant if the AER decreases or increasestoomuchThere is no direct proportion relationship betweenthe translation performance and the number of extractedphrases In other words most of the phrases are not needed atall only phrases extracted from the full word alignment areenough It can be seen from our corpus-motivated pruningmethod that only 15ndash30 phrases are needed for the basictranslations and most phrases in the phrase table are noisesto the final translation

Based on the existing phrase extraction algorithm fromword alignment better word alignment will be very helpful tothe final translation There are at least three advantages witha better quality of word alignment even without the existingphrase extraction algorithm improved as follows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

The Scientific World Journal 11

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Fr-e

n nu

mbe

r of p

hras

es

Fr-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

668712E7

785057E7

913628E654091E7

15419E9

(a)

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Es-e

n nu

mbe

r of p

hras

es

Es-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

753427E7

621835E7

140198E9

109586E7534143E7

(b)

De-en number of phrases

180E + 009

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

De-

en n

umbe

r of p

hras

es

Wod alignment types

Almost Asure Amid Afull Armost

716272E7712921E7

921992E6

527541E7

160618E9

(c)

Fr-e

n BL

EU

Fr-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

448

2226

996

248

248

(d)

Es-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

Es-e

n BL

EU

35

30

25

20

15

10

5

0

312

2824

1128

232

3076

(e)

De-

en B

LEU

De-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

23

1928

976

2166

294

(f)

Figure 7 Number of phrases extracted from different word alignment points and their BLEU

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

The next work we want to focus on is to add the corpus-motivated pruning technique into the translation modelsthat is the phrases will be filtered during the training timenot after generating the phrase table

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank all reviewers for the verycareful reading and helpful suggestions The authors aregrateful to the Science and Technology Development Fundof Macau and the Research Committee of the University ofMacau for the funding support for their research under theReference nos MYRG076 (Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS

References

[1] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology vol 1 pp 48ndash54 Association forComputational Linguistics Stroudsburg Pa USA 2003

[2] R Zens F J Och and H Ney Phrase-Based Statistical MachineTranslation Springer Berlin Germany 2002

[3] J H Shin Y S Han and K S Choi ldquoBilingual knowledgeacquisition from Korean-English parallel corpus using align-ment method Korean-English alignment at word and phraselevelrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 230ndash235 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[4] K Yamamoto T Kudo Y Tsuboi and YMatsumoto ldquoLearningsequence-to-sequence correspondences from parallel corporavia sequential pattern miningrdquo in Proceedings of the HLT-NAACL Workshop on Building and Using Parallel Texts DataDriven Machine Translation and Beyond vol 3 pp 73ndash80Association for Computational Linguistics Stroudsburg PaUSA 2003

[5] C Goutte K Yamada and E Gaussier ldquoAligning words usingmatrix factorisationrdquo in Proceedings of the 42nd AnnualMeetingon Association for Computational Linguistics Association forComputational Linguistics Stroudsburg Pa USA 2004

[6] F J Och C Tillmann H Ney and L F Informatik ImprovedAlignmentModels for StatisticalMachine Translation Universityof Maryland College Park Md USA 1999

[7] F J Och and H Ney ldquoA Comparison of Alignment Modelsfor Statistical Machine Translationrdquo in Proceedings of the 18thConference on Computational Linguistics vol 2 pp 1086ndash1090

12 The Scientific World Journal

Association for Computational Linguistics Stroudsburg PaUSA 2000

[8] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[9] F J Och and H Ney ldquoThe alignment template approach tostatistical machine translationrdquo Computational Linguistics vol30 no 4 pp 417ndash449 2004

[10] N Duan M Li and M Zhou ldquoImproving phrase extractionvia MBR phrase scoring and pruningrdquo in Proceedings of the MTSummit XIII pp 189ndash197 2011

[11] G Neubig T Watanabe E Sumita S Mori and T Kawa-hara ldquoAn unsupervised model for joint phrase alignment andextractionrdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human LanguageTechnologies vol 1 pp 632ndash641 Stroudsburg Pa USA June2011

[12] H L M Z Hendra Setiawan ldquoLearning phrase translationusing level of detail approachrdquo inTheTenthMachine TranslationSummit (MT-Summit X) 2005

[13] C Tillmann ldquoA projection extension algorithm for statisticalmachine translationrdquo in Proceedings of the Conference on Empir-ical Methods in Natural Language Processing Association forComputational Linguistics Stroudsburg PA USA 2003

[14] B Zhao and S Vogel ldquoA generalized alignment-free phraseextractionrdquo in Proceedings of the ACLWorkshop on Building andUsing Parallel Texts pp 141ndash144 Association for ComputationalLinguistics Stroudsburg Pa USA 2005

[15] Y Zhang and S Vogel ldquoCompetitive grouping in integratedphrase segmentation and alignment modelrdquo in Proceedings ofthe ACLWorkshop on Building and Using Parallel Texts pp 159ndash162 Association for Computational Linguistics StroudsburgPa USA 2005

[16] A Lopez ldquoStatistical machine translationrdquo ACM ComputingSurveys vol 40 no 3 article 8 2008

[17] K Ganchev J V Grac and B Taskar ldquoBetter alignments = Bet-ter translationsrdquo in Proceedings of the 46th Annual Meeting ofthe Association for Computational Linguistics Human LanguageTechnologies pp 986ndash993 June 2008

[18] D Vilar M Popovic and H Ney ldquoAER do we need toldquoimproverdquo our alignmentsrdquo in Proceedings of the InternationalWorkshop on Spoken Language Translation pp 205ndash212 2006

[19] A Fraser andDMarcu ldquoMeasuring word alignment quality forstatistical machine translationrdquo Computational Linguistics vol33 no 3 pp 293ndash303 2007

[20] P Lambert S Petitrenaud Y Ma and A Way ldquoWhat typesof word alignment improve statistical machine translationrdquoMachine Translation vol 26 no 4 pp 289ndash323 2012

[21] J H Johnson J Martin G Foster and R Kuhn ldquoImprovingtranslation quality by discarding most of the phrasetablerdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 967ndash975 June2007

[22] J H Johnson Conditional Significance Pruning DiscardingMore of Huge Phrase Tables 2012

[23] M Yang and J Zheng ldquoToward smaller faster and better hierar-chical phrase-based SMTrdquo in Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for ComputationalLinguistics and 4th International Joint Conference on Natural

Language Processing of the AFNLP (ACL-IJCNLP rsquo09) pp 237ndash240 Association for Computational Linguistics StroudsburgPa USA August 2009

[24] N Tomeh N cancedda and M Dymetman ldquoComplexity-based phrase-table filtering for statistical machine translationrdquoin Proceedings of the MT Summit XII 2009

[25] M Eck S Vogel and A Waibel ldquoTranslation model prun-ing via usage statistics for statistical machine translationrdquo inProceedings of the Human Language Technologies The AnnualConference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT rsquo07) pp 21ndash24Association for Computational Linguistics troudsburg PaUSA 2007

[26] M Eck S Vogel and A Waibel ldquoEstimating phrase pairrelevance formachine translation pruningrdquo inProceedings of theMT Summit XI pp 159ndash165 2007

[27] Y Chen A Eisele and M Kay ldquoImproving statistical machinetranslation efficiency by triangulationrdquo in Proceedings of theSixth International Conference on Language Resources andEvaluation 2008

[28] Y Chen M Kay and A Eisele ldquoIntersecting multilingual datafor faster and better statistical translationsrdquo in Proceedings of theHuman Language Technologies The Annual Conference of theNorth American Chapter of the Association for ComputationalLinguistics (NAACL HLT rsquo09) pp 128ndash136 Association forComputational Linguistics Stroudsburg Pa USA June 2009

[29] G Sanchis-Trilles D Ortiz-Martınez J Gonzalez-Rubio JGonzalez and F Casacuberta ldquoBilingual segmentation forphrasetable pruning in statistical machine translationrdquo in Pro-ceedings of the 15th International Conference of the EuropeanAssociation for Machine Translation (EAMT rsquo11) pp 257ndash264May 2011

[30] N Tomeh M Turchi G Wisinewski A Allauzen and YFrancois ldquoHow good are your phrases Assessing phrase qualitywith single class classificationrdquo in Proceedings of the Interna-tional Workshop on Spoken Language Translation pp 261ndash2682011

[31] W Ling J Graca I Trancoso and A Black ldquoEntropy-basedpruning for phrase-based machine translationrdquo in Proceedingsof the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Language Learn-ing pp 962ndash971 Association for Computational LinguisticsStroudsburg Pa USA 2012

[32] R Zens D Stanton and P Xu ldquoA systematic comparison ofphrase table pruning techniquesrdquo in Proceedings of the JointConference on Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning pp 972ndash983 Association for Computational Linguistics StroudsburgPa USA 2012

[33] K Papineni S Roukos T Ward andW Zhu ldquoBLEU a methodfor automatic evaluation of machine translationrdquo in Proceedingsof the 40th Annual Meeting of the Association for ComputationalLinguistics (ACL rsquo01) pp 311ndash318 2001

[34] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe Mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[35] P Koehn StatisticalMachine Translation CambridgeUniversityPress New York NY USA 1st edition 2010

[36] E Galbrun ldquoPhrase table pruning for statistical machine trans-lationrdquo Department of Computer Science Series of PublicationsC C-2009-22 University of Helsinki 2007

The Scientific World Journal 13

[37] T M Cover and J A Thomas Elements of Information TheoryWiley-Interscience New York NY USA 2006

[38] Z He Y Meng Y Lu H Yu and Q Liu ldquoReducing SMTrule table with monolingual key phraserdquo in Proceedings ofthe Joint Conference of the 47th Annual Meeting of the Asso-ciation for Computational Linguistics and 4th InternationalJoint Conference on Natural Language Processing of the AFNLP(ACL-IJCNLP rsquo09) pp 121ndash124 Association for ComputationalLinguistics Stroudsburg Pa USA August 2009

[39] Z He Y Meng and H Yu ldquoDiscarding monotone com-posed rule for hierarchical phrase-based statistical machinetranslationrdquo in Proceedings of the 3rd International UniversalCommunication Symposium (IUCS rsquo09) pp 25ndash29 New YorkNY USA December 2009

[40] K T Frantzi and S Ananiadou ldquoExtracting nested colloca-tionsrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 41ndash46 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[41] P Koehn H Hoang A Birch et al ldquoMoses open source toolkitfor statistical machine translationrdquo in Proceedings of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions pp 177ndash180 Association for ComputationalLinguistics Stroudsburg Pa USA 2007

[42] D Chiang ldquoA hierarchical phrase-based model for statisticalmachine translationrdquo in Proceedings of the 43rd Annual Meetingof the Association for Computational Linguistics pp 263ndash270Association for Computational Linguistics Stroudsburg PaUSA June 2005

[43] W A Gale and K W Church ldquoIdentifying word correspon-dences in parallel textsrdquo in Proceedings of the Workshop onSpeech and Natural Language pp 152ndash157 Association forComputational Linguistics Stroudsburg Pa USA 1991

[44] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[45] M Federico N Bertoldi and M Cettolo ldquoIRSTLM An opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

[46] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the MT Summit X 2005

[47] S Green and J DeNero ldquoA Class-based agreement model forgenerating accurately inflected translationsrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Long Papers vol 1 pp 146ndash155 Association forComputational Linguistics Stroudsburg Pa USA 2012

[48] N Xue F D Chiou and M lmer ldquoBuilding a large-scaleannotated chinese corpusrdquo in Proceedings of the 19th Interna-tional Conference on Computational Linguistics vol 1 pp 1ndash8PaAssociation for Computational Linguistics Stroudsburg PaUSA 2002

[49] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[50] S Vogel H Ney and C Tillmann ldquoHmm-based word align-ment in statistical translationrdquo in Proceedings of the 16thInternational Conference on Computational Linguistics pp 836ndash841 1996

[51] CWMT 2013 httpwwwliipcncwmt2013

[52] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) pp 1273ndash1278 Qingdao China July2010

[53] L Tian D Wong L Chao et al ldquoUM-Corpus a large English-Chinese parallel corpus for statistical machine translationrdquo inProceedings of the 9th International Conference on LanguageResources and Evaluation 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article A Relationship: Word Alignment, Phrase

The Scientific World Journal 9

Table 8 Statistics of English amp Chinese corpora

Languages Tokens Average length VocabulariesEnglish 804584844 2464 2357374Chinese 3007196322 3343 1533343

Table 9 Statistics for EC-Corpus testing data

Languages Tokens Average lengthEnglish 118802 2376CharacterCE 153705 3074CTBCE 118499 2370

word level (segmented by StanfordCTB segmenter presentedby CTBCE)

The UM-Corpus also contains English Chinese andPortuguese monolingual corpora In this experiment theEnglish and Chinese corpora are used There are 35938697English sentences from news domain and more than 70million Chinese sentences from China portal webpage Thestatistics about the two corpora are shown in Table 8

The 5000 sentences of test data (named UM-Testing) aredesigned according to domains in the UM-Corpus Thereare 1500 sentences for news 500 sentences for laws 500sentences for novels 600 sentences for thesis 700 sentencesfor education 600 sentences for science and 600 for subtitleThe statistics of the testing data are shown in Table 9

43 Equation of Word Alignments versus Phrases The calcu-lation for number of phrases requires the full word alignmentlinks Full word alignment in other words is the best qualityof word alignment It is obviously very time-consumingif we manually edit the large word aligned sentences It isalready observed in the past that generative models usedfor statistical word alignment create alignments of increasingquality as they are exposed to more data [19] The intuitionbehind this is simple as more cooccurrences of source andtargets words are observed the word alignments are betterFollowing this intuition if we wish to increase the quality ofa word alignment we allow the alignment process access toextra data which is used only during the alignment processand then removed If we wish to decrease the quality of aword alignment we divide the parallel corpus into piecesand align the pieces independently of one another and finallyconcatenate the results together In our experiments we usethe entire WMT 2006 corpus as training data firstly andthen the first 100000 alignments are chosen as the full wordalignments

Also the leftmost middle and rightmost word alignmentpoint based on the full word alignment are obtained based onthe results generated by GIZA++

The minimal and maximal sizes of the phrase tableextracted from the full and middle word alignment and the

differential rates are shown in Table 10The differential rate iscalculated by

119877diff-min =

1003816100381610038161003816119873min minus 119873moses-min1003816100381610038161003816

119873min (22)

119877diff-max =

1003816100381610038161003816119873max minus 119873moses-max1003816100381610038161003816

119873max (23)

where the119873moses-min and119873moses-max indicate the minimal andmaximal phrases extracted fromMoses

From Table 10 we can see that the actual number ofphrases is close to the calculation result With the small dif-ferential rate the (13) and (17) can be available to the practicalexperiments The difference may come from two reasons(1) the assumption of full word alignment produced fromthe GIZA++ may contain too many noises (2) equations(13) and (17) do not consider repetitions of the extractedphrases

44 Phrase Table versusMachine Translation Wehave shownthat the there is no direct relationship between the sizeof phrase table and machine translation performance Asreflected in (17) and Table 4 at least half of the phrases canbe removed without hurting too much the performance ofmachine translation In this part we want to use the largeChinese-English corpus to test our pruning algorithm basedon monolingual and bilingual corpus

According to our experience different thresholds willgenerate different size of phrase table and result in differentperformance of translation Therefore one of the key stepsis to choose the pruning threshold in Algorithms 2 and 3Firstly a baseline threshold is calculated from the logarithmof sentences of the given corpus (119873) that is log(119873) Forthe 119862-value threshold the baseline threshold is about 25(log(36 lowast 107)) For the 119875-value threshold the negativelogarithm will be chosen Because the probabilities involvedare incredibly tiny we will work instead with the negative ofthe natural logs of the probabilities For example instead ofselecting phrase pairs with a 119875-value less than exp(minus15) wewill select phrase pairs with a negative-log-119875-value greaterthan 15 The baseline threshold for 120591

119865is 45 (log(74 lowast 106))

Secondly the different thresholds will increase or decreasethan the baseline threshold The affection to the phrase tableand BLEU scores based on different thresholds is shown inTable 11

From Table 11 we can see that nearly 98 phrases canbe discarded with monolingual and bilingual corpora whilethe translation quality does not become worse too much(only decrease 235) Experiments show that about 85 of thephrase table is reduced and can generate the best translationquality which not only again proves the similar results as inother researchers [3 21 32 38] but also indicates that mostphrases are not needed at all The best translation can beachieved on the size of about 15 to 30of the original phrasetable That is to say only a small fraction of phrases are theessential elements while major of the extracted phrases arespurious and in some sense they are redundancy and shouldbe discarded

10 The Scientific World Journal

Table 10 Phrases comparison between Moses extraction and calculation result

Phrase length 119870 = 30

Languages 119873moses-min 119873min 119877diff-min 119873moses-max 119873max 119877diff-max

fr-en 9136283 9080307 0006 1541902790 1544021509 0001es-en 10958551 10895266 0006 1401977185 1489422170 0058de-en 9219921 8929095 0033 1606176665 1686043482 0047

Table 11 119862-value and 119875-value threshold effect on the phrase tablesize and BLEU scores

Thresholds UM-Corpus + CWMT 2013119862-value 120576 minus log(119875-value) 120591

119865Phrase table () BLEU ()

0 0 100 281310 20 836 280925 45 525 2798100 60 297 2811200 100 143 2827400 200 202 2578

Table 12 Translation performance on different word alignmenttypes

Languages BLEU (3064 sentences)119860 full 119860 lmost 119860 sure 119860mid 119860 rmost

fr-en 2480 448 2226 996 248es-en 3076 312 2824 1128 232de-en 2166 230 1928 976 294

45 Word Alignment versus Translation Quality In order toinvestigate the affection to the translation performance basedon different word alignment types and phrase table size weselected five different types of word alignments to generatethe phrase table and then measure the translation qualityby BLEU metric the full word alignment points (119860 full) theleftmost word alignment (119860

119897most) the sure alignment point(119860 sure) the middle word alignment point (119860mid) and therightmost word alignment point (119860

119903most)Clearly the 119860 full is the best quality of word alignment

and followed by 119860 sure which is the only 1-to-1 alignmentThe 119860

119897most 119860mid and 119860119903most are much worse ones which

is only one alignment link As mentioned previously allthese alignment points are obtained after getting the resultsgenerated by GIZA++ The experiment is carried out inWMT 2006 corpus Based on the different alignment typesthe number of phrases extracted and the correspondingtranslation BLEU are presented in Figure 7 and Table 12separately

The histograms show that there is not any direct propor-tion relationship between the translation performance andthe number of extracted phrases The numbers in Figure 7show that German-English language pairs can generate themost phrases while the translation quality is not the bestAlthough the fewest phrases produced from the Spanish-English language pairs the best translation performance canbe obtained Explained by (17) more phrases usually lead to

worse word alignment quality or it can be said that thereare too many unaligned words in the produced alignmentresults The results in Figure 7 and Table 12 show that (1) thefull word alignment will contribute most to the translationperformance and there will not be too much decrease whenonly the sure alignment is provided (2) however the first andlast word alignment will not produce too high translationquality (3) although the middle position word alignmentcan extract the most phrases the quality of the translationis not the best and it seems that middle word alignment cancontribute more to the finally translation

In fact the translation from the only one link wordalignment should be higher After all the default full wordalignments have too many noises which trained from a largetraining corpus without manual edit

5 Conclusions

In this paper some investigations are made about the rela-tionship among word alignment phrase table and machinetranslation After deducing a formula for estimating the sizeof the phrase table a maximal phrase pruning ratio can becalculated based on the average length of the given corpus andword alignment positions Finally translation performancebased on different word alignment types is compared

Equations (13) and (17) provide a simple way to predictthe size of the phrase table in advance based on the generatedword alignment points This is beneficial for evaluating howmuch hardware resources are required to train the modelwhen the size of the parallel corpus is huge

Experiment results show the affection to the translationperformance is significant if the AER decreases or increasestoomuchThere is no direct proportion relationship betweenthe translation performance and the number of extractedphrases In other words most of the phrases are not needed atall only phrases extracted from the full word alignment areenough It can be seen from our corpus-motivated pruningmethod that only 15ndash30 phrases are needed for the basictranslations and most phrases in the phrase table are noisesto the final translation

Based on the existing phrase extraction algorithm fromword alignment better word alignment will be very helpful tothe final translation There are at least three advantages witha better quality of word alignment even without the existingphrase extraction algorithm improved as follows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

The Scientific World Journal 11

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Fr-e

n nu

mbe

r of p

hras

es

Fr-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

668712E7

785057E7

913628E654091E7

15419E9

(a)

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Es-e

n nu

mbe

r of p

hras

es

Es-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

753427E7

621835E7

140198E9

109586E7534143E7

(b)

De-en number of phrases

180E + 009

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

De-

en n

umbe

r of p

hras

es

Wod alignment types

Almost Asure Amid Afull Armost

716272E7712921E7

921992E6

527541E7

160618E9

(c)

Fr-e

n BL

EU

Fr-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

448

2226

996

248

248

(d)

Es-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

Es-e

n BL

EU

35

30

25

20

15

10

5

0

312

2824

1128

232

3076

(e)

De-

en B

LEU

De-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

23

1928

976

2166

294

(f)

Figure 7 Number of phrases extracted from different word alignment points and their BLEU

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

The next work we want to focus on is to add the corpus-motivated pruning technique into the translation modelsthat is the phrases will be filtered during the training timenot after generating the phrase table

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank all reviewers for the verycareful reading and helpful suggestions The authors aregrateful to the Science and Technology Development Fundof Macau and the Research Committee of the University ofMacau for the funding support for their research under theReference nos MYRG076 (Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS

References

[1] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology vol 1 pp 48ndash54 Association forComputational Linguistics Stroudsburg Pa USA 2003

[2] R Zens F J Och and H Ney Phrase-Based Statistical MachineTranslation Springer Berlin Germany 2002

[3] J H Shin Y S Han and K S Choi ldquoBilingual knowledgeacquisition from Korean-English parallel corpus using align-ment method Korean-English alignment at word and phraselevelrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 230ndash235 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[4] K Yamamoto T Kudo Y Tsuboi and YMatsumoto ldquoLearningsequence-to-sequence correspondences from parallel corporavia sequential pattern miningrdquo in Proceedings of the HLT-NAACL Workshop on Building and Using Parallel Texts DataDriven Machine Translation and Beyond vol 3 pp 73ndash80Association for Computational Linguistics Stroudsburg PaUSA 2003

[5] C Goutte K Yamada and E Gaussier ldquoAligning words usingmatrix factorisationrdquo in Proceedings of the 42nd AnnualMeetingon Association for Computational Linguistics Association forComputational Linguistics Stroudsburg Pa USA 2004

[6] F J Och C Tillmann H Ney and L F Informatik ImprovedAlignmentModels for StatisticalMachine Translation Universityof Maryland College Park Md USA 1999

[7] F J Och and H Ney ldquoA Comparison of Alignment Modelsfor Statistical Machine Translationrdquo in Proceedings of the 18thConference on Computational Linguistics vol 2 pp 1086ndash1090

12 The Scientific World Journal

Association for Computational Linguistics Stroudsburg PaUSA 2000

[8] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[9] F J Och and H Ney ldquoThe alignment template approach tostatistical machine translationrdquo Computational Linguistics vol30 no 4 pp 417ndash449 2004

[10] N Duan M Li and M Zhou ldquoImproving phrase extractionvia MBR phrase scoring and pruningrdquo in Proceedings of the MTSummit XIII pp 189ndash197 2011

[11] G Neubig T Watanabe E Sumita S Mori and T Kawa-hara ldquoAn unsupervised model for joint phrase alignment andextractionrdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human LanguageTechnologies vol 1 pp 632ndash641 Stroudsburg Pa USA June2011

[12] H L M Z Hendra Setiawan ldquoLearning phrase translationusing level of detail approachrdquo inTheTenthMachine TranslationSummit (MT-Summit X) 2005

[13] C Tillmann ldquoA projection extension algorithm for statisticalmachine translationrdquo in Proceedings of the Conference on Empir-ical Methods in Natural Language Processing Association forComputational Linguistics Stroudsburg PA USA 2003

[14] B Zhao and S Vogel ldquoA generalized alignment-free phraseextractionrdquo in Proceedings of the ACLWorkshop on Building andUsing Parallel Texts pp 141ndash144 Association for ComputationalLinguistics Stroudsburg Pa USA 2005

[15] Y Zhang and S Vogel ldquoCompetitive grouping in integratedphrase segmentation and alignment modelrdquo in Proceedings ofthe ACLWorkshop on Building and Using Parallel Texts pp 159ndash162 Association for Computational Linguistics StroudsburgPa USA 2005

[16] A Lopez ldquoStatistical machine translationrdquo ACM ComputingSurveys vol 40 no 3 article 8 2008

[17] K Ganchev J V Grac and B Taskar ldquoBetter alignments = Bet-ter translationsrdquo in Proceedings of the 46th Annual Meeting ofthe Association for Computational Linguistics Human LanguageTechnologies pp 986ndash993 June 2008

[18] D Vilar M Popovic and H Ney ldquoAER do we need toldquoimproverdquo our alignmentsrdquo in Proceedings of the InternationalWorkshop on Spoken Language Translation pp 205ndash212 2006

[19] A Fraser andDMarcu ldquoMeasuring word alignment quality forstatistical machine translationrdquo Computational Linguistics vol33 no 3 pp 293ndash303 2007

[20] P Lambert S Petitrenaud Y Ma and A Way ldquoWhat typesof word alignment improve statistical machine translationrdquoMachine Translation vol 26 no 4 pp 289ndash323 2012

[21] J H Johnson J Martin G Foster and R Kuhn ldquoImprovingtranslation quality by discarding most of the phrasetablerdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 967ndash975 June2007

[22] J H Johnson Conditional Significance Pruning DiscardingMore of Huge Phrase Tables 2012

[23] M Yang and J Zheng ldquoToward smaller faster and better hierar-chical phrase-based SMTrdquo in Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for ComputationalLinguistics and 4th International Joint Conference on Natural

Language Processing of the AFNLP (ACL-IJCNLP rsquo09) pp 237ndash240 Association for Computational Linguistics StroudsburgPa USA August 2009

[24] N Tomeh N cancedda and M Dymetman ldquoComplexity-based phrase-table filtering for statistical machine translationrdquoin Proceedings of the MT Summit XII 2009

[25] M Eck S Vogel and A Waibel ldquoTranslation model prun-ing via usage statistics for statistical machine translationrdquo inProceedings of the Human Language Technologies The AnnualConference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT rsquo07) pp 21ndash24Association for Computational Linguistics troudsburg PaUSA 2007

[26] M Eck S Vogel and A Waibel ldquoEstimating phrase pairrelevance formachine translation pruningrdquo inProceedings of theMT Summit XI pp 159ndash165 2007

[27] Y Chen A Eisele and M Kay ldquoImproving statistical machinetranslation efficiency by triangulationrdquo in Proceedings of theSixth International Conference on Language Resources andEvaluation 2008

[28] Y Chen M Kay and A Eisele ldquoIntersecting multilingual datafor faster and better statistical translationsrdquo in Proceedings of theHuman Language Technologies The Annual Conference of theNorth American Chapter of the Association for ComputationalLinguistics (NAACL HLT rsquo09) pp 128ndash136 Association forComputational Linguistics Stroudsburg Pa USA June 2009

[29] G Sanchis-Trilles D Ortiz-Martınez J Gonzalez-Rubio JGonzalez and F Casacuberta ldquoBilingual segmentation forphrasetable pruning in statistical machine translationrdquo in Pro-ceedings of the 15th International Conference of the EuropeanAssociation for Machine Translation (EAMT rsquo11) pp 257ndash264May 2011

[30] N Tomeh M Turchi G Wisinewski A Allauzen and YFrancois ldquoHow good are your phrases Assessing phrase qualitywith single class classificationrdquo in Proceedings of the Interna-tional Workshop on Spoken Language Translation pp 261ndash2682011

[31] W Ling J Graca I Trancoso and A Black ldquoEntropy-basedpruning for phrase-based machine translationrdquo in Proceedingsof the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Language Learn-ing pp 962ndash971 Association for Computational LinguisticsStroudsburg Pa USA 2012

[32] R Zens D Stanton and P Xu ldquoA systematic comparison ofphrase table pruning techniquesrdquo in Proceedings of the JointConference on Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning pp 972ndash983 Association for Computational Linguistics StroudsburgPa USA 2012

[33] K Papineni S Roukos T Ward andW Zhu ldquoBLEU a methodfor automatic evaluation of machine translationrdquo in Proceedingsof the 40th Annual Meeting of the Association for ComputationalLinguistics (ACL rsquo01) pp 311ndash318 2001

[34] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe Mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[35] P Koehn StatisticalMachine Translation CambridgeUniversityPress New York NY USA 1st edition 2010

[36] E Galbrun ldquoPhrase table pruning for statistical machine trans-lationrdquo Department of Computer Science Series of PublicationsC C-2009-22 University of Helsinki 2007

The Scientific World Journal 13

[37] T M Cover and J A Thomas Elements of Information TheoryWiley-Interscience New York NY USA 2006

[38] Z He Y Meng Y Lu H Yu and Q Liu ldquoReducing SMTrule table with monolingual key phraserdquo in Proceedings ofthe Joint Conference of the 47th Annual Meeting of the Asso-ciation for Computational Linguistics and 4th InternationalJoint Conference on Natural Language Processing of the AFNLP(ACL-IJCNLP rsquo09) pp 121ndash124 Association for ComputationalLinguistics Stroudsburg Pa USA August 2009

[39] Z He Y Meng and H Yu ldquoDiscarding monotone com-posed rule for hierarchical phrase-based statistical machinetranslationrdquo in Proceedings of the 3rd International UniversalCommunication Symposium (IUCS rsquo09) pp 25ndash29 New YorkNY USA December 2009

[40] K T Frantzi and S Ananiadou ldquoExtracting nested colloca-tionsrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 41ndash46 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[41] P Koehn H Hoang A Birch et al ldquoMoses open source toolkitfor statistical machine translationrdquo in Proceedings of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions pp 177ndash180 Association for ComputationalLinguistics Stroudsburg Pa USA 2007

[42] D Chiang ldquoA hierarchical phrase-based model for statisticalmachine translationrdquo in Proceedings of the 43rd Annual Meetingof the Association for Computational Linguistics pp 263ndash270Association for Computational Linguistics Stroudsburg PaUSA June 2005

[43] W A Gale and K W Church ldquoIdentifying word correspon-dences in parallel textsrdquo in Proceedings of the Workshop onSpeech and Natural Language pp 152ndash157 Association forComputational Linguistics Stroudsburg Pa USA 1991

[44] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[45] M Federico N Bertoldi and M Cettolo ldquoIRSTLM An opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

[46] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the MT Summit X 2005

[47] S Green and J DeNero ldquoA Class-based agreement model forgenerating accurately inflected translationsrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Long Papers vol 1 pp 146ndash155 Association forComputational Linguistics Stroudsburg Pa USA 2012

[48] N Xue F D Chiou and M lmer ldquoBuilding a large-scaleannotated chinese corpusrdquo in Proceedings of the 19th Interna-tional Conference on Computational Linguistics vol 1 pp 1ndash8PaAssociation for Computational Linguistics Stroudsburg PaUSA 2002

[49] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[50] S Vogel H Ney and C Tillmann ldquoHmm-based word align-ment in statistical translationrdquo in Proceedings of the 16thInternational Conference on Computational Linguistics pp 836ndash841 1996

[51] CWMT 2013 httpwwwliipcncwmt2013

[52] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) pp 1273ndash1278 Qingdao China July2010

[53] L Tian D Wong L Chao et al ldquoUM-Corpus a large English-Chinese parallel corpus for statistical machine translationrdquo inProceedings of the 9th International Conference on LanguageResources and Evaluation 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article A Relationship: Word Alignment, Phrase

10 The Scientific World Journal

Table 10 Phrases comparison between Moses extraction and calculation result

Phrase length 119870 = 30

Languages 119873moses-min 119873min 119877diff-min 119873moses-max 119873max 119877diff-max

fr-en 9136283 9080307 0006 1541902790 1544021509 0001es-en 10958551 10895266 0006 1401977185 1489422170 0058de-en 9219921 8929095 0033 1606176665 1686043482 0047

Table 11 119862-value and 119875-value threshold effect on the phrase tablesize and BLEU scores

Thresholds UM-Corpus + CWMT 2013119862-value 120576 minus log(119875-value) 120591

119865Phrase table () BLEU ()

0 0 100 281310 20 836 280925 45 525 2798100 60 297 2811200 100 143 2827400 200 202 2578

Table 12 Translation performance on different word alignmenttypes

Languages BLEU (3064 sentences)119860 full 119860 lmost 119860 sure 119860mid 119860 rmost

fr-en 2480 448 2226 996 248es-en 3076 312 2824 1128 232de-en 2166 230 1928 976 294

45 Word Alignment versus Translation Quality In order toinvestigate the affection to the translation performance basedon different word alignment types and phrase table size weselected five different types of word alignments to generatethe phrase table and then measure the translation qualityby BLEU metric the full word alignment points (119860 full) theleftmost word alignment (119860

119897most) the sure alignment point(119860 sure) the middle word alignment point (119860mid) and therightmost word alignment point (119860

119903most)Clearly the 119860 full is the best quality of word alignment

and followed by 119860 sure which is the only 1-to-1 alignmentThe 119860

119897most 119860mid and 119860119903most are much worse ones which

is only one alignment link As mentioned previously allthese alignment points are obtained after getting the resultsgenerated by GIZA++ The experiment is carried out inWMT 2006 corpus Based on the different alignment typesthe number of phrases extracted and the correspondingtranslation BLEU are presented in Figure 7 and Table 12separately

The histograms show that there is not any direct propor-tion relationship between the translation performance andthe number of extracted phrases The numbers in Figure 7show that German-English language pairs can generate themost phrases while the translation quality is not the bestAlthough the fewest phrases produced from the Spanish-English language pairs the best translation performance canbe obtained Explained by (17) more phrases usually lead to

worse word alignment quality or it can be said that thereare too many unaligned words in the produced alignmentresults The results in Figure 7 and Table 12 show that (1) thefull word alignment will contribute most to the translationperformance and there will not be too much decrease whenonly the sure alignment is provided (2) however the first andlast word alignment will not produce too high translationquality (3) although the middle position word alignmentcan extract the most phrases the quality of the translationis not the best and it seems that middle word alignment cancontribute more to the finally translation

In fact the translation from the only one link wordalignment should be higher After all the default full wordalignments have too many noises which trained from a largetraining corpus without manual edit

5 Conclusions

In this paper some investigations are made about the rela-tionship among word alignment phrase table and machinetranslation After deducing a formula for estimating the sizeof the phrase table a maximal phrase pruning ratio can becalculated based on the average length of the given corpus andword alignment positions Finally translation performancebased on different word alignment types is compared

Equations (13) and (17) provide a simple way to predictthe size of the phrase table in advance based on the generatedword alignment points This is beneficial for evaluating howmuch hardware resources are required to train the modelwhen the size of the parallel corpus is huge

Experiment results show the affection to the translationperformance is significant if the AER decreases or increasestoomuchThere is no direct proportion relationship betweenthe translation performance and the number of extractedphrases In other words most of the phrases are not needed atall only phrases extracted from the full word alignment areenough It can be seen from our corpus-motivated pruningmethod that only 15ndash30 phrases are needed for the basictranslations and most phrases in the phrase table are noisesto the final translation

Based on the existing phrase extraction algorithm fromword alignment better word alignment will be very helpful tothe final translation There are at least three advantages witha better quality of word alignment even without the existingphrase extraction algorithm improved as follows

(i) Better alignment will extract fewer phrase pairs andkeep a manageable size of phrase table

The Scientific World Journal 11

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Fr-e

n nu

mbe

r of p

hras

es

Fr-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

668712E7

785057E7

913628E654091E7

15419E9

(a)

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Es-e

n nu

mbe

r of p

hras

es

Es-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

753427E7

621835E7

140198E9

109586E7534143E7

(b)

De-en number of phrases

180E + 009

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

De-

en n

umbe

r of p

hras

es

Wod alignment types

Almost Asure Amid Afull Armost

716272E7712921E7

921992E6

527541E7

160618E9

(c)

Fr-e

n BL

EU

Fr-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

448

2226

996

248

248

(d)

Es-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

Es-e

n BL

EU

35

30

25

20

15

10

5

0

312

2824

1128

232

3076

(e)

De-

en B

LEU

De-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

23

1928

976

2166

294

(f)

Figure 7 Number of phrases extracted from different word alignment points and their BLEU

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

The next work we want to focus on is to add the corpus-motivated pruning technique into the translation modelsthat is the phrases will be filtered during the training timenot after generating the phrase table

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank all reviewers for the verycareful reading and helpful suggestions The authors aregrateful to the Science and Technology Development Fundof Macau and the Research Committee of the University ofMacau for the funding support for their research under theReference nos MYRG076 (Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS

References

[1] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology vol 1 pp 48ndash54 Association forComputational Linguistics Stroudsburg Pa USA 2003

[2] R Zens F J Och and H Ney Phrase-Based Statistical MachineTranslation Springer Berlin Germany 2002

[3] J H Shin Y S Han and K S Choi ldquoBilingual knowledgeacquisition from Korean-English parallel corpus using align-ment method Korean-English alignment at word and phraselevelrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 230ndash235 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[4] K Yamamoto T Kudo Y Tsuboi and YMatsumoto ldquoLearningsequence-to-sequence correspondences from parallel corporavia sequential pattern miningrdquo in Proceedings of the HLT-NAACL Workshop on Building and Using Parallel Texts DataDriven Machine Translation and Beyond vol 3 pp 73ndash80Association for Computational Linguistics Stroudsburg PaUSA 2003

[5] C Goutte K Yamada and E Gaussier ldquoAligning words usingmatrix factorisationrdquo in Proceedings of the 42nd AnnualMeetingon Association for Computational Linguistics Association forComputational Linguistics Stroudsburg Pa USA 2004

[6] F J Och C Tillmann H Ney and L F Informatik ImprovedAlignmentModels for StatisticalMachine Translation Universityof Maryland College Park Md USA 1999

[7] F J Och and H Ney ldquoA Comparison of Alignment Modelsfor Statistical Machine Translationrdquo in Proceedings of the 18thConference on Computational Linguistics vol 2 pp 1086ndash1090

12 The Scientific World Journal

Association for Computational Linguistics Stroudsburg PaUSA 2000

[8] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[9] F J Och and H Ney ldquoThe alignment template approach tostatistical machine translationrdquo Computational Linguistics vol30 no 4 pp 417ndash449 2004

[10] N Duan M Li and M Zhou ldquoImproving phrase extractionvia MBR phrase scoring and pruningrdquo in Proceedings of the MTSummit XIII pp 189ndash197 2011

[11] G Neubig T Watanabe E Sumita S Mori and T Kawa-hara ldquoAn unsupervised model for joint phrase alignment andextractionrdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human LanguageTechnologies vol 1 pp 632ndash641 Stroudsburg Pa USA June2011

[12] H L M Z Hendra Setiawan ldquoLearning phrase translationusing level of detail approachrdquo inTheTenthMachine TranslationSummit (MT-Summit X) 2005

[13] C Tillmann ldquoA projection extension algorithm for statisticalmachine translationrdquo in Proceedings of the Conference on Empir-ical Methods in Natural Language Processing Association forComputational Linguistics Stroudsburg PA USA 2003

[14] B Zhao and S Vogel ldquoA generalized alignment-free phraseextractionrdquo in Proceedings of the ACLWorkshop on Building andUsing Parallel Texts pp 141ndash144 Association for ComputationalLinguistics Stroudsburg Pa USA 2005

[15] Y Zhang and S Vogel ldquoCompetitive grouping in integratedphrase segmentation and alignment modelrdquo in Proceedings ofthe ACLWorkshop on Building and Using Parallel Texts pp 159ndash162 Association for Computational Linguistics StroudsburgPa USA 2005

[16] A Lopez ldquoStatistical machine translationrdquo ACM ComputingSurveys vol 40 no 3 article 8 2008

[17] K Ganchev J V Grac and B Taskar ldquoBetter alignments = Bet-ter translationsrdquo in Proceedings of the 46th Annual Meeting ofthe Association for Computational Linguistics Human LanguageTechnologies pp 986ndash993 June 2008

[18] D Vilar M Popovic and H Ney ldquoAER do we need toldquoimproverdquo our alignmentsrdquo in Proceedings of the InternationalWorkshop on Spoken Language Translation pp 205ndash212 2006

[19] A Fraser andDMarcu ldquoMeasuring word alignment quality forstatistical machine translationrdquo Computational Linguistics vol33 no 3 pp 293ndash303 2007

[20] P Lambert S Petitrenaud Y Ma and A Way ldquoWhat typesof word alignment improve statistical machine translationrdquoMachine Translation vol 26 no 4 pp 289ndash323 2012

[21] J H Johnson J Martin G Foster and R Kuhn ldquoImprovingtranslation quality by discarding most of the phrasetablerdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 967ndash975 June2007

[22] J H Johnson Conditional Significance Pruning DiscardingMore of Huge Phrase Tables 2012

[23] M Yang and J Zheng ldquoToward smaller faster and better hierar-chical phrase-based SMTrdquo in Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for ComputationalLinguistics and 4th International Joint Conference on Natural

Language Processing of the AFNLP (ACL-IJCNLP rsquo09) pp 237ndash240 Association for Computational Linguistics StroudsburgPa USA August 2009

[24] N Tomeh N cancedda and M Dymetman ldquoComplexity-based phrase-table filtering for statistical machine translationrdquoin Proceedings of the MT Summit XII 2009

[25] M Eck S Vogel and A Waibel ldquoTranslation model prun-ing via usage statistics for statistical machine translationrdquo inProceedings of the Human Language Technologies The AnnualConference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT rsquo07) pp 21ndash24Association for Computational Linguistics troudsburg PaUSA 2007

[26] M Eck S Vogel and A Waibel ldquoEstimating phrase pairrelevance formachine translation pruningrdquo inProceedings of theMT Summit XI pp 159ndash165 2007

[27] Y Chen A Eisele and M Kay ldquoImproving statistical machinetranslation efficiency by triangulationrdquo in Proceedings of theSixth International Conference on Language Resources andEvaluation 2008

[28] Y Chen M Kay and A Eisele ldquoIntersecting multilingual datafor faster and better statistical translationsrdquo in Proceedings of theHuman Language Technologies The Annual Conference of theNorth American Chapter of the Association for ComputationalLinguistics (NAACL HLT rsquo09) pp 128ndash136 Association forComputational Linguistics Stroudsburg Pa USA June 2009

[29] G Sanchis-Trilles D Ortiz-Martınez J Gonzalez-Rubio JGonzalez and F Casacuberta ldquoBilingual segmentation forphrasetable pruning in statistical machine translationrdquo in Pro-ceedings of the 15th International Conference of the EuropeanAssociation for Machine Translation (EAMT rsquo11) pp 257ndash264May 2011

[30] N Tomeh M Turchi G Wisinewski A Allauzen and YFrancois ldquoHow good are your phrases Assessing phrase qualitywith single class classificationrdquo in Proceedings of the Interna-tional Workshop on Spoken Language Translation pp 261ndash2682011

[31] W Ling J Graca I Trancoso and A Black ldquoEntropy-basedpruning for phrase-based machine translationrdquo in Proceedingsof the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Language Learn-ing pp 962ndash971 Association for Computational LinguisticsStroudsburg Pa USA 2012

[32] R Zens D Stanton and P Xu ldquoA systematic comparison ofphrase table pruning techniquesrdquo in Proceedings of the JointConference on Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning pp 972ndash983 Association for Computational Linguistics StroudsburgPa USA 2012

[33] K Papineni S Roukos T Ward andW Zhu ldquoBLEU a methodfor automatic evaluation of machine translationrdquo in Proceedingsof the 40th Annual Meeting of the Association for ComputationalLinguistics (ACL rsquo01) pp 311ndash318 2001

[34] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe Mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[35] P Koehn StatisticalMachine Translation CambridgeUniversityPress New York NY USA 1st edition 2010

[36] E Galbrun ldquoPhrase table pruning for statistical machine trans-lationrdquo Department of Computer Science Series of PublicationsC C-2009-22 University of Helsinki 2007

The Scientific World Journal 13

[37] T M Cover and J A Thomas Elements of Information TheoryWiley-Interscience New York NY USA 2006

[38] Z He Y Meng Y Lu H Yu and Q Liu ldquoReducing SMTrule table with monolingual key phraserdquo in Proceedings ofthe Joint Conference of the 47th Annual Meeting of the Asso-ciation for Computational Linguistics and 4th InternationalJoint Conference on Natural Language Processing of the AFNLP(ACL-IJCNLP rsquo09) pp 121ndash124 Association for ComputationalLinguistics Stroudsburg Pa USA August 2009

[39] Z He Y Meng and H Yu ldquoDiscarding monotone com-posed rule for hierarchical phrase-based statistical machinetranslationrdquo in Proceedings of the 3rd International UniversalCommunication Symposium (IUCS rsquo09) pp 25ndash29 New YorkNY USA December 2009

[40] K T Frantzi and S Ananiadou ldquoExtracting nested colloca-tionsrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 41ndash46 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[41] P Koehn H Hoang A Birch et al ldquoMoses open source toolkitfor statistical machine translationrdquo in Proceedings of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions pp 177ndash180 Association for ComputationalLinguistics Stroudsburg Pa USA 2007

[42] D Chiang ldquoA hierarchical phrase-based model for statisticalmachine translationrdquo in Proceedings of the 43rd Annual Meetingof the Association for Computational Linguistics pp 263ndash270Association for Computational Linguistics Stroudsburg PaUSA June 2005

[43] W A Gale and K W Church ldquoIdentifying word correspon-dences in parallel textsrdquo in Proceedings of the Workshop onSpeech and Natural Language pp 152ndash157 Association forComputational Linguistics Stroudsburg Pa USA 1991

[44] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[45] M Federico N Bertoldi and M Cettolo ldquoIRSTLM An opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

[46] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the MT Summit X 2005

[47] S Green and J DeNero ldquoA Class-based agreement model forgenerating accurately inflected translationsrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Long Papers vol 1 pp 146ndash155 Association forComputational Linguistics Stroudsburg Pa USA 2012

[48] N Xue F D Chiou and M lmer ldquoBuilding a large-scaleannotated chinese corpusrdquo in Proceedings of the 19th Interna-tional Conference on Computational Linguistics vol 1 pp 1ndash8PaAssociation for Computational Linguistics Stroudsburg PaUSA 2002

[49] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[50] S Vogel H Ney and C Tillmann ldquoHmm-based word align-ment in statistical translationrdquo in Proceedings of the 16thInternational Conference on Computational Linguistics pp 836ndash841 1996

[51] CWMT 2013 httpwwwliipcncwmt2013

[52] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) pp 1273ndash1278 Qingdao China July2010

[53] L Tian D Wong L Chao et al ldquoUM-Corpus a large English-Chinese parallel corpus for statistical machine translationrdquo inProceedings of the 9th International Conference on LanguageResources and Evaluation 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article A Relationship: Word Alignment, Phrase

The Scientific World Journal 11

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Fr-e

n nu

mbe

r of p

hras

es

Fr-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

668712E7

785057E7

913628E654091E7

15419E9

(a)

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

Es-e

n nu

mbe

r of p

hras

es

Es-en number of phrases

Wod alignment types

Almost Asure Amid Afull Armost

753427E7

621835E7

140198E9

109586E7534143E7

(b)

De-en number of phrases

180E + 009

160E + 009

140E + 009

120E + 009

100E + 009

800E + 008

600E + 008

400E + 008

200E + 008

000E + 000

De-

en n

umbe

r of p

hras

es

Wod alignment types

Almost Asure Amid Afull Armost

716272E7712921E7

921992E6

527541E7

160618E9

(c)

Fr-e

n BL

EU

Fr-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

448

2226

996

248

248

(d)

Es-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

Es-e

n BL

EU

35

30

25

20

15

10

5

0

312

2824

1128

232

3076

(e)

De-

en B

LEU

De-en BLEU

Wod alignment types

Almost Asure Amid Afull Armost

20

10

0

23

1928

976

2166

294

(f)

Figure 7 Number of phrases extracted from different word alignment points and their BLEU

(ii) Better alignment will reduce the decoding time whensearching the most possible translation from thephrase table

(iii) Better alignment will produce better quality of wordor phrase level translation

The next work we want to focus on is to add the corpus-motivated pruning technique into the translation modelsthat is the phrases will be filtered during the training timenot after generating the phrase table

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank all reviewers for the verycareful reading and helpful suggestions The authors aregrateful to the Science and Technology Development Fundof Macau and the Research Committee of the University ofMacau for the funding support for their research under theReference nos MYRG076 (Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS

References

[1] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology vol 1 pp 48ndash54 Association forComputational Linguistics Stroudsburg Pa USA 2003

[2] R Zens F J Och and H Ney Phrase-Based Statistical MachineTranslation Springer Berlin Germany 2002

[3] J H Shin Y S Han and K S Choi ldquoBilingual knowledgeacquisition from Korean-English parallel corpus using align-ment method Korean-English alignment at word and phraselevelrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 230ndash235 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[4] K Yamamoto T Kudo Y Tsuboi and YMatsumoto ldquoLearningsequence-to-sequence correspondences from parallel corporavia sequential pattern miningrdquo in Proceedings of the HLT-NAACL Workshop on Building and Using Parallel Texts DataDriven Machine Translation and Beyond vol 3 pp 73ndash80Association for Computational Linguistics Stroudsburg PaUSA 2003

[5] C Goutte K Yamada and E Gaussier ldquoAligning words usingmatrix factorisationrdquo in Proceedings of the 42nd AnnualMeetingon Association for Computational Linguistics Association forComputational Linguistics Stroudsburg Pa USA 2004

[6] F J Och C Tillmann H Ney and L F Informatik ImprovedAlignmentModels for StatisticalMachine Translation Universityof Maryland College Park Md USA 1999

[7] F J Och and H Ney ldquoA Comparison of Alignment Modelsfor Statistical Machine Translationrdquo in Proceedings of the 18thConference on Computational Linguistics vol 2 pp 1086ndash1090

12 The Scientific World Journal

Association for Computational Linguistics Stroudsburg PaUSA 2000

[8] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[9] F J Och and H Ney ldquoThe alignment template approach tostatistical machine translationrdquo Computational Linguistics vol30 no 4 pp 417ndash449 2004

[10] N Duan M Li and M Zhou ldquoImproving phrase extractionvia MBR phrase scoring and pruningrdquo in Proceedings of the MTSummit XIII pp 189ndash197 2011

[11] G Neubig T Watanabe E Sumita S Mori and T Kawa-hara ldquoAn unsupervised model for joint phrase alignment andextractionrdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human LanguageTechnologies vol 1 pp 632ndash641 Stroudsburg Pa USA June2011

[12] H L M Z Hendra Setiawan ldquoLearning phrase translationusing level of detail approachrdquo inTheTenthMachine TranslationSummit (MT-Summit X) 2005

[13] C Tillmann ldquoA projection extension algorithm for statisticalmachine translationrdquo in Proceedings of the Conference on Empir-ical Methods in Natural Language Processing Association forComputational Linguistics Stroudsburg PA USA 2003

[14] B Zhao and S Vogel ldquoA generalized alignment-free phraseextractionrdquo in Proceedings of the ACLWorkshop on Building andUsing Parallel Texts pp 141ndash144 Association for ComputationalLinguistics Stroudsburg Pa USA 2005

[15] Y Zhang and S Vogel ldquoCompetitive grouping in integratedphrase segmentation and alignment modelrdquo in Proceedings ofthe ACLWorkshop on Building and Using Parallel Texts pp 159ndash162 Association for Computational Linguistics StroudsburgPa USA 2005

[16] A Lopez ldquoStatistical machine translationrdquo ACM ComputingSurveys vol 40 no 3 article 8 2008

[17] K Ganchev J V Grac and B Taskar ldquoBetter alignments = Bet-ter translationsrdquo in Proceedings of the 46th Annual Meeting ofthe Association for Computational Linguistics Human LanguageTechnologies pp 986ndash993 June 2008

[18] D Vilar M Popovic and H Ney ldquoAER do we need toldquoimproverdquo our alignmentsrdquo in Proceedings of the InternationalWorkshop on Spoken Language Translation pp 205ndash212 2006

[19] A Fraser andDMarcu ldquoMeasuring word alignment quality forstatistical machine translationrdquo Computational Linguistics vol33 no 3 pp 293ndash303 2007

[20] P Lambert S Petitrenaud Y Ma and A Way ldquoWhat typesof word alignment improve statistical machine translationrdquoMachine Translation vol 26 no 4 pp 289ndash323 2012

[21] J H Johnson J Martin G Foster and R Kuhn ldquoImprovingtranslation quality by discarding most of the phrasetablerdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 967ndash975 June2007

[22] J H Johnson Conditional Significance Pruning DiscardingMore of Huge Phrase Tables 2012

[23] M Yang and J Zheng ldquoToward smaller faster and better hierar-chical phrase-based SMTrdquo in Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for ComputationalLinguistics and 4th International Joint Conference on Natural

Language Processing of the AFNLP (ACL-IJCNLP rsquo09) pp 237ndash240 Association for Computational Linguistics StroudsburgPa USA August 2009

[24] N Tomeh N cancedda and M Dymetman ldquoComplexity-based phrase-table filtering for statistical machine translationrdquoin Proceedings of the MT Summit XII 2009

[25] M Eck S Vogel and A Waibel ldquoTranslation model prun-ing via usage statistics for statistical machine translationrdquo inProceedings of the Human Language Technologies The AnnualConference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT rsquo07) pp 21ndash24Association for Computational Linguistics troudsburg PaUSA 2007

[26] M Eck S Vogel and A Waibel ldquoEstimating phrase pairrelevance formachine translation pruningrdquo inProceedings of theMT Summit XI pp 159ndash165 2007

[27] Y Chen A Eisele and M Kay ldquoImproving statistical machinetranslation efficiency by triangulationrdquo in Proceedings of theSixth International Conference on Language Resources andEvaluation 2008

[28] Y Chen M Kay and A Eisele ldquoIntersecting multilingual datafor faster and better statistical translationsrdquo in Proceedings of theHuman Language Technologies The Annual Conference of theNorth American Chapter of the Association for ComputationalLinguistics (NAACL HLT rsquo09) pp 128ndash136 Association forComputational Linguistics Stroudsburg Pa USA June 2009

[29] G Sanchis-Trilles D Ortiz-Martınez J Gonzalez-Rubio JGonzalez and F Casacuberta ldquoBilingual segmentation forphrasetable pruning in statistical machine translationrdquo in Pro-ceedings of the 15th International Conference of the EuropeanAssociation for Machine Translation (EAMT rsquo11) pp 257ndash264May 2011

[30] N Tomeh M Turchi G Wisinewski A Allauzen and YFrancois ldquoHow good are your phrases Assessing phrase qualitywith single class classificationrdquo in Proceedings of the Interna-tional Workshop on Spoken Language Translation pp 261ndash2682011

[31] W Ling J Graca I Trancoso and A Black ldquoEntropy-basedpruning for phrase-based machine translationrdquo in Proceedingsof the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Language Learn-ing pp 962ndash971 Association for Computational LinguisticsStroudsburg Pa USA 2012

[32] R Zens D Stanton and P Xu ldquoA systematic comparison ofphrase table pruning techniquesrdquo in Proceedings of the JointConference on Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning pp 972ndash983 Association for Computational Linguistics StroudsburgPa USA 2012

[33] K Papineni S Roukos T Ward andW Zhu ldquoBLEU a methodfor automatic evaluation of machine translationrdquo in Proceedingsof the 40th Annual Meeting of the Association for ComputationalLinguistics (ACL rsquo01) pp 311ndash318 2001

[34] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe Mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[35] P Koehn StatisticalMachine Translation CambridgeUniversityPress New York NY USA 1st edition 2010

[36] E Galbrun ldquoPhrase table pruning for statistical machine trans-lationrdquo Department of Computer Science Series of PublicationsC C-2009-22 University of Helsinki 2007

The Scientific World Journal 13

[37] T M Cover and J A Thomas Elements of Information TheoryWiley-Interscience New York NY USA 2006

[38] Z He Y Meng Y Lu H Yu and Q Liu ldquoReducing SMTrule table with monolingual key phraserdquo in Proceedings ofthe Joint Conference of the 47th Annual Meeting of the Asso-ciation for Computational Linguistics and 4th InternationalJoint Conference on Natural Language Processing of the AFNLP(ACL-IJCNLP rsquo09) pp 121ndash124 Association for ComputationalLinguistics Stroudsburg Pa USA August 2009

[39] Z He Y Meng and H Yu ldquoDiscarding monotone com-posed rule for hierarchical phrase-based statistical machinetranslationrdquo in Proceedings of the 3rd International UniversalCommunication Symposium (IUCS rsquo09) pp 25ndash29 New YorkNY USA December 2009

[40] K T Frantzi and S Ananiadou ldquoExtracting nested colloca-tionsrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 41ndash46 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[41] P Koehn H Hoang A Birch et al ldquoMoses open source toolkitfor statistical machine translationrdquo in Proceedings of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions pp 177ndash180 Association for ComputationalLinguistics Stroudsburg Pa USA 2007

[42] D Chiang ldquoA hierarchical phrase-based model for statisticalmachine translationrdquo in Proceedings of the 43rd Annual Meetingof the Association for Computational Linguistics pp 263ndash270Association for Computational Linguistics Stroudsburg PaUSA June 2005

[43] W A Gale and K W Church ldquoIdentifying word correspon-dences in parallel textsrdquo in Proceedings of the Workshop onSpeech and Natural Language pp 152ndash157 Association forComputational Linguistics Stroudsburg Pa USA 1991

[44] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[45] M Federico N Bertoldi and M Cettolo ldquoIRSTLM An opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

[46] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the MT Summit X 2005

[47] S Green and J DeNero ldquoA Class-based agreement model forgenerating accurately inflected translationsrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Long Papers vol 1 pp 146ndash155 Association forComputational Linguistics Stroudsburg Pa USA 2012

[48] N Xue F D Chiou and M lmer ldquoBuilding a large-scaleannotated chinese corpusrdquo in Proceedings of the 19th Interna-tional Conference on Computational Linguistics vol 1 pp 1ndash8PaAssociation for Computational Linguistics Stroudsburg PaUSA 2002

[49] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[50] S Vogel H Ney and C Tillmann ldquoHmm-based word align-ment in statistical translationrdquo in Proceedings of the 16thInternational Conference on Computational Linguistics pp 836ndash841 1996

[51] CWMT 2013 httpwwwliipcncwmt2013

[52] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) pp 1273ndash1278 Qingdao China July2010

[53] L Tian D Wong L Chao et al ldquoUM-Corpus a large English-Chinese parallel corpus for statistical machine translationrdquo inProceedings of the 9th International Conference on LanguageResources and Evaluation 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article A Relationship: Word Alignment, Phrase

12 The Scientific World Journal

Association for Computational Linguistics Stroudsburg PaUSA 2000

[8] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[9] F J Och and H Ney ldquoThe alignment template approach tostatistical machine translationrdquo Computational Linguistics vol30 no 4 pp 417ndash449 2004

[10] N Duan M Li and M Zhou ldquoImproving phrase extractionvia MBR phrase scoring and pruningrdquo in Proceedings of the MTSummit XIII pp 189ndash197 2011

[11] G Neubig T Watanabe E Sumita S Mori and T Kawa-hara ldquoAn unsupervised model for joint phrase alignment andextractionrdquo in Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human LanguageTechnologies vol 1 pp 632ndash641 Stroudsburg Pa USA June2011

[12] H L M Z Hendra Setiawan ldquoLearning phrase translationusing level of detail approachrdquo inTheTenthMachine TranslationSummit (MT-Summit X) 2005

[13] C Tillmann ldquoA projection extension algorithm for statisticalmachine translationrdquo in Proceedings of the Conference on Empir-ical Methods in Natural Language Processing Association forComputational Linguistics Stroudsburg PA USA 2003

[14] B Zhao and S Vogel ldquoA generalized alignment-free phraseextractionrdquo in Proceedings of the ACLWorkshop on Building andUsing Parallel Texts pp 141ndash144 Association for ComputationalLinguistics Stroudsburg Pa USA 2005

[15] Y Zhang and S Vogel ldquoCompetitive grouping in integratedphrase segmentation and alignment modelrdquo in Proceedings ofthe ACLWorkshop on Building and Using Parallel Texts pp 159ndash162 Association for Computational Linguistics StroudsburgPa USA 2005

[16] A Lopez ldquoStatistical machine translationrdquo ACM ComputingSurveys vol 40 no 3 article 8 2008

[17] K Ganchev J V Grac and B Taskar ldquoBetter alignments = Bet-ter translationsrdquo in Proceedings of the 46th Annual Meeting ofthe Association for Computational Linguistics Human LanguageTechnologies pp 986ndash993 June 2008

[18] D Vilar M Popovic and H Ney ldquoAER do we need toldquoimproverdquo our alignmentsrdquo in Proceedings of the InternationalWorkshop on Spoken Language Translation pp 205ndash212 2006

[19] A Fraser andDMarcu ldquoMeasuring word alignment quality forstatistical machine translationrdquo Computational Linguistics vol33 no 3 pp 293ndash303 2007

[20] P Lambert S Petitrenaud Y Ma and A Way ldquoWhat typesof word alignment improve statistical machine translationrdquoMachine Translation vol 26 no 4 pp 289ndash323 2012

[21] J H Johnson J Martin G Foster and R Kuhn ldquoImprovingtranslation quality by discarding most of the phrasetablerdquoin Proceedings of the Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL rsquo07) pp 967ndash975 June2007

[22] J H Johnson Conditional Significance Pruning DiscardingMore of Huge Phrase Tables 2012

[23] M Yang and J Zheng ldquoToward smaller faster and better hierar-chical phrase-based SMTrdquo in Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for ComputationalLinguistics and 4th International Joint Conference on Natural

Language Processing of the AFNLP (ACL-IJCNLP rsquo09) pp 237ndash240 Association for Computational Linguistics StroudsburgPa USA August 2009

[24] N Tomeh N cancedda and M Dymetman ldquoComplexity-based phrase-table filtering for statistical machine translationrdquoin Proceedings of the MT Summit XII 2009

[25] M Eck S Vogel and A Waibel ldquoTranslation model prun-ing via usage statistics for statistical machine translationrdquo inProceedings of the Human Language Technologies The AnnualConference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT rsquo07) pp 21ndash24Association for Computational Linguistics troudsburg PaUSA 2007

[26] M Eck S Vogel and A Waibel ldquoEstimating phrase pairrelevance formachine translation pruningrdquo inProceedings of theMT Summit XI pp 159ndash165 2007

[27] Y Chen A Eisele and M Kay ldquoImproving statistical machinetranslation efficiency by triangulationrdquo in Proceedings of theSixth International Conference on Language Resources andEvaluation 2008

[28] Y Chen M Kay and A Eisele ldquoIntersecting multilingual datafor faster and better statistical translationsrdquo in Proceedings of theHuman Language Technologies The Annual Conference of theNorth American Chapter of the Association for ComputationalLinguistics (NAACL HLT rsquo09) pp 128ndash136 Association forComputational Linguistics Stroudsburg Pa USA June 2009

[29] G Sanchis-Trilles D Ortiz-Martınez J Gonzalez-Rubio JGonzalez and F Casacuberta ldquoBilingual segmentation forphrasetable pruning in statistical machine translationrdquo in Pro-ceedings of the 15th International Conference of the EuropeanAssociation for Machine Translation (EAMT rsquo11) pp 257ndash264May 2011

[30] N Tomeh M Turchi G Wisinewski A Allauzen and YFrancois ldquoHow good are your phrases Assessing phrase qualitywith single class classificationrdquo in Proceedings of the Interna-tional Workshop on Spoken Language Translation pp 261ndash2682011

[31] W Ling J Graca I Trancoso and A Black ldquoEntropy-basedpruning for phrase-based machine translationrdquo in Proceedingsof the Joint Conference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Language Learn-ing pp 962ndash971 Association for Computational LinguisticsStroudsburg Pa USA 2012

[32] R Zens D Stanton and P Xu ldquoA systematic comparison ofphrase table pruning techniquesrdquo in Proceedings of the JointConference on Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning pp 972ndash983 Association for Computational Linguistics StroudsburgPa USA 2012

[33] K Papineni S Roukos T Ward andW Zhu ldquoBLEU a methodfor automatic evaluation of machine translationrdquo in Proceedingsof the 40th Annual Meeting of the Association for ComputationalLinguistics (ACL rsquo01) pp 311ndash318 2001

[34] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe Mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[35] P Koehn StatisticalMachine Translation CambridgeUniversityPress New York NY USA 1st edition 2010

[36] E Galbrun ldquoPhrase table pruning for statistical machine trans-lationrdquo Department of Computer Science Series of PublicationsC C-2009-22 University of Helsinki 2007

The Scientific World Journal 13

[37] T M Cover and J A Thomas Elements of Information TheoryWiley-Interscience New York NY USA 2006

[38] Z He Y Meng Y Lu H Yu and Q Liu ldquoReducing SMTrule table with monolingual key phraserdquo in Proceedings ofthe Joint Conference of the 47th Annual Meeting of the Asso-ciation for Computational Linguistics and 4th InternationalJoint Conference on Natural Language Processing of the AFNLP(ACL-IJCNLP rsquo09) pp 121ndash124 Association for ComputationalLinguistics Stroudsburg Pa USA August 2009

[39] Z He Y Meng and H Yu ldquoDiscarding monotone com-posed rule for hierarchical phrase-based statistical machinetranslationrdquo in Proceedings of the 3rd International UniversalCommunication Symposium (IUCS rsquo09) pp 25ndash29 New YorkNY USA December 2009

[40] K T Frantzi and S Ananiadou ldquoExtracting nested colloca-tionsrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 41ndash46 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[41] P Koehn H Hoang A Birch et al ldquoMoses open source toolkitfor statistical machine translationrdquo in Proceedings of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions pp 177ndash180 Association for ComputationalLinguistics Stroudsburg Pa USA 2007

[42] D Chiang ldquoA hierarchical phrase-based model for statisticalmachine translationrdquo in Proceedings of the 43rd Annual Meetingof the Association for Computational Linguistics pp 263ndash270Association for Computational Linguistics Stroudsburg PaUSA June 2005

[43] W A Gale and K W Church ldquoIdentifying word correspon-dences in parallel textsrdquo in Proceedings of the Workshop onSpeech and Natural Language pp 152ndash157 Association forComputational Linguistics Stroudsburg Pa USA 1991

[44] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[45] M Federico N Bertoldi and M Cettolo ldquoIRSTLM An opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

[46] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the MT Summit X 2005

[47] S Green and J DeNero ldquoA Class-based agreement model forgenerating accurately inflected translationsrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Long Papers vol 1 pp 146ndash155 Association forComputational Linguistics Stroudsburg Pa USA 2012

[48] N Xue F D Chiou and M lmer ldquoBuilding a large-scaleannotated chinese corpusrdquo in Proceedings of the 19th Interna-tional Conference on Computational Linguistics vol 1 pp 1ndash8PaAssociation for Computational Linguistics Stroudsburg PaUSA 2002

[49] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[50] S Vogel H Ney and C Tillmann ldquoHmm-based word align-ment in statistical translationrdquo in Proceedings of the 16thInternational Conference on Computational Linguistics pp 836ndash841 1996

[51] CWMT 2013 httpwwwliipcncwmt2013

[52] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) pp 1273ndash1278 Qingdao China July2010

[53] L Tian D Wong L Chao et al ldquoUM-Corpus a large English-Chinese parallel corpus for statistical machine translationrdquo inProceedings of the 9th International Conference on LanguageResources and Evaluation 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 13: Research Article A Relationship: Word Alignment, Phrase

The Scientific World Journal 13

[37] T M Cover and J A Thomas Elements of Information TheoryWiley-Interscience New York NY USA 2006

[38] Z He Y Meng Y Lu H Yu and Q Liu ldquoReducing SMTrule table with monolingual key phraserdquo in Proceedings ofthe Joint Conference of the 47th Annual Meeting of the Asso-ciation for Computational Linguistics and 4th InternationalJoint Conference on Natural Language Processing of the AFNLP(ACL-IJCNLP rsquo09) pp 121ndash124 Association for ComputationalLinguistics Stroudsburg Pa USA August 2009

[39] Z He Y Meng and H Yu ldquoDiscarding monotone com-posed rule for hierarchical phrase-based statistical machinetranslationrdquo in Proceedings of the 3rd International UniversalCommunication Symposium (IUCS rsquo09) pp 25ndash29 New YorkNY USA December 2009

[40] K T Frantzi and S Ananiadou ldquoExtracting nested colloca-tionsrdquo in Proceedings of the 16th Conference on ComputationalLinguistics vol 1 pp 41ndash46 Association for ComputationalLinguistics Stroudsburg Pa USA 1996

[41] P Koehn H Hoang A Birch et al ldquoMoses open source toolkitfor statistical machine translationrdquo in Proceedings of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions pp 177ndash180 Association for ComputationalLinguistics Stroudsburg Pa USA 2007

[42] D Chiang ldquoA hierarchical phrase-based model for statisticalmachine translationrdquo in Proceedings of the 43rd Annual Meetingof the Association for Computational Linguistics pp 263ndash270Association for Computational Linguistics Stroudsburg PaUSA June 2005

[43] W A Gale and K W Church ldquoIdentifying word correspon-dences in parallel textsrdquo in Proceedings of the Workshop onSpeech and Natural Language pp 152ndash157 Association forComputational Linguistics Stroudsburg Pa USA 1991

[44] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[45] M Federico N Bertoldi and M Cettolo ldquoIRSTLM An opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

[46] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the MT Summit X 2005

[47] S Green and J DeNero ldquoA Class-based agreement model forgenerating accurately inflected translationsrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Long Papers vol 1 pp 146ndash155 Association forComputational Linguistics Stroudsburg Pa USA 2012

[48] N Xue F D Chiou and M lmer ldquoBuilding a large-scaleannotated chinese corpusrdquo in Proceedings of the 19th Interna-tional Conference on Computational Linguistics vol 1 pp 1ndash8PaAssociation for Computational Linguistics Stroudsburg PaUSA 2002

[49] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[50] S Vogel H Ney and C Tillmann ldquoHmm-based word align-ment in statistical translationrdquo in Proceedings of the 16thInternational Conference on Computational Linguistics pp 836ndash841 1996

[51] CWMT 2013 httpwwwliipcncwmt2013

[52] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) pp 1273ndash1278 Qingdao China July2010

[53] L Tian D Wong L Chao et al ldquoUM-Corpus a large English-Chinese parallel corpus for statistical machine translationrdquo inProceedings of the 9th International Conference on LanguageResources and Evaluation 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 14: Research Article A Relationship: Word Alignment, Phrase

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014