14
Published as a conference paper at ICLR 2021 N EAREST N EIGHBOR MACHINE T RANSLATION Urvashi Khandelwal †* , Angela Fan , Dan Jurafsky , Luke Zettlemoyer , Mike Lewis Stanford University Facebook AI Research {urvashik,jurafsky}@stanford.edu {angelafan,lsz,mikelewis}@fb.com ABSTRACT We introduce k-nearest-neighbor machine translation (kNN-MT), which predicts tokens with a nearest neighbor classifier over a large datastore of cached examples, using representations from a neural translation model for similarity search. This approach requires no additional training and scales to give the decoder direct ac- cess to billions of examples at test time, resulting in a highly expressive model that consistently improves performance across many settings. Simply adding nearest neighbor search improves a state-of-the-art German-English translation model by 1.5 BLEU. kNN-MT allows a single model to be adapted to diverse domains by using a domain-specific datastore, improving results by an average of 9.2 BLEU over zero-shot transfer, and achieving new state-of-the-art results—without train- ing on these domains. A massively multilingual model can also be specialized for particular language pairs, with improvements of 3 BLEU for translating from English into German and Chinese. Qualitatively, kNN-MT is easily interpretable; it combines source and target context to retrieve highly relevant examples. 1 I NTRODUCTION Non-parametric methods have recently been successfully applied to tasks such as language modeling (Khandelwal et al., 2020) and question answering (Guu et al., 2020; Lewis et al., 2020). They allow models that are (1) expressive, because they can use an arbitrary amount of data at test time; (2) adaptable, because predictions can be controlled by changing the datastore, and (3) interpretable, because the data used to make the prediction can be directly inspected. We introduce kNN-MT, a simple non-parametric method for machine translation (MT) using nearest neighbor retrieval. kNN- MT can be added to any pre-trained neural translation model without further training, and signifi- cantly improves performance for in-domain, out-of-domain, and multi-lingual evaluations. More specifically, kNN-MT interpolates the target-token softmax distribution from a neural MT model with a multinomial generated using nearest neighbor search over examples cached in a data store. The cache is over translation contexts (i.e. the complete source and prefix of the target), and is indexed by hidden states computed from the base MT model. We hypothesize that contexts which are close in representation space are more likely to be followed by the same target word. We show this is not only true for the original training data, thereby improving base model performance, but across a range of different bi-text corpora, allowing for simple and effective model adaptation. Our work builds upon recent results showing the effectiveness of nearest neighbor methods in uncon- ditional language models (Khandelwal et al., 2020). We generalize to conditional language models, by using both source and target context, and show nearest neighbour models can be effective for generation in addition to density estimation. Compared to prior work on non-parametric methods for MT, our approach is arguably simpler (in that it requires no training, as compared to Gu et al. (2018)) and more expressive (in that it provides access to billions of key-value pairs during infer- ence, as compared to Zhang et al. (2018); Gu et al. (2018)). Extensive experiments show that kNN-MT scales to datastores containing billions of tokens, im- proving results across a range of settings. For example, it improves a state-of-the-art German- English translation model by 1.5 BLEU. kNN-MT can also be used to adapt a single model to * Work done while the first author was interning at Facebook AI Research. 1

Published as a conference paper at ICLR 2021

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Published as a conference paper at ICLR 2021

Published as a conference paper at ICLR 2021

NEAREST NEIGHBOR MACHINE TRANSLATION

Urvashi Khandelwal†∗, Angela Fan‡, Dan Jurafsky†, Luke Zettlemoyer‡, Mike Lewis‡†Stanford University‡Facebook AI Research{urvashik,jurafsky}@stanford.edu{angelafan,lsz,mikelewis}@fb.com

ABSTRACT

We introduce k-nearest-neighbor machine translation (kNN-MT), which predictstokens with a nearest neighbor classifier over a large datastore of cached examples,using representations from a neural translation model for similarity search. Thisapproach requires no additional training and scales to give the decoder direct ac-cess to billions of examples at test time, resulting in a highly expressive model thatconsistently improves performance across many settings. Simply adding nearestneighbor search improves a state-of-the-art German-English translation model by1.5 BLEU. kNN-MT allows a single model to be adapted to diverse domains byusing a domain-specific datastore, improving results by an average of 9.2 BLEUover zero-shot transfer, and achieving new state-of-the-art results—without train-ing on these domains. A massively multilingual model can also be specializedfor particular language pairs, with improvements of 3 BLEU for translating fromEnglish into German and Chinese. Qualitatively, kNN-MT is easily interpretable;it combines source and target context to retrieve highly relevant examples.

1 INTRODUCTION

Non-parametric methods have recently been successfully applied to tasks such as language modeling(Khandelwal et al., 2020) and question answering (Guu et al., 2020; Lewis et al., 2020). They allowmodels that are (1) expressive, because they can use an arbitrary amount of data at test time; (2)adaptable, because predictions can be controlled by changing the datastore, and (3) interpretable,because the data used to make the prediction can be directly inspected. We introduce kNN-MT, asimple non-parametric method for machine translation (MT) using nearest neighbor retrieval. kNN-MT can be added to any pre-trained neural translation model without further training, and signifi-cantly improves performance for in-domain, out-of-domain, and multi-lingual evaluations.

More specifically, kNN-MT interpolates the target-token softmax distribution from a neural MTmodel with a multinomial generated using nearest neighbor search over examples cached in a datastore. The cache is over translation contexts (i.e. the complete source and prefix of the target), andis indexed by hidden states computed from the base MT model. We hypothesize that contexts whichare close in representation space are more likely to be followed by the same target word. We showthis is not only true for the original training data, thereby improving base model performance, butacross a range of different bi-text corpora, allowing for simple and effective model adaptation.

Our work builds upon recent results showing the effectiveness of nearest neighbor methods in uncon-ditional language models (Khandelwal et al., 2020). We generalize to conditional language models,by using both source and target context, and show nearest neighbour models can be effective forgeneration in addition to density estimation. Compared to prior work on non-parametric methodsfor MT, our approach is arguably simpler (in that it requires no training, as compared to Gu et al.(2018)) and more expressive (in that it provides access to billions of key-value pairs during infer-ence, as compared to Zhang et al. (2018); Gu et al. (2018)).

Extensive experiments show that kNN-MT scales to datastores containing billions of tokens, im-proving results across a range of settings. For example, it improves a state-of-the-art German-English translation model by 1.5 BLEU. kNN-MT can also be used to adapt a single model to

∗Work done while the first author was interning at Facebook AI Research.

1

Page 2: Published as a conference paper at ICLR 2021

Published as a conference paper at ICLR 2021

Training Translation Contexts DatastoreRepresentation Target

J'ai été à Paris. J'avais été à la maison.J'apprécie l’été.…J'ai ma propre chambre.

I haveI had

I enjoy…

I have…

beenbeen

summer…my

Distances

43

100…1

Nearest k

mybeenbeen

134

Temperature

mybeenbeen

0.10.30.4

Normalization

mybeenbeen

0.400.320.28

Aggregation

mybeen

0.40.6

(s(n), t(n)i�1)

<latexit sha1_base64="N44/1HzVGLPqDfDOhKx9WZHVfSk=">AAACAXicbZDLSgMxFIYz9VbrbdSN4CZYhBa0zFRBl0U3LivYC7RjyaRpG5rJDMkZoQzjxldx40IRt76FO9/G9LLQ1h8CH/85h5Pz+5HgGhzn28osLa+srmXXcxubW9s79u5eXYexoqxGQxGqpk80E1yyGnAQrBkpRgJfsIY/vB7XGw9MaR7KOxhFzAtIX/IepwSM1bEPCvo+KchieoJhCp2En7ppsWPnnZIzEV4EdwZ5NFO1Y3+1uyGNAyaBCqJ1y3Ui8BKigFPB0lw71iwidEj6rGVQkoBpL5lckOJj43RxL1TmScAT9/dEQgKtR4FvOgMCAz1fG5v/1Vox9C69hMsoBibpdFEvFhhCPI4Dd7liFMTIAKGKm79iOiCKUDCh5UwI7vzJi1Avl9yzUvn2PF+5msWRRYfoCBWQiy5QBd2gKqohih7RM3pFb9aT9WK9Wx/T1ow1m9lHf2R9/gAox5Vr</latexit>

dj = d(kj , q)<latexit sha1_base64="n9hinBOyQcUdyIB3g+21R/Mwjp0=">AAAB+HicbVDLSgNBEOyNrxgfWfXoZTAIESTsRkEvQtCLxwjmAUlYZmdnk0lmH87MCnHJl3jxoIhXP8Wbf+Mk2YMmFjQUVd10d7kxZ1JZ1reRW1ldW9/Ibxa2tnd2i+beflNGiSC0QSIeibaLJeUspA3FFKftWFAcuJy23NHN1G89UiFZFN6rcUx7Ae6HzGcEKy05ZtFzhugKeeWRMzxFDyeOWbIq1gxomdgZKUGGumN+db2IJAENFeFYyo5txaqXYqEY4XRS6CaSxpiMcJ92NA1xQGUvnR0+Qcda8ZAfCV2hQjP190SKAynHgas7A6wGctGbiv95nUT5l72UhXGiaEjmi/yEIxWhaQrIY4ISxceaYCKYvhWRARaYKJ1VQYdgL768TJrVin1Wqd6dl2rXWRx5OIQjKIMNF1CDW6hDAwgk8Ayv8GY8GS/Gu/Exb80Z2cwB/IHx+QNX6JGV</latexit>

d0j = dj/T<latexit sha1_base64="BQh5h7vmN8Q/ZD3c/ZeY5CyHqyw=">AAAB/XicbVDLSgMxFM3UV62v8bFzEyyCqzpTBd0IRTcuK/QF7ThkMmmbNskMSUaoQ/FX3LhQxK3/4c6/MW1noa0HLhzOuZd77wliRpV2nG8rt7S8srqWXy9sbG5t79i7ew0VJRKTOo5YJFsBUoRRQeqaakZasSSIB4w0g+HNxG8+EKloJGp6FBOPo56gXYqRNpJvH4T3aSeWlJOxP4BXMPQHpzXfLjolZwq4SNyMFEGGqm9/dcIIJ5wIjRlSqu06sfZSJDXFjIwLnUSRGOEh6pG2oQJxorx0ev0YHhslhN1ImhIaTtXfEyniSo14YDo50n01703E/7x2oruXXkpFnGgi8GxRN2FQR3ASBQypJFizkSEIS2puhbiPJMLaBFYwIbjzLy+SRrnknpXKd+fFynUWRx4cgiNwAlxwASrgFlRBHWDwCJ7BK3iznqwX6936mLXmrGxmH/yB9fkDjN2UoA==</latexit>

pkNN(yi) =X

j

1yi=vjp(kj)

<latexit sha1_base64="5EFpdxiefNCi9kyl2kTV479+dvk=">AAACGHicbVDJSgNBEO1xjXEb9eilMQjxEmeioJdA0IsnUTAmkAlNT6ejnelZ6K4JhmE+w4u/4sWDIl5z82/sLAe3BwWP96qoqucnUmhwnE9rbn5hcWm5sFJcXVvf2LS3tm91nCrGGyyWsWr5VHMpIt4AAZK3EsVp6Eve9IPzsd8ccKVFHN3AMOGdkN5FoicYBSMR+zAhmQf8AbLg8jLPy0MiDnANezoNST9zSWaE2oD086QckP5BTuySU3EmwH+JOyMlNMMVsUdeN2ZpyCNgkmrddp0EOhlVIJjkedFLNU8oC+gdbxsa0ZDrTjZ5LMf7RuniXqxMRYAn6veJjIZaD0PfdIYU7vVvbyz+57VT6J12MhElKfCITRf1UokhxuOUcFcozkAODaFMCXMrZvdUUQYmy6IJwf398l9yW624R5Xq9XGpfjaLo4B20R4qIxedoDq6QFeogRh6RM/oFb1ZT9aL9W59TFvnrNnMDvoBa/QFrfugJw==</latexit>

Test Input Generated tokens Representation Target

J'ai été dans ma propre chambre. I have ?

x<latexit sha1_base64="hL+FaLtOT9luwfLW3Ut08xl3Pcw=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7OxmZtZICF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOeHjQA=</latexit> y1:i�1

<latexit sha1_base64="PO61XiX9N1V9DyH300KfiyUbzIE=">AAAB+HicbVBNS8NAEJ34WetHox69LBbBiyWpguKp6MVjBfsBbQib7aZdutmE3Y1QQ36JFw+KePWnePPfuG1z0NYHA4/3ZpiZFyScKe0439bK6tr6xmZpq7y9s7tXsfcP2ipOJaEtEvNYdgOsKGeCtjTTnHYTSXEUcNoJxrdTv/NIpWKxeNCThHoRHgoWMoK1kXy70h9hnU1yP3Ov2Zmb+3bVqTkzoGXiFqQKBZq+/dUfxCSNqNCEY6V6rpNoL8NSM8JpXu6niiaYjPGQ9gwVOKLKy2aH5+jEKAMUxtKU0Gim/p7IcKTUJApMZ4T1SC16U/E/r5fq8MrLmEhSTQWZLwpTjnSMpimgAZOUaD4xBBPJzK2IjLDERJusyiYEd/HlZdKu19zzWv3+otq4KeIowREcwym4cAkNuIMmtIBACs/wCm/Wk/VivVsf89YVq5g5hD+wPn8AS+iS2A==</latexit>

q = f(x, y1:i�1)<latexit sha1_base64="nycBGoirjxXgWv76r62gzxMR4TU=">AAACAnicbVDLSsNAFJ3UV62vqCtxM1iEClqSKiiCUHTjsoJ9QBvKZDpph04mcWYihhDc+CtuXCji1q9w5984bbPQ1gMXDufcy733uCGjUlnWt5Gbm19YXMovF1ZW19Y3zM2thgwigUkdBywQLRdJwigndUUVI61QEOS7jDTd4dXIb94TIWnAb1UcEsdHfU49ipHSUtfcuYMX0Cs9HMLOAKkkTruJfU6P7PSgaxatsjUGnCV2RoogQ61rfnV6AY58whVmSMq2bYXKSZBQFDOSFjqRJCHCQ9QnbU058ol0kvELKdzXSg96gdDFFRyrvycS5EsZ+67u9JEayGlvJP7ntSPlnTkJ5WGkCMeTRV7EoArgKA/Yo4JgxWJNEBZU3wrxAAmElU6toEOwp1+eJY1K2T4uV25OitXLLI482AV7oARscAqq4BrUQB1g8AiewSt4M56MF+Pd+Ji05oxsZhv8gfH5A5oclaU=</latexit>

yi<latexit sha1_base64="Kr9zKiAScfd9h9AHI+C+F2nCG10=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHkyXoR3QoecgZNVZ6yPq8X664VXcOskq8nFQgR6Nf/uoNYpZGKA0TVOuu5ybGn1BlOBM4LfVSjQllYzrErqWSRqj9yfzUKTmzyoCEsbIlDZmrvycmNNI6iwLbGVEz0sveTPzP66YmvPYnXCapQckWi8JUEBOT2d9kwBUyIzJLKFPc3krYiCrKjE2nZEPwll9eJa1a1buo1u4vK/WbPI4inMApnIMHV1CHO2hAExgM4Rle4c0Rzovz7nwsWgtOPnMMf+B8/gBjZI3d</latexit>

kj = f(s(n), t(n)i�1)

<latexit sha1_base64="i54TyR1bILxvs/AZ7tvMBaSEeAI=">AAACCHicbZDLSsNAFIYn9VbrLerShYNFaEFLUgXdCEU3LivYC7QxTKaTduxkEmYmQglZuvFV3LhQxK2P4M63cdpmoa0/DHz85xzOnN+LGJXKsr6N3MLi0vJKfrWwtr6xuWVu7zRlGAtMGjhkoWh7SBJGOWkoqhhpR4KgwGOk5Q2vxvXWAxGShvxWjSLiBKjPqU8xUtpyzf2hew8voF+Sd0mJl9MjqKbgJvTYTsuuWbQq1kRwHuwMiiBT3TW/ur0QxwHhCjMkZce2IuUkSCiKGUkL3ViSCOEh6pOORo4CIp1kckgKD7XTg34o9OMKTtzfEwkKpBwFnu4MkBrI2drY/K/WiZV/7iSUR7EiHE8X+TGDKoTjVGCPCoIVG2lAWFD9V4gHSCCsdHYFHYI9e/I8NKsV+6RSvTkt1i6zOPJgDxyAErDBGaiBa1AHDYDBI3gGr+DNeDJejHfjY9qaM7KZXfBHxucPkYeXyA==</latexit>

vj = t(n)i

<latexit sha1_base64="KaQSbVGpldT+7iT8el9xHMO2T9k=">AAAB+HicbVBNS8NAEN34WetHox69LBahXkpSBb0IRS8eK9gPaGPYbLft2s0m7E4KNfSXePGgiFd/ijf/jds2B219MPB4b4aZeUEsuAbH+bZWVtfWNzZzW/ntnd29gr1/0NBRoiir00hEqhUQzQSXrA4cBGvFipEwEKwZDG+mfnPElOaRvIdxzLyQ9CXvcUrASL5dGPmP+ArDQ1qSpxOf+3bRKTsz4GXiZqSIMtR8+6vTjWgSMglUEK3brhODlxIFnAo2yXcSzWJCh6TP2oZKEjLtpbPDJ/jEKF3ci5QpCXim/p5ISaj1OAxMZ0hgoBe9qfif106gd+mlXMYJMEnni3qJwBDhaQq4yxWjIMaGEKq4uRXTAVGEgskqb0JwF19eJo1K2T0rV+7Oi9XrLI4cOkLHqIRcdIGq6BbVUB1RlKBn9IrerCfrxXq3PuatK1Y2c4j+wPr8AXldklI=</latexit>

p(kj) / exp(�d0j)<latexit sha1_base64="yyfihfYqxx1eAV9bVk2cuYrTRzM=">AAACAnicbVDLSsNAFJ3UV62vqCtxM1jEdmFJqqDLohuXFewD2hAmk0k77SQZZiZiCcWNv+LGhSJu/Qp3/o3TNgttPTBwOOde7pzjcUalsqxvI7e0vLK6ll8vbGxube+Yu3tNGScCkwaOWSzaHpKE0Yg0FFWMtLkgKPQYaXnD64nfuidC0ji6UyNOnBD1IhpQjJSWXPOAl4buoAy7XMRcxZA88NKp7w5Oyq5ZtCrWFHCR2Bkpggx11/zq+jFOQhIpzJCUHdviykmRUBQzMi50E0k4wkPUIx1NIxQS6aTTCGN4rBUfBrHQL1Jwqv7eSFEo5Sj09GSIVF/OexPxP6+TqODSSWnEE0UiPDsUJAzqrJM+oE8FwYqNNEFYUP1XiPtIIKx0awVdgj0feZE0qxX7rFK9PS/WrrI68uAQHIESsMEFqIEbUAcNgMEjeAav4M14Ml6Md+NjNpozsp198AfG5w8p6JX9</latexit>

Figure 1: An illustration of how the kNN distribution is computed. The datastore, which is con-structed offline, consists of representations of training set translation contexts and correspondingtarget tokens for every example in the parallel data. During generation, the query representation,conditioned on the test input as well as previously generated tokens, is used to retrieve the k nearestneighbors from the datastore, along with the corresponding target tokens. The distance from thequery is used to compute a distribution over the retrieved targets after applying a softmax tempera-ture. This distribution is the final kNN distribution.

diverse domains by simply adding a domain-specific datastore—improving results by an average of9.2 BLEU over the base model out-of-domain, and even outperforming existing models that train onthese domains. Finally, language-pair-specific datastores are used to adapt a multilingual model toparticular language pairs, with improvements of 3 BLEU for translating English into German andChinese. We find that retrievals from kNN-MT are typically highly contextually relevant.

2 NEAREST NEIGHBOR MACHINE TRANSLATION

kNN-MT involves augmenting the decoder of a pre-trained machine translation model with a nearestneighbor retrieval mechanism, allowing the model direct access to a datastore of cached examples.The translation is generated word-by-word; at each time step, we find the most similar contexts inthe datastore, and compute a distribution over the corresponding target tokens, as shown in Figure 1.This distribution is then interpolated with the output distribution from the pre-trained MT model.

More specifically, given an input sequence of tokens in a source language s = (s1, . . . , sM1), a

neural MT model outputs a sequence of tokens t = (t1, . . . , tM2) in the target language. When

using autoregressive decoders, the output distribution for each token ti in the target sequence isconditioned on the entire source sequence as well as the previous target tokens, p(ti|s, t1:i−1). Let(s, t1:i−1) be the translation context and ti be the target token.

Datastore creation Our datastore is constructed offline and consists of a set of key-value pairs.The key is a high-dimensional representation of the entire translation context computed by the MTdecoder, f(s, t1:i−1), where f represents a mapping from input to an intermediate representationof the decoder. The value is the corresponding ground truth target token ti. For a parallel textcollection (S, T ), the representations are generated by a single forward pass over each example andthe complete datastore is defined as follows:

(K,V) = {(f(s, t1:i−1), ti), ∀ti ∈ t | (s, t) ∈ (S, T )} (1)

Tokens from the source language are not stored directly as values in the datastore. Conditioning onthe source is implicit via the keys, and the values are only target language tokens.

Generation At test time, given a source x, the model outputs a distribution over the vocabularypMT (yi|x, y1:i−1) for the target yi at every step of generation, where y represents the generatedtokens. The model also outputs the representation f(x, y1:i−1), which is used to query the datastorefor the k nearest neighbors N according to squared-L2 distance, d. In practice, the search overbillions of key-value pairs is carried out using FAISS (Johnson et al., 2017), a library for fast nearestneighbor search in high-dimensional spaces.

The retrieved set is converted into a distribution over the vocabulary by applying a softmax withtemperature T to the negative distances and aggregating over multiple occurrences of the same vo-

2

Page 3: Published as a conference paper at ICLR 2021

Published as a conference paper at ICLR 2021

cabulary item. Using a temperature greater than one flattens the distribution, and prevents overfittingto the most similar retrievals.

pkNN(yi|x, y1:i−1) ∝∑

(kj ,vj)∈N1yi=vj exp

(−d(kj , f(x, y1:i−1))T

)(2)

While a pure kNN approach is effective, we improve results by interpolating with the base modeldistribution, which is more robust in cases without relevant cached examples. The model and kNNdistributions are interpolated with a tuned parameter λ, resulting in the final kNN-MT distribution:

p(yi|x, y1:i−1) = λ pkNN(yi|x, y1:i−1) + (1− λ) pMT(yi|x, y1:i−1) (3)

The complete translation is generated using beam search.

kNN-MT vs. kNN-LM kNN-MT is a generalization of kNN-LM applied to conditional sequencegeneration, with a few important differences. First, the keys are not only conditioned on prior con-text, but also on the source sequence (here, in a different language). This means that the representa-tions must encode both source and target context; we show examples in Section 6. Second, there isan additional tuned parameter, the softmax temperature. Higher temperatures flatten the distributionand allow for greater diversity without overfitting to the retrieved contexts, as shown in Section 6.

3 EXPERIMENTAL SETUP

We experiment with kNN-MT in three settings: (1) single language-pair translation, (2) multilingualMT and (3) domain adaptation.

Data We use the following datasets for training and evaluation.

WMT’19: For the single language-pair experiments, we use WMT’19 data for German-English.

CCMATRIX: We train our multilingual model on CCMatrix (Schwenk et al., 2019), containingparallel data for 79 languages and 1,546 language pairs. The parallel sentences are mined fromcleaned monolingual commoncrawl data created using the ccNet pipeline (Wenzek et al., 2019).Semantically similar sentences in different languages are aligned using a learned distance measure;we use examples where the distance measure is at least 1.06, resulting in 4 billion sentence-pairs.

NEWSTEST: The newstest2018 and newstest2019 test sets from WMT (Bojar et al., 2018; Barraultet al., 2019) are used as validation and test sets for the multilingual experiments. The same German-English validation and test sets are also used for evaluation in the single language-pair and domainadaptation experiments.

TED TALKS: We use the Ted Talks data prepared by Qi et al. (2018) for evaluation in the multilingualsetting, particularly to explore performance for language pairs that do no include English.

MULTI-DOMAINS: We use the multi-domains dataset (Koehn & Knowles, 2017), re-split by Aharoni& Goldberg (2020) for the domain adaptation experiments. It includes German-English parallel datafor train/validation/test sets in five domains: Medical, Law, IT, Koran and Subtitles.

Models For the single language-pair and domain adaptation experiments, we use the WMT’19German-English news translation task winner (Ng et al., 2019), available via the FAIRSEQ library(Ott et al., 2019).1 It is a Transformer encoder-decoder model (Vaswani et al., 2017) with 6 layers,1,024 dimensional representations, 8,192 dimensional feedforward layers and 8 attention heads.Apart from WMT’19 training data, this model is trained on over 10 billion tokens of backtranslationdata and fine-tuned on newstest test sets from years prior to 2018. In this work, we do not useensembles or n-best reranking.

For multilingual MT, we trained a 418M parameter Transformer-based encoder-decoder model onthe CCMatrix data for 100K updates. The model has embedding dimension 1,024, hidden dimen-sion 4,096, 12 layers in both the encoder and decoder, with 16 attention heads. To balance the

1https://github.com/pytorch/fairseq/tree/master/examples/translation

3

Page 4: Published as a conference paper at ICLR 2021

Published as a conference paper at ICLR 2021

training of different language pairs, which have various resource levels, we apply temperature up-sampling with T = 5 (Arivazhagan et al., 2019). The vocabulary is shared across all languagesand consists of 128K subwords extracted using sentencepiece (Kudo & Richardson, 2018).2 Allresults use case-sensitive detokenized BLEU, measured using SACREBLEU (Post, 2018).We pro-vide the SACREBLEU signatures, along with details on the statistical power of our experiments, inAppendix C.

kNN-MT In this work, we use a FAISS index to represent the datastore and search for nearestneighbors. The keys are stored in clusters to speed up search and quantized to 64-bytes for spaceefficiency (the full-precision keys are discarded). The index is constructed offline via a single for-ward pass over every example in the given parallel text collection. We use the 1024-dimensionalrepresentation input to the final layer feedforward network as the key. Building the index involvesa training phase to learn the cluster centroids. We use 5M keys for learning 131K cluster centroidsfor the multilingual experiments, and 1M keys for 4K clusters for in-domain data in the domainadaptation experiments. During inference, we query the datastore for 64 neighbors while searching32 clusters. The interpolation and softmax temperature parameters are tuned on the validation sets.3

Computational Cost While kNN-MT does not add trainable model parameters, it does add somecomputational overhead. The primary cost of building the datastore is a single forward pass over allexamples in the datastore, which is a fraction of the cost for training on the same examples for oneepoch. During inference, retrieving 64 keys from a datastore containing billions of items results in ageneration speed that is two orders of magnitude slower than the base MT system. Generation speedcan be improved by searching fewer clusters, using smaller beams, or querying smaller datastores,with relatively minor trade-offs in performance, as we will see in Section 5. Developing fasternearest neighbor search tools remains an active area of research (Guo et al., 2020).

4 EXPERIMENTS

4.1 SINGLE LANGUAGE-PAIR TRANSLATION

To test whether kNN-MT can improve a model’s ability to generalize from its training data, we firstapply it to a state-of-the-art translation model, using a datastore containing only the original trainingset. We use a state-of-the-art German-English model as our base MT system, which scores 37.59BLEU on the newstest2019 test set.4 This is a highly competitive baseline – apart from the WMT’19training data, the base model has also been trained on over 10 billion tokens of extra backtranslationdata as well as fine-tuned on newstest test sets from previous years. Providing this heavily tunedbase model with a datastore containing about 770M tokens of WMT’19 training data improvesperformance by 1.5 BLEU to 39.08, without any additional training. This result shows that evenvery strong translation models can be improved with test-time access to training sets.

4.2 MULTILINGUAL MACHINE TRANSLATION

Next, we apply kNN-MT to multilingual machine translation, to measure its ability to add capacityto a model when using very large training sets. For these experiments, we create datastores usingsubsets of the CCMatrix parallel data that the model has been trained on.

Retrieving neighbors from same source language data Here, we build one datastore perlanguage-pair being tested, using the training examples for that language-pair. Table 1 shows perfor-mance for the baseline and kNN-MT on 17 language-pairs from newstest2019. Retrieving neighborsresults in up to 3 BLEU improvements for English-German, English-Chinese and Chinese-English,with an average improvement of 1.4 BLEU across all 17 pairs, without any additional training.

Table 1 also shows the sizes of each of the datastores. Datastore size and the increase in BLEU areonly weakly correlated across languages, though within a language, a larger datastore is decidedly

2Evaluating our model on the recently released OPUS100 (Zhang et al., 2020) corpus improves upon theresult in Zhang et al. (2020) by 0.4 BLEU, suggesting that it is a very strong baseline.

3Code for kNN-MT will be available at https://github.com/urvashik/knnlm.4The winning system (scoring 40.8 BLEU) extends this model with ensembling and n-best reranking.

4

Page 5: Published as a conference paper at ICLR 2021

Published as a conference paper at ICLR 2021

de-en ru-en zh-en ja-en fi-en lt-en de-fr de-cs en-csTest set sizes 2,000 2,000 2,000 993 1,996 1,000 1,701 1,997 2,000

Base MT 34.45 36.42 24.23 12.79 25.92 29.59 32.75 21.15 22.78+kNN-MT 35.74 37.83 27.51 13.14 26.55 29.98 33.68 21.62 23.76Datastore Size 5.56B 3.80B 1.19B 360M 318M 168M 4.21B 696M 533M

en-de en-ru en-zh en-ja en-fi en-lt fr-de cs-de Avg.Test set sizes 1,997 1,997 1,997 1,000 1,997 998 1,701 1,997 -

Base MT 36.47 26.28 30.22 21.35 21.37 17.41 26.04 22.78 26.00+kNN-MT 39.49 27.91 33.63 23.23 22.20 18.25 27.81 23.55 27.40Datastore Size 6.50B 4.23B 1.13B 433M 375M 204M 3.98B 689M -

Table 1: Multilingual machine translation with kNN-MT. All test sets are from newstest2019, exceptja-en/en-ja which are from newstest2020. Adding kNN-MT increases BLEU scores in all cases,and by over 3 points for en-de, zh-en and en-zh. Bold scores indicate significant results based onstatistically powered experiments (Card et al., 2020).

Ted Talks Newstest2019 Avg.de-ja ru-ja uk-ja de-ru de-zh fr-de cs-de de-cs

Test set sizes 4,442 5,090 3,560 4,288 4,349 1,701 1,997 1,997 -

Base MT 10.11 9.69 8.36 17.24 20.48 26.04 22.78 21.15 16.98+kNN-MT (en-∗) 11.08 10.42 9.64 18.02 21.22 27.85 23.71 21.74 17.96

Datastore Size 433M 433M 433M 4.23B 1.13B 6.50B 6.50B 533M -

Table 2: Adding datastores with English source-side data can improve translation from other lan-guages by an average of 1 BLEU, suggesting that our representations generalize over different sourcelangauges. The model’s representations of the source generalize across languages and make cross-lingual retrieval effective.

better, as shown in Section 5. This suggests that underlying factors, such as the quality of the paralleldata used to populate the datastore, may also factor into the size of the improvements from kNN-MT.

Finally, we also observe that improvements for languages translated into English, on average 1.23BLEU, are lower than improvements for languages translated from English, on average 1.94 BLEU.As English is the most frequent language in the base model’s training data, this suggests that kNN-MT is particularly useful for improving decoders in languages that may be underfit during training.

Retrieving neighbors using English as the source language Here, we construct datastores fromtraining examples where English is the source language, and the target language is the languagebeing tested. This setting is useful for rarer language pairs with less bi-text, and is related to pivoting(Utiyama & Isahara, 2007; Cohn & Lapata, 2007). Table 2 shows that on five pairs from the TedTalks data and three from newstest2019, we find that kNN-MT improves performance by 1 BLEUon average. This result shows that the model’s representations of the source generalize well enoughacross languages to make cross-lingual retrieval effective. Further investigation is needed to studythe extent to which multilingual representations from related and unrelated languages can improvetranslation performance via kNN-MT.

4.3 DOMAIN ADAPTATION

We also measure the effectiveness of kNN-MT for domain adaptation, in which we use a domain-specific datastore to adapt the model, without further training. For these experiments, we use theGerman-English translation model from Section 4.1 as our base MT system and provide domain-specific data in the datastores. We also explore the effects of retrieving neighbors from a largeamount of out-of-domain data as well as from a single multi-domain datastore.

5

Page 6: Published as a conference paper at ICLR 2021

Published as a conference paper at ICLR 2021

Newstest2019 Medical Law IT Koran Subtitles Avg.

Test set sizes 2,000 2,000 2,000 2,000 2,000 2,000 -

Aharoni & Goldberg (2020):one model per domain - 56.5 59.0 43.0 15.9 27.3 40.34one model for all domains - 53.3 57.2 42.1 20.9 27.6 40.22best data selection method - 54.8 58.8 43.5 21.8 27.4 41.26

Base MT 37.59 39.91 45.71 37.98 16.30 29.21 33.82+kNN-MT:

in-domain datastore 39.08 54.35 61.78 45.82 19.45 31.73 42.63WMT’19 datastore 39.08 40.22 46.74 40.27 17.99 29.23 34.89all-domains datastore 38.88 54.54 61.11 48.63 19.22 31.70 43.04

Datastore Size (in-domain) 770M 5.70M 18.3M 3.10M 450K 159M -

Table 3: Domain adaptation using kNN-MT. The base MT system is trained on WMT’19 datawhich is also treated as the in-domain data for newstest2019. kNN-MT improves the base model byan average of 9.2 BLEU, without training, to achieve the best reported results on this task.

Domain-specific datastores Table 3 shows the base MT system’s in-domain performance on new-stest2019, as well as zero-shot transfer to five other domains. kNN-MT significantly outperformsthe base MT system in all settings. For the multi-domains dataset, kNN-MT improves the base MTmodel performance by an average of 9.2 BLEU, with improvements as large as 16 BLEU on Lawand 14.5 BLEU on Medical, all without any further training. We also provide scores from Aharoni& Goldberg (2020) for models trained on in-domain data, those trained on all domains jointly, andthose trained using the best-performing data selection method proposed by the authors. We find thatkNN-MT also outperforms the best reported average on the multi-domains dataset by 1.4 BLEU.

Out-of-domain and multi-domain datastores Table 3 also shows performance for retrievingneighbors from 770M tokens of WMT’19 data that the model has been trained on. While the averageBLEU for the multi-domain data is 1 point higher, the improvements are much smaller comparedto using in-domain data. This illustrates the value of adding domain-specific data to the datastoreover adding a large amount of arbitrary data. We also measure the effectiveness of building a singlemulti-domain datastore containing parallel data from all six settings. Performance on IT improvesby 3 BLEU but scores for the other domains are mostly the same. This shows that kNN-MT is robustto the presence of out-of-domain examples since retrieving neighbors from a datastore where largeamounts of data is out-of-domain does not hurt performance relative to using only in-domain data.

5 TUNING KNN-MT

We investigate how key hyperparameters affect multilingual kNN-MT on validation data. We pro-vide validation set BLEU scores as well as hyperparameter choices for our experiments in Ap-pendix A.

Softmax temperature A softmax temperature is used when estimating the nearest neighbor distri-bution in order to prevent the model from assigning most of the probability mass to a single neighbor,thus hurting diversity. Values greater than 1 will flatten the distribution, which can improve kNN-MT performance. Figure 2 shows that a temperature of 1 results in significantly lower BLEU scores.For all of our experiments, values of either 10 or 100 prove to be optimal.

Number of neighbors per query In our experiments, we fix the value of k, the number of neigh-bors retrieved per query, to 64. For a fixed temperature and interpolation parameter, we find thatperformance does not improve when retrieving a larger number of neighbors, and in some cases, per-formance deteriorates. This suggests that retrieving more neighbors can add noise to the sequencegeneration process. Figure 2 shows that in some cases, performance improves when retrieving fewerneighbors, and further gains may be possible by tuning this parameter.

6

Page 7: Published as a conference paper at ICLR 2021

Published as a conference paper at ICLR 2021

2 4 8 16 32 64 128Number of neighbors (k)

32

33

34

35

36

37

38

BLEU

scor

e

en-zh

Base MTTemperature=1Temperature=10Temperature=100Temperature=1,000

Figure 2: Effect of the number of neighbors re-trieved and the softmax temperature on the val-idation BLEU score for en-zh. Temperaturesgreater than 1 are important to prevent the modelfrom overfitting to the most similar neighbor. Forhigher temperatures, more neighbors do not al-ways result in improvements.

0.0 0.2 0.4 0.6 0.8 1.0Fraction of the Size of the Datastore

0

1

2

3

4

5

Diffe

renc

e in

BLE

U sc

ore

Datastore size: en-zh=1.13B, ru-en=3.8B

kNN-MT for en-zhkNN-MT for ru-en

Figure 3: Effect of datastore size on the valida-tion BLEU score for ru-en and en-zh. Perfor-mance improves monotonically with size but re-trieval can be slow for datastores containing bil-lions of tokens. Smaller datastores, which ac-count for a large fraction of the improvement, canbe used for faster retrieval.

Datastore size Figure 3 shows increasing the size of the datastore improves translation perfor-mance. However, larger datastores result in slower retrieval, indicating a speed-performance trade-off. Much of the benefit can be realized with much smaller, and correspondingly faster, datastores.

6 QUALITATIVE ANALYSIS

To better understand kNN-MT, we examine the retrievals for several examples. We use the German-English model and generate with only the kNN distribution (λ = 1) with beam size 1, retrievingk = 8 neighbors from the News Commentary and Common Crawl subsets of WMT’19 data.

Figure 4 shows an example from newstest2018 where all the retrieved neighbors map to the sametarget, military. Many of the retrieved examples include phrases similar to tamed the military suchas Autoritat gegenuber dem Militar, Kontrolle des Militars and das Militar gezwungen on the sourceside and authority over the military, control over the military and forced the military given the localtarget side context, but differ sharply in their longer context, often describing different nations andcenturies. We provide additional examples illustrating this point in Appendix B.

Another interesting observation is that kNN, even when not interpolated with the base model, isable to reconstruct named entities that are split into multiple subword tokens, even if that particularnamed entity does not appear in the datastore. One such example is the name Haysom that is splitinto subwords Hay and som. The retrieved neighbors for the first subword token include examplesthat contain the names Hayes and Haydn, while those for the second include Grissom and Folsom,showing subword representations are used effectively in the nearest neighbor search.

7 RELATED WORK

Retrieval in Translation Recent work has integrated retrieval of words and phrases into neuraltranslation, to gain some of the advantages of the previous generation of word- and phrase-basedmethods (Brown et al., 1993; Koehn et al., 2003). For example, Zhang et al. (2018) proposedguiding models by retrieving n-grams and up-weighting the probabilities of retrieved tokens. Tuet al. (2018) use cache-based models (Grave et al., 2017a;b) to save and retrieve translation histories,so models can adapt to changing contexts. Compared to these, kNN-MT has several advantages —for instance, the external datastore only needs to be created once, whereas the cache model requiresconstant writes. Further, kNN-MT scales retrieval to orders-of-magnitude larger datastores, whiletaking advantage of neural context representations.

Other work has retrieved complete example translation sentences at test time. Nagao (1984) pro-posed example-based MT for translating sequences by analogy. Before deep learning was widely

7

Page 8: Published as a conference paper at ICLR 2021

Published as a conference paper at ICLR 2021

Test Input: Dabei schien es, als habe Erdogan das Militar gezahmt.Generated tokens: In doing so, it seems as if Erdogan has tamed the

Training Set Translation Context (source and target) TrainingSet Target

ContextProbability

Dem charismatischen Minis-terprasidenten Recep TayyipErdogan, der drei aufeinanderfol-gende Wahlen fur sich entscheidenkonnte, ist es gelungen seineAutoritat gegenuber dem Militargeltend zu machen.

The charismatic prime minister, Re-cep Tayyip Erdogan, having wonthree consecutive elections, hasbeen able to exert his authority overthe

military 0.132

Ein bemerkenswerter Fall war dieErmordung des gemaßigten Pre-mierministers Inukai Tsuyoshiim Jahre 1932, die dasEnde jeder wirklichen zivilenKontrolle des Militars markiert.

One notable case was the assas-sination of moderate Prime Minis-ter Inukai Tsuyoshi in 1932, whichmarked the end of any real civiliancontrol of the

military 0.130

Sie sind Teil eines Normal-isierungsprozesses und der Her-stellung der absoluten zivilenKontrolle uber das Militar undbestatigen das Prinzip, dassniemand uber dem Gesetz steht.

They are part of a process of nor-malization, of the establishment ofabsolute civilian control of the

military 0.129

Diese hart formulierte Erklarungwurde als verschleierte, jedoch un-missverstandliche Warnung ange-sehen, dass das Militar bereit wareeinzuschreiten...

That toughly worded statement wasseen as a veiled but unmistakablewarning that the

military 0.123

... ... ... ...

Final kNN distribution: military = 1.0Final Translation: In doing so, Erdogan seemed to have tamed the military.Reference: In doing so, it seems as if Erdogan has tamed the military.

Figure 4: Example retrievals using kNN-MT. Not only do the retrievals all correctly predict thetarget word military, but the local contexts tend to be semantically related. Both the source and thethree nearest retrievals express the concept of control over the military.

adopted, this approach was extended to identifying portions of the training data that could be a trans-lation based on edit distance (Doi et al., 2005), matching training examples based on local trigramcontexts (van den Bosch et al., 2007), using phrase-based memories (van Gompel et al., 2010) andincorporating syntactic features when retrieving similar examples (Stroppa et al., 2007; Haque et al.,2009). Recently, Gu et al. (2018) proposed a model that retrieves examples similar to the test sourcesequence and then attends over this subset of retrieved source-target pairs at the token level, whilegenerating translations. Bulte & Tezcan (2019) and Xu et al. (2020) use fuzzy-matching with trans-lation memories and augment source sequences with retrieved source-target pairs. These techniquesface challenges in identifying relevant retrieval candidates, as they focus on sentence-level retrieval.In contrast, kNN-MT focuses on token level retrieval from billions of key-value pairs, meaning thateach word can retrieve the most relevant examples for its translation.

Finally, various studies have explored retrieving additional information to improve domain adap-tation, often using lexicons (Hu et al., 2019), domain-adaptive training (Farajian et al., 2017) orattending over neighbors similar to n-grams in the source (Bapna & Firat, 2019). These modifica-tions require additional training, whereas kNN-MT provides the flexibility to use different datastoreswhen decoding in different domains, keeping the model fixed.

Retrieval in Text Generation Retrieval mechanisms have also been applied to generation tasksmore broadly. Weston et al. (2018) and Fan et al. (2020) improve dialogue response generation sys-tems by retrieving examples and concatenating them to model inputs. Lewis et al. (2020) improve

8

Page 9: Published as a conference paper at ICLR 2021

Published as a conference paper at ICLR 2021

open-domain question answering systems by retrieving relevant contexts from Wikipedia and con-catenating them to the inputs. Hashimoto et al. (2018) use a retrieve-and-edit framework to generatestructured outputs such as code, by jointly training the editor and retriever. For kNN-MT, retrievalresults in a distribution over the vocabulary that is used for generation directly and does not requirefurther training or providing the retrieval candidates as input.

8 CONCLUSION

We introduced a simple and effective method that can be applied to any neural MT model withoutfurther training. We show that similar contexts in a model’s embedding space are more likely to befollowed by similar next words, allowing the model to be improved by interpolation with a nearestneighbor classifier. The approach improves a state-of-the-art model in-domain, leads to large gainsout-of-domain, and can specialize a multilingual model for specific language-pairs. Future workshould improve efficiency, for example by down-sampling frequent target words in the datastore.

ACKNOWLEDGMENTS

The authors thank Kartikay Khandelwal for thoughtful discussions, Holger Schwenk and SergeyEdunov for sharing data and model details, and Matthijs Douze for answering questions about FAISS.

REFERENCES

Roee Aharoni and Yoav Goldberg. Unsupervised domain clusters in pretrained language models. InAssociation of Computational Linguistics (ACL), 2020.

Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun,Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. Massively multilingual neuralmachine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019,2019.

Ankur Bapna and Orhan Firat. Non-parametric adaptation for neural machine translation. In NorthAmerican Association of Computational Linguistics (NAACL), 2019.

Loıc Barrault, Ondrej Bojar, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, YvetteGraham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz,Mathias Muller, Santanu Pal, Matt Post, and Marcos Zampieri. Findings of the 2019 con-ference on machine translation (WMT19). In Proceedings of the Fourth Conference on Ma-chine Translation (Volume 2: Shared Task Papers, Day 1), pp. 1–61, Florence, Italy, Au-gust 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-5301. URLhttps://www.aclweb.org/anthology/W19-5301.

Ondrej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Philipp Koehn,and Christof Monz. Findings of the 2018 conference on machine translation (WMT18). In Pro-ceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 272–303,Belgium, Brussels, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-6401. URL https://www.aclweb.org/anthology/W18-6401.

Peter F Brown, Stephen A Della Pietra, Vincent J Della Pietra, and Robert L Mercer. The mathe-matics of statistical machine translation: Parameter estimation. Computational linguistics, 19(2):263–311, 1993.

Bram Bulte and Arda Tezcan. Neural fuzzy repair: Integrating fuzzy matches into neural machinetranslation. In Association of Computational Linguistics (ACL), 2019.

Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Juraf-sky. With little power comes great responsibility. In Empirical Methods in Natural LanguageProcessing (EMNLP), 2020.

Trevor Cohn and Mirella Lapata. Machine translation by triangulation: Making effective use ofmulti-parallel corpora. In Proceedings of the 45th Annual Meeting of the Association of Compu-tational Linguistics, pp. 728–735, Prague, Czech Republic, June 2007. Association for Computa-tional Linguistics. URL https://www.aclweb.org/anthology/P07-1092.

9

Page 10: Published as a conference paper at ICLR 2021

Published as a conference paper at ICLR 2021

Takao Doi, Hirofumi Yamamoto, and Eiichiro Sumita. Example-based machine translation usingefficient sentence retrieval based on edit-distance. ACM Transactions on Asian Language Infor-mation Processing, 4(4):377–399, 2005.

Angela Fan, Claire Gardent, Chloe Braud, and Antoine Bordes. Augmenting transformers withknn-based composite memory for dialogue. arXiv preprint arXiv:2004.12744, 2020.

M Amin Farajian, Marco Turchi, Matteo Negri, and Marcello Federico. Multi-domain neural ma-chine translation through unsupervised adaptation. In Proceedings of the Second Conference onMachine Translation, pp. 127–137, 2017.

Edouard Grave, Moustapha M Cisse, and Armand Joulin. Unbounded cache model for online lan-guage modeling with open vocabulary. In NIPS, pp. 6042–6052, 2017a.

Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with acontinuous cache. In International Conference on Learning Representations (ICLR), 2017b.

Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor OK Li. Search engine guided neural machinetranslation. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Ku-mar. Accelerating large-scale inference with anisotropic vector quantization. In InternationalConference on Machine Learning (ICML), 2020.

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909, 2020.

Rejwanul Haque, Sudip Naskar, Yanjun Ma, and Andy Way. Using supertags as source languagecontext in smt. 13th Annual Conference of the European Association for Machine Translation,(EAMT), 01 2009.

Tatsunori B. Hashimoto, Kelvin Guu, Yonatan Oren, and Percy Liang. A retrieve-and-edit frame-work for predicting structured outputs. In Advances in Neural Information Processing Systems(NeurIPS), 2018.

Junjie Hu, Mengzhou Xia, Graham Neubig, and Jaime G Carbonell. Domain adaptation of neuralmachine translation by lexicon induction. In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics, pp. 2989–3001, 2019.

Jeff Johnson, Matthijs Douze, and Herve Jegou. Billion-scale similarity search with gpus. arXivpreprint arXiv:1702.08734, 2017.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generaliza-tion through memorization: Nearest neighbor language models. In International Conference onLearning Representations (ICLR), 2020.

Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. In First Work-shop on Neural Machine Translation (WNMT), 2017.

Philipp Koehn, Franz J Och, and Daniel Marcu. Statistical phrase-based translation. Technicalreport, UNIVERSITY OF SOUTHERN CALIFORNIA MARINA DEL REY INFORMATIONSCIENCES INST, 2003.

Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subwordtokenizer and detokenizer for neural text processing. In Empirical Methods in Natural LanguageProcessing: System Demonstrations (EMNLP), 2018.

Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, NamanGoyal, Heinrich Kuttler, Mike Lewis, Wen tau Yih, Tim Rocktaschel, Sebastian Riedel, andDouwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv preprintarXiv:2005.11401, 2020.

Makoto Nagao. A framework of a mechanical translation between japanese and english by analogyprinciple. In Artificial and human intelligence, 1984.

10

Page 11: Published as a conference paper at ICLR 2021

Published as a conference paper at ICLR 2021

Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. Facebook fair’swmt19 news translation task submission. In Fourth Conference on Machine Translation (WMT),2019.

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier,and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In North AmericanAssociation of Computational Linguistics: System Demonstrations (NAACL), 2019.

Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conferenceon Machine Translation: Research Papers, pp. 186–191, Belgium, Brussels, October 2018. As-sociation for Computational Linguistics. URL https://www.aclweb.org/anthology/W18-6319.

Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. When andwhy are pre-trained word embeddings useful for neural machine translation? In North AmericanAssociation of Computational Linguistics (NAACL), 2018.

Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, and Armand Joulin.Ccmatrix: Mining billions of high-quality parallel sentences on the web. arXiv preprintarXiv:1911.04944, 2019.

Nicolas Stroppa, Antal van den Bosch, and Andy Way. Exploiting source similarity for smt usingcontext-informed features. Proceedings of The 11th Conference on Theoretical and Methodolog-ical Issues in Machine Translation, 01 2007.

Zhaopeng Tu, Yang Liu, Shuming Shi, and Tong Zhang. Learning to remember translation historywith a continuous cache. In Transactions of the Association of Computational Linguistics (TACL),2018.

Masao Utiyama and Hitoshi Isahara. A comparison of pivot methods for phrase-based statistical ma-chine translation. In Human Language Technologies 2007: The Conference of the North AmericanChapter of the Association for Computational Linguistics; Proceedings of the Main Conference,pp. 484–491, Rochester, New York, April 2007. Association for Computational Linguistics. URLhttps://www.aclweb.org/anthology/N07-1061.

Antal van den Bosch, Nicolas Stroppa, and Andy Way. A memory-based classification approach tomarker-based ebmt. METIS-II Workshop on New Approaches to Machine Translation, 04 2007.

Maarten van Gompel, Antal van den Bosch, and Peter Berck. Extending memory-based machinetranslation to phrases. In Computational Linguistics in the Netherlands (CLIN), pp. 79–86, 012010.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Infor-mation Processing Systems (NIPS), 2017.

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzman,Armand Joulin, and Edouard Grave. Ccnet: Extracting high quality monolingual datasets fromweb crawl data. arXiv preprint arXiv:1911.00359, 2019.

Jason Weston, Emily Dinan, and Alexander H Miller. Retrieve and refine: Improved sequencegeneration models for dialogue. arXiv preprint arXiv:1808.04776, 2018.

Jitao Xu, Josep Crego, and Jean Senellart. Boosting neural machine translation with similar transla-tions. In Association of Computational Linguistics (ACL), 2020.

Biao Zhang, Philip Williams, Ivan Titov, and Rico Sennrich. Improving massively multilingualneural machine translation and zero-shot translation. arXiv preprint arXiv:2004.11867, 2020.

Jingyi Zhang, Masao Utiyama, Eiichro Sumita, Graham Neubig, and Satoshi Nakamura. Guidingneural machine translation with retrieved translation pieces. In North American Association ofComputational Linguistics (NAACL), 2018.

11

Page 12: Published as a conference paper at ICLR 2021

Published as a conference paper at ICLR 2021

de-en ru-en zh-en ja-en fi-en lt-en de-fr de-cs en-csTest set sizes 2,998 3,000 3,981 1,998 3,000 2,000 1,512 3,000 2,983

Base MT 38.21 30.51 21.93 14.44 21.50 28.68 28.46 21.97 20.66+kNN-MT 40.48 32.6 24.49 15.62 22.11 29.41 29.74 23.01 21.79Datastore Size 5.56B 3.80B 1.19B 360M 318M 168M 4.21B 696M 533MInterpolation (λ) 0.6 0.5 0.4 0.4 0.2 0.2 0.6 0.4 0.3Temperature (T ) 10 10 10 10 10 10 100 100 10

en-de en-ru en-zh en-ja en-fi en-lt fr-de cs-de Avg.Test set sizes 2,998 3,000 3,981 1,998 3,000 2,000 1,512 3,000 -

Base MT 39.07 26.00 32.72 16.31 16.02 21.11 25.16 24.16 25.11+kNN-MT 42.22 29.52 37.96 18.28 17.22 22.84 26.39 24.5 26.95Datastore Size 6.50B 4.23B 1.13B 433M 375M 204M 3.98B 689M -Interpolation (λ) 0.6 0.7 0.7 0.6 0.4 0.4 0.5 0.4 -Temperature (T ) 100 10 100 10 10 10 10 100 -

Table 4: Multilingual machine translation with kNN-MT on the validation set. We show the thetuned interpolation parameter (λ) as well as the tuned softmax temperature (T ) for each languagepair.

Newstest2019 Medical Law IT Koran Subtitles Avg.

Test set sizes 2,000 2,000 2,000 2,000 2,000 2,000 -

Base MT 48.07 39.94 45.78 35.78 16.30 29.74 33.51+kNN-MT:

in-domain datastore 48.57 53.12 61.58 42.41 19.67 32.28 41.81

Datastore Size (in-domain) 770M 5.70M 18.3M 3.10M 450K 159M -Interpolation (λ) 0.4 0.8 0.8 0.7 0.8 0.7 -Temperature (T ) 100 10 10 10 100 10 -

Table 5: Domain adaptation using kNN-MT on the multi-domains validation data and newstest2018.The base MT system is trained on WMT’19 data which is also treated as the in-domain data fornewstest2018. We present the interpolation (λ) and softmax temperature (T ) hyperparameter choicesfor each domain.

A HYPERPARAMETER TUNING

In this section, we present validation set results as well as the hyperparameter choices for the mul-tilingual machine translation and domain adaptation experiments. Only two hyperparameters havebeen tuned on the validation sets, the interpolation parameter λ and the softmax temperature T . Thenumber of neighbors k has been fixed to 64, the number of clusters searched has been set to 32and the beam size has been set to 5. For the number of clusters in the FAISS index, preliminaryexperiments showed that for larger datastores, while using more clusters does not hurt performance,it does significantly speed up the search process since searching within the clusters is exhaustive.Hence, we use 131K clusters for the multilingual experiments.

Table 4 shows the validation set BLEU scores for the multilingual experiments as well as the hyper-parameter choices, and Table 5 shows the same for the domain adaptation experiments using onlythe in-domain datastores. Values for the interpolation parameter lie between 0 and 1. We also notethat for a fixed value of λ = 0.5, using kNN-MT either performs similarly to or improves the baseMT model’s performance, but never hurts, on validation sets across the 17 language pairs evaluatedin Section 4.2. For the temperature, we find that values of either 10 or 100 are optimal for all of ourexperiments.

12

Page 13: Published as a conference paper at ICLR 2021

Published as a conference paper at ICLR 2021

Test Input: Aber in Papua hat sich wenig verandert, und heute fuhlen sich die Einheimischen betrogen.Generated tokens: But not much has

Training Set Translation Context Training Set Target ContextProbability

Nach einem schwer umkampftenWahlkampf , der deutlich uber zweiMilliarden Dollar kostete, sieht es furviele Beobachter aus, als hatte sich inder amerikanischen Politik nicht vielgeandert...

After a hard-fought elec-tion campaign, costing wellin excess of $2 billion, itseems to many observersthat not much has

changed 0.143

Geandert freilich hat sich wenig: Schlechtgehandhabte Kriege...

But not much has changed 0.137

Kaum etwas hat sich verandert , außerdass es jetzt nicht mehr die Bewohner desAppartement-Gebaudes...

Not much has changed 0.130

Es ist zwar richtig, dass sich seit demAusbruch der globalen Finanzkrisevor uber vier Jahren und der schon2010 verabschiedeten Dodd-Frank-Finanzmarkreformen in den VereinigtenStaaten kaum etwas daran geandert hat...

True, while the global fi-nancial crisis erupted morethan four years ago, andthe Dodd-Frank financialreforms were adopted in theUnited States back in 2010,not much has

changed 0.121

... ... ...

Final kNN distribution: changed = 1.0Final Translation: But not much has changed in Papua, and locals feel betrayed today.Reference: But precious little has changed in Papua, and today local people feel betrayed.

Figure 5: An example where kNN-MT retrieves the same target token, changed, across all theneighbors. It matches local contexts on both the source and target sides, but not the global contextregarding Papua.

B ADDITIONAL EXAMPLES

Figure 5 further illustrates the behavior of kNN-MT using local contexts in both the source andtarget to retrieve nearest neighbors. Figure 6 shows a case where the model has very little target-sideprior context and mainly relies on the source context to retrieve the best neighbors.

C BLEU SCORES

In this paper, all results use case-sensitive detokenized BLEU, measured using SACREBLEU (Post,2018), with the following signatures:General: BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.4.13For chinese: BLEU+case.mixed+numrefs.1+smooth.exp+tok.zh+version.1.4.13For japanese:BLEU+case.mixed+numrefs.1+smooth.exp+tok.ja-mecab-0.996-IPA+version.1.4.13

For Table 1 and other experiments, we follow advice from Card et al. (2020) regarding the statisti-cal power of machine translation experiments given the improvements in BLEU scores and the sizeof the dataset. The authors present results for a single language pair and we verify that their as-sumptions hold for a couple of other language pairs. Specifically, we find that for Chinese-EnglishP0 = 0.13 and b0 = 12, and for English-Chinese P0 = 0.07 and b0 = 16. This indicates thatthese experiments, with the test sets containing about 2,000 examples, have close to 100% powerwhich was verified using the notebooks provided by Card et al. (2020). We refer the reader to theoriginal paper for more details. More generally, experiments on datasets which contain about 2,000examples, with improvements of about 1 BLEU or higher, are statistically powered.

13

Page 14: Published as a conference paper at ICLR 2021

Published as a conference paper at ICLR 2021

Test Input: Wir werden das Beste tun, mit dem, was wir haben.Generated tokens: We

Training Set Translation Context (source and target) TrainingSet Target

ContextProbability

Wir werden versuchen zu beweisen, dass dieVermutung falsch ist, dies zu tun, nur um einGegenbeispiel Leinwande, dass die Aussagefalsch ist, zu finden.

We will 0.145

Allerdings, wenn man sich diese große Na-tion, wird diese Nation aussehen zu weit, undwir werden das tun, was wichtig und wertvoll,Identity Wiederherstellen der Nation.

However, if you lookat this great nation,this nation will look toowide and we

will 0.132

Wir werden alles tun, um die Dinge fur dieAnfanger sehr einfach zu machen, wahrend wires den Experten erlauben, Dinge zu verandern,falls sie wollen.

We will 0.127

“Wir werden ihre Falle und die Falle andererpolitischer Gefangener vor die Gerichte brin-gen und das falsche Bild zerstoren...

“We are 0.127

... ... ... ...

Final kNN distribution: will = 0.639, are = 0.238, intend= 0.123Final Translation: We will do the best we can with what we have.Reference: We’ll do the best we can with what we got.

Figure 6: An example where the model has access to a very short amount of target-side context thatis ambiguous. kNN-MT is able to rely on source context to resolve this and generate the correcttarget token, will.

14