18
HAL Id: hal-02889805 https://hal.inria.fr/hal-02889805 Submitted on 5 Jul 2020 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. CamemBERT: a Tasty French Language Model Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, Benoît Sagot To cite this version: Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, et al.. CamemBERT: a Tasty French Language Model. ACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics, Jul 2020, Seattle / Virtual, United States. 10.18653/v1/2020.acl- main.645. hal-02889805

CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

HAL Id: hal-02889805https://hal.inria.fr/hal-02889805

Submitted on 5 Jul 2020

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

CamemBERT: a Tasty French Language ModelLouis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont,

Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, Benoît Sagot

To cite this version:Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, et al..CamemBERT: a Tasty French Language Model. ACL 2020 - 58th Annual Meeting of the Associationfor Computational Linguistics, Jul 2020, Seattle / Virtual, United States. �10.18653/v1/2020.acl-main.645�. �hal-02889805�

Page 2: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

CamemBERT: a Tasty French Language ModelLouis Martin∗1,2,3 Benjamin Muller∗2,3 Pedro Javier Ortiz Suárez∗2,3

Yoann Dupont3 Laurent Romary2 Éric Villemonte de la Clergerie2Djamé Seddah2 Benoît Sagot2

1Facebook AI Research, Paris, France 2Inria, Paris, France3Sorbonne Université, Paris, France

[email protected]{benjamin.muller, pedro.ortiz, laurent.romary,

eric.de_la_clergerie, djame.seddah, benoit.sagot}@[email protected]

Abstract

Pretrained language models are now ubiqui-tous in Natural Language Processing. Despitetheir success, most available models have ei-ther been trained on English data or on the con-catenation of data in multiple languages. Thismakes practical use of such models—in all lan-guages except English—very limited. In thispaper, we investigate the feasibility of train-ing monolingual Transformer-based languagemodels for other languages, taking Frenchas an example and evaluating our languagemodels on part-of-speech tagging, dependencyparsing, named entity recognition and naturallanguage inference tasks. We show that the useof web crawled data is preferable to the useof Wikipedia data. More surprisingly, we showthat a relatively small web crawled dataset(4GB) leads to results that are as good as thoseobtained using larger datasets (130+GB). Ourbest performing model CamemBERT reachesor improves the state of the art in all four down-stream tasks.

1 Introduction

Pretrained word representations have a long historyin Natural Language Processing (NLP), from non-contextual (Brown et al., 1992; Ando and Zhang,2005; Mikolov et al., 2013; Pennington et al., 2014)to contextual word embeddings (Peters et al., 2018;Akbik et al., 2018). Word representations are usu-ally obtained by training language model architec-tures on large amounts of textual data and then fedas an input to more complex task-specific architec-tures. More recently, these specialized architectureshave been replaced altogether by large-scale pre-trained language models which are fine-tuned foreach application considered. This shift has resultedin large improvements in performance over a wide

∗Equal contribution. Order determined alphabetically.

range of tasks (Devlin et al., 2019; Radford et al.,2019; Liu et al., 2019; Raffel et al., 2019).

These transfer learning methods exhibit clearadvantages over more traditional task-specific ap-proaches. In particular, they can be trained in anunsupervized manner, thereby taking advantageof the information contained in large amounts ofraw text. Yet they come with implementation chal-lenges, namely the amount of data and computa-tional resources needed for pretraining, which canreach hundreds of gigabytes of text and requirehundreds of GPUs (Yang et al., 2019; Liu et al.,2019). This has limited the availability of thesestate-of-the-art models to the English language, atleast in the monolingual setting. This is particularlyinconvenient as it hinders their practical use in NLPsystems. It also prevents us from investigating theirlanguage modelling capacity, for instance in thecase of morphologically rich languages.

Although multilingual models give remarkableresults, they are often larger, and their results, as wewill observe for French, can lag behind their mono-lingual counterparts for high-resource languages.

In order to reproduce and validate results thathave so far only been obtained for English, we takeadvantage of the newly available multilingual cor-pora OSCAR (Ortiz Suárez et al., 2019) to train amonolingual language model for French, dubbedCamemBERT. We also train alternative versionsof CamemBERT on different smaller corpora withdifferent levels of homogeneity in genre and stylein order to assess the impact of these parameters ondownstream task performance. CamemBERT usesthe RoBERTa architecture (Liu et al., 2019), an im-proved variant of the high-performing and widelyused BERT architecture (Devlin et al., 2019).

We evaluate our model on four different down-stream tasks for French: part-of-speech (POS) tag-ging, dependency parsing, named entity recogni-tion (NER) and natural language inference (NLI).

Page 3: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

CamemBERT improves on the state of the art in allfour tasks compared to previous monolingual andmultilingual approaches including mBERT, XLMand XLM-R, which confirms the effectiveness oflarge pretrained language models for French.

We make the following contributions:

• First release of a monolingual RoBERTamodel for the French language using recentlyintroduced large-scale open source corporafrom the Oscar collection and first outside theoriginal BERT authors to release such a largemodel for an other language than English.1

• We achieve state-of-the-art results on fourdownstream tasks: POS tagging, dependencyparsing, NER and NLI, confirming the effec-tiveness of BERT-based language models forFrench.

• We demonstrate that small and diverse train-ing sets can achieve similar performance tolarge-scale corpora, by analysing the impor-tance of the pretraining corpus in terms of sizeand domain.

2 Previous work

2.1 Contextual Language ModelsFrom non-contextual to contextual word em-beddings The first neural word vector repre-sentations were non-contextualized word embed-dings, most notably word2vec (Mikolov et al.,2013), GloVe (Pennington et al., 2014) and fastText(Mikolov et al., 2018), which were designed to beused as input to task-specific neural architectures.Contextualized word representations such as ELMo(Peters et al., 2018) and flair (Akbik et al., 2018),improved the representational power of word em-beddings by taking context into account. Amongother reasons, they improved the performance ofmodels on many tasks by handling words poly-semy. This paved the way for larger contextualizedmodels that replaced downstream architectures alto-gether in most tasks. Trained with language model-ing objectives, these approaches range from LSTM-based architectures such as (Dai and Le, 2015),to the successful transformer-based architecturessuch as GPT2 (Radford et al., 2019), BERT (Devlinet al., 2019), RoBERTa (Liu et al., 2019) and morerecently ALBERT (Lan et al., 2019) and T5 (Raffelet al., 2019).

1Released at: https://camembert-model.fr un-der the MIT open-source license.

Non-English contextualized models Followingthe success of large pretrained language models,they were extended to the multilingual setting withmultilingual BERT (hereafter mBERT) (Devlinet al., 2018), a single multilingual model for 104different languages trained on Wikipedia data, andlater XLM (Lample and Conneau, 2019), whichsignificantly improved unsupervized machine trans-lation. More recently XLM-R (Conneau et al.,2019), extended XLM by training on 2.5TB ofdata and outperformed previous scores on multi-lingual benchmarks. They show that multilingualmodels can obtain results competitive with mono-lingual models by leveraging higher quality datafrom other languages on specific downstream tasks.

A few non-English monolingual models havebeen released: ELMo models for Japanese, Por-tuguese, German and Basque2 and BERT for Sim-plified and Traditional Chinese (Devlin et al., 2018)and German (Chan et al., 2019).

However, to the best of our knowledge, no par-ticular effort has been made toward training modelsfor languages other than English at a scale similarto the latest English models (e.g. RoBERTa trainedon more than 100GB of data).

BERT and RoBERTa Our approach is based onRoBERTa (Liu et al., 2019) which itself is based onBERT (Devlin et al., 2019). BERT is a multi-layerbidirectional Transformer encoder trained with amasked language modeling (MLM) objective, in-spired by the Cloze task (Taylor, 1953). It comesin two sizes: the BERTBASE architecture and theBERTLARGE architecture. The BERTBASE architec-ture is 3 times smaller and therefore faster andeasier to use while BERTLARGE achieves increasedperformance on downstream tasks. RoBERTa im-proves the original implementation of BERT byidentifying key design choices for better perfor-mance, using dynamic masking, removing thenext sentence prediction task, training with largerbatches, on more data, and for longer.

3 Downstream evaluation tasks

In this section, we present the four downstreamtasks that we use to evaluate CamemBERT, namely:Part-Of-Speech (POS) tagging, dependency pars-ing, Named Entity Recognition (NER) and NaturalLanguage Inference (NLI). We also present thebaselines that we will use for comparison.

2https://allennlp.org/elmo

Page 4: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

Tasks POS tagging is a low-level syntactic task,which consists in assigning to each word its corre-sponding grammatical category. Dependency pars-ing consists in predicting the labeled syntactic treein order to capture the syntactic relations betweenwords.

For both of these tasks we run our experimentsusing the Universal Dependencies (UD)3 frame-work and its corresponding UD POS tag set (Petrovet al., 2012) and UD treebank collection (Nivreet al., 2018), which was used for the CoNLL 2018shared task (Seker et al., 2018). We perform ourevaluations on the four freely available FrenchUD treebanks in UD v2.2: GSD (McDonald et al.,2013), Sequoia4 (Candito and Seddah, 2012; Can-dito et al., 2014), Spoken (Lacheret et al., 2014;Bawden et al., 2014)5, and ParTUT (Sanguinettiand Bosco, 2015). A brief overview of the size andcontent of each treebank can be found in Table 1.

Treebank #Tokens #Sentences Genres

Blogs, NewsGSD 389,363 16,342Reviews, Wiki

····················Medical, NewsSequoia 68,615 3,099Non-fiction, Wiki

····················Spoken 34,972 2,786 Spoken

····················ParTUT 27,658 1,020 Legal, News, Wikis

····················FTB 350,930 27,658 News

Table 1: Statistics on the treebanks used in POS tagging,dependency parsing, and NER (FTB).

We also evaluate our model in NER, which is asequence labeling task predicting which words re-fer to real-world objects, such as people, locations,artifacts and organisations. We use the French Tree-bank6 (FTB) (Abeillé et al., 2003) in its 2008 ver-sion introduced by Candito and Crabbé (2009) andwith NER annotations by Sagot et al. (2012). TheFTB contains more than 11 thousand entity men-tions distributed among 7 different entity types. Abrief overview of the FTB can also be found inTable 1.

Finally, we evaluate our model on NLI, using theFrench part of the XNLI dataset (Conneau et al.,2018). NLI consists in predicting whether a hypoth-esis sentence is entailed, neutral or contradicts apremise sentence. The XNLI dataset is the exten-

3https://universaldependencies.org4https://deep-sequoia.inria.fr5Speech transcript uncased that includes annotated disflu-

encies without punctuation6This dataset has only been stored and used on Inria’s

servers after signing the research-only agreement.

sion of the Multi-Genre NLI (MultiNLI) corpus(Williams et al., 2018) to 15 languages by trans-lating the validation and test sets manually intoeach of those languages. The English training setis machine translated for all languages other thanEnglish. The dataset is composed of 122k train,2490 development and 5010 test examples for eachlanguage. As usual, NLI performance is evaluatedusing accuracy.

Baselines In dependency parsing and POS-tagging we compare our model with:

• mBERT: The multilingual cased version ofBERT (see Section 2.1). We fine-tune mBERTon each of the treebanks with an additionallayer for POS-tagging and dependency pars-ing, in the same conditions as our Camem-BERT model.

• XLMMLM-TLM: A multilingual pretrained lan-guage model from Lample and Conneau(2019), which showed better performancethan mBERT on NLI. We use the version avail-able in the Hugging’s Face transformer library(Wolf et al., 2019); like mBERT, we fine-tuneit in the same conditions as our model.

• UDify (Kondratyuk, 2019): A multitask andmultilingual model based on mBERT, UDifyis trained simultaneously on 124 different UDtreebanks, creating a single POS tagging anddependency parsing model that works across75 different languages. We report the scoresfrom Kondratyuk (2019) paper.

• UDPipe Future (Straka, 2018): An LSTM-based model ranked 3rd in dependency parsingand 6th in POS tagging at the CoNLL 2018shared task (Seker et al., 2018). We report thescores from Kondratyuk (2019) paper.

• UDPipe Future + mBERT + Flair (Strakaet al., 2019): The original UDPipe Futureimplementation using mBERT and Flair asfeature-based contextualized word embed-dings. We report the scores from Straka et al.(2019) paper.

In French, no extensive work has been done onNER due to the limited availability of annotatedcorpora. Thus we compare our model with the onlyrecent available baselines set by Dupont (2017),who trained both CRF (Lafferty et al., 2001) and

Page 5: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

BiLSTM-CRF (Lample et al., 2016) architectureson the FTB and enhanced them using heuristicsand pretrained word embeddings. Additionally, asfor POS and dependency parsing, we compare ourmodel to a fine-tuned version of mBERT for theNER task.

For XNLI, we provide the scores of mBERTwhich has been reported for French by Wuand Dredze (2019). We report scores fromXLMMLM-TLM (described above), the best modelfrom Lample and Conneau (2019). We also reportthe results of XLM-R (Conneau et al., 2019).

4 CamemBERT: a French LanguageModel

In this section, we describe the pretraining data,architecture, training objective and optimisationsetup we use for CamemBERT.

4.1 Training dataPretrained language models benefits from beingtrained on large datasets (Devlin et al., 2018; Liuet al., 2019; Raffel et al., 2019). We therefore usethe French part of the OSCAR corpus (Ortiz Suárezet al., 2019), a pre-filtered and pre-classified ver-sion of Common Crawl.7

OSCAR is a set of monolingual corpora ex-tracted from Common Crawl snapshots. It followsthe same approach as (Grave et al., 2018) by us-ing a language classification model based on thefastText linear classifier (Grave et al., 2017; Joulinet al., 2016) pretrained on Wikipedia, Tatoeba andSETimes, which supports 176 languages. No otherfiltering is done. We use a non-shuffled version ofthe French data, which amounts to 138GB of rawtext and 32.7B tokens after subword tokenization.

4.2 Pre-processingWe segment the input text data into subword unitsusing SentencePiece (Kudo and Richardson, 2018).SentencePiece is an extension of Byte-Pair encod-ing (BPE) (Sennrich et al., 2016) and WordPiece(Kudo, 2018) that does not require pre-tokenization(at the word or token level), thus removing the needfor language-specific tokenisers. We use a vocabu-lary size of 32k subword tokens. These subwordsare learned on 107 sentences sampled randomlyfrom the pretraining dataset. We do not use sub-word regularisation (i.e. sampling from multiplepossible segmentations) for the sake of simplicity.

7https://commoncrawl.org/about/

4.3 Language Modeling

Transformer Similar to RoBERTa and BERT,CamemBERT is a multi-layer bidirectional Trans-former (Vaswani et al., 2017). Given thewidespread usage of Transformers, we do not de-scribe them here and refer the reader to (Vaswaniet al., 2017). CamemBERT uses the original ar-chitectures of BERTBASE (12 layers, 768 hiddendimensions, 12 attention heads, 110M parame-ters) and BERTLARGE (24 layers, 1024 hidden di-mensions, 16 attention heads, 335M parameters).CamemBERT is very similar to RoBERTa, themain difference being the use of whole-word mask-ing and the usage of SentencePiece tokenization(Kudo and Richardson, 2018) instead of WordPiece(Schuster and Nakajima, 2012).

Pretraining Objective We train our model onthe Masked Language Modeling (MLM) task.Given an input text sequence composed of N to-kens x1, ..., xN , we select 15% of tokens for pos-sible replacement. Among those selected tokens,80% are replaced with the special <MASK> token,10% are left unchanged and 10% are replaced by arandom token. The model is then trained to predictthe initial masked tokens using cross-entropy loss.

Following the RoBERTa approach, we dynami-cally mask tokens instead of fixing them staticallyfor the whole dataset during preprocessing. Thisimproves variability and makes the model morerobust when training for multiple epochs.

Since we use SentencePiece to tokenize our cor-pus, the input tokens to the model are a mix ofwhole words and subwords. An upgraded versionof BERT8 and Joshi et al. (2019) have shown thatmasking whole words instead of individual sub-words leads to improved performance. Whole-wordMasking (WWM) makes the training task more dif-ficult because the model has to predict a wholeword rather than predicting only part of the wordgiven the rest. We train our models using WWMby using whitespaces in the initial untokenized textas word delimiters.

WWM is implemented by first randomly sam-pling 15% of the words in the sequence and thenconsidering all subword tokens in each of this 15%for candidate replacement. This amounts to a pro-portion of selected tokens that is close to the origi-nal 15%. These tokens are then either replaced by

8https://github.com/google-research/bert/blob/master/README.md

Page 6: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

<MASK> tokens (80%), left unchanged (10%) orreplaced by a random token.

Subsequent work has shown that the next sen-tence prediction (NSP) task originally used inBERT does not improve downstream task perfor-mance (Lample and Conneau, 2019; Liu et al.,2019), thus we also remove it.

Optimisation Following (Liu et al., 2019), weoptimize the model using Adam (Kingma and Ba,2014) (β1 = 0.9, β2 = 0.98) for 100k steps withlarge batch sizes of 8192 sequences, each sequencecontaining at most 512 tokens. We enforce each se-quence to only contain complete paragraphs (whichcorrespond to lines in the our pretraining dataset).

Pretraining We use the RoBERTa implementa-tion in the fairseq library (Ott et al., 2019). Ourlearning rate is warmed up for 10k steps up to apeak value of 0.0007 instead of the original 0.0001given our large batch size, and then fades to zerowith polynomial decay. Unless otherwise specified,our models use the BASE architecture, and arepretrained for 100k backpropagation steps on 256Nvidia V100 GPUs (32GB each) for a day. Wedo not train our models for longer due to practicalconsiderations, even though the performance stillseemed to be increasing.

4.4 Using CamemBERT for downstreamtasks

We use the pretrained CamemBERT in two ways.In the first one, which we refer to as fine-tuning,we fine-tune the model on a specific task in an end-to-end manner. In the second one, referred to asfeature-based embeddings or simply embeddings,we extract frozen contextual embedding vectorsfrom CamemBERT. These two complementary ap-proaches shed light on the quality of the pretrainedhidden representations captured by CamemBERT.

Fine-tuning For each task, we append the rel-evant predictive layer on top of CamemBERT’sarchitecture. Following the work done on BERT(Devlin et al., 2019), for sequence tagging and se-quence labeling we append a linear layer that re-spectively takes as input the last hidden represen-tation of the <s> special token and the last hiddenrepresentation of the first subword token of eachword. For dependency parsing, we plug a bi-affinegraph predictor head as inspired by Dozat and Man-ning (2017). We refer the reader to this article formore details on this module. We fine-tune on XNLI

by adding a classification head composed of onehidden layer with a non-linearity and one linearprojection layer, with input dropout for both.

We fine-tune CamemBERT independently foreach task and each dataset. We optimize the modelusing the Adam optimiser (Kingma and Ba, 2014)with a fixed learning rate. We run a grid search ona combination of learning rates and batch sizes. Weselect the best model on the validation set out of the30 first epochs. For NLI we use the default hyper-parameters provided by the authors of RoBERTa onthe MNLI task.9 Although this might have pushedthe performances even further, we do not applyany regularisation techniques such as weight de-cay, learning rate warm-up or discriminative fine-tuning, except for NLI. We show that fine-tuningCamemBERT in a straightforward manner leadsto state-of-the-art results on all tasks and outper-forms the existing BERT-based models in all cases.The POS tagging, dependency parsing, and NERexperiments are run using Hugging Face’s Trans-former library extended to support CamemBERTand dependency parsing (Wolf et al., 2019). TheNLI experiments use the fairseq library followingthe RoBERTa implementation.

Embeddings Following Straková et al. (2019)and Straka et al. (2019) for mBERT and the En-glish BERT, we make use of CamemBERT in afeature-based embeddings setting. In order to ob-tain a representation for a given token, we firstcompute the average of each sub-word’s represen-tations in the last four layers of the Transformer,and then average the resulting sub-word vectors.

We evaluate CamemBERT in the embeddingssetting for POS tagging, dependency parsing andNER; using the open-source implementations ofStraka et al. (2019) and Straková et al. (2019).10

5 Evaluation of CamemBERT

In this section, we measure the performance of ourmodels by evaluating them on the four aforemen-tioned tasks: POS tagging, dependency parsing,NER and NLI.

9More details at https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.glue.md.

10UDPipe Future is available at https://github.com/CoNLL-UD-2018/UDPipe-Future, and the codefor nested NER is available at https://github.com/ufal/acl2019_nested_ner.

Page 7: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

GSD SEQUOIA SPOKEN PARTUTMODEL

UPOS LAS UPOS LAS UPOS LAS UPOS LAS

mBERT (fine-tuned) 97.48 89.73 98.41 91.24 96.02 78.63 97.35 91.37XLMMLM-TLM (fine-tuned) 98.13 90.03 98.51 91.62 96.18 80.89 97.39 89.43UDify (Kondratyuk, 2019) 97.83 91.45 97.89 90.05 96.23 80.01 96.12 88.06UDPipe Future (Straka, 2018) 97.63 88.06 98.79 90.73 95.91 77.53 96.93 89.63+ mBERT + Flair (emb.) (Straka et al., 2019) 97.98 90.31 99.32 93.81 97.23 81.40 97.64 92.47

··················································································································································································································CamemBERT (fine-tuned) 98.18 92.57 99.29 94.20 96.99 81.37 97.65 93.43UDPipe Future + CamemBERT (embeddings) 97.96 90.57 99.25 93.89 97.09 81.81 97.50 92.32

Table 2: POS and dependency parsing scores on 4 French treebanks, reported on test sets assuming gold tokeniza-tion and segmentation (best model selected on validation out of 4). Best scores in bold, second best underlined.

Model F1

SEM (CRF) (Dupont, 2017) 85.02LSTM-CRF (Dupont, 2017) 85.57mBERT (fine-tuned) 87.35

······················································································CamemBERT (fine-tuned) 89.08LSTM+CRF+CamemBERT (embeddings) 89.55

Table 3: NER scores on the FTB (best model selectedon validation out of 4). Best scores in bold, second bestunderlined.

Model Acc. #Params

mBERT (Devlin et al., 2019) 76.9 175MXLMMLM-TLM (Lample and Conneau, 2019) 80.2 250MXLM-RBASE (Conneau et al., 2019) 80.1 270M

·········································································································CamemBERT (fine-tuned) 82.5 110M

Supplement: LARGE modelsXLM-RLARGE (Conneau et al., 2019) 85.2 550M

·········································································································CamemBERTLARGE (fine-tuned) 85.7 335M

Table 4: NLI accuracy on the French XNLI test set(best model selected on validation out of 10). Bestscores in bold, second best underlined.

POS tagging and dependency parsing ForPOS tagging and dependency parsing, we compareCamemBERT with other models in the two set-tings: fine-tuning and as feature-based embeddings.We report the results in Table 2.

CamemBERT reaches state-of-the-art scores onall treebanks and metrics in both scenarios. The twoapproaches achieve similar scores, with a slight ad-vantage for the fine-tuned version of CamemBERT,thus questioning the need for complex task-specificarchitectures such as UDPipe Future.

Despite a much simpler optimisation process andno task specific architecture, fine-tuning Camem-BERT outperforms UDify on all treebanks andsometimes by a large margin (e.g. +4.15% LASon Sequoia and +5.37 LAS on ParTUT). Camem-BERT also reaches better performance than othermultilingual pretrained models such as mBERTand XLMMLM-TLM on all treebanks.

CamemBERT achieves overall slightly bet-ter results than the previous state-of-the-art andtask-specific architecture UDPipe Future+mBERT+Flair, except for POS tagging on Sequoia and POStagging on Spoken, where CamemBERT lags by0.03% and 0.14% UPOS respectively. UDPipe Fu-ture+mBERT +Flair uses the contextualized stringembeddings Flair (Akbik et al., 2018), which are infact pretrained contextualized character-level wordembeddings specifically designed to handle mis-spelled words as well as subword structures suchas prefixes and suffixes. This design choice mightexplain the difference in score for POS taggingwith CamemBERT, especially for the Spoken tree-bank where words are not capitalized, a factor thatmight pose a problem for CamemBERT which wastrained on capitalized data, but that might be prop-erly handle by Flair on the UDPipe Future+mBERT+Flair model.

Named-Entity Recognition For NER, we simi-larly evaluate CamemBERT in the fine-tuning set-ting and as input embeddings to the task specificarchitecture LSTM+CRF. We report these scoresin Table 3.

In both scenarios, CamemBERT achieves higherF1 scores than the traditional CRF-based architec-tures, both non-neural and neural, and than fine-tuned multilingual BERT models.11

Using CamemBERT as embeddings to the tra-ditional LSTM+CRF architecture gives slightlyhigher scores than by fine-tuning the model(89.08 vs. 89.55). This demonstrates that althoughCamemBERT can be used successfully without anytask-specific architecture, it can still produce highquality contextualized embeddings that might beuseful in scenarios where powerful downstreamarchitectures exist.

11XLMMLM-TLM is a lower-case model. Case is crucial forNER, therefore we do not report its low performance (84.37%)

Page 8: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

Natural Language Inference On the XNLIbenchmark, we compare CamemBERT to previ-ous state-of-the-art multilingual models in the fine-tuning setting. In addition to the standard Camem-BERT model with a BASE architecture, we trainanother model with the LARGE architecture, re-ferred to as CamemBERTLARGE, for a fair com-parison with XLM-RLARGE. This model is trainedwith the CCNet corpus, described in Sec. 6, for100k steps.12 We expect that training the model forlonger would yield even better performance.

CamemBERT reaches higher accuracy than itsBASE counterparts reaching +5.6% over mBERT,+2.3 over XLMMLM-TLM, and +2.4 over XLM-RBASE. CamemBERT also uses as few as halfas many parameters (110M vs. 270M for XLM-RBASE).

CamemBERTLARGE achieves a state-of-the-artaccuracy of 85.7% on the XNLI benchmark, asopposed to 85.2, for the recent XLM-RLARGE.

CamemBERT uses fewer parameters than mul-tilingual models, mostly because of its smaller vo-cabulary size (e.g. 32k vs. 250k for XLM-R). Twoelements might explain the better performance ofCamemBERT over XLM-R. Even though XLM-R was trained on an impressive amount of data(2.5TB), only 57GB of this data is in French,whereas we used 138GB of French data. Addition-ally XLM-R also handles 100 languages, and theauthors show that when reducing the number oflanguages to 7, they can reach 82.5% accuracy forFrench XNLI with their BASE architecture.

Summary of CamemBERT’s results Camem-BERT improves the state of the art for the 4 down-stream tasks considered, thereby confirming onFrench the usefulness of Transformer-based mod-els. We obtain these results when using Camem-BERT as a fine-tuned model or when used as con-textual embeddings with task-specific architectures.This questions the need for more complex down-stream architectures, similar to what was shownfor English (Devlin et al., 2019). Additionally, thissuggests that CamemBERT is also able to producehigh-quality representations out-of-the-box with-out further tuning.

12We train our LARGE model with the CCNet corpus forpractical reasons. Given that BASE models reach similar per-formance when using OSCAR or CCNet as pretraining corpus(Appendix Table 8), we expect an OSCAR LARGE model toreach comparable scores.

6 Impact of corpus origin and size

In this section we investigate the influence of thehomogeneity and size of the pretraining corpus ondownstream task performance. With this aim, wetrain alternative version of CamemBERT by vary-ing the pretraining datasets. For this experiment,we fix the number of pretraining steps to 100k, andallow the number of epochs to vary accordingly(more epochs for smaller dataset sizes). All modelsuse the BASE architecture.

In order to investigate the need for homogeneousclean data versus more diverse and possibly noisierdata, we use alternative sources of pretraining datain addition to OSCAR:

• Wikipedia, which is homogeneous in termsof genre and style. We use the official2019 French Wikipedia dumps13. We removeHTML tags and tables using Giuseppe At-tardi’s WikiExtractor.14

• CCNet (Wenzek et al., 2019), a dataset ex-tracted from Common Crawl with a differentfiltering process than for OSCAR. It was builtusing a language model trained on Wikipedia,in order to filter out bad quality texts suchas code or tables.15 As this filtering step bi-ases the noisy data from Common Crawl tomore Wikipedia-like text, we expect CCNetto act as a middle ground between the unfil-tered “noisy” OSCAR dataset, and the “clean”Wikipedia dataset. As a result of the differ-ent filtering processes, CCNet contains longerdocuments on average compared to OSCARwith smaller—and often noisier—documentsweeded out.

Table 6 summarizes statistics of these different cor-pora.

In order to make the comparison between thesethree sources of pretraining data, we randomly sam-ple 4GB of text (at the document level) from OS-CAR and CCNet, thereby creating samples of bothCommon-Crawl-based corpora of the same size asthe French Wikipedia. These smaller 4GB samplesalso provides us a way to investigate the impact

13https://dumps.wikimedia.org/backup-index.html.

14https://github.com/attardi/wikiextractor.

15We use the HEAD split, which corresponds to the top 33%of documents in terms of filtering perplexity.

Page 9: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

GSD SEQUOIA SPOKEN PARTUT AVERAGE NER NLIDATASET SIZE

UPOS LAS UPOS LAS UPOS LAS UPOS LAS UPOS LAS F1 ACC.

Fine-tuningWiki 4GB 98.28 93.04 98.74 92.71 96.61 79.61 96.20 89.67 97.45 88.75 89.86 78.32CCNet 4GB 98.34 93.43 98.95 93.67 96.92 82.09 96.50 90.98 97.67 90.04 90.46 82.06OSCAR 4GB 98.35 93.55 98.97 93.70 96.94 81.97 96.58 90.28 97.71 89.87 90.65 81.88

·······················································································································································································································································OSCAR 138GB 98.39 93.80 98.99 94.00 97.17 81.18 96.63 90.56 97.79 89.88 91.55 81.55

Embeddings (with UDPipe Future (tagging, parsing) or LSTM+CRF (NER))Wiki 4GB 98.09 92.31 98.74 93.55 96.24 78.91 95.78 89.79 97.21 88.64 91.23 -CCNet 4GB 98.22 92.93 99.12 94.65 97.17 82.61 96.74 89.95 97.81 90.04 92.30 -OSCAR 4GB 98.21 92.77 99.12 94.92 97.20 82.47 96.74 90.05 97.82 90.05 91.90 -

·······················································································································································································································································OSCAR 138GB 98.18 92.77 99.14 94.24 97.26 82.44 96.52 89.89 97.77 89.84 91.83 -

Table 5: Results on the four tasks using language models pre-trained on data sets of varying homogeneity and size,reported on validation sets (average of 4 runs for POS tagging, parsing and NER, average of 10 runs for NLI).

Corpus Size #tokens #docs Tokens/docPercentiles:

5% 50% 95%

Wikipedia 4GB 990M 1.4M 102 363 2530CCNet 135GB 31.9B 33.1M 128 414 2869OSCAR 138GB 32.7B 59.4M 28 201 1946

Table 6: Statistics on the pretraining datasets used.

of pretraining data size. Downstream task perfor-mance for our alternative versions of CamemBERTare provided in Table 5. The upper section reportsscores in the fine-tuning setting while the lowersection reports scores for the embeddings.

6.1 Common Crawl vs. Wikipedia?

Table 5 clearly shows that models trained on the4GB versions of OSCAR and CCNet (CommonCrawl) perform consistently better than the the onetrained on the French Wikipedia. This is true bothin the fine-tuning and embeddings setting. Unsur-prisingly, the gap is larger on tasks involving textswhose genre and style are more divergent fromthose of Wikipedia, such as tagging and parsingon the Spoken treebank. The performance gap isalso very large on the XNLI task, probably as aconsequence of the larger diversity of Common-Crawl-based corpora in terms of genres and topics.XNLI is indeed based on multiNLI which covers arange of genres of spoken and written text.

The downstream task performances of the mod-els trained on the 4GB version of CCNet and OS-CAR are much more similar.16

16We provide the results of a model trained on the wholeCCNet corpus in the Appendix. The conclusions are similarwhen comparing models trained on the full corpora: down-stream results are similar when using OSCAR or CCNet.

6.2 How much data do you need?An unexpected outcome of our experiments is thatthe model trained “only” on the 4GB sample of OS-CAR performs similarly to the standard Camem-BERT trained on the whole 138GB OSCAR. Theonly task with a large performance gap is NER,where “138GB” models are better by 0.9 F1 points.This could be due to the higher number of namedentities present in the larger corpora, which is ben-eficial for this task. On the contrary, other tasksdon’t seem to gain from the additional data.

In other words, when trained on corpora suchas OSCAR and CCNet, which are heterogeneousin terms of genre and style, 4GB of uncompressedtext is large enough as pretraining corpus to reachstate-of-the-art results with the BASE architecure,better than those obtained with mBERT (pretrainedon 60GB of text).17 This calls into question theneed to use a very large corpus such as OSCAR orCCNet when training a monolingual Transformer-based language model such as BERT or RoBERTa.Not only does this mean that the computational(and therefore environmental) cost of training astate-of-the-art language model can be reduced, butit also means that CamemBERT-like models canbe trained for all languages for which a Common-Crawl-based corpus of 4GB or more can be created.OSCAR is available in 166 languages, and pro-vides such a corpus for 38 languages. Moreover, itis possible that slightly smaller corpora (e.g. downto 1GB) could also prove sufficient to train high-performing language models. We obtained our re-sults with BASE architectures. Further research isneeded to confirm the validity of our findings onlarger architectures and other more complex natural

17The OSCAR-4GB model gets slightly better XNLI accu-racy than the full OSCAR-138GB model (81.88 vs. 81.55).This might be due to the random seed used for pretraining, aseach model is pretrained only once.

Page 10: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

language understanding tasks. However, even witha BASE architecture and 4GB of training data, thevalidation loss is still decreasing beyond 100k steps(and 400 epochs). This suggests that we are stillunder-fitting the 4GB pretraining dataset, traininglonger might increase downstream performance.

7 Discussion

Since the pre-publication of this work (Martin et al.,2019), many monolingual language models haveappeared, e.g. (Le et al., 2019; Virtanen et al., 2019;Delobelle et al., 2020), for as much as 30 languages(Nozza et al., 2020). In almost all tested config-urations they displayed better results than multi-lingual language models such as mBERT (Pireset al., 2019). Interestingly, Le et al. (2019) showedthat using their FlauBert, a RoBERTa-based lan-guage model for French, which was trained on lessbut more edited data, in conjunction to Camem-BERT in an ensemble system could improve theperformance of a parsing model and establish a newstate-of-the-art in constituency parsing of French,highlighting thus the complementarity of both mod-els.18 As it was the case for English when BERTwas first released, the availability of similar scalelanguage models for French enabled interestingapplications, such as large scale anonymizationof legal texts, where CamemBERT-based mod-els established a new state-of-the-art on this task(Benesty, 2019), or the first large question answer-ing experiments on a French Squad data set thatwas released very recently (d’Hoffschmidt et al.,2020) where the authors matched human perfor-mance using CamemBERTLARGE. Being the firstpre-trained language model that used the open-source Common Crawl Oscar corpus and givenits impact on the community, CamemBERT pavedthe way for many works on monolingual languagemodels that followed. Furthermore, the availabilityof all its training data favors reproducibility andis a step towards better understanding such mod-els. In that spirit, we make the models used in ourexperiments available via our website and via thehuggingface and fairseq APIs, in additionto the base CamemBERT model.

8 Conclusion

In this work, we investigated the feasibility of train-ing a Transformer-based language model for lan-

18We refer the reader to (Le et al., 2019) for a comprehen-sive benchmark and details therein.

guages other than English. Using French as anexample, we trained CamemBERT, a languagemodel based on RoBERTa. We evaluated Camem-BERT on four downstream tasks (part-of-speechtagging, dependency parsing, named entity recog-nition and natural language inference) in which ourbest model reached or improved the state of theart in all tasks considered, even when compared tostrong multilingual models such as mBERT, XLMand XLM-R, while also having fewer parameters.

Our experiments demonstrate that using webcrawled data with high variability is preferableto using Wikipedia-based data. In addition weshowed that our models could reach surprisinglyhigh performances with as low as 4GB of pretrain-ing data, questioning thus the need for large scalepretraining corpora. This shows that state-of-the-artTransformer-based language models can be trainedon languages with far fewer resources than En-glish, whenever a few gigabytes of data are avail-able. This paves the way for the rise of monolin-gual contextual pre-trained language-models forunder-resourced languages. The question of know-ing whether pretraining on small domain specificcontent will be a better option than transfer learn-ing techniques such as fine-tuning remains openand we leave it for future work.

Pretrained on pure open-source corpora, Camem-BERT is freely available and distributed with theMIT license via popular NLP libraries (fairseqand huggingface) as well as on our websitecamembert-model.fr.

Acknowledgments

We want to thank Clémentine Fourrier for her proof-reading and insightful comments, and Alix Chaguéfor her great logo. This work was partly funded bythree French National funded projects granted toInria and other partners by the Agence Nationale dela Recherche, namely projects PARSITI (ANR-16-CE33-0021), SoSweet (ANR-15-CE38-0011) andBASNUM (ANR-18-CE38-0003), as well as by thelast author’s chair in the PRAIRIE institute fundedby the French national agency ANR as part of the“Investissements d’avenir” programme under thereference ANR-19-P3IA-0001.

Page 11: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

References

Anne Abeillé, Lionel Clément, and François Tou-ssenel. 2003. Building a Treebank for French,pages 165–187. Kluwer, Dordrecht.

Alan Akbik, Duncan Blythe, and Roland Vollgraf.2018. Contextual string embeddings for se-quence labeling. In Proceedings of the 27thInternational Conference on Computational Lin-guistics, COLING 2018, Santa Fe, New Mexico,USA, August 20-26, 2018, pages 1638–1649. As-sociation for Computational Linguistics.

Rie Kubota Ando and Tong Zhang. 2005. A frame-work for learning predictive structures from mul-tiple tasks and unlabeled data. J. Mach. Learn.Res., 6:1817–1853.

Rachel Bawden, Marie-Amélie Botalla, KimGerdes, and Sylvain Kahane. 2014. Correct-ing and validating syntactic dependency in thespoken French treebank rhapsodie. In Proceed-ings of the Ninth International Conference onLanguage Resources and Evaluation (LREC’14),pages 2320–2325, Reykjavik, Iceland. EuropeanLanguage Resources Association (ELRA).

Michaël Benesty. 2019. Ner algo benchmark: spacy,flair, m-bert and camembert on anonymizingfrench commercial legal cases.

Peter F. Brown, Vincent J. Della Pietra, Peter V.de Souza, Jennifer C. Lai, and Robert L. Mercer.1992. Class-based n-gram models of naturallanguage. Computational Linguistics, 18(4):467–479.

Marie Candito and Benoit Crabbé. 2009. Im-proving generative statistical parsing with semi-supervised word clustering. In Proc. of IWPT’09,Paris, France.

Marie Candito, Guy Perrier, Bruno Guillaume,Corentin Ribeyre, Karën Fort, Djamé Seddah,and Éric Villemonte de la Clergerie. 2014. Deepsyntax annotation of the sequoia french tree-bank. In Proceedings of the Ninth InternationalConference on Language Resources and Evalua-tion, LREC 2014, Reykjavik, Iceland, May 26-31,2014., pages 2298–2305. European LanguageResources Association (ELRA).

Marie Candito and Djamé Seddah. 2012. Le cor-pus sequoia : annotation syntaxique et exploita-

tion pour l’adaptation d’analyseur par pont lex-ical (the sequoia corpus : Syntactic annotationand use for a parser lexical domain adaptationmethod) [in french]. In Proceedings of the JointConference JEP-TALN-RECITAL 2012, volume2: TALN, Grenoble, France, June 4-8, 2012,pages 321–334.

Branden Chan, Timo Möller, Malte Pietsch, TanaySoni, and Chin Man Yeung. 2019. German bert.https://deepset.ai/german-bert.

Alexis Conneau, Kartikay Khandelwal, NamanGoyal, Vishrav Chaudhary, Guillaume Wenzek,Francisco Guzmán, Edouard Grave, Myle Ott,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Unsupervised cross-lingual representation learn-ing at scale. ArXiv preprint : 1911.02116.

Alexis Conneau, Ruty Rinott, Guillaume Lample,Adina Williams, Samuel R. Bowman, HolgerSchwenk, and Veselin Stoyanov. 2018. XNLI:evaluating cross-lingual sentence representa-tions. In Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Pro-cessing, Brussels, Belgium, October 31 - Novem-ber 4, 2018, pages 2475–2485. Association forComputational Linguistics.

Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In Advances inNeural Information Processing Systems 28: An-nual Conference on Neural Information Process-ing Systems 2015, December 7-12, 2015, Mon-treal, Quebec, Canada, pages 3079–3087.

Pieter Delobelle, Thomas Winters, and BettinaBerendt. 2020. RobBERT: a Dutch RoBERTa-based Language Model. ArXiv preprint2001.06286.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Multilingual bert.https://github.com/google-research/

bert/blob/master/multilingual.md.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-trainingof deep bidirectional transformers for languageunderstanding. In Proceedings of the 2019 Con-ference of the North American Chapter of theAssociation for Computational Linguistics: Hu-man Language Technologies, NAACL-HLT 2019,Minneapolis, MN, USA, June 2-7, 2019, Volume

Page 12: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

1 (Long and Short Papers), pages 4171–4186.Association for Computational Linguistics.

Martin d’Hoffschmidt, Maxime Vidal, Wacim Bel-blidia, and Tom Brendlé. 2020. Fquad: Frenchquestion answering dataset. arXiv preprintarXiv:2002.06071.

Timothy Dozat and Christopher D. Manning. 2017.Deep biaffine attention for neural dependencyparsing. In 5th International Conference onLearning Representations, ICLR 2017, Toulon,France, April 24-26, 2017, Conference TrackProceedings. OpenReview.net.

Yoann Dupont. 2017. Exploration de traitspour la reconnaissance d’entit’es nomm’ees dufrançais par apprentissage automatique. In 24eConf’erence sur le Traitement Automatique desLangues Naturelles (TALN), page 42.

Edouard Grave, Piotr Bojanowski, Prakhar Gupta,Armand Joulin, and Tomas Mikolov. 2018.Learning word vectors for 157 languages. InProceedings of the Eleventh International Con-ference on Language Resources and Evalua-tion, LREC 2018, Miyazaki, Japan, May 7-12,2018. European Language Resources Associa-tion (ELRA).

Edouard Grave, Tomas Mikolov, Armand Joulin,and Piotr Bojanowski. 2017. Bag of tricks forefficient text classification. In Proceedings ofthe 15th Conference of the European Chapter ofthe Association for Computational Linguistics,EACL 2017, Valencia, Spain, April 3-7, 2017,Volume 2: Short Papers, pages 427–431. Associ-ation for Computational Linguistics.

Ganesh Jawahar, Benoît Sagot, and Djamé Seddah.2019. What does BERT learn about the structureof language? In (Korhonen et al., 2019), pages3651–3657.

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.Weld, Luke Zettlemoyer, and Omer Levy.2019. Spanbert: Improving pre-training byrepresenting and predicting spans. CoRR,abs/1907.10529.

Armand Joulin, Edouard Grave, Piotr Bojanowski,Matthijs Douze, Hervé Jégou, and TomasMikolov. 2016. Fasttext.zip: Compressingtext classification models. ArXiv preprint1612.03651.

Diederik P Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. ArXivpreprint 1412.6980.

Daniel Kondratyuk. 2019. 75 languages, 1model: Parsing universal dependencies univer-sally. CoRR, abs/1904.02099.

Anna Korhonen, David R. Traum, and LluísMàrquez, editors. 2019. Proceedings of the 57thConference of the Association for ComputationalLinguistics, ACL 2019, Florence, Italy, July 28-August 2, 2019, Volume 1: Long Papers. Associ-ation for Computational Linguistics.

Taku Kudo. 2018. Subword regularization: Im-proving neural network translation models withmultiple subword candidates. In Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics, ACL 2018, Mel-bourne, Australia, July 15-20, 2018, Volume 1:Long Papers, pages 66–75. Association for Com-putational Linguistics.

Taku Kudo and John Richardson. 2018. Sentence-piece: A simple and language independent sub-word tokenizer and detokenizer for neural textprocessing. In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing, EMNLP 2018: System Demonstra-tions, Brussels, Belgium, October 31 - November4, 2018, pages 66–71. Association for Computa-tional Linguistics.

Anne Lacheret, Sylvain Kahane, Julie Beliao, AnneDister, Kim Gerdes, Jean-Philippe Goldman,Nicolas Obin, Paola Pietrandrea, and AtanasTchobanov. 2014. Rhapsodie: a prosodic-syntactic treebank for spoken French. InProceedings of the Ninth International Con-ference on Language Resources and Evalua-tion (LREC’14), pages 295–301, Reykjavik, Ice-land. European Language Resources Association(ELRA).

John D. Lafferty, Andrew McCallum, and FernandoC. N. Pereira. 2001. Conditional random fields:Probabilistic models for segmenting and labelingsequence data. In Proceedings of the EighteenthInternational Conference on Machine Learning(ICML 2001), Williams College, Williamstown,MA, USA, June 28 - July 1, 2001, pages 282–289.Morgan Kaufmann.

Page 13: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

Guillaume Lample, Miguel Ballesteros, SandeepSubramanian, Kazuya Kawakami, and ChrisDyer. 2016. Neural architectures for named en-tity recognition. In NAACL HLT 2016, The 2016Conference of the North American Chapter ofthe Association for Computational Linguistics:Human Language Technologies, San Diego Cali-fornia, USA, June 12-17, 2016, pages 260–270.The Association for Computational Linguistics.

Guillaume Lample and Alexis Conneau. 2019.Cross-lingual language model pretraining.CoRR, abs/1901.07291.

Zhenzhong Lan, Mingda Chen, Sebastian Good-man, Kevin Gimpel, Piyush Sharma, and RaduSoricut. 2019. ALBERT: A lite BERT for self-supervised learning of language representations.ArXiv preprint 1909.11942.

Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne,Maximin Coavoux, Benjamin Lecouteux,Alexandre Allauzen, Benoît Crabbé, LaurentBesacier, and Didier Schwab. 2019. Flaubert:Unsupervised language model pre-training forfrench. ArXiv : 1912.05372.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du,Mandar Joshi, Danqi Chen, Omer Levy, MikeLewis, Luke Zettlemoyer, and Veselin Stoyanov.2019. Roberta: A robustly optimized BERT pre-training approach. ArXiv preprint 1907.11692.

Louis Martin, Benjamin Muller, Pedro Javier OrtizSuárez, Yoann Dupont, Laurent Romary, ÉricVillemonte de la Clergerie, Djamé Seddah, andBenoît Sagot. 2019. CamemBERT: a TastyFrench Language Model. arXiv e-prints. ArXivpreprint : 1911.03894.

Ryan McDonald, Joakim Nivre, YvonneQuirmbach-Brundage, Yoav Goldberg, DipanjanDas, Kuzman Ganchev, Keith Hall, Slav Petrov,Hao Zhang, Oscar Täckström, Claudia Bedini,Núria Bertomeu Castelló, and Jungmee Lee.2013. Universal dependency annotation formultilingual parsing. In Proceedings of the 51stAnnual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers),pages 92–97, Sofia, Bulgaria. Association forComputational Linguistics.

Tomas Mikolov, Edouard Grave, Piotr Bojanowski,Christian Puhrsch, and Armand Joulin. 2018. Ad-

vances in pre-training distributed word represen-tations. In Proceedings of the Eleventh Interna-tional Conference on Language Resources andEvaluation, LREC 2018, Miyazaki, Japan, May7-12, 2018.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gre-gory S. Corrado, and Jeffrey Dean. 2013. Dis-tributed representations of words and phrasesand their compositionality. In Advances in Neu-ral Information Processing Systems 26: 27thAnnual Conference on Neural Information Pro-cessing Systems 2013. Proceedings of a meetingheld December 5-8, 2013, Lake Tahoe, Nevada,United States., pages 3111–3119.

Joakim Nivre, Mitchell Abrams, Željko Agic,Lars Ahrenberg, Lene Antonsen, Maria Je-sus Aranzabe, Gashaw Arutie, Masayuki Asa-hara, Luma Ateyah, Mohammed Attia, AitziberAtutxa, Liesbeth Augustinus, Elena Badmaeva,Miguel Ballesteros, Esha Banerjee, SebastianBank, Verginica Barbu Mititelu, John Bauer,Sandra Bellato, Kepa Bengoetxea, Riyaz Ah-mad Bhat, Erica Biagetti, Eckhard Bick, Ro-gier Blokland, Victoria Bobicev, Carl Börstell,Cristina Bosco, Gosse Bouma, Sam Bowman,Adriane Boyd, Aljoscha Burchardt, Marie Can-dito, Bernard Caron, Gauthier Caron, Gülsen Ce-biroglu Eryigit, Giuseppe G. A. Celano, SavasCetin, Fabricio Chalub, Jinho Choi, YongseokCho, Jayeol Chun, Silvie Cinková, Aurélie Col-lomb, Çagrı Çöltekin, Miriam Connor, MarineCourtin, Elizabeth Davidson, Marie-Catherinede Marneffe, Valeria de Paiva, Arantza Diaz deIlarraza, Carly Dickerson, Peter Dirix, KajaDobrovoljc, Timothy Dozat, Kira Droganova,Puneet Dwivedi, Marhaba Eli, Ali Elkahky,Binyam Ephrem, Tomaž Erjavec, Aline Eti-enne, Richárd Farkas, Hector Fernandez Al-calde, Jennifer Foster, Cláudia Freitas, KatarínaGajdošová, Daniel Galbraith, Marcos Garcia,Moa Gärdenfors, Kim Gerdes, Filip Ginter,Iakes Goenaga, Koldo Gojenola, Memduh Gökır-mak, Yoav Goldberg, Xavier Gómez Guino-vart, Berta Gonzáles Saavedra, Matias Gri-oni, Normunds Gruzıtis, Bruno Guillaume,Céline Guillot-Barbance, Nizar Habash, JanHajic, Jan Hajic jr., Linh Hà My, Na-RaeHan, Kim Harris, Dag Haug, Barbora Hladká,Jaroslava Hlavácová, Florinel Hociung, PetterHohle, Jena Hwang, Radu Ion, Elena Irimia,

Page 14: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

Tomáš Jelínek, Anders Johannsen, Fredrik Jør-gensen, Hüner Kasıkara, Sylvain Kahane, Hi-roshi Kanayama, Jenna Kanerva, Tolga Kayade-len, Václava Kettnerová, Jesse Kirchner, Na-talia Kotsyba, Simon Krek, Sookyoung Kwak,Veronika Laippala, Lorenzo Lambertino, TatianaLando, Septina Dian Larasati, Alexei Lavren-tiev, John Lee, Phương Lê Hông, Alessan-dro Lenci, Saran Lertpradit, Herman Leung,Cheuk Ying Li, Josie Li, Keying Li, Kyung-Tae Lim, Nikola Ljubešic, Olga Loginova, OlgaLyashevskaya, Teresa Lynn, Vivien Macketanz,Aibek Makazhanov, Michael Mandl, Christo-pher Manning, Ruli Manurung, Catalina Maran-duc, David Marecek, Katrin Marheinecke, Héc-tor Martínez Alonso, André Martins, Jan Mašek,Yuji Matsumoto, Ryan McDonald, Gustavo Men-donça, Niko Miekka, Anna Missilä, Catalin Mi-titelu, Yusuke Miyao, Simonetta Montemagni,Amir More, Laura Moreno Romero, ShinsukeMori, Bjartur Mortensen, Bohdan Moskalevskyi,Kadri Muischnek, Yugo Murawaki, KailiMüürisep, Pinkey Nainwani, Juan IgnacioNavarro Horñiacek, Anna Nedoluzhko, GuntaNešpore-Berzkalne, Lương Nguyên Thi., HuyênNguyên Thi. Minh, Vitaly Nikolaev, RattimaNitisaroj, Hanna Nurmi, Stina Ojala, Adédayò.Olúòkun, Mai Omura, Petya Osenova, RobertÖstling, Lilja Øvrelid, Niko Partanen, ElenaPascual, Marco Passarotti, Agnieszka Patejuk,Siyao Peng, Cenel-Augusto Perez, Guy Per-rier, Slav Petrov, Jussi Piitulainen, Emily Pitler,Barbara Plank, Thierry Poibeau, Martin Popel,Lauma Pretkalnin, a, Sophie Prévost, ProkopisProkopidis, Adam Przepiórkowski, Tiina Puo-lakainen, Sampo Pyysalo, Andriela Rääbis,Alexandre Rademaker, Loganathan Ramasamy,Taraka Rama, Carlos Ramisch, Vinit Ravis-hankar, Livy Real, Siva Reddy, Georg Rehm,Michael Rießler, Larissa Rinaldi, Laura Rituma,Luisa Rocha, Mykhailo Romanenko, RudolfRosa, Davide Rovati, Valentin Ros, ca, Olga Rud-ina, Shoval Sadde, Shadi Saleh, Tanja Samardžic,Stephanie Samson, Manuela Sanguinetti, BaibaSaulıte, Yanin Sawanakunanon, Nathan Schnei-der, Sebastian Schuster, Djamé Seddah, Wolf-gang Seeker, Mojgan Seraji, Mo Shen, AtsukoShimada, Muh Shohibussirri, Dmitry Sichinava,Natalia Silveira, Maria Simi, Radu Simionescu,Katalin Simkó, Mária Šimková, Kiril Simov,Aaron Smith, Isabela Soares-Bastos, Antonio

Stella, Milan Straka, Jana Strnadová, AlaneSuhr, Umut Sulubacak, Zsolt Szántó, Dima Taji,Yuta Takahashi, Takaaki Tanaka, Isabelle Tellier,Trond Trosterud, Anna Trukhina, Reut Tsarfaty,Francis Tyers, Sumire Uematsu, Zdenka Ure-šová, Larraitz Uria, Hans Uszkoreit, SowmyaVajjala, Daniel van Niekerk, Gertjan van No-ord, Viktor Varga, Veronika Vincze, Lars Wallin,Jonathan North Washington, Seyi Williams,Mats Wirén, Tsegay Woldemariam, Tak-sumWong, Chunxiao Yan, Marat M. Yavrumyan,Zhuoran Yu, Zdenek Žabokrtský, Amir Zeldes,Daniel Zeman, Manying Zhang, and HanzhiZhu. 2018. Universal dependencies 2.2. LIN-DAT/CLARIN digital library at the Institute ofFormal and Applied Linguistics (ÚFAL), Facultyof Mathematics and Physics, Charles University.

Debora Nozza, Federico Bianchi, and Dirk Hovy.2020. What the [mask]? making senseof language-specific BERT models. CoRR,abs/2003.02912.

Pedro Javier Ortiz Suárez, Benoît Sagot, and Lau-rent Romary. 2019. Asynchronous pipeline forprocessing huge corpora on medium to low re-source infrastructures. Challenges in the Man-agement of Large Corpora (CMLC-7) 2019,page 9.

Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier,and Michael Auli. 2019. fairseq: A fast, ex-tensible toolkit for sequence modeling. In Pro-ceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics: Human Language Technolo-gies, NAACL-HLT 2019, Minneapolis, MN, USA,June 2-7, 2019, Demonstrations, pages 48–53.Association for Computational Linguistics.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectorsfor word representation. In Proceedings of the2014 Conference on Empirical Methods in Nat-ural Language Processing, EMNLP 2014, Oc-tober 25-29, 2014, Doha, Qatar, A meeting ofSIGDAT, a Special Interest Group of the ACL,pages 1532–1543. ACL.

Matthew E. Peters, Mark Neumann, Mohit Iyyer,Matt Gardner, Christopher Clark, Kenton Lee,and Luke Zettlemoyer. 2018. Deep contextual-ized word representations. In Proceedings of the

Page 15: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

2018 Conference of the North American Chapterof the Association for Computational Linguistics:Human Language Technologies, NAACL-HLT2018, New Orleans, Louisiana, USA, June 1-6,2018, Volume 1 (Long Papers), pages 2227–2237.Association for Computational Linguistics.

Slav Petrov, Dipanjan Das, and Ryan T. McDonald.2012. A universal part-of-speech tagset. In Pro-ceedings of the Eighth International Conferenceon Language Resources and Evaluation, LREC2012, Istanbul, Turkey, May 23-25, 2012, pages2089–2096. European Language Resources As-sociation (ELRA).

Telmo Pires, Eva Schlinger, and Dan Garrette.2019. How multilingual is multilingual bert?arXiv:1906.01502.

Alec Radford, Jeffrey Wu, Rewon Child, DavidLuan, Dario Amodei, and Ilya Sutskever. 2019.Language models are unsupervised multitasklearners. OpenAI Blog, 1(8):9.

Colin Raffel, Noam Shazeer, Adam Roberts,Katherine Lee, Sharan Narang, Michael Matena,Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Ex-ploring the limits of transfer learning with aunified text-to-text transformer. ArXiv preprint1910.10683.

Benoît Sagot, Marion Richard, and Rosa Stern.2012. Annotation référentielle du corpus arboréde Paris 7 en entités nommées (referential namedentity annotation of the paris 7 french treebank)[in french]. In Proceedings of the Joint Con-ference JEP-TALN-RECITAL 2012, volume 2:TALN, Grenoble, France, June 4-8, 2012, pages535–542. ATALA/AFCP.

Manuela Sanguinetti and Cristina Bosco. 2015.PartTUT: The Turin University Parallel Tree-bank. In Roberto Basili, Cristina Bosco, RodolfoDelmonte, Alessandro Moschitti, and MariaSimi, editors, Harmonization and Developmentof Resources and Tools for Italian Natural Lan-guage Processing within the PARLI Project, vol-ume 589 of Studies in Computational Intelli-gence, pages 51–69. Springer.

Mike Schuster and Kaisuke Nakajima. 2012.Japanese and korean voice search. In 2012 IEEEInternational Conference on Acoustics, Speechand Signal Processing (ICASSP), pages 5149–5152. IEEE.

Amit Seker, Amir More, and Reut Tsarfaty. 2018.Universal morpho-syntactic parsing and the con-tribution of lexica: Analyzing the onlp lab sub-mission to the conll 2018 shared task. In Pro-ceedings of the CoNLL 2018 Shared Task: Mul-tilingual Parsing from Raw Text to UniversalDependencies, pages 208–215.

Rico Sennrich, Barry Haddow, and AlexandraBirch. 2016. Neural machine translation of rarewords with subword units. In Proceedings of the54th Annual Meeting of the Association for Com-putational Linguistics, ACL 2016, August 7-12,2016, Berlin, Germany, Volume 1: Long Papers.The Association for Computer Linguistics.

Milan Straka. 2018. Udpipe 2.0 prototype at conll2018 ud shared task. In Proceedings of theCoNLL 2018 Shared Task: Multilingual Pars-ing from Raw Text to Universal Dependencies,pages 197–207.

Milan Straka, Jana Straková, and Jan Hajic. 2019.Evaluating contextualized embeddings on 54 lan-guages in POS tagging, lemmatization and de-pendency parsing. ArXiv preprint 1908.07448.

Jana Straková, Milan Straka, and Jan Hajic. 2019.Neural architectures for nested NER through lin-earization. In (Korhonen et al., 2019), pages5326–5331.

Wilson L Taylor. 1953. “cloze procedure”: A newtool for measuring readability. Journalism Bul-letin, 30(4):415–433.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.BERT rediscovers the classical NLP pipeline.In Proceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics,pages 4593–4601, Florence, Italy. Associationfor Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,Lukasz Kaiser, and Illia Polosukhin. 2017. At-tention is all you need. In Advances in Neu-ral Information Processing Systems 30: AnnualConference on Neural Information ProcessingSystems 2017, 4-9 December 2017, Long Beach,CA, USA, pages 5998–6008.

Antti Virtanen, Jenna Kanerva, Rami Ilo, JouniLuoma, Juhani Luotolahti, Tapio Salakoski, Filip

Page 16: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

Ginter, and Sampo Pyysalo. 2019. Multilingualis not enough: Bert for finnish. ArXiv preprint1912.07076.

Guillaume Wenzek, Marie-Anne Lachaux, AlexisConneau, Vishrav Chaudhary, FranciscoGuzmán, Armand Joulin, and Edouard Grave.2019. CCNet: Extracting High Quality Mono-lingual Datasets from Web Crawl Data. ArXivpreprint 1911.00359.

Adina Williams, Nikita Nangia, and Samuel R.Bowman. 2018. A broad-coverage challengecorpus for sentence understanding through infer-ence. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associa-tion for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT 2018, NewOrleans, Louisiana, USA, June 1-6, 2018, Vol-ume 1 (Long Papers), pages 1112–1122.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi,Pierric Cistac, Tim Rault, R’emi Louf, MorganFuntowicz, and Jamie Brew. 2019. Hugging-face’s transformers: State-of-the-art natural lan-guage processing. ArXiv preprint 1910.03771.

Shijie Wu and Mark Dredze. 2019. Beto, bentz,becas: The surprising cross-lingual effectivenessof BERT. CoRR, abs/1904.09077.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.Carbonell, Ruslan Salakhutdinov, and Quoc V.Le. 2019. Xlnet: Generalized autoregressivepretraining for language understanding. CoRR,abs/1906.08237.

Page 17: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

Appendix

In the appendix, we analyse different designchoices of CamemBERT (Table 8), namely withrespect to the use of whole-word masking, the train-ing dataset, the model size, and the number of train-ing steps in complement with the analyses of theimpact of corpus origin an size (Section 6. In all theablations, all scores come from at least 4 averagedruns. For POS tagging and dependency parsing, weaverage the scores on the 4 treebanks. We also re-port all averaged test scores of our different modelsin Table 7.

A Impact of Whole-Word Masking

In Table 8, we compare models trained usingthe traditional subword masking with whole-wordmasking. Whole-Word Masking positively impactsdownstream performances for NLI (although onlyby 0.5 points of accuracy). To our surprise, thisWhole-Word Masking scheme does not benefitmuch lower level task such as Name Entity Recog-nition, POS tagging and Dependency Parsing.

B Impact of model size

Table 8 compares models trained with the BASEand LARGE architectures. These models weretrained with the CCNet corpus (135GB) for prac-tical reasons. We confirm the positive influenceof larger models on the NLI and NER tasks. TheLARGE architecture leads to respectively 19.7%error reduction and 23.7%. To our surprise, on POStagging and dependency parsing, having three timemore parameters doesn’t lead to a significant differ-ence compared to the BASE model. Tenney et al.(2019) and Jawahar et al. (2019) have shown thatlow-level syntactic capabilities are learnt in lowerlayers of BERT while higher level semantic repre-sentations are found in upper layers of BERT. POStagging and dependency parsing probably do notbenefit from adding more layers as the lower layersof the BASE architecture already capture what isnecessary to complete these tasks.

C Impact of training dataset

Table 8 compares models trained on CCNet andon OSCAR. The major difference between the twodatasets is the additional filtering step of CCNetthat favors Wikipedia-Like texts. The model pre-trained on OSCAR gets slightly better results onPOS tagging and dependency parsing, but gets a

0 20000 40000 60000 80000 100000steps

60

65

70

75

80

85

90

95

100

Scor

es

ParsingNERNLILanguage Modelling

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Perp

lexi

ty

Figure 1: Impact of number of pretraining steps ondownstream performance for CamemBERT.

.

larger +1.31 improvement on NER. The CCNetmodel gets better performance on NLI (+0.67).

D Impact of number of steps

Figure 1 displays the evolution of downstream taskperformance with respect to the number of steps.All scores in this section are averages from at least4 runs with different random seeds. For POS tag-ging and dependency parsing, we also average thescores on the 4 treebanks.

We evaluate our model at every epoch (1 epochequals 8360 steps). We report the masked languagemodelling perplexity along with downstream per-formances. Figure 1, suggests that the more com-plex the task the more impactful the number ofsteps is. We observe an early plateau for depen-dency parsing and NER at around 22k steps, whilefor NLI, even if the marginal improvement withregard to pretraining steps becomes smaller, theperformance is still slowly increasing at 100k steps.

In Table 8, we compare two models trained onCCNet, one for 100k steps and the other for 500ksteps to evaluate the influence of the total numberof steps. The model trained for 500k steps doesnot increase the scores much from just trainingfor 100k steps in POS tagging and parsing. Theincrease is slightly higher for XNLI (+0.84).

Those results suggest that low level syntacticrepresentation are captured early in the languagemodel training process while it needs more stepsto extract complex semantic information as neededfor NLI.

Page 18: CamemBERT: a Tasty French Language Model - Inria...CamemBERT improves on the state of the art in all four tasks compared to previous monolingual and multilingual approaches including

GSD SEQUOIA SPOKEN PARTUT NER NLIDATASET MASKING ARCH. #STEPS

UPOS LAS UPOS LAS UPOS LAS UPOS LAS F1 ACC.

Fine-tuning

OSCAR Subword BASE 100k 98.25 92.29 99.25 93.70 96.95 79.96 97.73 92.68 89.23 81.18OSCAR Whole-word BASE 100k 98.21 92.30 99.21 94.33 96.97 80.16 97.78 92.65 89.11 81.92CCNET Subword BASE 100k 98.02 92.06 99.26 94.13 96.94 80.39 97.55 92.66 89.05 81.77CCNET Whole-word BASE 100k 98.03 92.43 99.18 94.26 96.98 80.89 97.46 92.33 89.27 81.92CCNET Whole-word BASE 500k 98.21 92.43 99.24 94.60 96.69 80.97 97.65 92.48 89.08 83.43CCNET Whole-word LARGE 100k 98.01 91.09 99.23 93.65 97.01 80.89 97.41 92.59 89.39 85.29

Embeddings (with UDPipe Future (tagging, parsing) or LSTM+CRF (NER))OSCAR Subword BASE 100k 98.01 90.64 99.27 94.26 97.15 82.56 97.70 92.70 90.25 -OSCAR Whole-word BASE 100k 97.97 90.44 99.23 93.93 97.08 81.74 97.50 92.28 89.48 -CCNET Subword BASE 100k 97.87 90.78 99.20 94.33 97.17 82.39 97.54 92.51 89.38 -CCNET Whole-word BASE 100k 97.96 90.76 99.23 94.34 97.04 82.09 97.39 92.82 89.85 -CCNET Whole-word BASE 500k 97.84 90.25 99.14 93.96 97.01 82.17 97.27 92.28 89.07 -CCNET Whole-word LARGE 100k 98.01 90.70 99.23 94.01 97.04 82.18 97.31 92.28 88.76 -

Table 7: Performance reported on Test sets for all trained models (average over multiple fine-tuning seeds).

DATASET MASKING ARCH. #PARAM. #STEPS UPOS LAS NER XNLI

Masking StrategyOSCAR Subword BASE 110M 100k 97.78 89.80 91.55 81.04OSCAR Whole-word BASE 110M 100k 97.79 89.88 91.44 81.55

Model SizeCCNet Whole-word BASE 110M 100k 97.67 89.46 90.13 82.22CCNet Whole-word LARGE 335M 100k 97.74 89.82 92.47 85.73

DatasetCCNet Whole-word BASE 110M 100k 97.67 89.46 90.13 82.22OSCAR Whole-word BASE 110M 100k 97.79 89.88 91.44 81.55

Number of StepsCCNet Whole-word BASE 110M 100k 98.04 89.85 90.13 82.20CCNet Whole-word BASE 110M 500k 97.95 90.12 91.30 83.04

Table 8: Comparing scores on the Validation sets of different design choices. POS tagging and parsing datasetsare averaged. (average over multiple fine-tuning seeds).