Learning and Evaluating Contextual Embedding of Source Code

Learning and Evaluating Contextual Embedding of Source Code

Aditya Kanade 1 2 Petros Maniatis 2 Gogul Balakrishnan 2 Kensen Shi 2

Abstract

Recent research has achieved impressive resultson understanding and improving source code bybuilding up on machine-learning techniques de-veloped for natural languages A significant ad-vancement in natural-language understanding hascome with the development of pre-trained con-textual embeddings such as BERT which canbe fine-tuned for downstream tasks with less la-beled data and training budget while achievingbetter accuracies However there is no attemptyet to obtain a high-quality contextual embed-ding of source code and to evaluate it on multipleprogram-understanding tasks simultaneously thatis the gap that this paper aims to mitigate Specifi-cally first we curate a massive deduplicated cor-pus of 74M Python files from GitHub which weuse to pre-train CuBERT an open-sourced code-understanding BERT model and second we cre-ate an open-sourced benchmark that comprisesfive classification tasks and one program-repairtask akin to code-understanding tasks proposedin the literature before We fine-tune CuBERT onour benchmark tasks and compare the resultingmodels to different variants of Word2Vec tokenembeddings BiLSTM and Transformer modelsas well as published state-of-the-art models show-ing that CuBERT outperforms them all even withshorter training and with fewer labeled exam-ples Future work on source-code embedding canbenefit from reusing our benchmark and fromcomparing against CuBERT models as a strongbaseline

Equal contribution 1Indian Institute of Science BangaloreIndia 2Google Brain Mountain View USA Correspondence toAditya Kanade ltkanadeiiscacingt Petros Maniatis ltmani-atisgooglecomgt

Proceedings of the 37 th International Conference on MachineLearning Online PMLR 119 2020 Copyright 2020 by the au-thor(s)

1 IntroductionModern software engineering places a high value on writingclean and readable code This helps other developers under-stand the authorrsquos intent so that they can maintain and extendthe code Developers use meaningful identifier names andnatural-language documentation to make this happen (Mar-tin 2008) As a result source code contains substantialinformation that can be exploited by machine-learning algo-rithms Indeed sequence modeling on source code has beenshown to be successful in a variety of software-engineeringtasks such as code completion (Hindle et al 2012 Raychevet al 2014) source code to pseudo-code mapping (Odaet al 2015) API-sequence prediction (Gu et al 2016) pro-gram repair (Pu et al 2016 Gupta et al 2017) and naturallanguage to code mapping (Iyer et al 2018) among others

The distributed vector representations of tokens called to-ken (or word) embeddings are a crucial component ofneural methods for sequence modeling Learning usefulembeddings in a supervised setting with limited data isoften difficult Therefore many unsupervised learning ap-proaches have been proposed to take advantage of largeamounts of unlabeled data that are more readily availableThis has resulted in ever more useful pre-trained token em-beddings (Mikolov et al 2013a Pennington et al 2014Bojanowski et al 2017) However the subtle differencesin the meaning of a token in varying contexts are lost wheneach word is associated with a single representation Recenttechniques for learning contextual embeddings (McCannet al 2017 Peters et al 2018 Radford et al 2018 2019Devlin et al 2019 Yang et al 2019) provide ways to com-pute representations of tokens based on their surroundingcontext and have shown significant accuracy improvementsin downstream tasks even with only a small number oftask-specific parameters

Inspired by the success of pre-trained contextual embed-dings for natural languages we present the first attempt toapply the underlying techniques to source code In partic-ular BERT (Devlin et al 2019) produces a bidirectionalTransformer encoder (Vaswani et al 2017) by training it topredict values of masked tokens and whether two sentencesfollow each other in a natural discourse The pre-trainedmodel can be fine-tuned for downstream supervised tasksand has been shown to produce state-of-the-art results on


a number of natural-language understanding benchmarksIn this work we derive a contextual embedding of sourcecode by training a BERT model on source code We call ourmodel CuBERT short for Code Understanding BERT

In order to achieve this we curate a massive corpus ofPython programs collected from GitHub GitHub projectsare known to contain a large amount of duplicate code Toavoid biasing the model to such duplicated code we performdeduplication using the method of Allamanis (2018) Theresulting corpus has 74 million files with a total of 93billion tokens (16 million unique) For comparison wealso train Word2Vec embeddings (Mikolov et al 2013ab)namely continuous bag-of-words (CBOW) and Skipgramembeddings on the same corpus

For evaluating CuBERT we create a benchmark of five clas-sification tasks and a sixth localization and repair task Theclassification tasks range from classification of source codeaccording to presence or absence of certain classes of bugsto mismatch between a functionrsquos natural language descrip-tion and its body to predicting the right kind of exception tocatch for a given code fragment The localization and repairtask defined for variable-misuse bugs (Vasic et al 2019)is a pointer-prediction task Although similar tasks haveappeared in prior work the associated datasets come fromdifferent languages and varied sources instead we create acohesive multiple-task benchmark dataset in this work Toproduce a high-quality dataset we ensure that there is nooverlap between pre-training and fine-tuning examples andthat all of the tasks are defined on Python code

We fine-tune CuBERT on each of the classification tasksand compare the results to multi-layered bidirectionalLSTM (Hochreiter amp Schmidhuber 1997) models as wellas Transformers (Vaswani et al 2017) We train the LSTMmodels from scratch and also using pre-trainined Word2Vecembeddings Our results show that CuBERT consistentlyoutperforms these baseline models by 32 to 147across the classification tasks We perform a number ofadditional studies by varying the sampling strategies usedfor training Word2Vec models and by varying programlengths In addition we also show that CuBERT can befine-tuned effectively using only 33 of the task-specificlabeled data and with only 2 epochs and that even thenit attains results competitive to the baseline models trainedwith the full datasets and many more epochs CuBERTwhen fine-tuned on the variable-misuse localization andrepair task produces high classification localization andlocalization+repair accuracies and outperforms publishedstate-of-the-art models (Hellendoorn et al 2020 Vasic et al2019) Our contributions are as follows

bull We present the first attempt at pre-training a BERTcontextual embedding of source code

bull We show the efficacy of the pre-trained contextual em-bedding on five classification tasks Our fine-tunedmodels outperform baseline LSTM models (withwith-out Word2Vec embeddings) as well as Transformerstrained from scratch even with reduced training data

bull We evaluate CuBERT on a pointer prediction task andshow that it outperforms state-of-the-art results signifi-cantly

bull We make the models and datasets publicly available1

We hope that future work benefits from our contribu-tions by reusing our benchmark tasks and by compar-ing against our strong baseline models

2 Related WorkGiven the abundance of natural-language text and the rel-ative difficulty of obtaining labeled data much effort hasbeen devoted to using large corpora to learn about languagein an unsupervised fashion before trying to focus on taskswith small labeled training datasets Word2Vec (Mikolovet al 2013ab) computed word embeddings based on wordco-occurrence and proximity but the same embedding isused regardless of the context The continued advances inword (Pennington et al 2014) and subword (Bojanowskiet al 2017) embeddings led to publicly released pre-trainedembeddings used in a variety of tasks

To deal with varying word context contextual word embed-dings were developed (McCann et al 2017 Peters et al2018 Radford et al 2018 2019) in which an embeddingis learned for the context of a word in a particular sentencenamely the sequence of words preceding it and possiblyfollowing it BERT (Devlin et al 2019) improved natural-language pre-training by using a de-noising autoencoderInstead of learning a language model which is inherentlysequential BERT optimizes for predicting a noised wordwithin a sentence Such prediction instances are gener-ated by choosing a word position and either keeping it un-changed removing the word or replacing the word with arandom wrong word It also pre-trains with the objectiveof predicting whether two sentences can be next to eachother These pre-training objectives along with the use ofa Transformer-based architecture gave BERT an accuracyboost in a number of NLP tasks over the state-of-the-artBERT has been improved upon in various ways includingmodifying training objectives utilizing ensembles combin-ing attention with autoregression (Yang et al 2019) andexpanding pre-training corpora and time (Liu et al 2019)However the main architecture of BERT seems to hold upas the state-of-the-art as of this writing

1httpsgithubcomgoogle-researchgoogle-researchtreemastercubert


In the space of programming languages embeddings havebeen learned for specific software-engineering tasks (Chenamp Monperrus 2019) These include embeddings of variableand method identifiers using local and global context (Al-lamanis et al 2015) abstract syntax trees (ASTs) (Mouet al 2016 Zhang et al 2019) AST paths (Alon et al2019) memory heap graphs (Li et al 2016) and ASTsenriched with data-flow information (Allamanis et al 2018Hellendoorn et al 2020) These approaches require an-alyzing source code beyond simple tokenization In thiswork we derive a pre-trained contextual embedding of tok-enized source code without explicitly modeling source-code-specific information and show that the resulting embeddingcan be effectively fine-tuned for downstream tasks

CodeBERT (Feng et al 2020) targets paired natural-language (NL) and multi-lingual programming-language(PL) tasks such as code search and generation of code doc-umentation It pre-trains a Transformer encoder by treatinga natural-language description of a function and its bodyas separate sentences in the sentence-pair representationof BERT We also handle natural language directly but donot require such a separation Natural-language tokens canbe mixed with source-code tokens both within and acrosssentences in our encoding One of our benchmark tasksfunction-docstring mismatch illustrates the ability of Cu-BERT to handle NL-PL tasks

3 Experimental SetupWe now outline our benchmarks and experimental studyThe supplementary material contains deeper detail aimed atreproducing our results

31 Code Corpus for Fine-Tuning Tasks

We use the ETH Py150 corpus (Raychev et al 2016) to gen-erate datasets for the fine-tuning tasks This corpus consistsof 150K Python files from GitHub and is partitioned intoa training split (100K files) and a test split (50K files) Weheld out 10K files from the training split as a validation splitWe deduplicated the dataset in the fashion of Allamanis(2018) Finally we drop from this corpus those projectsfor which licensing information was not available or whoselicenses restrict use or redistribution We call the resultingcorpus the ETH Py150 Open corpus2 This is our Pythonfine-tuning code corpus and it consists of 74749 trainingfiles 8302 validation files and 41457 test files

32 The GitHub Python Pre-Training Code Corpus

We used the public GitHub repository hosted on GooglersquosBigQuery platform (the github repos dataset under Big-

2httpsgithubcomgoogle-research-datasetseth_py150_open

Queryrsquos public-data project bigquery-public-data)We extracted all files ending in py under open-source re-distributable licenses removed symbolic links and retainedonly files reported to be in the refsheadsmasterbranch This resulted in about 162 million files

To avoid duplication between pre-training and fine-tuningdata we removed files that had high similarity to the files inthe ETH Py150 Open corpus using the method of Allamanis(2018) In particular two files are considered similar to eachother if the Jaccard similarity between the sets of tokens(identifiers and string literals) is above 08 and in additionit is above 07 for multi-sets of tokens This brought thedataset to 143 million files We then further deduplicatedthe remaining files by clustering them into equivalenceclasses holding similar files according to the same similaritymetric and keeping only one exemplar per equivalence classThis helps avoid biasing the pre-trained embedding Finallywe removed files that could not be parsed In the end wewere left with 74 million Python files containing over 93billion tokens This is our Python pre-training code corpus

33 Source-Code Modeling

We first tokenize a Python program using the standardPython tokenizer (the tokenize package) We leave lan-guage keywords intact and produce special tokens for syn-tactic elements that have either no string representation (egDEDENT tokens which occur when a nested program scopeconcludes) or ambiguous interpretation (eg new-line char-acters inside string literals at the logical end of a Pythonstatement or in the middle of a Python statement result indistinct special tokens) We split identifiers according tocommon heuristic rules (eg snake or Camel case) Finallywe split string literals using heuristic rules on white-spacecharacters and on special characters We limit all thus pro-duced tokens to a maximum length of 15 characters Wecall this the program vocabulary Our Python pre-trainingcode corpus contained 16 million unique tokens

We greedily compress the program vocabulary into asubword vocabulary (Schuster amp Nakajima 2012) us-ing the SubwordTextEncoder from the Tensor2Tensorproject (Vaswani et al 2018)3 resulting in about 50K to-kens All words in the program vocabulary can be losslesslyencoded using one or more of the subword tokens

We tokenize programs first into program tokens as de-scribed above and then encode those tokens one by onein the subword vocabulary The objective of this encod-ing scheme is to preserve syntactically meaningful bound-aries of tokens For example the identifier ldquosnake caserdquo

3httpsgithubcomtensorflowtensor2tensorblobmastertensor2tensordata_generatorstext_encoderpy


could be encoded as ldquosna ke ca serdquo preserving thesnake case split of its characters even if the subtoken ldquoe crdquowere very popular in the corpus the latter encoding mightresult in a smaller representation but would lose the intent ofthe programmer in using a snake-case identifier Similarlyldquoi=0rdquo may be very frequent in the corpus but we still forceit to be encoded as separate tokens i = and 0 ensuring thatwe preserve the distinction between operators and operandsBoth the BERT model and the Word2Vec embeddings arebuilt on the subword vocabulary

34 Fine-Tuning Tasks

To evaluate CuBERT we design five classification tasks anda multi-headed pointer task These are motivated by priorwork but unfortunately the associated datasets come fromdifferent languages and varied sources We want the tasksto be on Python code and for accurate results we ensurethat there is no overlap between pre-training and fine-tuningdatasets We therefore create all the tasks on the ETH Py150Open corpus (see Section 31) As discussed in Section 32we ensure that there is no duplication between this and thepre-training corpus We hope that our datasets for thesetasks will be useful to others as well The fine-tuning tasksare described below A more detailed discussion is presentedin the supplementary material

Variable-Misuse Classification Allamanis et al (2018)observed that developers may mistakenly use an incorrectvariable in the place of a correct one These mistakes mayoccur when developers copy-paste similar code but forgetto rename all occurrences of variables from the originalfragment or when there are similar variable names that canbe confused with each other These can be subtle errorsthat remain undetected during compilation The task byAllamanis et al (2018) is to choose the correct variable nameat a location within a C function We take the classificationversion restated by Vasic et al (2019) wherein given afunction the task is to predict whether there is a variablemisuse at any location in the function without specifyinga particular location to consider Here the classifier has toconsider all variables and their usages to make the decisionIn order to create negative (buggy) examples we replace avariable use at some location with another variable that isdefined within the function

Wrong Binary Operator Pradel amp Sen (2018) proposedthe task of detecting whether a binary operator in a givenexpression is correct They use features extracted fromlimited surrounding context We use the entire functionwith the goal of detecting whether any binary operator inthe function is incorrect The negative examples are createdby randomly replacing some binary operator with anothertype-compatible operator

Swapped Operand Pradel amp Sen (2018) propose thewrong binary operand task where a variable or constantis used incorrectly in an expression but that task is quitesimilar to the variable-misuse task we already use Wetherefore define another class of operand errors where theoperands of non-commutative binary operators are swappedThe operands can be arbitrary subexpressions and are notrestricted to be just variables or constants To simplify ex-ample generation we restrict this task to examples in whichthe operator and operands all fit within a single line

Function-Docstring Mismatch Developers are encour-aged to write descriptive docstrings to explain the function-ality and usage of functions This provides parallel corporabetween code and natural language sentences that have beenused for machine translation (Barone amp Sennrich 2017)detecting uninformative docstrings (Louis et al 2018) andto evaluate their utility to provide supervision in neural codesearch (Cambronero et al 2019) We prepare a sentence-pair classification problem where the function and its doc-string form two distinct sentences The positive examplescome from the correct function-docstring pairs We createnegative examples by replacing correct docstrings with doc-strings of other functions randomly chosen from the datasetFor this task the existing docstring is removed from thefunction body

Exception Type While it is possible to write genericexception handlers (eg ldquoexcept Exceptionrdquo inPython) it is considered a good coding practice to catchand handle the precise exceptions that can be raised by acode fragment4 We identified the 20 most common excep-tion types from the GitHub dataset excluding the catch-allException (full list in Table 1 in the supplementary ma-terial) Given a function with an except clause for one ofthese exception types we replace the exception with a spe-cial ldquoholerdquo token The task is the multi-class classificationproblem of predicting the original exception type

Variable-Misuse Localization and Repair As an in-stance of a non-classification task we consider the jointclassification localization and repair version of the variable-misuse task from Vasic et al (2019) Given a function thetask is to predict one pointer (called the localization pointer)to identify a variable-misuse location and another pointer(called the repair pointer) to identify a variable from thesame function that is the right one to use at the faulty loca-tion The model is also trained to classify functions that donot contain any variable misuse as bug-free by making thelocalization pointer point to a special location in the func-tion We create negative examples using the same method

4httpsgooglegithubiostyleguidepyguidehtml24-exceptions


Train Validation Test

Variable-Misuse Classification 700708 8192 (75478) 378440Wrong Binary Operator 459400 8192 (49804) 251804Swapped Operand 236246 8192 (26118) 130972Function-Docstring 340846 8192 (37592) 186698Exception Type 18480 2088 (2088) 10348Variable-Misuse Localization and Repair 700708 8192 (75478) 378440

Table 1 Benchmark fine-tuning datasets Note that for validation we have subsampled the original datasets (in parentheses) down to8192 examples except for exception classification which only had 2088 validation examples all of which are included

as used in the Variable-Misuse Classification task

Table 1 lists the sizes of the resulting benchmark datasetsextracted from the fine-tuning corpus The Exception Typetask contains significantly fewer examples than the othertasks since examples for this task only come from functionsthat catch one of the chosen 20 exception types

35 BERT for Source Code

The BERT model (Devlin et al 2019) consists of a multi-layered Transformer encoder It is trained with two tasks(1) to predict the correct tokens in a fraction of all positionssome of which have been replaced with incorrect tokens orthe special [MASK] token (the Masked Language Modeltask or MLM) and (2) to predict whether the two sentencesseparated by the special [SEP] token follow each otherin some natural discourse (the Next-Sentence Predictiontask or NSP) Thus each example consists of one or twosentences where a sentence is the concatenation of con-tiguous lines from the source corpus sized to fit the targetexample length To ensure that every sentence is treated inmultiple instances of both MLM and NSP BERT by defaultduplicates the corpus 10 times and generates independentlyderived examples from each duplicate With 50 proba-bility the second example sentence comes from a randomdocument (for NSP) A token is chosen at random for anMLM prediction (up to 20 per example) and from thosechosen 80 are masked 10 are left undisturbed and10 are replaced with a random token

CuBERT is similarly formulated but a CuBERT line is a log-ical code line as defined by the Python standard Intuitivelya logical code line is the shortest sequence of consecutivelines that constitutes a legal statement eg it has correctlymatching parentheses We count example lengths by count-ing the subword tokens of both sentences (see Section 33)

We train the BERT Large model having 24 layers with 16attention heads and 1024 hidden units Sentences are cre-ated from our pre-training dataset Task-specific classifierspass the embedding of a special start-of-example [CLS]token through feed-forward and softmax layers For thepointer prediction task the pointers are computed exactly as

by Vasic et al (2019) whereas in that work the pointers arecomputed from the output of an LSTM layer in our modelthey are computed from the last-layer hiddens of BERT

36 Baselines

361 WORD2VEC

We train Word2Vec models using the same pre-trainingcorpus as the BERT model To maintain parity we gen-erate the dataset for Word2Vec using the same pipeline asBERT but by disabling masking and generation of negativeexamples for NSP The dataset is generated without anyduplication We train both CBOW and Skipgram modelsusing GenSim (Rehurek amp Sojka 2010) To deal with thelarge vocabulary we use negative sampling and hierarchicalsoftmax (Mikolov et al 2013ab) to train the two versionsIn all we obtain four types of Word2Vec embeddings

362 BIDIRECTIONAL LSTM AND TRANSFORMER

In order to obtain context-sensitive encodings of input se-quences for the fine-tuning tasks we use multi-layered bidi-rectional LSTMs (Hochreiter amp Schmidhuber 1997) (BiL-STMs) These are initialized with the pre-trained Word2Vecembeddings To further evaluate whether LSTMs aloneare sufficient without pre-training we also train BiLSTMswith an embedding matrix that is initialized from scratchwith Xavier initialization (Glorot amp Bengio 2010) Wealso trained Transformer models (Vaswani et al 2017) forour fine-tuning tasks We used BERTrsquos own Transformerimplementation to ensure comparability of results For com-parison with prior work we use the unidirectional LSTMand pointer model from Vasic et al (2019) for the Variable-Misuse Localization and Repair task

4 Experimental Results41 Training Details

CuBERTrsquos dataset generation duplicates the corpus 10 timeswhereas Word2Vec is trained without duplication To com-pensate for this difference we trained Word2Vec for 10


epochs and CuBERT for 1 epoch We chose models byvalidation accuracy both during hyperparameter searchesand during model selection within an experiment

We pre-train CuBERT with the default configuration of theBERT Large model one model per example length (128256 512 and 1024 subword tokens) with batch sizes of8192 4096 2048 and 1024 respectively and the defaultBERT learning rate of 1times 10minus4 Fine-tuned models alsoused the same batch sizes as for pre-training and BERTrsquosdefault learning rate (5times 10minus5) For both we graduallywarm up the learning rate for the first 10 of exampleswhich is BERTrsquos default value

For Word2Vec when training with negative samples wechoose 5 negative samples The embedding size for allthe Word2Vec pre-trained models is set at 1024 For thebaseline BiLSTM models we performed a hyperparametersearch on each task and pre-training configuration separately(5 tasks each trained with the four Word2Vec embeddingsplus the randomly initialized embeddings) for the 512 ex-ample length For each of these 25 task configurations wevaried the number of layers (1 to 3) the number of hid-den units (128 256 and 512) the LSTM output dropoutprobability (01 and 05) and the learning rate (1times 10minus31times 10minus4 and 1times 10minus5) We used the Adam (Kingma ampBa 2014) optimizer throughout and batch size 8192 forall tasks except the Exception-Type task for which we usedbatch size 64 Invariably the best hyperparameter selectionhad 512 hidden units per layer and learning rate of 1times 10minus3but the number of layers (mostly 2 or 3) and dropout prob-ability varied across best task configurations Though nosingle Word2Vec configuration is the best CBOW trainedwith negative sampling gives the most consistent resultsoverall

For the baseline Transformer models we originally at-tempted to train a model of the same configuration as Cu-BERT However the sizes of our fine-tuning datasets seemedtoo small to train that large a Transformer Instead weperformed a hyperparameter search for each task individ-ually for the 512 example length We varied the num-ber of transformer layers (1 to 6) hidden units (128 256and 512) learning rates (1times 10minus3 5times 10minus4 1times 10minus45times 10minus5 and 1times 10minus5) and batch sizes (512 1024 2048and 4096) The best architecture varied across the tasks forexample 5 layers with 128 hiddens and the highest learningrate worked best for the Function-Docstring task whereasfor the Exception-Type task 2 layers 512 hiddens and thesecond lowest learning rate worked best

Finally for our baseline pointer model (referred to asLSTM+pointer below) we searched over the following hy-perparameter choices hidden sizes of 512 and 1024 tokenembedding sizes of 512 and 1024 and learning rates of1times 10minus1 1times 10minus2 and 1times 10minus3 We used the Adam op-

timizer a batch size of 256 and example length 512 Incontrast to the original work (Vasic et al 2019) we gen-erated one pair of buggybug-free examples per function(rather than one per variable use per function which wouldbias towards longer functions) and use CuBERTrsquos subword-tokenized vocabulary of 50K subtokens (rather than a lim-ited full-token vocabulary which leaves many tokens out ofvocabulary)

We used TPUs for training our models except for pre-training Word2Vec embeddings and the pointer model byVasic et al (2019) For the rest and for all evaluationswe used P100 or V100 GPUs All experiments using pre-trained word or contextual embeddings continued to fine-tune weights throughout training

42 Research Questions

We set out to answer the following research questions Wewill address each with our results

1 Do contextual embeddings help with source-code anal-ysis tasks when pre-trained on an unlabeled code cor-pus We compare CuBERT to BiLSTM models withand without pre-trained Word2Vec embeddings on theclassification tasks (Section 43)

2 Does fine-tuning actually help or is the Transformermodel by itself sufficient We compare fine-tunedCuBERT models to Transformer-based models trainedfrom scratch on the classification tasks (Section 44)

3 How does the performance of CuBERT on the classifi-cation tasks scale with the amount of labeled trainingdata We compare the performance of fine-tuned Cu-BERT models when fine-tuning with 33 66 and100 of the task training data (Section 45)

4 How does context size affect CuBERT We comparefine-tuning performance for different example lengthson the classification tasks (Section 46)

5 How does CuBERT perform on complex tasks againststate-of-the-art methods We implemented and fine-tuned a model for a multi-headed pointer predictiontask namely the Variable-Misuse Localization andRepair task (Section 47) We compare it to the modelsfrom (Vasic et al 2019) and (Hellendoorn et al 2020)

Except for Section 46 all the results are presented for se-quences of length 512 We give examples of classificationinstances in the supplementary material and include visual-izations of attention weights for them


Setting Misuse Operator Operand Docstring Exception

BiLSTM

From scratch 7629 8365 8807 7601 5279

CBOW ns 8033 8682 8980 8908 6701

(100 epochs) hs 7800 8585 9014 8769 6031

Skipgram ns 7706 8514 8931 8381 6007hs 8053 8634 8975 8880 6506

CuBERT2 epochs 9404 8990 9220 9721 610410 epochs 9514 9215 9362 9808 779720 epochs 9521 9246 9336 9809 7912

Transformer 100 epochs 7828 7655 8783 9102 4956

Table 2 Test accuracies of fine-tuned CuBERT against BiLSTM (with and without Word2Vec embeddings) and Transformer trained fromscratch on the classification tasks ldquonsrdquo and ldquohsrdquo respectively refer to negative sampling and hierarchical softmax settings used for trainingCBOW and Skipgram models ldquoFrom scratchrdquo refers to training with freshly initialized token embeddings without pre-training

43 Contextual vs Word Embeddings

The purpose of this analysis is to understand how much pre-trained contextual embeddings help compared to word em-beddings For each classification task we trained BiLSTMmodels starting with each of the Word2Vec embeddingsnamely continuous bag of words (CBOW) and Skipgramtrained with negative sampling or hierarchical softmax Wetrained the BiLSTM models for 100 epochs and the Cu-BERT models for 20 epochs and all models stopped im-proving by the end

The resulting test accuracies are shown in Table 2 (first 5rows and next-to-last row) CuBERT consistently outper-forms BiLSTM (with the best task-wise Word2Vec configu-ration) on all tasks by a margin of 32 to 147 Thusthe pre-trained contextual embedding provides superior re-sults even with a smaller budget of 20 epochs comparedto the 100 epochs used for BiLSTMs The Exception-Typeclassification task has an order of magnitude less trainingdata than the other tasks (see Table 1) The difference be-tween the performance of BiLSTM and CuBERT is substan-tially higher for this task Thus fine-tuning is of much valuefor tasks with limited labeled training data

We analyzed the performance of CuBERT with the reducedfine-tuning budget of only 2 and 10 epochs (see the remain-ing rows of the CuBERT section in Table 2) Except forthe Exception Type task CuBERT outperforms the best100-epoch BiLSTM within 2 fine-tuning epochs On theException-Type task CuBERT with 2 fine-tuning epochsoutperforms all but two configurations of the BiLSTM base-line This shows that even when restricted to just a fewfine-tuning epochs CuBERT can reach accuracies that arecomparable to or better than those of BiLSTMs trained withWord2Vec embeddings

To sanity-check our findings about BiLSTMs we alsotrained the BiLSTM models from scratch without pre-

trained embeddings The results are shown in the first rowof Table 2 Compared to those the use of Word2Vec embed-dings performs better by a margin of 27 to 142

44 Is Transformer All You Need

One may wonder if CuBERTrsquos promising results derivemore from using a Transformer-based model for its classi-fication tasks and less from the actual unsupervised pre-training Here we compare our results on the classificationtasks to a Transformer-based model trained from scratchie without the benefit of a pre-trained embedding Asdiscussed in Section 41 the size of the training data limitedus to try out Transformers that were substantially smallerthan the CuBERT model (BERT Large architecture) Allthe Transformer models were trained for 100 epochs duringwhich their performance stopped improving We selectedthe best model within the chosen hyperparameters for eachtask based on best validation accuracy

As seen from the last row of Table 2 the performance of Cu-BERT is substantially higher than the Transformer modelstrained from scratch Thus for the same choice of archi-tecture (ie Transformer) pre-training seems to help byenabling training of a larger and better model

45 The Effects of Little Supervision

The big draw of unsupervised pre-training followed byfine-tuning is that some tasks have small labeled datasetsWe study here how CuBERT fares with reduced trainingdata We sampled uniformly the fine-tuning dataset to 33and 66 of its size and produced corresponding trainingdatasets for each classification task We then fine-tunedthe pre-trained CuBERT model with each of the 3 differenttraining splits Validation and testing were done with thesame original datasets Table 3 shows the results

The Function Docstring task seems robust to the reduction


Best of Epochs

TrainFraction Misuse Operator Operand Docstring Exception

2100 9404 8990 9220 9721 610466 9311 8876 9161 9704 194933 9140 8642 9052 9638 2009

10100 9514 9215 9362 9808 779766 9478 9151 9337 9793 752433 9428 9066 9258 9736 6734

20100 9521 9246 9336 9809 791266 9490 9179 9339 9799 773133 9445 9109 9282 9763 7498

Table 3 Effects of reducing training-split size on fine-tuning performance on the classification tasks

Length Misuse Operator Operand Docstring Exception Misuse on BiLSTM

128 8397 7929 7802 9819 6203 7432256 9202 8819 8803 9814 7280 7847512 9521 9246 9336 9809 7912 8033

1024 9583 9338 9562 9790 8127 8192

Table 4 Best out of 20 epochs of fine-tuning for four example lengths on the classification tasks For contrast we also include results forVariable Misuse using the BiLSTM Word2Vec (CBOW + ns) classifier as length varies

of the training dataset both early and late in the fine-tuningprocess (that is within 2 vs 20 epochs) whereas the Excep-tion Classification task is heavily impacted by the datasetreduction given that it has relatively few training exam-ples to begin with Interestingly enough for some taskseven fine-tuning for only 2 epochs and only using a third ofthe training data outperforms the baselines For examplefor Variable Misuse and Function Docstring CuBERT at 2epochs and 33 of training data substantially outperformsthe BiLSTM with Word2Vec and the Transformer baselines

46 The Effects of Context

Context size is especially useful in code tasks given thatsome relevant information may lie many ldquosentencesrdquo awayfrom its locus of interest Here we study how reducingthe context length (ie the length of the examples used topre-train and fine-tune) affects performance We producedata with shorter example lengths by first pre-training amodel on a given example length and then fine-tuning thatmodel on the corresponding task with examples of that sameexample length5 Table 4 shows the results

Although context seems to be important to most tasks theFunction Docstring task paradoxically improves with lesscontext This may be because the task primarily depends on

5Note that we did not attempt to say pre-train on length 1024and then fine-tune that model on length 256-examples which mayalso be a practical scenario

comparison between the docstring and the function signa-ture and including more context dilutes the modelrsquos focus

For comparison we also evaluated the BiLSTM model onvarying example lengths for the Variable-Misuse task withCBOW and negative sampling (last column of Table 4)More context does seem to benefit the BiLSTM Variable-Misuse classifier as well However the improvement offeredby CuBERT with increasing context is significantly greater

47 Evaluation on a Multi-Headed Pointer Task

We now discuss the results of fine-tuning CuBERT to predictthe localization and repair pointers for the variable-misusetask For this task we implement the multi-headed pointermodel from Vasic et al (2019) on top of CuBERT Thebaseline consists of the same pointer model on a unidirec-tional LSTM as used by Vasic et al (2019) We refer tothese models as CuBERT+pointer and LSTM+pointer re-spectively Due to limitations of space we omit the detailsof the pointer model and refer the reader to the above pa-per However the two implementations are identical abovethe sequence encoding layer the difference is the BERTencoder versus an LSTM encoder As reported in Section 4of that work to enable comparison with an enumerativeapproach the evaluation was performed only on 12K testexamples Instead here we report the numbers on all 378Kof our test examples for both models

We trained the baseline model for 100 epochs and fine-tuned


Model Test Data Setting True Classification Localization Loc+RepairPositive Accuracy Accuracy Accuracy

LSTM C 100 epochs 8241 7930 6439 5689

CuBERT C2 epochs 9690 9487 9114 894110 epochs 9723 9549 9233 908420 epochs 9727 9540 9212 9061

CuBERT H2 epochs 9563 9071 8350 807710 epochs 9607 9171 8537 829120 epochs 9614 9149 8485 8230

Hellendoorn et al (2020) H 8190 7380

Table 5 Variable-misuse localization and repair task Comparison of the LSTM+pointer model (Vasic et al 2019) to our fine-tunedCuBERT+pointer model We also show results on the test data by Hellendoorn et al (2020) computed by us and reported by the authors intheir Table 1 In the Test Data column C means our CuBERT test dataset and H means the test dataset used by Hellendoorn et al (2020)

CuBERT for 2 10 and 20 epochs Table 5 gives the resultsalong the same metrics as Vasic et al (2019) The metricsare defined as follows 1) True Positive is the percentage ofbug-free functions classified as bug-free 2) ClassificationAccuracy is the percentage of correctly classified examples(between bug-free and buggy) 3) Localization Accuracy isthe percentage of buggy examples for which the localizationpointer correctly identifies the bug location 4) Localiza-tion+Repair Accuracy is the percentage of buggy examplesfor which both the localization and repair pointers makecorrect predictions As seen from Table 5 (top 4 rows)CuBERT+pointer outperforms LSTM+pointer consistentlyacross all the metrics and even within 2 and 10 epochs

More recently Hellendoorn et al (2020) evaluated hybridmodels for the same task combining graph neural networksTransformers and RNNs and greatly improving prior re-sults To compare we obtained the same test dataset fromthe authors and evaluated our CuBERT fine-tuned modelon it The last four rows of Table 5 show our results andthe results reported in that work Interestingly the modelsby Hellendoorn et al (2020) make use of richer input rep-resentations including syntax data flow and control flowNevertheless CuBERT outperforms them while using onlya lexical representation of the input program

5 Conclusions and Future WorkWe present the first attempt at pre-trained contextual em-bedding of source code by training a BERT model calledCuBERT which we fine-tuned on five classification tasksand compared against BiLSTM with Word2Vec embeddingsand Transformer models As a more challenging task wealso evaluated CuBERT on a multi-headed pointer predic-tion task CuBERT outperformed the baseline models con-sistently We evaluated CuBERT with less data and fewerepochs highlighting the benefits of pre-training on a mas-

sive code corpus

We use only source-code tokens and leave it to the underly-ing Transformer model to infer any structural interactionsbetween them through self-attention Prior work (Allamaniset al 2018 Hellendoorn et al 2020) has argued for ex-plicitly using structural program information (eg controlflow and data flow) It is an interesting avenue of futurework to incorporate such information in pre-training usingrelation-aware Transformers (Shaw et al 2018) Howeverour improved results in comparison to Hellendoorn et al(2020) show that CuBERT is a simple yet powerful tech-nique and provides a strong baseline for future work onsource-code representations

While surpassing the accuracies achieved by CuBERT withnewer models and pre-trainingfine-tuning methods wouldbe a natural extension to this work we also envision otherfollow-up work There is increasing interest in develop-ing pre-training methods that can produce smaller modelsmore efficiently and that trade-off accuracy for reducedmodel size Further our benchmark could be valuable totechniques that explore other program representations (egtrees and graphs) in multi-task learning and to developrelated tasks such as program synthesis

AcknowledgementsWe are indebted to Daniel Tarlow for his guidance and gen-erous advice throughout the development of this work Ourwork has also improved thanks to feedback use cases help-ful libraries and proofs of concept offered by David BieberVincent Hellendoorn Ben Lerner Hyeontaek Lim RishabhSingh Charles Sutton and Manushree Vijayvergiya Fi-nally we are grateful to the anonymous reviewers who gaveuseful constructive comments and helped us improve ourpresentation and results


ReferencesAllamanis M The adverse effects of code duplica-

tion in machine learning models of code CoRRabs181206469 2018 URL httparxivorgabs181206469

Allamanis M Barr E T Bird C and Sutton C Sug-gesting accurate method and class names In Proceed-ings of the 2015 10th Joint Meeting on Foundationsof Software Engineering ESECFSE 2015 pp 38ndash49New York NY USA 2015 ACM ISBN 978-1-4503-3675-8 doi 10114527868052786849 URL httpdoiacmorg10114527868052786849

Allamanis M Brockschmidt M and Khademi M Learn-ing to represent programs with graphs In InternationalConference on Learning Representations 2018

Alon U Zilberstein M Levy O and Yahav E Code2vecLearning distributed representations of code Proc ACMProgram Lang 3(POPL)401ndash4029 January 2019ISSN 2475-1421 doi 1011453290353 URL httpdoiacmorg1011453290353

Barone A V M and Sennrich R A parallel corpus ofpython functions and documentation strings for auto-mated code documentation and code generation arXivpreprint arXiv170702275 2017

Bojanowski P Grave E Joulin A and Mikolov T En-riching word vectors with subword information Transac-tions of the Association for Computational Linguistics 5135ndash146 2017

Cambronero J Li H Kim S Sen K and Chandra SWhen deep learning met code search arXiv preprintarXiv190503813 2019

Chen Z and Monperrus M A literature study of embed-dings on source code arXiv preprint arXiv1904030612019

Coenen A Reif E Yuan A Kim B Pearce A ViegasF and Wattenberg M Visualizing and Measuring theGeometry of BERT ArXiv abs190602715 2019

Devlin J Chang M-W Lee K and Toutanova K BERTPre-training of deep bidirectional transformers for lan-guage understanding In Proceedings of the 2019 Con-ference of the North American Chapter of the Associa-tion for Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers) pp4171ndash4186 Minneapolis Minnesota June 2019 Asso-ciation for Computational Linguistics doi 1018653v1N19-1423 URL httpswwwaclweborganthologyN19-1423

Feng Z Guo D Tang D Duan N Feng X GongM Shou L Qin B Liu T Jiang D et al Code-bert A pre-trained model for programming and naturallanguages arXiv preprint arXiv200208155 2020

Glorot X and Bengio Y Understanding the difficultyof training deep feedforward neural networks In Pro-ceedings of the thirteenth international conference onartificial intelligence and statistics pp 249ndash256 2010

Gu X Zhang H Zhang D and Kim S Deep apilearning In Proceedings of the 2016 24th ACM SIG-SOFT International Symposium on Foundations of Soft-ware Engineering FSE 2016 pp 631ndash642 New YorkNY USA 2016 ACM ISBN 978-1-4503-4218-6 doi10114529502902950334 URL httpdoiacmorg10114529502902950334

Gupta R Pal S Kanade A and Shevade S Deep-fix Fixing common c language errors by deep learn-ing In Proceedings of the Thirty-First AAAI Confer-ence on Artificial Intelligence AAAIrsquo17 pp 1345ndash1351AAAI Press 2017 URL httpdlacmorgcitationcfmid=32982393298436

Hellendoorn V J Sutton C Singh R Maniatis P andBieber D Global relational models of source code InInternational Conference on Learning Representations2020 URL httpsopenreviewnetforumid=B1lnbRNtwr

Hindle A Barr E T Su Z Gabel M and Devanbu POn the naturalness of software In 2012 34th InternationalConference on Software Engineering (ICSE) pp 837ndash847 June 2012 doi 101109ICSE20126227135

Hochreiter S and Schmidhuber J Long short-termmemory Neural Comput 9(8)1735ndash1780 Novem-ber 1997 ISSN 0899-7667 doi 101162neco1997981735 URL httpdxdoiorg101162neco1997981735

Iyer S Konstas I Cheung A and Zettlemoyer LMapping language to code in programmatic contextIn Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing BrusselsBelgium October 31 - November 4 2018 pp 1643ndash1652 2018 URL httpswwwaclweborganthologyD18-1192

Kingma D P and Ba J Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014

Li Y Tarlow D Brockschmidt M and Zemel R SGated graph sequence neural networks In 4th Interna-tional Conference on Learning Representations ICLR2016 San Juan Puerto Rico May 2-4 2016 Conference


Track Proceedings 2016 URL httparxivorgabs151105493

Liu Y Ott M Goyal N Du J Joshi M Chen D LevyO Lewis M Zettlemoyer L and Stoyanov V RobertaA robustly optimized BERT pretraining approach CoRRabs190711692 2019 URL httparxivorgabs190711692

Louis A Dash S K Barr E T and Sutton C Deeplearning to detect redundant method comments arXivpreprint arXiv180604616 2018

Martin R C Clean Code A Handbook of Agile Soft-ware Craftsmanship Prentice Hall PTR Upper SaddleRiver NJ USA 1 edition 2008 ISBN 01323508829780132350884

McCann B Bradbury J Xiong C and Socher RLearned in translation Contextualized word vectors InGuyon I Luxburg U V Bengio S Wallach H Fer-gus R Vishwanathan S and Garnett R (eds) Ad-vances in Neural Information Processing Systems 30 pp6294ndash6305 Curran Associates Inc 2017

Mikolov T Chen K Corrado G and Dean J Efficientestimation of word representations in vector space In 1stInternational Conference on Learning RepresentationsICLR 2013 Scottsdale Arizona USA May 2-4 2013Workshop Track Proceedings 2013a URL httparxivorgabs13013781

Mikolov T Sutskever I Chen K Corrado G S andDean J Distributed representations of words and phrasesand their compositionality In Burges C J C BottouL Welling M Ghahramani Z and Weinberger K Q(eds) Advances in Neural Information Processing Sys-tems 26 pp 3111ndash3119 Curran Associates Inc 2013b

Mou L Li G Zhang L Wang T and Jin ZConvolutional neural networks over tree structuresfor programming language processing In Proceed-ings of the Thirtieth AAAI Conference on ArtificialIntelligence AAAIrsquo16 pp 1287ndash1293 AAAI Press2016 URL httpdlacmorgcitationcfmid=30158123016002

Oda Y Fudaba H Neubig G Hata H Sakti S TodaT and Nakamura S Learning to generate pseudo-codefrom source code using statistical machine translation(t) In 2015 30th IEEEACM International Conferenceon Automated Software Engineering (ASE) pp 574ndash584IEEE 2015

Pennington J Socher R and Manning C D GloveGlobal vectors for word representation In In EMNLP2014

Peters M E Neumann M Iyyer M Gardner M ClarkC Lee K and Zettlemoyer L Deep contextualizedword representations In Proceedings of NAACL-HLT pp2227ndash2237 2018

Pradel M and Sen K Deepbugs A learning approach toname-based bug detection Proc ACM Program Lang2(OOPSLA)1471ndash14725 October 2018 ISSN 2475-1421 doi 1011453276517 URL httpdoiacmorg1011453276517

Pu Y Narasimhan K Solar-Lezama A and Barzilay RSk p A neural program corrector for moocs In Com-panion Proceedings of the 2016 ACM SIGPLAN Interna-tional Conference on Systems Programming Languagesand Applications Software for Humanity SPLASH Com-panion 2016 pp 39ndash40 New York NY USA 2016ACM ISBN 978-1-4503-4437-1 doi 10114529840432989222 URL httpdoiacmorg10114529840432989222

Radford A Narasimhan K Salimans Tand Sutskever I Improving language un-derstanding by generative pre-training URLhttpss3-us-west-2 amazonaws comopenai-assetsresearchcoverslanguageunsupervisedlanguageunderstanding paper pdf 2018

Radford A Wu J Child R Luan D Amodei D andSutskever I Language models are unsupervised multitasklearners OpenAI Blog 1(8) 2019

Raychev V Vechev M and Yahav E Code com-pletion with statistical language models In Proceed-ings of the 35th ACM SIGPLAN Conference on Pro-gramming Language Design and Implementation PLDIrsquo14 pp 419ndash428 New York NY USA 2014 ACMISBN 978-1-4503-2784-8 doi 10114525942912594321 URL httpdoiacmorg10114525942912594321

Raychev V Bielik P and Vechev M T Probabilisticmodel for code with decision trees In Proceedings ofthe 2016 ACM SIGPLAN International Conference onObject-Oriented Programming Systems Languages andApplications OOPSLA 2016 part of SPLASH 2016 Ams-terdam The Netherlands October 30 - November 4 2016pp 731ndash747 2016

Rehurek R and Sojka P Software Framework for TopicModelling with Large Corpora In Proceedings of theLREC 2010 Workshop on New Challenges for NLPFrameworks pp 45ndash50 Valletta Malta May 2010ELRA httpismuniczpublication884893en


Schuster M and Nakajima K Japanese and korean voicesearch In International Conference on Acoustics Speechand Signal Processing pp 5149ndash5152 2012

Shaw P Uszkoreit J and Vaswani A Self-attentionwith relative position representations arXiv preprintarXiv180302155 2018

Vasic M Kanade A Maniatis P Bieber D and SinghR Neural program repair by jointly learning to localizeand repair CoRR abs190401720 2019 URL httparxivorgabs190401720

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser L u and Polosukhin I Atten-tion is all you need In Guyon I Luxburg U V BengioS Wallach H Fergus R Vishwanathan S and Gar-nett R (eds) Advances in Neural Information Process-ing Systems 30 pp 5998ndash6008 Curran Associates Inc2017 URL httppapersnipsccpaper7181-attention-is-all-you-needpdf

Vaswani A Bengio S Brevdo E Chollet F GomezA N Gouws S Jones L Kaiser L KalchbrennerN Parmar N Sepassi R Shazeer N and Uszko-reit J Tensor2tensor for neural machine translationIn Proceedings of the 13th Conference of the Associa-tion for Machine Translation in the Americas AMTA2018 Boston MA USA March 17-21 2018 - Volume1 Research Papers pp 193ndash199 2018 URL httpswwwaclweborganthologyW18-1819

Yang Z Dai Z Yang Y Carbonell J G Salakhut-dinov R and Le Q V Xlnet Generalized autore-gressive pretraining for language understanding CoRRabs190608237 2019 URL httparxivorgabs190608237

Zhang J Wang X Zhang H Sun H Wang K andLiu X A novel neural source code representation basedon abstract syntax tree In 2019 IEEEACM 41st Interna-tional Conference on Software Engineering (ICSE) pp783ndash794 IEEE 2019


A Open-Sourced ArtifactsWe release data and some source-code utilities athttpsgithubcomgoogle-researchgoogle-researchtreemastercubert Therepository contains the following

GitHub Manifest A list of all the file versions we includedinto our pre-training corpus after removing files simi-lar to the fine-tuning corpus6 and after deduplicationThe manifest can be used to retrieve the file contentsfrom GitHub or Googlersquos BigQuery This dataset wasretrieved from Googlersquos BigQuery on June 21 2020

Vocabulary Our subword vocabulary computed from thepre-training corpus

Pre-trained Models Pre-trained models on the pre-training corpus after 1 and 2 epochs for examplesof length 512 and the BERT Large architecture

Task Datasets Datasets containing training validation andtesting examples for each of the 6 tasks For the clas-sification tasks we provide original source code andclassification labels For the localization and repairtask we provide subtokenized code and masks speci-fying the targets

Fine-tuned Models Fine-tuned models for the 6 tasksFine-tuning was done on the 1-epoch pre-trained modelFor each classification task we provide the checkpointwith highest validation accuracy for the localizationand repair task we provide the checkpoint with highestlocalization and repair accuracy These are the check-points we used to evaluate on our test datasets and tocompute the numbers in the main paper

Code-encoding Library We provide code for tokenizingPython code and for producing inputs to CuBERTrsquospre-training and fine-tuning models

Localization-and-repair Fine-tuning Model We providea library for constructing the localization-and-repairmodel on top of CuBERTrsquos encoder layers For theclassification tasks the model is identical to that ofBERTrsquos classification fine-tuning model

Please see the README for details file encoding andschema and terms of use

B Data Preparation for Fine-Tuning TasksB1 Label Frequencies

All four of our binary-classification fine-tuning tasks hadan equal number of buggy and bug-free examples The


Exception task which is a multi-class classification taskhad a different number of examples per class (ie exceptiontypes) For the Exception task we show the breakdown ofexample counts per label for our fine-tuning dataset splits inTable 6

B2 Fine-Tuning Task Datasets

In this section we describe in detail how we produced ourfine-tuning datasets (Section 34 of the main paper)

A common primitive in all our data generation is splittinga Python module into functions We do this by parsingthe Python file and identifying function definitions in theAbstract Syntax Tree that have no other function definitionbetween themselves and the root of the tree The resultingfunctions include functions defined at module scope butalso methods of classes and subclasses Not included arefunctions defined within other function and method bodiesor methods of classes that are themselves defined withinother function or method bodies

We do not filter functions by length although task-specificdata generation may filter out some functions (see below)When generating examples for a fixed-length pre-training orfine-tuning model we prune all examples to the maximumtarget sequence length (in this paper we consider 128 256512 and 1024 subtokenized sequence lengths) Note thatif a synthetically generated buggybug-free example pairdiffers only at a location beyond the target length (say onthe 2000-th subtoken) we still retain both examples Forinstance for the Variable-Misuse Localization and Repairtask we retain both buggy and bug-free examples even ifthe error andor repair locations lie beyond the end of themaximum target length During evaluation if the error orrepair locations fall beyond the length limit of the examplewe count the example as a model failure

B21 REPRODUCIBLE DATA GENERATION

We make pseudorandom choices at various stages in fine-tuning data generation It was important to design a pseu-dorandomness mechanism that gave (a) reproducible datageneration (b) non-deterministic choices drawn from theuniform distribution and (c) order independence Orderindependence is important because our data generation isdone in a distributed fashion (using Apache Beam) so dif-ferent pseudorandom number generator state machines areused by each distributed worker

Pseudorandomness is computed based on an experiment-wide seed but is independent of the order in which exam-ples are generated Specifically to make a pseudorandomchoice about a function we hash (using MD5) the seed andthe function data (its source code and metadata about itsprovenance) and use the resulting hash as a uniform pseudo-


Exception Type Test Validation Train100 66 33

ASSERTION_ERROR 155 29 323 189 86ATTRIBUTE_ERROR 1372 274 2444 1599 834DOES_NOT_EXIST 7 2 3 3 2HTTP_ERROR 55 9 104 78 38IMPORT_ERROR 690 170 1180 750 363INDEX_ERROR 586 139 1035 684 346IO_ERROR 721 136 1318 881 427KEY_ERROR 1926 362 3384 2272 1112KEYBOARD_INTERRUPT 232 58 509 336 166NAME_ERROR 78 19 166 117 60NOT_IMPLEMENTED_ERROR 119 24 206 127 72OBJECT_DOES_NOT_EXIST 95 16 197 142 71OS_ERROR 779 131 1396 901 459RUNTIME_ERROR 107 34 247 159 80STOP_ITERATION 270 61 432 284 131SYSTEM_EXIT 105 16 200 120 52TYPE_ERROR 809 156 1564 1038 531UNICODE_DECODE_ERROR 134 21 196 135 63VALIDATION_ERROR 92 16 159 96 39VALUE_ERROR 2016 415 3417 2232 1117

Table 6 Example counts per class for the Exception Type task broken down into the dataset splits We show separately the 100 traindataset as well as its 33 and 66 subsamples used in the ablations

random value from the function for whatever needs the datagenerator has (eg in choosing one of multiple choices) Inthat way the same function will always result in the samechoices given a seed regardless of the order in which eachfunction is processed thereby ensuring reproducible datasetgeneration

To select among multiple choices we hash the functionrsquospseudorandom value along with all choices (sorted in acanonical order) and use the digest to compute an indexwithin the list of choices Note that given two choicesover different candidates but for the same function inde-pendent decisions will be drawn We also use such order-independent pseudorandomness when subsampling datasets(eg to generate the validation datasets) In those cases wehash a sample with the seed as above and turn the resultingdigest into a pseudorandom number in [0 1] which can beused to decide given a target sampling rate

B22 VARIABLE-MISUSE CLASSIFICATION

A variable use is any mention of a variable in a load scopeThis includes a variable that appears in the right-hand side ofan assignment or a field dereference We regard as definedall variables mentioned either in the formal arguments of afunction definition or on the left-hand side of an assignmentWe do not include in our defined variables those declared inmodule scope (ie globals)

To decide whether to generate examples from a function weparse it and collect all variable-use locations and all definedvariables as described above We discard the function if ithas no variable uses or if it defines fewer than two variablesthis is necessary since if there is only one variable definedthe model has no choice to make but the default one We alsodiscard the function if it has more than 50 defined variablessuch functions are few and tend to be auto-generated Forany function that we do not discard ie an eligible functionwe generate a buggy and a bug-free example as describednext

To generate a buggy example from an eligible function wechoose one variable use pseudorandomly (see above howmultiple-choice decisions are done) and replace its currentoccupant with a different pseudorandomly-chosen variabledefined in the function (with a separate multiple-choicedecision)

Note that in the work by Vasic et al (2019) a buggy andbug-free example pair was generated for every variable usein an eligible function In the work by Hellendoorn et al(2020) a buggy and bug-free example pair was generated forup to three variable uses in an eligible function ie somefunctions with one use would result in one example pairwhereas functions with many variable uses would result inthree example pairs In contrast our work produces exactlyone example pair for every eligible function Eligibility wasdefined identically in all three projects


Commutative Non-CommutativeArithmetic + -

Comparison == = is is not lt lt= gt gt=Membership in not in

Boolean and or

Table 7 Binary operators

B23 WRONG BINARY OPERATOR

This task considers both commutative and non-commutativebinary operators (unlike the Swapped-Argument Classifica-tion task) See Table 7 for the full list and note that we haveexcluded relatively infrequent operators eg the Pythoninteger division operator

If a function has no binary operators it is discarded Other-wise it is used to generate a bug-free example and a singlebuggy example as follows one of the operators is chosenpseudorandomly (as described above) and a different oper-ator chosen to replace it from the same row of Table 7 Sofor instance a buggy example would only swap == withis but not with not in which would not type-check ifwe performed static type inference on Python

We take appropriate care to ensure the code parses after abug is introduced For instance if we swap the operator inthe expression 1==2 with is we ensure that there is spacebetween the tokens (ie 1 is 2 rather than the incorrect1is2) even though the space was not needed before

B24 SWAPPED OPERAND

Since this task targets swapping the arguments of binaryoperators we only consider non-commutative operatorsfrom Table 7

Functions without eligible operators are discarded and thechoice of the operator to mutate in a function as well asthe choice of buggy operator to use are done as above butlimiting choices only to non-commutative operators

To avoid complications due to format changes we onlyconsider expressions that fit in a single line (in contrast tothe Wrong Binary Operator Classification task) We also donot consider expressions that look the same after swapping(eg a - a)

B25 FUNCTION-DOCSTRING MISMATCH

In Python a function docstring is a string literal that di-rectly follows the function signature and precedes the mainfunction body Whereas in other common programminglanguages the function documentation is a comment inPython it is an actual semantically meaningful string literal

We discard functions that have no docstring from this

dataset or functions that have an empty docstring Wesplit the rest into the function definition without the doc-string and the docstring summary (ie the first line of textfrom its docstring) discarding the rest of the docstring

We create bug-free examples by pairing a function with itsown docstring summary

To create buggy examples we pair every function with an-other functionrsquos docstring summary according to a globalpseudorandom permutation of all functions for all i wecombine the i-th function (without its docstring) with thePi-th functionrsquos docstring summary where P is a pseudoran-dom permutation under a given seed We discard pairingsin which i == P [i] but for the seeds we chose no suchpathological permuted pairings occurred

B26 EXCEPTION TYPE

Note that unlike all other tasks this task has no notion ofbuggy or bug-free examples

We discard functions that do not have any except clausesin them

For the rest we collect all locations holding exception typeswithin except clauses and choose one of those locationsto query the model for classification Note that a singleexcept clause may hold a comma-separated list of ex-ception types and the same type may appear in multiplelocations within a function Once a location is chosen wereplace it with a special HOLE token and create a clas-sification example that pairs the function (with the maskedexception location) with the true label (the removed excep-tion type)

The count of examples per exception type can be found inTable 6

B27 VARIABLE MISUSE LOCALIZATION AND REPAIR

The dataset for this task is identical to that for the Variable-Misuse Classification task (Section B22) However unlikethe classification task examples contain more features rele-vant to localization and repair Specifically in addition tothe token sequence describing the program we also extracta number of boolean input masks

bull A candidates mask which marks as True all tokensholding a variable which can therefore be either thelocation of a bug or the location of a repair The firstposition is always a candidate since it may be used toindicate a bug-free program

bull A targets mask which marks as True all tokens holdingthe correct variable for buggy examples Note that thecorrect variable may appear in multiple locations in afunction therefore this mask may have multiple True


positions Bug-free examples have an all-False targetsmask

bull An error-location mask which marks as True the loca-tion where the bug occurs (for buggy examples) or thefirst location (for bug-free examples)

All the masks mark as True some of the locations that holdvariables Because many variables are subtokenized intomultiple tokens if a variable is to be marked as True in thecorresponding mask we only mark as True its first subtokenkeeping trailing subtokens as False

C Attention VisualizationsIn this section we provide sample code snippets used to testthe different classification tasks Further Figures 1ndash5 showvisualizations of the attention matrix of the last layer of thefine-tuned CuBERT model (Coenen et al 2019) for the codesnippets In the visualization the Y-axis shows the querytokens and X-axis shows the tokens being attended to Theattention weight between a pair of tokens is the maximum ofthe weights assigned by the multi-head attention mechanismThe color changes from dark to light as weight changes from0 to 1


def on_resize(self event)eventapply_zoom()

Figure 1 Variable Misuse Example In the code snippet lsquoeventapply zoomrsquo should actually be lsquoselfapply zoomrsquo TheCuBERT variable-misuse model correctly predicts that the code has an error As seen from the attention map the query tokens areattending to the second occurrence of the lsquoeventrsquo token in the snippet which corresponds to the incorrect variable usage


def__gt__(selfother)if isinstance(otherint)and other==0return selfget_value()gt0

return other is not self

Figure 2 Wrong Operator Example In this code snippet lsquoother is not selfrsquo should actually be lsquoother lt selfrsquo TheCuBERT wrong-binary-operator model correctly predicts that the code snippet has an error As seen from the attention map the querytokens are all attending to the incorrect operator lsquoisrsquo


def__contains__(clsmodel)return cls_registry in model

Figure 3 Swapped Operand Example In this code snippet the return statement should be lsquomodel in cls registryrsquo Theswapped-operand model correctly predicts that the code snippet has an error The query tokens are paying substantial attention to lsquoinrsquoand the second occurrence of lsquomodelrsquo in the snippet


Docstring rsquoGet form initial datarsquoFunctiondef__add__(selfcov)return SumOfKernel(selfcov)

Figure 4 Function Docstring Example The CuBERT function-docstring model correctly predicts that the docstring is wrong for this codesnippet Note that most of the query tokens are attending to the tokens in the docstring


trysubprocesscall(hook_value)return jsonify(success=True) 200

except __HOLE__ as ereturn jsonify(success=Falseerror=str(e)) 400

Figure 5 Exception Classification Example For this code snippet the CuBERT exception-classification model correctly predictslsquo HOLE rsquo as lsquoOSErrorrsquo The modelrsquos attention matrix also shows that lsquo HOLE rsquo is attending to lsquosubprocessrsquo which is indicativeof an OS-related error


a number of natural-language understanding benchmarksIn this work we derive a contextual embedding of sourcecode by training a BERT model on source code We call ourmodel CuBERT short for Code Understanding BERT

In order to achieve this we curate a massive corpus ofPython programs collected from GitHub GitHub projectsare known to contain a large amount of duplicate code Toavoid biasing the model to such duplicated code we performdeduplication using the method of Allamanis (2018) Theresulting corpus has 74 million files with a total of 93billion tokens (16 million unique) For comparison wealso train Word2Vec embeddings (Mikolov et al 2013ab)namely continuous bag-of-words (CBOW) and Skipgramembeddings on the same corpus

For evaluating CuBERT we create a benchmark of five clas-sification tasks and a sixth localization and repair task Theclassification tasks range from classification of source codeaccording to presence or absence of certain classes of bugsto mismatch between a functionrsquos natural language descrip-tion and its body to predicting the right kind of exception tocatch for a given code fragment The localization and repairtask defined for variable-misuse bugs (Vasic et al 2019)is a pointer-prediction task Although similar tasks haveappeared in prior work the associated datasets come fromdifferent languages and varied sources instead we create acohesive multiple-task benchmark dataset in this work Toproduce a high-quality dataset we ensure that there is nooverlap between pre-training and fine-tuning examples andthat all of the tasks are defined on Python code

We fine-tune CuBERT on each of the classification tasksand compare the results to multi-layered bidirectionalLSTM (Hochreiter amp Schmidhuber 1997) models as wellas Transformers (Vaswani et al 2017) We train the LSTMmodels from scratch and also using pre-trainined Word2Vecembeddings Our results show that CuBERT consistentlyoutperforms these baseline models by 32 to 147across the classification tasks We perform a number ofadditional studies by varying the sampling strategies usedfor training Word2Vec models and by varying programlengths In addition we also show that CuBERT can befine-tuned effectively using only 33 of the task-specificlabeled data and with only 2 epochs and that even thenit attains results competitive to the baseline models trainedwith the full datasets and many more epochs CuBERTwhen fine-tuned on the variable-misuse localization andrepair task produces high classification localization andlocalization+repair accuracies and outperforms publishedstate-of-the-art models (Hellendoorn et al 2020 Vasic et al2019) Our contributions are as follows

bull We present the first attempt at pre-training a BERTcontextual embedding of source code

bull We show the efficacy of the pre-trained contextual em-bedding on five classification tasks Our fine-tunedmodels outperform baseline LSTM models (withwith-out Word2Vec embeddings) as well as Transformerstrained from scratch even with reduced training data

bull We evaluate CuBERT on a pointer prediction task andshow that it outperforms state-of-the-art results signifi-cantly

bull We make the models and datasets publicly available1

We hope that future work benefits from our contribu-tions by reusing our benchmark tasks and by compar-ing against our strong baseline models

2 Related WorkGiven the abundance of natural-language text and the rel-ative difficulty of obtaining labeled data much effort hasbeen devoted to using large corpora to learn about languagein an unsupervised fashion before trying to focus on taskswith small labeled training datasets Word2Vec (Mikolovet al 2013ab) computed word embeddings based on wordco-occurrence and proximity but the same embedding isused regardless of the context The continued advances inword (Pennington et al 2014) and subword (Bojanowskiet al 2017) embeddings led to publicly released pre-trainedembeddings used in a variety of tasks

To deal with varying word context contextual word embed-dings were developed (McCann et al 2017 Peters et al2018 Radford et al 2018 2019) in which an embeddingis learned for the context of a word in a particular sentencenamely the sequence of words preceding it and possiblyfollowing it BERT (Devlin et al 2019) improved natural-language pre-training by using a de-noising autoencoderInstead of learning a language model which is inherentlysequential BERT optimizes for predicting a noised wordwithin a sentence Such prediction instances are gener-ated by choosing a word position and either keeping it un-changed removing the word or replacing the word with arandom wrong word It also pre-trains with the objectiveof predicting whether two sentences can be next to eachother These pre-training objectives along with the use ofa Transformer-based architecture gave BERT an accuracyboost in a number of NLP tasks over the state-of-the-artBERT has been improved upon in various ways includingmodifying training objectives utilizing ensembles combin-ing attention with autoregression (Yang et al 2019) andexpanding pre-training corpora and time (Liu et al 2019)However the main architecture of BERT seems to hold upas the state-of-the-art as of this writing

1httpsgithubcomgoogle-researchgoogle-researchtreemastercubert







































36 Baselines

361 WORD2VEC
























BiLSTM

From scratch 7629 8365 8807 7601 5279

CBOW ns 8033 8682 8980 8908 6701

(100 epochs) hs 7800 8585 9014 8769 6031

Skipgram ns 7706 8514 8931 8381 6007hs 8053 8634 8975 8880 6506

















Best of Epochs


2100 9404 8990 9220 9721 610466 9311 8876 9161 9704 194933 9140 8642 9052 9638 2009

10100 9514 9215 9362 9808 779766 9478 9151 9337 9793 752433 9428 9066 9258 9736 6734

20100 9521 9246 9336 9809 791266 9490 9179 9339 9799 773133 9445 9109 9282 9763 7498



128 8397 7929 7802 9819 6203 7432256 9202 8819 8803 9814 7280 7847512 9521 9246 9336 9809 7912 8033

1024 9583 9338 9562 9790 8127 8192














LSTM C 100 epochs 8241 7930 6439 5689








sive code corpus
























































































Boolean and or






B24 SWAPPED OPERAND










B26 EXCEPTION TYPE





































































36 Baselines

361 WORD2VEC
























BiLSTM

From scratch 7629 8365 8807 7601 5279

CBOW ns 8033 8682 8980 8908 6701

(100 epochs) hs 7800 8585 9014 8769 6031

Skipgram ns 7706 8514 8931 8381 6007hs 8053 8634 8975 8880 6506

















Best of Epochs


2100 9404 8990 9220 9721 610466 9311 8876 9161 9704 194933 9140 8642 9052 9638 2009

10100 9514 9215 9362 9808 779766 9478 9151 9337 9793 752433 9428 9066 9258 9736 6734

20100 9521 9246 9336 9809 791266 9490 9179 9339 9799 773133 9445 9109 9282 9763 7498



128 8397 7929 7802 9819 6203 7432256 9202 8819 8803 9814 7280 7847512 9521 9246 9336 9809 7912 8033

1024 9583 9338 9562 9790 8127 8192














LSTM C 100 epochs 8241 7930 6439 5689








sive code corpus
























































































Boolean and or






B24 SWAPPED OPERAND










B26 EXCEPTION TYPE





















































36 Baselines

361 WORD2VEC
























BiLSTM

From scratch 7629 8365 8807 7601 5279

CBOW ns 8033 8682 8980 8908 6701

(100 epochs) hs 7800 8585 9014 8769 6031

Skipgram ns 7706 8514 8931 8381 6007hs 8053 8634 8975 8880 6506

















Best of Epochs


2100 9404 8990 9220 9721 610466 9311 8876 9161 9704 194933 9140 8642 9052 9638 2009

10100 9514 9215 9362 9808 779766 9478 9151 9337 9793 752433 9428 9066 9258 9736 6734

20100 9521 9246 9336 9809 791266 9490 9179 9339 9799 773133 9445 9109 9282 9763 7498



128 8397 7929 7802 9819 6203 7432256 9202 8819 8803 9814 7280 7847512 9521 9246 9336 9809 7912 8033

1024 9583 9338 9562 9790 8127 8192














LSTM C 100 epochs 8241 7930 6439 5689








sive code corpus
























































































Boolean and or






B24 SWAPPED OPERAND










B26 EXCEPTION TYPE










































36 Baselines

361 WORD2VEC
























BiLSTM

From scratch 7629 8365 8807 7601 5279

CBOW ns 8033 8682 8980 8908 6701

(100 epochs) hs 7800 8585 9014 8769 6031

Skipgram ns 7706 8514 8931 8381 6007hs 8053 8634 8975 8880 6506

















Best of Epochs


2100 9404 8990 9220 9721 610466 9311 8876 9161 9704 194933 9140 8642 9052 9638 2009

10100 9514 9215 9362 9808 779766 9478 9151 9337 9793 752433 9428 9066 9258 9736 6734

20100 9521 9246 9336 9809 791266 9490 9179 9339 9799 773133 9445 9109 9282 9763 7498



128 8397 7929 7802 9819 6203 7432256 9202 8819 8803 9814 7280 7847512 9521 9246 9336 9809 7912 8033

1024 9583 9338 9562 9790 8127 8192














LSTM C 100 epochs 8241 7930 6439 5689








sive code corpus
























































































Boolean and or






B24 SWAPPED OPERAND










B26 EXCEPTION TYPE

















































BiLSTM

From scratch 7629 8365 8807 7601 5279

CBOW ns 8033 8682 8980 8908 6701

(100 epochs) hs 7800 8585 9014 8769 6031

Skipgram ns 7706 8514 8931 8381 6007hs 8053 8634 8975 8880 6506

















Best of Epochs


2100 9404 8990 9220 9721 610466 9311 8876 9161 9704 194933 9140 8642 9052 9638 2009

10100 9514 9215 9362 9808 779766 9478 9151 9337 9793 752433 9428 9066 9258 9736 6734

20100 9521 9246 9336 9809 791266 9490 9179 9339 9799 773133 9445 9109 9282 9763 7498



128 8397 7929 7802 9819 6203 7432256 9202 8819 8803 9814 7280 7847512 9521 9246 9336 9809 7912 8033

1024 9583 9338 9562 9790 8127 8192














LSTM C 100 epochs 8241 7930 6439 5689








sive code corpus
























































































Boolean and or






B24 SWAPPED OPERAND










B26 EXCEPTION TYPE

































BiLSTM

From scratch 7629 8365 8807 7601 5279

CBOW ns 8033 8682 8980 8908 6701

(100 epochs) hs 7800 8585 9014 8769 6031

Skipgram ns 7706 8514 8931 8381 6007hs 8053 8634 8975 8880 6506

















Best of Epochs


2100 9404 8990 9220 9721 610466 9311 8876 9161 9704 194933 9140 8642 9052 9638 2009

10100 9514 9215 9362 9808 779766 9478 9151 9337 9793 752433 9428 9066 9258 9736 6734

20100 9521 9246 9336 9809 791266 9490 9179 9339 9799 773133 9445 9109 9282 9763 7498



128 8397 7929 7802 9819 6203 7432256 9202 8819 8803 9814 7280 7847512 9521 9246 9336 9809 7912 8033

1024 9583 9338 9562 9790 8127 8192














LSTM C 100 epochs 8241 7930 6439 5689








sive code corpus
























































































Boolean and or






B24 SWAPPED OPERAND










B26 EXCEPTION TYPE
































Best of Epochs


2100 9404 8990 9220 9721 610466 9311 8876 9161 9704 194933 9140 8642 9052 9638 2009

10100 9514 9215 9362 9808 779766 9478 9151 9337 9793 752433 9428 9066 9258 9736 6734

20100 9521 9246 9336 9809 791266 9490 9179 9339 9799 773133 9445 9109 9282 9763 7498



128 8397 7929 7802 9819 6203 7432256 9202 8819 8803 9814 7280 7847512 9521 9246 9336 9809 7912 8033

1024 9583 9338 9562 9790 8127 8192














LSTM C 100 epochs 8241 7930 6439 5689








sive code corpus
























































































Boolean and or






B24 SWAPPED OPERAND










B26 EXCEPTION TYPE

































LSTM C 100 epochs 8241 7930 6439 5689








sive code corpus
























































































Boolean and or






B24 SWAPPED OPERAND










B26 EXCEPTION TYPE



















































































































Boolean and or






B24 SWAPPED OPERAND










B26 EXCEPTION TYPE





























































































Boolean and or






B24 SWAPPED OPERAND










B26 EXCEPTION TYPE










































































Boolean and or






B24 SWAPPED OPERAND










B26 EXCEPTION TYPE


































































Boolean and or






B24 SWAPPED OPERAND










B26 EXCEPTION TYPE













































Boolean and or






B24 SWAPPED OPERAND










B26 EXCEPTION TYPE


































Boolean and or






B24 SWAPPED OPERAND










B26 EXCEPTION TYPE









































































































Documents

Learning and Evaluating Contextual Embedding of Source Code