arXiv:1908.05620v1 [cs.CL] 15 Aug 2019Multi-genre Natural Language Inference Corpus (MNLI;Williams et al.2018), Recognizing Textual Entailment (RTE;Dagan et al.2006; Bar-Haim et al.2006;Giampiccolo

Visualizing and Understanding the Effectiveness of BERT

Yaru Hao†∗, Li Dong‡, Furu Wei‡, Ke Xu††Beihang University‡Microsoft Research

{haoyaru@,kexu@nlsde.}buaa.edu.cn{lidong1,fuwei}@microsoft.com

AbstractLanguage model pre-training, such as BERT,has achieved remarkable results in many NLPtasks. However, it is unclear why the pre-training-then-fine-tuning paradigm can im-prove performance and generalization capabil-ity across different tasks. In this paper, we pro-pose to visualize loss landscapes and optimiza-tion trajectories of fine-tuning BERT on spe-cific datasets. First, we find that pre-trainingreaches a good initial point across downstreamtasks, which leads to wider optima and eas-ier optimization compared with training fromscratch. We also demonstrate that the fine-tuning procedure is robust to overfitting, eventhough BERT is highly over-parameterized fordownstream tasks. Second, the visualizationresults indicate that fine-tuning BERT tends togeneralize better because of the flat and wideoptima, and the consistency between the train-ing loss surface and the generalization errorsurface. Third, the lower layers of BERT aremore invariant during fine-tuning, which sug-gests that the layers that are close to input learnmore transferable representations of language.

1 Introduction

Language model pre-training has achieved strongperformance in many NLP tasks (Peters et al.,2018; Howard and Ruder, 2018a; Radford et al.,2018; Devlin et al., 2018; Baevski et al., 2019;Dong et al., 2019). A neural encoder is trainedon a large text corpus by using language model-ing objectives. Then the pre-trained model eitheris used to extract vector representations for input,or is fine-tuned on the specific datasets.

Recent work (Tenney et al., 2019b; Liu et al.,2019a; Goldberg, 2019; Tenney et al., 2019a)has shown that the pre-trained models can en-code syntactic and semantic information of lan-guage. However, it is unclear why pre-training

∗Contribution during internship at Microsoft Research.

is effective on downstream tasks in terms of bothtrainability and generalization capability. In thiswork, we take BERT (Devlin et al., 2018) as anexample to understand the effectiveness of pre-training. We visualize the loss landscapes andthe optimization procedure of fine-tuning on spe-cific datasets in three ways. First, we compute theone-dimensional (1D) loss curve, so that we caninspect the difference between fine-tuning BERTand training from scratch. Second, we visualizethe two-dimensional (2D) loss surface, which pro-vides more information about loss landscapes than1D curves. Third, we project the high-dimensionaloptimization trajectory of fine-tuning to the ob-tained 2D loss surface, which demonstrate thelearning properties in an intuitive way.

The main findings are summarized as follows.First, visualization results indicate that BERT pre-training reaches a good initial point across down-stream tasks, which leads to wider optima on the2D loss landscape compared with random initial-ization. Moreover, the visualization of optimiza-tion trajectories shows that pre-training results ineasier optimization and faster convergence. Wealso demonstrate that the fine-tuning procedure isrobust to overfitting. Second, loss landscapes offine-tuning partially explain the good generaliza-tion capability of BERT. Specifically, pre-trainingobtains more flat and wider optima, which indi-cates the pre-trained model tends to generalize bet-ter on unseen data (Chaudhari et al., 2017; Li et al.,2018; Izmailov et al., 2018). Additionally, wefind that the training loss surface correlates wellwith the generalization error. Third, we demon-strate that the lower (i.e., close to input) layersof BERT are more invariant across tasks than thehigher layers, which suggests that the lower layerslearn transferable representations of language. Weverify the point by visualizing the loss landscapewith respect to different groups of layers.

arX

iv:1

908.

0562

0v1

[cs

.CL

] 1

5 A

ug 2

019

2 Background: BERT

We use BERT (Bidirectional Encoder Represen-tations from Transformers; Devlin et al. 2018) asan example of pre-trained language models in ourexperiments. BERT is pre-trained on a large cor-pus by using the masked language modeling andnext-sentence prediction objectives. Then we canadd task-specific layers to the BERT model, andfine-tune all the parameters according to the down-stream tasks.

BERT employs a Transformer (Vaswani et al.,2017) network to encode contextual information,which contains multi-layer self-attention blocks.Given the embeddings {xi}|x|i=1 of input text, weconcatenate them into H0 = [x1, · · · ,x|x|]. Then,an L-layer Transformer encodes the input: Hl =Transformer blockl(H

l−1), where l = 1, · · · , L,and HL = [hL1 , · · · ,hL|x|]. We use the hidden vec-tor hLi as the contextualized representation of theinput token xi. For more implementation details,we refer readers to Vaswani et al. (2017).

3 Methodology

We employ three visualization methods to under-stand why fine-tuning the pre-trained BERT modelcan achieve better performance on downstreamtasks compared with training from scratch. Weplot both one-dimensional and two-dimensionalloss landscapes of BERT on the specific datasets.Besides, we project the optimization trajectoriesof the fine-tuning procedure to the loss surface.The visualization algorithms can also be used forthe models that are trained from random initial-ization, so that we can compare the difference be-tween two learning paradigm.

3.1 One-dimensional Loss Curve

Let θ0 denote the initialized parameters. For fine-tuning BERT, θ0 represents the the pre-trained pa-rameters. For training from scratch, θ0 repre-sents the randomly initialized parameters. Afterfine-tuning, the model parameters are updated toθ1. The one-dimensional (1D) loss curve aims toquantify the loss values along the optimization di-rection (i.e., from θ0 to θ1).

The loss curve is plotted by linear interpola-tion between θ0 and θ1 (Goodfellow and Vinyals,2015). The curve function f(α) is defined as:

f(α) = J (θ0 + αδ1) (1)

where α is a scalar parameter, δ1 = θ1 − θ0 is theoptimization direction, and J (θ) is the loss func-tion under the model parameters θ. In our experi-ments, we set the range of α to [−4, 4] and sample40 points for each axis. Note that we only considerthe parameters of BERT in θ0 and θ1, so δ1 onlyindicates the updates of the original BERT param-eters. The effect of the added task-specific layersis eliminated by keeping them fixed to the learnedvalues.

3.2 Two-dimensional Loss SurfaceThe one-dimensional loss curve can be extendedto the two-dimensional (2D) loss surface (Li et al.,2018). Similar as in Equation (1), we need to de-fine two directions (δ1 and δ2) as axes to plot theloss surface:

f(α, β) = J (θ0 + αδ1 + βδ2) (2)

where α, β are scalar values, J (·) is the loss func-tion, and θ0 represents the initialized parameters.Similar to Section 3.1, we are only interested inthe parameter space of the BERT encoder, with-out taking into consideration task-specific layers.One of the axes is the optimization direction δ1 =θ1 − θ0 on the target dataset, which is defined inthe same way as in Equation (1). We computethe other axis direction via δ2 = θ2 − θ0, whereθ2 represents the fine-tuned parameters on anotherdataset. So the other axis is the optimization di-rection of fine-tuning on another dataset. Eventhough the other dataset is randomly chosen, ex-perimental results confirm that the optimizationdirections δ1, δ2 are divergent and orthogonal toeach other because of the high-dimensional pa-rameter space.

The direction vectors δ1 and δ2 are projectedonto a two-dimensional plane. It is beneficial toensure the scale equivalence of two axes for visu-alization purposes. Similar to the filter normal-ization approach introduced in (Li et al., 2018),we address this issue by normalizing two direc-tion vectors to the same norm. We re-scale δ2to ‖δ1‖‖δ2‖δ2, where ‖·‖ denotes the Euclidean norm.We set the range of both α and β to [−4, 4] andsample 40 points for each axis.

3.3 Optimization TrajectoryOur goal is to project the optimization trajectory ofthe fine-tuning procedure onto the 2D loss surfaceobtained in Section 3.2. Let {(dαi , d

βi )}Ti=1 de-

note the projected optimization trajectory, where

(dαi , dβi ) is a projected point in the loss surface,

and i = 1, · · · , T represents the i-th epoch of fine-tuning.

As shown in Equation (2), we have known theoptimization direction δ1 = θ1 − θ0 on the tar-get dataset. We can compute the deviation degreesbetween the optimization direction and the trajec-tory to visualize the projection results. Let θi de-note the BERT parameters at the i-th epoch, andδi = θi − θ0 denote the optimization direction atthe i-th epoch. The point (dαi , d

βi ) of the trajectory

is computed via:

vcos =δi × δ1‖δi‖ · ‖δ1‖

(3)

dαi = vcos

∥∥δi∥∥‖δ1‖

=δi × δ1

‖δ1‖2(4)

dβi =

√(‖δi‖‖δ1‖

)2 − (dαi )2 (5)

where × denotes the cross product of two vectors,and ‖·‖ denotes the Euclidean norm. To be spe-cific, we first compute cosine similarity betweenδi and δ1, which indicates the angle between thecurrent optimization direction and the final opti-mization direction. Then we get the projectionvalues dαi and dβi by computing the deviation de-grees between the optimization direction δi andthe axes.

4 Experimental Setup

We conduct experiments on four datasets:Multi-genre Natural Language Inference Corpus(MNLI; Williams et al. 2018), RecognizingTextual Entailment (RTE; Dagan et al. 2006;Bar-Haim et al. 2006; Giampiccolo et al. 2007;Bentivogli et al. 2009), Stanford Sentiment Tree-bank (SST-2; Socher et al. 2013), and MicrosoftResearch Paraphrase Corpus (MRPC; Dolan andBrockett 2005). We use the same data split asin (Wang et al., 2019). The accuracy metric isused for evaluation.

We employ the pre-trained BERT-large modelin our experiments. The cased version of tokenizeris used. We follow the settings and the hyper-parameters suggested in (Devlin et al., 2018). TheAdam (Kingma and Ba, 2015) optimizer is usedfor fine-tuning. The number of fine-tuning epochsis selected from {3, 4, 5}. For RTE and MRPC,we set the batch size to 32, and the learning rate to

1e-5. For MNLI and SST-2, the batch size is 64,and the learning rate is 3e-5.

For the setting of training from scratch, weuse the same network architecture as BERT, andrandomly initialize the model parameters. Mosthyper-parameters are kept the same. The num-ber of training epochs is larger than fine-tuningBERT, because training from scratch requiresmore epochs to converge. The number of epochsis set to 8 for SST-2, and 16 for the other datasets,which is validated on the development set.

5 Pre-training Gets a Good Initial PointAcross Downstream Tasks

Fine-tuning BERT on the usually performs signif-icantly better than training the same network withrandom initialization, especially when the datasize is small. Results indicate that language modelpre-training objectives learn good initialization fordownstream tasks. In this section, we inspect thebenefits of using BERT as the initial point fromthree aspects.

5.1 Pre-training Leads to Wider Optima

As described in Section 3.2, we plot 2D trainingloss surfaces on four datasets in Figure 1. The fig-ures in the top row are the models trained fromscratch, while the others in bottom are based onfine-tuning BERT. The start point indicates thetraining loss of the initialized model, while theend point indicates the final training loss. Weobserve that the optima obtained by fine-tuningBERT are much wider than training from scratch.A wide optimum of fine-tuning BERT implicatesthat the small perturbations of the model param-eters cannot hurt the final performance seriously,while a thin optimum is more sensitive to thesesubtle changes. Moreover, in Section 6.1, we fur-ther discuss about the width of the optima can con-tribute to the generalization capability.

As shown in Figure 1, the fine-tuning path fromthe start point to the end point on the loss land-scape is more smooth than training from scratch.In other words, the training loss of fine-tuningBERT tends to monotonously decrease along theoptimization direction, which eases optimizationand accelerates training convergence. In contrast,the path from random initial point to the end pointis more rough, which requires a more carefullytweaked optimizer to obtain reasonable perfor-mance.

MNLI RTE SST-2 MRPC

Trai

ning

from

scra

tch

43

21

01

23

4

432101234

0123

startpoint

end point

4 3 2 1 0 1 2 3 44321

0123401234

startpoint

end point

43

21

01

23

4

43

210

12

34

02468

startpointend

point4321

012344 3 2 1 0 1 2 3 40

123

startpoint

end point

Fine

-tun

ing

BE

RT

43

21

01

23

4

432101234

0123

startpointend

point

4 3 2 1 0 1 2 3 44321

01234012

startpoint

end point

43

21

01

23

4

43

210

12

34

0123

startpoint

end point

4321

0123

44 3 2 1 0 1 2 3 4

0.00.51.01.5

startpoint end

point

0 1 2 3 0 1 2 3 0 1 2 3 0.0 0.5 1.0 1.5

Figure 1: Training loss surfaces of training from scratch (top) and fine-tuning BERT (bottom) on four datasets.We annotate the start point (i.e., initialized model) and the end point (i.e., estimated model) in the loss surfaces.Pre-training leads to wider optima, and eases optimization compared with random initialization.

0 2 4 6 8 10 12 14 16 18 20Epochs

0.0

0.2

0.4

0.6

0.8

Trai

ning

Los

s

MNLI fine-tuneMNLI scratchRTE fine-tuneRTE scratchMRPC fine-tuneMRPC scratchSST-2 fine-tuneSST-2 scratch

Figure 2: Training loss of fine-tuning BERT and train-ing from scratch on four datasets.

5.2 Pre-training Eases Optimization onDownstream Tasks

We fine-tune BERT and train the same networkfrom scratch on four datasets. The learning curvesare shown in Figure 2. We find that training fromscratch requires more iterations to converge on thedatasets, while pre-training-then-fine-tuning con-verges faster in terms of training loss. We alsonotice that the final loss of training from scratchtends to be higher than fine-tuning BERT, even ifit undergoes more epochs. On the RTE dataset,training the model from scratch has a hard timedecreasing the loss in the first few epochs.

In order to visualize the dynamic convergenceprocess, we plot the optimization trajectories us-

ing the method described in Section 3.3. As shownin Figure 3, for training from scratch, the op-timization directions of the first few epochs aredivergent from the final optimization direction.Moreover, the loss landscape from the initial pointto the end point is more rough than fine-tuningBERT, we can see that the trajectory of trainingfrom scratch on the MRPC dataset crosses an ob-stacle to reach the end point.

Compared with training from scratch, fine-tuning BERT finds the optimization direction in amore straightforward way. The optimization pro-cess also converges faster. Besides, the fine-tuningpath is unimpeded along the optimization direc-tion. In addition, because of the wider optima nearthe initial point, fine-tuning BERT tends to reachthe expected optimal region even if it optimizesalong the direction of the first epoch.

5.3 Pre-training-then-fine-tuning is Robust toOverfitting

The BERT-large model has 345M parame-ters, which is over-parameterized for the targetdatasets. However, experimental results showfine-tuning BERT is robust to over-fitting, i.e., thegeneralization error (namely, the classification er-ror rate on the development set) does not dramat-ically increase for more training epochs, despitethe huge number of model parameters.

MRPC MNLI

Trai

ning

from

scra

tch

4 3 2 1 0 1 2 3 4432101234

4 3 2 1 0 1 2 3 4432101234

Fine

-tun

ing

BE

RT

4 3 2 1 0 1 2 3 4432101234

4 3 2 1 0 1 2 3 4432101234

0.0 0.5 1.0 0 1 2

Figure 3: The optimization trajectories on the train-ing loss surfaces of training from scratch (top) andfine-tuning BERT (bottom) on the MRPC and MNLIdatasets. Darker color represents smaller training loss.

We use the MRPC dataset as a case study, be-cause its data size is relatively small, which isprone to overfitting if we train the model fromscratch. As shown in Figure 4, we plot the op-timization trajectory of fine-tuning on the gen-eralization error surface. We first fine-tune theBERT model for five epochs as suggested in (De-vlin et al., 2018). Then we continue fine-tuning foranother twenty epochs, which still obtains com-parable performance with the first five epochs.Figure 4 shows that even though we fine-tunethe BERT model for twenty more epochs, thefinal estimation is not far away from its origi-nal optimum. Moreover, the optimum area iswide enough to avoid the model from jumpingout the region with good generalization capabil-ity, which explains why the pre-training-then-fine-tuning paradigm is robust to overfitting.

6 Pre-training Helps to Generalize Better

Although training from scratch can achieve com-parable training losses as fine-tuning BERT, themodel with random initialization usually has poorperformance on the unseen data. In this sec-tion, we use visualization techniques to under-stand why the model obtained by pre-training-then-fine-tuning tends to have better generaliza-tion capability.

4 3 2 1 0 1 2 3 4432101234

MRP

C

0.09

0.18

0.27

0.36

0.45

0.54

0.63

Figure 4: The optimization trajectory of fine-tuning5 + 20 epochs on MRPC. The two-dimensional gen-eralization error surface is presented. We find that pre-training-then-fine-tuning is robust to overfitting.

4 3 2 1 0 1 2 3 40

1

2

3SS

T-2

4 3 2 1 0 1 2 3 4

fine-tunescratch

4 3 2 1 0 1 2 3 40

1

2

MRP

C

4 3 2 1 0 1 2 3 4

fine-tunescratch

4 3 2 1 0 1 2 3 40

1

2

3

MNL

I

4 3 2 1 0 1 2 3 4

fine-tunescratch

4 3 2 1 0 1 2 3 40

1

2

3

4

RTE

4 3 2 10 1 2 3 4

fine-tunescratch

Figure 5: One-dimensional training loss curves. Dashlines represent training from scratch, and solid linesrepresent fine-tuning BERT. The scales of axes are nor-malized for comparison. The optima of training fromscratch are sharper than fine-tuning BERT.

6.1 Wide and Flat Optima Lead to BetterGeneralization

Previous work (Hochreiter and Schmidhuber,1997; Keskar et al., 2016; Li et al., 2018) showsthat the flatness of a local optimum correlates withthe generalization capability, i.e., more flat optimalead to better generalization. The finding inspiresus to inspect the loss landscapes of BERT fine-tuning, so that we can understand the generaliza-tion capability from the perspective of the flatnessof optima.

Section 5.1 presents that the optima obtainedby fine-tuning BERT are wider than trainingfrom scratch. As shown in Figure 5, we fur-ther plot one-dimensional training loss curves of

both fine-tuning BERT and training from scratch,which represents the transverse section of two-dimensional loss surface along the optimizationdirection. We normalize the scale of axes for flat-ness comparison as suggested in (Li et al., 2018).Figure 5 shows that the optima of fine-tuningBERT are more flat, while training from scratchobtains more sharp optima. The results indicatethat pre-training-then-fine-tuning tends to gener-alize better on unseen data.

6.2 Consistency Between Training LossSurface and Generalization ErrorSurface

To further understand the effects of geometry ofloss functions to the generalization capability, wemake comparisons between the training loss sur-faces and the generalization error surfaces on dif-ferent datasets. The classification error rate on thedevelopment set is used as an indicator of the gen-eralization capability.

As shown in Figure 6, we find the end pointsof fine-tuning BERT fall into the wide areas withsmaller generalization error. The results showthat the generalization error surfaces are consistentwith the corresponding training loss surfaces onthe datasets, i.e., smaller training loss tends to de-crease the error on the development set. Moreover,the fine-tuned BERT models tend to stay approxi-mately optimal under subtle perturbations. The vi-sualization results also indicate that it is preferredto converge to wider and more flat local optima,as the training loss surface and the generaliza-tion error surface are shifted with respect to eachother (Izmailov et al., 2018). In contrast, trainingfrom scratch obtains thinner optimum areas andpoorer generalization than fine-tuning BERT, es-pecially on the datasets with relatively small datasize (such as MRPC, and RTE). Intuitively, thethin and sharp optima on the training loss surfacesare hard to be migrated to the generalization sur-faces.

For training from scratch, it is not surprisingthat on larger datasets (such as MNLI, and SST-2) the generalization error surfaces are more con-sistent with the training loss surfaces. The resultssuggest that training the model from scratch usu-ally requires more training examples to generalizebetter compared with fine-tuning BERT.

7 Lower Layers of BERT are MoreInvariant and Transferable

The BERT-large model has 24 layers. Differentlayers could have learned different granularities orperspectives of language during the pre-trainingprocedure. For example, Tenney et al. (2019a)observe that most local syntactic phenomena areencoded in lower layers while higher layers cap-ture more complex semantics. They also showthat most examples can be classified correctly inthe first few layers. From above, we conjecturethat lower layers of BERT are more invariant andtransferable across tasks.

We divide the layers of the BERT-large modelinto three groups: low layers (0th-7th layer), mid-dle layers (8th-15th layer), and high layers (16th-23rd layer). As shown in Figure 7, we plot thetwo-dimensional loss surfaces with respect to dif-ferent groups of layers (i.e., parameter subspaceinstead of all parameters) around the fine-tunedpoint. To be specific, we modify the loss sur-face function in Section 3.2 to f(α, β) = J (θ1 +αδG1 + βδG2 ), where θ1 represents the fine-tunedparameters, G ∈ {low layers, middle layers, highlayers}, and the optimization direction of the layergroup is used as the axis. On the visualized losslandscapes, f(0, 0) corresponds to the loss valueat the fine-tuned point. Besides, f(−1, 0) corre-sponds to the loss value with the correspondinglayer group rollbacked to its original values in thepre-trained BERT model.

Figure 7 shows that the loss surface with respectto lower layers has the wider local optimum alongthe optimization direction. The results demon-strate that rollbacking parameters of lower layersto their original values (the star-shaped points inFigure 7) does not dramatically hurt the modelperformance. In contrast, rollbacking high layersmakes the model fall into the region with high loss.This phenomenon indicates that the optimizationof high layers is critical to fine-tuning whereaslower layers are more invariant and transferableacross tasks.

In order to make a further verification, we roll-back different layer groups of the fine-tuned modelto the parameters of the original pre-trained BERTmodel. The accuracy results on the developmentset are presented in Table 1. Similar to Figure 7,the generalization capability does not dramaticallydecrease after rollbacking low layers or middlelayers. Rollbacking low layers (0th-7th layer)

MNLI RTE SST-2 MRPC

Trai

ning

from

scra

tch

4 3 2 1 0 1 2 3 4432101234

4 3 2 1 0 1 2 3 4432101234

4 3 2 1 0 1 2 3 4432101234

4 3 2 1 0 1 2 3 4432101234

Fine

-tun

ing

BE

RT

4 3 2 1 0 1 2 3 4432101234

4 3 2 1 0 1 2 3 4432101234

4 3 2 1 0 1 2 3 4432101234

4 3 2 1 0 1 2 3 4432101234

0.25 0.50 0.75 0.2 0.4 0.6 0.2 0.4 0.6 0.1 0.3 0.5

Figure 6: Two-dimensional generalization error surfaces of training from scratch (top) and fine-tuning BERT(bottom). The dot-shaped point and the star-shaped point indicate the generalization errors at the beginning and atthe end of the training procedure, respectively.

Dataset BERTLayer Rollback

0-7 8-15 16-23

MNLI 86.5486.73(+0.19)

84.71(-1.83)

32.85(-53.88)

RTE 75.4573.29(-2.16)

70.04(-5.41)

47.29(-28.16)

SST-2 94.0493.69(-0.35)

93.12(-0.92)

59.29(-34.75)

MRPC 90.2087.99(-2.21)

80.15(-10.05)

78.17(-12.03)

Table 1: Accuracy on the development sets. The sec-ond column represents the fine-tuned BERT models onthe specific datasets. The last column represents thefine-tuned models with rollbacking different groups oflayers.

even improves the generalization capability on theMNLI dataset. By contrast, rollbacking high lay-ers hurts the model performance. Evaluation re-sults suggest that low layers that are close to in-put learn more transferable representations of lan-guage, which makes them more invariant acrosstasks. Moreover, high layers seem to play a moreimportant role in learning task-specific informa-tion during fine-tuning.

8 Related Work

Pre-trained contextualized word representationslearned from language modeling objectives, suchas CoVe (McCann et al., 2017), ELMo (Peterset al., 2018), ULMFit (Howard and Ruder, 2018b),GPT (Radford et al., 2018, 2019), and BERT (De-vlin et al., 2018), have shown strong performanceon a variety of natural language processing tasks.

Recent work of inspecting the effectiveness ofthe pre-trained models (Linzen et al., 2016; Kun-coro et al., 2018; Tenney et al., 2019b; Liu et al.,2019a) focuses on analyzing the syntactic and se-mantic properties. Tenney et al. (2019b) and Liuet al. (2019a) suggest that pre-training helps themodels to encode much syntactic informationand many transferable features through evaluatingmodels on several probing tasks. Goldberg (2019)assesses the syntactic abilities of BERT and drawsthe similar conclusions. Our work explores the ef-fectiveness of pre-training from another angle. Wepropose to visualize the loss landscapes and opti-mization trajectories of the BERT fine-tuning pro-cedure. The visualization results help us to un-derstand the benefits of pre-training in a more in-tuitive way. More importantly, the geometry ofloss landscapes partially explains why fine-tuningBERT can achieve better generalization capabilitythan training from scratch.

Liu et al. (2019a) find that different layers of

Low layers (0-7) Middle layers (8-15) High layers (16-23)

MN

LI

4 3 2 1 0 1 2 3 4432101234

4 3 2 1 0 1 2 3 4432101234

4 3 2 1 0 1 2 3 4432101234

0.0

0.5

1.0

1.5

2.0M

RPC

4 3 2 1 0 1 2 3 4432101234

4 3 2 1 0 1 2 3 4432101234

4 3 2 1 0 1 2 3 4432101234

0.0

0.5

1.0

Figure 7: Layer-wise training loss surfaces on the MNLI dataset (top) and the MRPC dataset (bottom). The dot-shaped point represents the fine-tuned model. The star-shaped point represents the model with rollbacking differentlayer groups. Low layers correspond to the 0th-7th layers of BERT, middle layers correspond to the 8th-15th layers,and high layers correspond to the 16th-23rd layers.

BERT exhibit different transferability. Peters et al.(2019) show that the classification tasks build upinformation mainly in the intermediate and lastlayers of BERT. Tenney et al. (2019a) observe thatlow layers of BERT encode more local syntax,while high layers capture more complex seman-tics. Zhang et al. (2019) also show that not alllayers of a deep neural model have equal contribu-tions to model performance. We draw the similarconclusion by visualizing layer-wise loss surfaceof BERT on downstream tasks. Besides, we findthat low layers of BERT are more invariant andtransferable across datasets.

In the computer vision community, many effortshave been made to visualize the loss function, andfigure out how the geometry of a loss function af-fects the generalization (Goodfellow and Vinyals,2015; Im et al., 2016; Li et al., 2018). Hochre-iter and Schmidhuber (1997) define the flatness asthe size of the connected region around a mini-mum. Keskar et al. (2016) characterize the defini-tion of flatness using eigenvalues of the Hessian,and conclude that small-batch training convergesto flat minima, which leads to good generaliza-tion. Li et al. (2018) propose a filter normaliza-tion method to reduce the influence of parameterscale, and show that the sharpness of a minimumcorrelates well with generalization capability. Theassumption is also used to design optimization al-

gorithms (Chaudhari et al., 2017; Izmailov et al.,2018), which aims at finding broader optima withbetter generalization than standard SGD.

9 Conclusion

We visualize the loss landscapes and optimiza-tion trajectories of the BERT fine-tuning proce-dure, which aims at inspecting the effectivenessof language model pre-training. We find thatpre-training leads to wider optima on the losslandscape, and eases optimization compared withtraining from scratch. Moreover, we give evidencethat the pre-training-then-fine-tuning paradigm isrobust to overfitting. We also demonstrate the con-sistency between the training loss surfaces and thegeneralization error surfaces, which explains whypre-training improves the generalization capabil-ity. In addition, we find that low layers of theBERT model are more invariant and transferableacross tasks.

All our experiments and conclusions werederived from BERT fine-tuning. A furtherunderstanding of how multi-task training withBERT (Liu et al., 2019b) improves fine-tuning,and how it affects the geometry of loss surfaces areworthy of exploration, which we leave to futurework. Moreover, the results motivate us to developfine-tuning algorithms that converge to wider andmore flat optima, which would lead to better gen-

eralization on unseen data. In addition, we wouldlike to apply the proposed methods for other pre-trained models.

Acknowledgements

The work was partially supported by National Nat-ural Science Foundation of China (NSFC) [GrantNo. 61421003].

ReferencesAlexei Baevski, Sergey Edunov, Yinhan Liu, Luke

Zettlemoyer, and Michael Auli. 2019. Cloze-driven pretraining of self-attention networks. arXivpreprint arXiv:1903.07785.

Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, andDanilo Giampiccolo. 2006. The second PASCALrecognising textual entailment challenge. In Pro-ceedings of the Second PASCAL Challenges Work-shop on Recognising Textual Entailment.

Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, DaniloGiampiccolo, and Bernardo Magnini. 2009. Thefifth PASCAL recognizing textual entailment chal-lenge. In In Proc Text Analysis Conference (TAC09.

Pratik Chaudhari, Anna Choromanska, Stefano Soatto,Yann LeCun, Carlo Baldassi, Christian Borgs, Jen-nifer Chayes, Levent Sagun, and Riccardo Zecchina.2017. Entropy-SGD: Biasing gradient descent intowide valleys.

Ido Dagan, Oren Glickman, and Bernardo Magnini.2006. The pascal recognising textual entailmentchallenge. In Proceedings of the First Inter-national Conference on Machine Learning Chal-lenges: Evaluating Predictive Uncertainty VisualObject Classification, and Recognizing Textual En-tailment, MLCW’05, pages 177–190, Berlin, Hei-delberg. Springer-Verlag.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: pre-training ofdeep bidirectional transformers for language under-standing. CoRR, abs/1810.04805.

William B Dolan and Chris Brockett. 2005. Automati-cally constructing a corpus of sentential paraphrases.In Proceedings of the Third International Workshopon Paraphrasing (IWP2005).

Li Dong, Nan Yang, Wenhui Wang, Furu Wei,Xiaodong Liu, Yu Wang, Jianfeng Gao, MingZhou, and Hsiao-Wuen Hon. 2019. Unifiedlanguage model pre-training for natural languageunderstanding and generation. arXiv preprintarXiv:1905.03197.

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,and Bill Dolan. 2007. The third PASCAL recogniz-ing textual entailment challenge. In Proceedings of

the ACL-PASCAL Workshop on Textual Entailmentand Paraphrasing, pages 1–9, Prague. Associationfor Computational Linguistics.

Yoav Goldberg. 2019. Assessing BERT’s syntacticabilities. CoRR, abs/1901.05287.

Ian J. Goodfellow and Oriol Vinyals. 2015. Qual-itatively characterizing neural network optimiza-tion problems. In 3rd International Conference onLearning Representations, ICLR 2015, San Diego,CA, USA, May 7-9, 2015, Conference Track Pro-ceedings.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Flatminima. Neural Comput., 9(1):1–42.

Jeremy Howard and Sebastian Ruder. 2018a. Universallanguage model fine-tuning for text classification.In Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), pages 328–339, Melbourne, Aus-tralia. Association for Computational Linguistics.

Jeremy Howard and Sebastian Ruder. 2018b. Universallanguage model fine-tuning for text classification. InACL. Association for Computational Linguistics.

Daniel Jiwoong Im, Michael Tao, and Kristin Branson.2016. An empirical analysis of deep network losssurfaces. CoRR, abs/1612.04010.

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov,Dmitry Vetrov, and Andrew Gordon Wilson. 2018.Averaging weights leads to wider optima and bettergeneralization. arXiv preprint arXiv:1803.05407.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge No-cedal, Mikhail Smelyanskiy, and Ping Tak PeterTang. 2016. On large-batch training for deep learn-ing: Generalization gap and sharp minima. CoRR,abs/1609.04836.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In 3rd Interna-tional Conference on Learning Representations, SanDiego, CA.

Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo-gatama, Stephen Clark, and Phil Blunsom. 2018.LSTMs can learn syntax-sensitive dependencieswell, but modeling structure makes them better. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 1426–1436, Melbourne, Aus-tralia. Association for Computational Linguistics.

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, andTom Goldstein. 2018. Visualizing the loss landscapeof neural nets. In Neural Information ProcessingSystems.

Tal Linzen, Emmanuel Dupoux, and Yoav Gold-berg. 2016. Assessing the ability of LSTMsto learn syntax-sensitive dependencies. CoRR,abs/1611.01368.

https://arxiv.org/pdf/1611.01838.pdf


https://doi.org/10.1007/11736790_9

https://doi.org/10.1007/11736790_9

https://www.aclweb.org/anthology/W07-1401

https://www.aclweb.org/anthology/W07-1401

http://arxiv.org/abs/1901.05287





https://doi.org/10.1162/neco.1997.9.1.1

https://doi.org/10.1162/neco.1997.9.1.1

https://www.aclweb.org/anthology/P18-1031














Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew Peters, and Noah A. Smith. 2019a. Lin-guistic knowledge and transferability of contextualrepresentations. CoRR, abs/1903.08855.

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-feng Gao. 2019b. Multi-task deep neural net-works for natural language understanding. CoRR,abs/1901.11504.

Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in translation: Con-textualized word vectors. In Advances in Neural In-formation Processing Systems, pages 6297–6308.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.

Matthew Peters, Sebastian Ruder, and Noah A. Smith.2019. To tune or not to tune? Adapting pretrainedrepresentations to diverse tasks.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.

Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Process-ing, pages 1631–1642, Seattle, Washington, USA.Association for Computational Linguistics.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019a.BERT rediscovers the classical NLP pipeline. InProceedings of the 57th Annual Meeting of the As-sociation for Computational Linguistics.

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R. Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R. Bowman, Dipan-jan Das, and Ellie Pavlick. 2019b. What do youlearn from context? Probing for sentence structurein contextualized word representations. In Interna-tional Conference on Learning Representations.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008. Curran Asso-ciates, Inc.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2019.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In Inter-national Conference on Learning Representations.

Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long Papers), pages 1112–1122, New Orleans,Louisiana. Association for Computational Linguis-tics.

Chiyuan Zhang, Samy Bengio, and Yoram Singer.2019. Are all layers created equal? arXiv preprintarXiv:1902.01996.






http://www.aclweb.org/anthology/N18-1202

http://www.aclweb.org/anthology/N18-1202



https://s3-us-west-2.amazonaws.com/openaiassets/research-covers/language-unsupervised/language understanding paper.pdf

https://s3-us-west-2.amazonaws.com/openaiassets/research-covers/language-unsupervised/language understanding paper.pdf

https://www.aclweb.org/anthology/D13-1170



https://arxiv.org/abs/1905.05950

https://openreview.net/forum?id=SJzSgnRcKX



http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

https://openreview.net/forum?id=rJ4km2R5t7

https://openreview.net/forum?id=rJ4km2R5t7

https://doi.org/10.18653/v1/N18-1101

https://doi.org/10.18653/v1/N18-1101

https://arxiv.org/abs/1902.01996

Documents

arXiv:1908.05620v1 [cs.CL] 15 Aug 2019Multi-genre Natural Language Inference Corpus (MNLI;Williams et al.2018), Recognizing Textual Entailment (RTE;Dagan et al.2006; Bar-Haim et al.2006;Giampiccolo