14
L ANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS : COMPUTATIONAL CONVERGENCE AND ITS LIMITS APREPRINT Charlotte Caucheteux Facebook AI Research Jean-Rémi King PSL University, CNRS Facebook AI Research June 29, 2020 ABSTRACT Deep learning has recently allowed substantial progress in language tasks such as translation and 1 completion. Do such models process language similarly to humans, and is this similarity driven by 2 systematic structural, functional and learning principles? To address these issues, we tested whether 3 the activations of 7,400 artificial neural networks trained on image, word and sentence processing 4 linearly map onto the hierarchy of human brain responses elicited during a reading task, using 5 source-localized magneto-encephalography (MEG) recordings of one hundred and four subjects. Our 6 results confirm that visual, word and language models sequentially correlate with distinct areas of the 7 left-lateralized cortical hierarchy of reading. However, only specific subsets of these models converge 8 towards brain-like representations during their training. Specifically, when the algorithms are trained 9 on language modeling, their middle layers become increasingly similar to the late responses of the 10 language network in the brain. By contrast, input and output word embedding layers often diverge 11 away from brain activity during training. These differences are primarily rooted in the sustained 12 and bilateral responses of the temporal and frontal cortices. Together, these results suggest that 13 the compositional - but not the lexical - representations of modern language models converge to a 14 brain-like solution. 15 Keywords Natural Language Processing | Neurobiology of Language | Magneto-encephalography 16 1 Introduction 17 Deep neural networks trained on "language modeling" (i.e. guessing masked words in a given text) have recently led 18 to substantial progress (Vaswani et al., 2017; Devlin et al., 2018; Lample and Conneau, 2019) on tasks traditionally 19 associated with human intelligence (Turing, 2009; Chomsky, 2006). While still limited (Loula et al., 2018), the 20 improvements in translation, dialogue and summarization allowed by these models, lead to a simple question: do these 21 artificial neural networks tend to process language similarly to humans, or do they just superficially mimic our behavior? 22 This question is all-the-more challenging that the neurobiology of language remains in its infancy: the transformation of 23 sensory input into phonemic and orthographic representations are becoming increasingly understood (Mesgarani et al., 24 2014; Dehaene and Cohen, 2011; Hickok and Poeppel, 2007; Di Liberto et al., 2015; Tang et al., 2017; Donhauser and 25 Baillet, 2019). However, the large inter-individual variability and the distributed nature of high-level representations 26 (Fedorenko et al., 2010; Price, 2010; Fedorenko et al., 2020) have limited the study of lexical (how individual words 27 are represented) and compositional representations (i.e. how words are combined with one another) to piecemeal 28 investigations (Binder et al., 1997; Pylkkänen and Marantz, 2003; Binder et al., 2009; Price, 2010; Fedorenko et al., 29 2010; Pallier et al., 2011; Fedorenko et al., 2016; Blank et al., 2016; Brennan et al., 2016; Brennan and Pylkkänen, 30 2017; Pylkkanen and Brennan, 2019; Hagoort, 2019). 31 Reading, for example, depends on a cortical hierarchy that originates in the primary visual cortex (V1), continues 32 within the visual word form area (in the fusiform gyrus - where letters and strings are identified, and reaches the angular 33 gyrus, the anterior temporal lobe and the middle temporal gyrus - associated with word meaning (Price, 2010; Dehaene 34 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288 doi: bioRxiv preprint

Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

LANGUAGE PROCESSING IN BRAINS AND DEEP NEURALNETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS

A PREPRINT

Charlotte CaucheteuxFacebook AI Research

Jean-Rémi KingPSL University, CNRSFacebook AI Research

June 29, 2020

ABSTRACT

Deep learning has recently allowed substantial progress in language tasks such as translation and1

completion. Do such models process language similarly to humans, and is this similarity driven by2

systematic structural, functional and learning principles? To address these issues, we tested whether3

the activations of 7,400 artificial neural networks trained on image, word and sentence processing4

linearly map onto the hierarchy of human brain responses elicited during a reading task, using5

source-localized magneto-encephalography (MEG) recordings of one hundred and four subjects. Our6

results confirm that visual, word and language models sequentially correlate with distinct areas of the7

left-lateralized cortical hierarchy of reading. However, only specific subsets of these models converge8

towards brain-like representations during their training. Specifically, when the algorithms are trained9

on language modeling, their middle layers become increasingly similar to the late responses of the10

language network in the brain. By contrast, input and output word embedding layers often diverge11

away from brain activity during training. These differences are primarily rooted in the sustained12

and bilateral responses of the temporal and frontal cortices. Together, these results suggest that13

the compositional - but not the lexical - representations of modern language models converge to a14

brain-like solution.15

Keywords Natural Language Processing | Neurobiology of Language |Magneto-encephalography16

1 Introduction17

Deep neural networks trained on "language modeling" (i.e. guessing masked words in a given text) have recently led18

to substantial progress (Vaswani et al., 2017; Devlin et al., 2018; Lample and Conneau, 2019) on tasks traditionally19

associated with human intelligence (Turing, 2009; Chomsky, 2006). While still limited (Loula et al., 2018), the20

improvements in translation, dialogue and summarization allowed by these models, lead to a simple question: do these21

artificial neural networks tend to process language similarly to humans, or do they just superficially mimic our behavior?22

This question is all-the-more challenging that the neurobiology of language remains in its infancy: the transformation of23

sensory input into phonemic and orthographic representations are becoming increasingly understood (Mesgarani et al.,24

2014; Dehaene and Cohen, 2011; Hickok and Poeppel, 2007; Di Liberto et al., 2015; Tang et al., 2017; Donhauser and25

Baillet, 2019). However, the large inter-individual variability and the distributed nature of high-level representations26

(Fedorenko et al., 2010; Price, 2010; Fedorenko et al., 2020) have limited the study of lexical (how individual words27

are represented) and compositional representations (i.e. how words are combined with one another) to piecemeal28

investigations (Binder et al., 1997; Pylkkänen and Marantz, 2003; Binder et al., 2009; Price, 2010; Fedorenko et al.,29

2010; Pallier et al., 2011; Fedorenko et al., 2016; Blank et al., 2016; Brennan et al., 2016; Brennan and Pylkkänen,30

2017; Pylkkanen and Brennan, 2019; Hagoort, 2019).31

Reading, for example, depends on a cortical hierarchy that originates in the primary visual cortex (V1), continues32

within the visual word form area (in the fusiform gyrus - where letters and strings are identified, and reaches the angular33

gyrus, the anterior temporal lobe and the middle temporal gyrus - associated with word meaning (Price, 2010; Dehaene34

.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288doi: bioRxiv preprint

Page 2: Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

A PREPRINT - JUNE 29, 2020

once

upon

a

the

a

time

Input Activations Predict

Xa

Ya

once

upon

a

Brain Score: corr(WXtest, Ytest) where: minW|Ytrain WXtrain|2 + |W|2

V1

Angular /Fusiform

BA6

Infero-frontal

SupramarginalFronto-polar

mSTGTemporo-polar

0.0 0.1 0.2 0.3 0.4 0.5Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

Norm

alize

d Am

plitu

de

Max:

0

1Br

ain

Scor

e

Convergence

Language score:Low High

0

1

Brai

n Sc

ore

Divergence

Training Steps0

1

Brai

n Sc

ore

FortunateRelationship

A B C

Figure 1: A. Hypotheses. An artificial neural network is said to converge to brain-like representations if training makesits activations become increasingly correlated with those of the brain, and vice versa if it diverges. Because artificialneural networks are high dimensional, part of a random network could significantly correlate with brain activity, andthus lead to a "fortunate relationship" between the brain and the algorithm. In this schema, each dot represents oneartificial neural network frozen at a given training step. B. To quantify the similarity between an artificial neural networkand the brain, a linear regression can be fit from the model’s activations (X) to the brain response (Y ) to the samestimulus sequence (here: ’once upon a’). The resulting "brain score" Yamins et al. (2014) is independent of the trainingobjective of the model (e.g. predicting the next word). C. Average (absolute) MEG responses to word onset in variousregions of the cortical network associated with reading, normalized by their maximum amplitudes to highlight theirrelative onsets and peaks (top ticks). See Movie 1 for additional results.

and Cohen, 2011; Hickok and Poeppel, 2007; Huth et al., 2016). This hierarchical pathway and a parallel motor35

route - responsible for articulatory codes (Hickok and Poeppel, 2007) together connect with the inferior frontal gyrus,36

repeatedly associated with the compositional processes of language (Dehaene and Cohen, 2011; Hickok and Poeppel,37

2007; Hagoort, 2005; Mollica et al.; Fedorenko et al., 2020)). However, the precise nature, format and dynamics of38

each of these language regions remain largely unknown (Fedorenko et al., 2020; Dehaene and Cohen, 2011; Hagoort,39

2019; Hickok and Poeppel, 2007).40

Recently, several neuroimaging studies have directly linked the representations of language models to those of the41

brain. In this view, the word embeddings developed for natural language processing (Bengio et al., 2003; Mikolov et al.,42

2013; Pennington et al., 2014; Bojanowski et al., 2016) have been shown to linearly correlate with the brain responses43

elicited by words presented in isolation (Mitchell et al., 2008; Anderson et al., 2019; Sassenhagen and Fiebach, 2019)44

or within narratives (Reddy Oota et al., 2018; Abnar et al., 2017; Ruan et al., 2016; Brodbeck et al., 2018; Gauthier and45

Ivanova, 2018). Furthermore, contextualized word embeddings significantly improve such correlations in the prefrontal,46

temporal and parietal cortices (Jain and Huth, 2018; Athanasiou et al., 2018; Toneva and Wehbe, 2019). However, these47

important findings call for a principled investigation: do modern language models systematically (1) converge to, (2)48

anecdotally correlate with, or (3) even diverge away from the sequence of computations the brain implements to process49

language (Figure 1 A-B)? On the contrary, do some language models better correlate with brain activity than simply50

because (i) they are higher dimensional, (ii) they are trained on different texts, (iii) they naturally express a large variety51

of non-linear dynamics etc?52

To address this issue, we assessed whether the activations of 7,400 artificial neural networks varying in their architectures,53

objectives, training and performances, linearly correlate with source-localized magneto-encephalography (MEG)54

recordings of 104 subjects reading words sequentially presented in random word lists or arranged into meaningful55

sentences. The results reveal the functional organization of the reading cortical network with a remarkable spatio-56

temporal clarity (Videos 1 and 2). Critically, the extensive comparison across models reveal that, during training, the57

2

.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288doi: bioRxiv preprint

Page 3: Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

A PREPRINT - JUNE 29, 2020

Figure 2: A. Average brain scores across time (0 - 1 sec after word onset) and subjects for the deep CNN trained oncharacter recognition (top), Word2Vec (middle) and the difference between the two (bottom) in response to wordspresented in random word lists. B. Average brain scores within each region-of-interest (panels) obtained with theCNN (gray) and with Word2Vec (W2V). The coloured area indicate when W2V is higher than CNN. C. Second-levelstatistical comparison across subjects of the brain scores obtained within each region-of-interest (averaged from 0 - 1 s)with the CNN (gray) and W2V (color), resulting from a two-sided Wilcoxon signed-rank test. Error bars are the 95%confidence intervals of the scores’ distribution across subjects.

middle - but not the outer - layers of deep language models systematically converge to the sustained representations of58

the bilateral frontal and temporal cortices.59

2 Results60

Towards a spatio-temporal decomposition of the reading network61

In spite of substantial individual variability, our MEG source reconstructions were consistent with the anatomy of62

the reading network (Dehaene and Cohen, 2011): on average, written words elicited a sequence of brain responses63

originating in V1 around ≈100 ms and continuing within the left posterior fusiform gyrus around 200 ms, the superior64

and middle temporal gyri, as well as the pre-motor and infero-frontal cortices between 150 and 500 ms after word onset65

(Figure 1 C, Video 1).66

Some of these brain responses likely represent non-linguistic features (e.g. visual onsets, width of the words etc).67

Following previous works (Kriegeskorte, 2015; Güçlü and van Gerven, 2015; Eickenberg et al., 2017; Yamins and68

DiCarlo, 2016), we thus tested whether single-trial MEG responses linearly correlated with the activations of the last69

layer of a deep convolutional neural network (CNN) trained on character recognition (Baek et al., 2019). Specifically,70

we input the CNN with 100x32 pixels images of each the words presented to the subjects in order to convert them into71

888-dimensional visual embedding vectors. We then fitted a ridge-regularized linear regression across these visual72

embeddings to predict the MEG responses to the corresponding words presented within random lists. Finally, we73

estimated the precision of this model-to-brain mapping on out-of-vocabulary predictions with a Pearson R correlation74

score, hereafter referred to as "brain score".75

A spatio-temporal summary of the brain scores obtained with the CNN are displayed in Figure 2. Overall, our results76

confirm that single-trial brain responses can be linearly predicted by the activations of a deep CNN, with a peak brain77

score of R=8.4% ± .46% (standard error of the mean across subjects) in the early visual cortex at 150 ms. Note that78

the present MEG brain scores are substantially lower than those reported with analogies fMRI and electrophysiology79

studies (Yamins et al., 2014; Jain and Huth, 2018; Huth et al., 2016), partly because we here predict individual recording80

samples rather than averaged responses or noise-corrected repetitions. These effects are nonetheless robust across81

subjects (Supplementary Figure 1).82

3

.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288doi: bioRxiv preprint

Page 4: Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

A PREPRINT - JUNE 29, 2020

Figure 3: A. Brain scores of a 256-dimensional CBOW Word2Vec embedding, trained with a context size of 5 wordsat different training stages (full corpus size: 278M words from Wikipedia), as a function of time (averaged acrossMEG senors). B Same as (A) using a Skip-gram Word2Vec. C. Averaged brain scores obtained across time, space(all MEG channels) and subjects (y-axis), for each of the word embedding of CBOW Word2Vec embeddings (dots,n=6,889 models in total) as a function of their training (x-axis) and performance on the training task (color) (seemethods for details). D. Same as (C) using a Skip-gram Word2Vec. E-G. Averaged Brain scores obtained as a functionof the objective (CBOW vs Skipgram), context size and dimensionality, irrespective of other properties (e.g. training,performance etc). Error bars indicate the 95% confidence interval. H. Feature importance estimated from a randomforest regressor fitted for each subject separately (dots, n=95) to predict the brain scores (averaged across time andspace) of an embedding model given its loss, training step, objective, context size and dimensionality. Overall, theperformance of word embedding has a modest and non-monotonic impact on brain scores.

Word embeddings specifically correlate with late, distributed and lateralized brain responses to words83

Visual embeddings cannot capture the arbitrary meanings of written words. Word semantics, however, is partially84

captured by algorithms trained to predict word vectors from their context (CBOW) or vice versa (Skipgram, (Mikolov85

et al., 2013)). We thus applied the above brain score analysis with Word2Vec embeddings, in order to identify where86

and when semantic representations may account for brain responses above and beyond low-level visual features.87

Word2Vec (Mikolov et al., 2013) trained with a CBOW objective better predicted brain responses than the visual88

embeddings of the CNN from ≈ 200ms, and with a peak difference around 400ms (mean ∆R = 1.6% across sources89

and subjects) especially in the left-lateralized temporal and prefrontal responses (statistics summarized in Figure 2C).90

These results confirm that the corresponding brain regions specifically generate lexical representations.91

Word embeddings do not systematically converge towards brain-like representations92

To test whether the similarity between word embeddings and cortical responses is fortunate or principled, we trained93

18 word embeddings varying in random seeds , dimensionality (n ∈ 128, 256, 512), context sizes (n ∈ 2, 4, 6) and94

learning objectives (Skipgram vs CBOW). We then estimated for each of them, how their brain score (averaged over95

time and independently of brain location) varied as a function of their training steps (cf. Methods) and performance on96

the training task (log-likelihood on a test set, cf.SI-2-4). Note that, because losses between architectures are hardly97

comparable because their training tasks differ, we here focus on their relative performance (%): i.e. the loss at a current98

step divided by the final loss. Figure 3 summarizes how the average brain scores vary with each structural and functional99

properties word embeddings’.100

We then used the feature-importance analysis of random forests (Breiman, 2001), in order to quantify the independent101

impact of each model property (relative performance, context size, dimensionality, CBOW or Skipgram objective and102

training step) on the final brain scores (cf. Methods). The dimensionality (∆R2 = 7%± 0.8%, p < 10−16, as estimated103

with a second-level Wilcoxon test across subjects), the context size (10%± 0.6%, p < 10−16) and the performance of104

the model (∆R2 = 6%± 0.7%, p < 10−16) significantly but modestly varied the extent to which a word embedding105

linearly correlates with brain responses to words. By contrast, the amount of training (∆R2 = 64%± 2%, p < 10−16)106

4

.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288doi: bioRxiv preprint

Page 5: Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

A PREPRINT - JUNE 29, 2020

Figure 4: A. Average brain scores across time (0 - 2 sec after word onset) and subjects for Word2Vec (top), theninth layer of a 13-layer Causal Language Transformer (CLT, middle) and the difference between the two (bottom) inresponse to words presented within sentences. B. Average brain scores within each region-of-interest (panels) obtainedwith the CNN (gray), Word2Vec (color) and the CLT (black). C. Second-level statistical comparison across subjects ofthe brain scores obtained within each region-of-interest (averaged from 0 - 2 s) with Word2Vec (color) and the CLT(black), resulting from a two-sided Wilcoxon signed-rank test. Error bars are the 95% confidence intervals of the scores’distribution across subjects. See Movie 2 for additional results.

and the learning objective (CBOW vs Skipgram) (∆R2 = 50%± 3%, p < 10−16 ) appeared to be important predictors107

of whether an embedding would linearly correlate with brain responses to words.108

To our surprise, however, brain scores did not vary monotonically with the training and performance of word embeddings.109

For example, the brain scores of CBOW-trained Word2Vec steeply increased at the beginning of training, decreased110

after 60 training steps (i.e. after having been exposed to 3M words and 155,000 distinct words), and finally reached111

a plateau after 3,000 training steps ( 150M words and 750,000 distinct words). (Fig. 3 C). Similarly, the Word2Vec112

embeddings trained with a skip-gram objective reached a plateau around only five training steps (i.e. ≈250,000 words).113

Together, these results suggest that training word embedding algorithms does not make them systematically converge to114

brain-like representations.115

Contextualized word embeddings specifically correlate with the delayed responses of the bilateral prefrontal116

and temporal cortices117

Word embeddings are not contextualized: i.e. words are associated with a unique vector, independently of the context in118

which they are used. Yet, the meaning of a sentence depends not only on the meaning of its words, but also on the rules119

used to combine them (Dummett, 1981). To test whether modern language models systematically combine words into120

representations that linearly correlate with those of the human brain, we applied the above brain score analyses with the121

contextualized word representations generated by a variety of transformers (Devlin et al., 2018; Vaswani et al., 2017;122

Radford et al., 2019) - state-of-the art deep neural networks trained to predict words given a context. To make sure that123

the models, like the brain, process and combine words sequentially, we restrict our analyses to causal transformers (i.e.124

unidirectional, from left to right) (CLT), and now focus on the brain responses to words presented within isolated but125

meaningful sentences.126

Figure 4 summarizes the brain scores obtained with visual, word and contextualized word embeddings, generated by the127

visual CNN, a representative Word2Vec embedding and the ninth layer of a representative 13-layer CLT, respectively.128

Video 2 displays the brain scores obtained with the visual CNN (blue), as well as the gain in brain scores obtained with129

the Word2Vec embedding (green) and the gain in brain scores obtained with the ninth layer of a CLT (red). Overall, the130

spatio-temporal brain scores obtained with visual and word embeddings in the context of sentences were largely similar131

to those obtained in the context of random word lists: the former peaked around 100 ms in the primary visual cortex and132

rapidly propagated across the reading network, whereas the latter peaked around 400 ms and were primarily distributed133

over the left temporal and frontal cortices (Figure 4, Video 2). The contextualized word embeddings led to significantly134

5

.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288doi: bioRxiv preprint

Page 6: Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

A PREPRINT - JUNE 29, 2020

Figure 5: A. Gain in brain score (y-axis) between the representation extracted from each layer of a 13-layer CLT(x-axis) and its input layer (average across all time samples). Each line depicts a subject. B. Same as (A) showing theaverage effect across subjects for clarity (left), as well as the average gain in brain scores obtained with 8-layer and4-layer CLT. C. Brain scores averaged over time and models as a function of their dimensionality (left), number attentionheads (middle) and total number of layers (right) D. Brain scores (averaged time and subjects, y-axis) for the first(non-contextualized) layer of 594 CLTs (18 architectures x 33 color-coded training steps). Each dot corresponds to theaveraged brain score of a frozen model. E. Same as D using the middle layers of the CLTs (nlayers

3 < layer ≤ 2nlayers

3 ).F. Same as D using the deepest layers of the CLTs (layer > 2

nlayers

3 ). G. Feature importance estimated, for each subjectseparately (dots, n=95), with a random forest fitted to predict the average brain score from the model’s hyperparameters(number of attention heads, total number of layers, and dimensionality of its layers), training step and performance, aswell as from the layer relative depth of the layer used to compute this brain score. Error bars are the 95% confidenceintervals of the scores’ distribution across subjects.

higher brain scores than word embeddings especially one second after word onset. These improvements in brain scores135

peaked in the bilateral temporal and prefrontal cortices, especially around the infero-frontal cortex, anterior temporal136

lobe and middle and superior temporal gyrus. Overall these results show that we can precisely track the formation of137

visual, lexical and compositional representations in the expected brain regions (Price, 2010; Dehaene and Cohen, 2011;138

Hagoort, 2019).139

Middle layers converge towards brain responses while outer layers diverge away from them140

Is the gain in brain score obtained with the middle layer of a 13-layer CLT fortunate or, on the contrary, does it reflect a141

convergence of these modern language models towards brain-like representations? To address this issue, we trained 18142

CLT varying in dimensionality (n ∈ 128, 256, 512), number of attention heads (n ∈ 4, 8) and total number of layers143

(n ∈ 4, 8, 12) but all trained on the same Wikipedia corpus. We froze each of these networks at different training144

stages (cf. Methods), input them with the word sequences that were presented to the subjects, and extracted their145

representations from each of their layers in order to estimate the corresponding brain scores.146

First, we observed that, at the end of training, the middle layers weakly but systematically outperformed outer layers147

(e.g. first and last): e.g. the difference in brain scores between the middle and output layers of a 13-layer CLT148

was above chance (∆R = 1%, ±0.06 across subjects, p < 10−29) and this difference holds across architectures149

(∆R = 0.75%± 0.15 across architectures, p < 10−7 as assessed with a second-level Wilcoxon test across subjects,150

Figure 5A-B).151

Second, we investigated how the brain score of each layer varied with the training and the accuracy of the artificial152

neural network, to predict the word following a context (cf. Methods, Supplementary Figure 3). On average, the153

brain scores of the input layer increased within the 400k first training steps (≈ 80M processed words), but ultimately154

presented a steady decrease over training steps (Figure 3): Pearson correlation between training steps and brain scores:155

R=-63% ±3.1%p < 10−34, Figure 5 D). The brain scores of the deepest layers (layer above the 66th percentile of156

the model’s depth) also rapidly started to decrease with training:−35%± 2.6% correlation after 2 epochs, p < 10−21157

(Figure 5 F). By contrast, the brain scores of the middle layers (layer in between the 33th and the 66th percentile of the158

model’s depth) systematically increased with the training of the model (R = 33%± 1.5%, p < 10−35, Figure 5 E).159

6

.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288doi: bioRxiv preprint

Page 7: Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

A PREPRINT - JUNE 29, 2020

To summarize how brain scores are independently modulated by the models architecture, their training and the relative160

depth of their representations, we implemented a feature analysis across models based on a random forest (Figure161

5G). These results confirmed that the relative depth of the extracted representation (∆R2 = 120%± 3%, p < 10−16),162

the performance of the models on language modeling (∆R2 = 59% ± 2%, p < 10−16) and the network’s amount163

of training (∆R2 = 2%± 0.4%, p < 10−16) had each a major impact on brain scores (Figure 5H). By contrast, the164

dimensionality of the layers and total number of layers modestly influenced the brain scores (∆R2 < 0.3%).165

Overall, these results suggest that, beyond the marginal effects of the models’ architectures, the middle - but not the166

outer - layers of deep language models systematically converge towards brain-like representations.167

3 Discussion168

Whether modern A.I. algorithms converge to the computational solutions of the human brain, or whether they simply169

find far-out tricks to superficially mimic human behavior is largely unknown. Following recent achievements in170

electrophysiology (Yamins et al., 2014; Tang et al., 2018) and neuroimaging (Khaligh-Razavi and Kriegeskorte, 2014;171

Kriegeskorte, 2015; Güçlü and van Gerven, 2015; Eickenberg et al., 2017; Huth et al., 2016; Toneva and Wehbe, 2019;172

Kell et al., 2018), we here tackled this challenge on the restricted issue of word and sentence processing by assessing173

how brain responses to random word lists (Figure 2, 3) and meaningful sentences (Figures 4 and 5, Video 2) variably174

correlated with different (parts of) artificial neural networks depending on their architectures, training and language175

performances.176

Our results reveal how and when the reading network transforms written words into visual, lexical and compositional177

representations. In particular, the sequential recruitment of the early visual cortex and of the fusiform gyrus match and178

extend previous fMRI and electro-cortigography studies (Dehaene and Cohen, 2011; Hermes et al., 2017). Furthermore,179

the brain scores obtained with word embeddings delineate a neural code of word semantics distributed over the frontal,180

temporal and parietal cortices similarly to what has been reported in recent fMRI studies (Mitchell et al., 2008; Jain and181

Huth, 2018; Toneva and Wehbe, 2019). Finally, the compositional representations of deep language models peaked182

precisely in the brain regions traditionally associated with high-level sentence processing (Pallier et al., 2011; Hickok183

and Poeppel, 2007; Brennan and Pylkkänen, 2017). As expected (Fedorenko et al., 2010; Cogan et al., 2014), most184

of these effects appeared left-lateralized, but significantly recruited both hemispheres. As usual, such MEG source185

estimates, however, should be considered with parsimony (Baillet, 2017). In particular, the ventro-medial prefrontal and186

anterior cingulate cortices presently observed (Video 2) are specific to language processing. Their responses could thus187

relate to cognitive processes, that are not specific to language, such as emotional evaluation and working memory load.188

Critically, we found that the middle - but not the outer - layers of modern deep language models become increasingly189

similar to brain activity as they learn to accurately predict the word that should follow a given context. This suggests190

that the way language models learn to combine - as opposed to represent - words converges to a brain-like solution.191

We should stress, however, that this convergence is almost certainly partial. First, current language models are still far192

from human-level performance on a variety of tasks such as dialogue, summarization, and systematic generalization193

(Loula et al., 2018; Zellers et al., 2019). In addition, they can be disrupted by adversarial examples (Nie et al., 2019).194

Finally, the architecture of the popular transformer network (Vaswani et al., 2017) is in many ways not biologically195

plausible: while the brain has been repeatedly associated with a predictive coding architecture, where prediction errors196

are computed at each level of an interconnected hierarchy of recurrent networks (Friston, 2010), transformers access an197

unreasonably large buffer of words through an attentionally-gated set of "duplicated" feedforward networks, which198

together only compute prediction errors at their final layer. In light of these major architectural differences, it is thus199

remarkable to see that the brain and the middle layers of these models find a partially common solution to language200

processing.201

The reason why outer layers fail to converge to brain-like representations surprised us, as word embedding have been202

repeatedly used to model brain activity (Mitchell et al., 2008; Sassenhagen and Fiebach, 2019; Huth et al., 2016). We203

can speculate that this phenomenon results from the way language models are trained. Specifically, language models204

aim to predict words from other words. Presumably, this artificial task is best achieved by learning syntax, pragmatics205

as well as general knowledge, which, together can help transform a series of words into a meaningful construct -206

meaningful construct which can then be used to constrain the probability of subsequent words. By contrast, the human207

brain may use a different objective (Richards et al.; Friston, 2010): e.g. maximizing the information exchange during208

dialogue, predicting the trajectory of a narrative or minimizing surprise at each the level of representational hierarchy -209

as opposed to minimizing lexical surprise. These alternative objectives may happen to also necessitate words to be210

combined into meaningful constructs: i.e. to necessitate going through representations that are similar to the networks’.211

While speculative, this issue highlights the necessity, both for A.I. and cognitive neuroscience, to explore further what212

7

.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288doi: bioRxiv preprint

Page 8: Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

A PREPRINT - JUNE 29, 2020

the brain aims to achieve during language processing (i.e. finding the learning objective), rather than how it achieves it213

(i.e. finding the representations necessary to achieve this goal).214

Finally, the present study only starts to elucidate the precise nature of linguistic representations in the brains and215

artificial neural network. Similarly, it only starts to unravel the complex interplay between the regions of the language216

network. How the mind builds and organizes its lexicon and how it parses and manipulate sentences thus remain open217

questions. We hope that the present workwill serve as a stepping stone to progress on these historical questions.218

4 Methods219

4.1 Models220

We aimed to compare brain activity to three families of models, targeting visual, lexical and compositional representa-221

tions, respectively.222

Visual Network223

To model visual representations, every word presented to the subjects were rendered on a gray 100 x 32 pixel background224

with a centered black Arial font, and input to a VGG network pretrained to recognize words from images (Baek et al.,225

2019), resulting in 888-dimensional embeddings. These embeddings aim to replicate previous work on the similarity226

between visual neural networks and brain activity in response to images (e.g. (Yamins et al., 2014; Kriegeskorte, 2015;227

Güçlü and van Gerven, 2015)) while ensuring that the stimuli input to the network are similar to those used for their228

training.229

Word embedding networks230

To model lexical representations, every word presented to the subjects was lower-cased and input to Word2Vec models231

(Mikolov et al., 2013). Word2Vec consists of a one-hidden-layer neural network stacked on a look-up-table (also232

called embedding layer). Depending on the learning objective, they were trained to predict the current word wt given233

its context wt−k, ...wt−2, wt−1, wt+1, wt+2, ...wt+k (Continuous Bag-Of-Words, CBOW), or the context given the234

current word (a.k.a Skip-gram). The models were trained using Gensim implementation (Rehurek and Sojka, 2010)235

(hyper-parameters: batch size of 50,000 words, 0.001 sampling, hierarchical soft-max, i.e without negative sampling),236

on 288,450,116 lower-cased, tokenized (regexp tokenizer from Gensim) words from Wikipedia, without punctuation.237

We restricted the vocabulary to the stimuli words used in the present MEG study and to the words that appeared at least238

5 times in the training corpus (resulting in ≈ 820K distinct vocabulary words in total).239

To evaluate the networks’ performance, we computed their loss (negative-log-likelihood) on a test dataset of 458,622240

words from Wikipedia. Losses between CBOW and Skip-Gram with various context sizes are hardly comparable241

because the training task differ. We thus reported (3), for each network, its performance stage (%), from worst (0%) to242

the best performance (100%).243

In total, we investigated 18 Word2Vec architectures varying in learning objective (Skipgram vs CBOW), dimensionality244

(∈ [128, 256, 512]) and context size during training (k ∈ [2, 4, 6]) for three different random seeds. We froze each245

network at different training stages: 130 steps between initialisation and epoch 1,000. Networks were stopped after 3246

days of training on 6 CPUs independently of their training stages, resulting in 6,889 models of non-contextualized word247

embeddings in total (see Supplementary Information for model details).248

Causal Language Transformer249

To model compositional representations (in the sense of non-linear interactions between words), we extracted the250

contextualized word representation of a variety of transformers trained on language modeling (Vaswani et al., 2017).251

We restricted our analyses to causal networks (i.e unidirectional, processing words from left to right) to facilitate the252

interpretation of the model-to-brain mapping. For example, given the sentence ’THE CAT IS ON THE MAT’, the brain253

responses to ’ON’ are solely compared to the activations of CLT input with ’THE CAT IS ON’. CLT representations254

were generated for every word, by generating word sequences from the three previous sentences. We did not observe255

qualitatively different results when using longer contexts. Note that sentences were isolated, and were not part of a256

narrative.257

Algorithms were trained on a language modelling task using XLM implementation (see Supplementary Information for258

details on the hyper-parameters) (Lample and Conneau, 2019), on the same Wikipedia corpus of 278,386,651 Wikipedia259

8

.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288doi: bioRxiv preprint

Page 9: Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

A PREPRINT - JUNE 29, 2020

words extracted using WikiExtractor 1 and pre-processed using Moses tokenizer (Koehn et al.), with punctuation. We260

restricted the vocab to the 50,000 most frequent words, concatenated with all of the words used in the study (50,341261

vocabulary words in total). These design choices enforces that the difference in brain scores observed across models262

cannot be explained by differences in corpora and text preprocessing.263

To evaluate the networks’ performance on language modeling, we computed their perplexity (exponential of the entropy)264

and their accuracy (accuracy at predicting the current word given previous words) on a test dataset of 180,883 words265

from Wikipedia. Note that we had to use a larger test dataset to evaluate Word2Vec networks because of the loss was266

not stable.267

In total, we investigated 18 distinct CLT architectures varying in dimensionality (∈ [128, 256, 512]), number of layers268

(∈ [4, 8, 12]) and attention heads (∈ [4, 8]). We froze the networks at 33 training stages between initialisation and epoch269

200. The CLTs were stopped after 3 days of training on 8 GPUs, resulting in 579 CLT models in total, and 5,223270

contextualized word representations (one per layer).271

4.2 Magneto-encephalography272

One-hundred and two subjects performed a one-hour-long reading task while being recorded with a 257 CTF magneto-273

encephalography (MEG) scanner by Schoffolen and colleagues (Schoffelen et al., 2019). Words were flashed one at a274

time to the subjects, and grouped into sequences of 9 - 15 words ("word sequences"), for a total of approximately 2,700275

words per subject. The inter-stimulus interval varied between 300 ms and 1,400 ms, and the inter-sequence interval276

consisted of 5s-long blank screen. Word sequences were either meaningful sentences, or random word lists. Sentences277

and word lists were blocked into five sequences. Twenty percents of the sequences were followed by a yes/no question278

about the content of the previous sentences (e.g. "Did grandma give a cookie to the girl?) and word lists (e.g. Was the279

word ‘grandma’ mentioned?) to ensure that subjects were paying attention.280

The 257 MEG time series were pre-processed using the Python library MNE (Gramfort et al., 2014). Signals were281

band-passed filter between 0.1 and 40 Hz filtered using MNE default parameters and segmented between -0.5s to 2s282

relative to word onset.283

Seven subjects were excluded from the analyses because of difficulties processing metadata.284

4.2.1 Brain score285

To assess the linear relationship between the MEG signals of one subject and the activations of one artificial neural286

network (e.g a word embedding, or the activations of the first layer of a CLT), we split the subject’s MEG samples into287

K train-test folds (test folds=20% of words, K=5). For the analyses of word lists (section 2, 2, 2), train and test sets288

contained separate words (out-of-vocabulary cross-validation). For the analyses of ’sentences’ (section 2, 2), train and289

test sets contained separate sentences. For each train fold, |T | cross-validated ridge regressions were fitted to predict290

the MEG signal elicited t seconds after the word onset, given the corresponding word embedding as input (t ∈ T , |T |291

varies between 6 and 180 depending on the experiment, cf. 4.3.1). We used the RidgeCV regressor from scikit-learn292

(Pedregosa et al., 2011), with penalization parameter varying between 10−3 and 108 (20 values, logarithmically scaled)293

to reliably optimize hyperparameters on each training set.294

We then evaluated the performance of the K ∗ |T | ridge regressions fits by computing the Pearson’s correlation between295

predicted and actual MEG on test folds. We obtained K ∗ |T | correlation scores, corresponding to the ability the296

network’s activations to correlate with brain activity at time step t, for fold k. We call "brain scores" the correlation297

scores averaged across folds.298

4.2.2 Source reconstruction299

To evaluate the anatomical location of these effects, we performed the brain scores on source-reconstructed MEG300

signals, by correlating the single-trial source estimates ("true" source) with the single-trial source predictions generated301

by the model-to-brain mapping regressions.302

To this aim, Freesurfer (Fischl, 2012) was used to automatically segment the T1-weighted anatomical MRI of individual303

subjects, using the ’recon-all’ pipeline. We then manually co-referenced the subjects’ segmented skull with the head-304

markers digitized prior to the MEG acquisition. A single-layer forward model was made using MNE-Python’s default305

parameters (Gramfort et al., 2014). Because of lack of empty-room recordings, the noise covariance matrix used for the306

inverse operator was estimated from the zero-centered 200ms of baseline MEG activity preceding word onset.307

1https://github.com/attardi/wikiextractor

9

.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288doi: bioRxiv preprint

Page 10: Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

A PREPRINT - JUNE 29, 2020

The average brain responses of Figure 1 C were computed as the square of the source-reconstruction of the average308

evoked related field across all words for each subject separately, then averaged across subjects and finally divided by309

their respective maxima, in order to highlight temporal differences. Region-of-interest analyses were selected from310

the PALS Brodmann’ area atlas (Van Essen, 2005) and the Destrieux Atlas segmentation (Destrieux et al., 2010) to311

summarize these large-dimensional effects. Movie 1 displays the non-normalized sources. The model-to-brain mapping312

regressions fitted across the baseline-corrected sensors Y sensorsTrue were used to generated single-trial predictions at313

the sensors levels Y sensorsPred . Then, both Y sensors

True and Y sensorsPred were projected onto the subjects’ source space using314

a dSPM inverse operator with default parameters (e.g. fixed orientation, depth=0.8 and lambda2=1). Finally, the315

brain-score was computed from the correlation between Y sourcesTrue and Y sources

Pred .316

4.3 Statistics317

4.3.1 Convergence analysis318

All neural networks but the visual CNN were trained from scratch on the same corpus (cf 4.1 and 4.1) and systematically319

computed the brain scores of their activations on each subject, sensor and time sample independently. For computational320

reasons, we restricted ourselves to six representative time samples regularly distributed between ∈ [−400, 1, 600]ms.321

Brain scores were then averaged across channels, time samples and subjects to obtain the results in Figure 3 and 5. To322

evaluate the convergence of a model, we computed, for each subject, the Pearson’s correlation between the brain scores323

of the network and its performance and/or its training step.324

4.3.2 Feature importance325

To systematically quantify how the architecture, the performance and the learning of the artificial neural networks326

impacted their ability to linearly correlate with brain activity, we fitted, for each subject separately, a random forest327

across the models’ properties (e.g. dimensionality, training stage) to predict their brain scores, using scikit-learn’s328

RandomForest (Breiman, 2001; Pedregosa et al., 2011). The performance of the random forests (R2) was evaluated329

with a 5-fold cross-validation across models for each subject separately.330

For Word2Vec embeddings (Figure 3), we used the learning objective (skip-gram versus continuous-bag-of-words),331

the context size during training (∈ {2, 4, 6}, the dimensionality (∈ {128, 256, 512}), the performance stage the model332

(∈ [0, 1], cf. Word embedding networks) and its training stage (epochs), as the input data to the random forest.333

For CLT embeddings (Figure 5), we used the number of attention heads (∈ [4, 8]), total number of layers (∈ [4, 8, 12]), di-334

mensionality (∈ [128, 256, 512]), training step (epochs, ∈ [0, 200]), accuracy and the relative depth of the representation335

(between 0 and 1, depth = layernlayers

), as the input data to the random forest.336

In both case, we implemented a "feature importance" analyses to assess the contribution of each characteristic on337

brain-score. Feature importance measures ∆R2: the decrease in R2 when shuffling one feature repeatedly (here,338

nrepeats = 100). For each subject, we reported the average decrease across the cross-validation folds (Figure 3, 5). The339

resulting importance score (∆R2) observed per feature, are expected to be centered around 0 if the model property does340

not impact on brain-score, and positive otherwise. Note that the decrease in R2 can be greater than one when the model341

performs worse than chance (with negative R2) when shuffling a feature.342

4.3.3 Population statistics343

For clarity, most figures report average effect sizes obtained across subjects. To estimate the robustness of these344

estimates, we systematically performed second-level analyses across subjects. Specifically, we applied Wilcoxon345

signed-rank tests across subjects’ estimates to evaluate whether the effect under consideration was systematically346

different from chance level.347

Error bars and ± refer to standard of the mean intervals.348

4.4 Ethics349

This study was conducted in compliance with the Helsinki Declaration. No experiments on living beings were performed350

for this study. These data were provided (in part) by the Donders Institute for Brain, Cognition and Behaviour after351

having been approved by the local ethics committee (CMO – the local “Committee on Research Involving Human352

Subjects” in the Arnhem-Nijmegen region).353

10

.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288doi: bioRxiv preprint

Page 11: Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

A PREPRINT - JUNE 29, 2020

5 Acknowledgement354

This work was supported by ANR-17-EURE-0017, the Fyssen Foundation and the Bettencourt Foundation to JRK for355

his work at PSL.356

References357

S. Abnar, R. Ahmed, M. Mijnheer, and W. H. Zuidema. Experiential, distributional and dependency-based word358

embeddings have complementary roles in decoding brain activity. CoRR, abs/1711.09285, 2017. URL http:359

//arxiv.org/abs/1711.09285.360

A. J. Anderson, E. C. Lalor, F. Lin, J. R. Binder, L. Fernandino, C. J. Humphries, L. L. Conant, R. D. S. Raizada,361

S. Grimm, and X. Wang. Multiple regions of a cortical network commonly encode the meaning of words in multiple362

grammatical positions of read sentences. 29(6):2396–2411, 2019. ISSN 1047-3211. doi: 10.1093/cercor/bhy110.363

URL https://academic.oup.com/cercor/article/29/6/2396/4996559. Publisher: Oxford Academic.364

N. Athanasiou, E. Iosif, and A. Potamianos. Neural activation semantic models: Computational lexical semantic models365

of localized neural activations. In Proceedings of the 27th International Conference on Computational Linguistics,366

pages 2867–2878, Santa Fe, New Mexico, USA, Aug. 2018. Association for Computational Linguistics. URL367

https://www.aclweb.org/anthology/C18-1243.368

J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee. What is wrong with scene text recognition model369

comparisons? dataset and model analysis. In Proceedings of the IEEE International Conference on Computer Vision,370

pages 4715–4723, 2019.371

S. Baillet. Magnetoencephalography for brain electrophysiology and imaging. Nature neuroscience, 20(3):327–339,372

2017.373

Y. Bengio, R. Ducharme, and P. Vincent. A neural probabilistic language model. In T. K. Leen, T. G. Dietterich, and374

V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 932–938. MIT Press, 2003. URL375

http://papers.nips.cc/paper/1839-a-neural-probabilistic-language-model.pdf.376

J. R. Binder, J. A. Frost, T. A. Hammeke, R. W. Cox, S. M. Rao, and T. Prieto. Human brain language areas identified377

by functional magnetic resonance imaging. Journal of Neuroscience, 17(1):353–362, 1997.378

J. R. Binder, R. H. Desai, W. W. Graves, and L. L. Conant. Where is the semantic system? a critical review379

and meta-analysis of 120 functional neuroimaging studies. 19(12):2767–2796, 2009. ISSN 1047-3211. doi:380

10.1093/cercor/bhp055. URL https://academic.oup.com/cercor/article/19/12/2767/376348. Publisher:381

Oxford Academic.382

I. Blank, Z. Balewski, K. Mahowald, and E. Fedorenko. Syntactic processing is distributed across the language system.383

Neuroimage, 127:307–323, 2016.384

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. arXiv preprint385

arXiv:1607.04606, 2016.386

L. Breiman. Random forests. 45(1):5–32, 2001. ISSN 1573-0565. doi: 10.1023/A:1010933404324. URL https:387

//doi.org/10.1023/A:1010933404324.388

J. R. Brennan and L. Pylkkänen. Meg evidence for incremental sentence composition in the anterior temporal lobe.389

Cognitive science, 41:1515–1531, 2017.390

J. R. Brennan, E. P. Stabler, S. E. Van Wagenen, W.-M. Luh, and J. T. Hale. Abstract linguistic structure correlates with391

temporal activity during naturalistic comprehension. Brain and language, 157:81–94, 2016.392

C. Brodbeck, L. E. Hong, and J. Z. Simon. Rapid transformation from auditory to linguistic representations of393

continuous speech. Current Biology, 28(24):3976–3983, 2018.394

N. Chomsky. Language and mind. Cambridge University Press, 2006.395

G. B. Cogan, T. Thesen, C. Carlson, W. Doyle, O. Devinsky, and B. Pesaran. Sensory–motor transformations for speech396

occur bilaterally. Nature, 507(7490):94–98, 2014.397

S. Dehaene and L. Cohen. The unique role of the visual word form area in reading. Trends in cognitive sciences, 15(6):398

254–262, 2011.399

C. Destrieux, B. Fischl, A. Dale, and E. Halgren. Automatic parcellation of human cortical gyri and sulci using standard400

anatomical nomenclature. Neuroimage, 53(1):1–15, 2010.401

J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language402

understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.403

11

.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288doi: bioRxiv preprint

Page 12: Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

A PREPRINT - JUNE 29, 2020

G. M. Di Liberto, J. A. O’Sullivan, and E. C. Lalor. Low-frequency cortical entrainment to speech reflects phoneme-level404

processing. Current Biology, 25(19):2457–2465, 2015.405

P. W. Donhauser and S. Baillet. Two distinct neural timescales for predictive speech processing. Neuron, 2019.406

M. Dummett. Frege: Philosophy of language. Harvard University Press, 1981.407

M. Eickenberg, A. Gramfort, G. Varoquaux, and B. Thirion. Seeing it all: Convolutional network layers map the408

function of the human visual system. NeuroImage, 152:184–194, 2017.409

E. Fedorenko, P.-J. Hsieh, A. Nieto-Castañón, S. Whitfield-Gabrieli, and N. Kanwisher. New method for fmri410

investigations of language: defining rois functionally in individual subjects. Journal of neurophysiology, 104(2):411

1177–1194, 2010.412

E. Fedorenko, T. L. Scott, P. Brunner, W. G. Coon, B. Pritchett, G. Schalk, and N. Kanwisher. Neural correlate of the413

construction of sentence meaning. Proceedings of the National Academy of Sciences, 113(41):E6256–E6262, 2016.414

E. Fedorenko, I. Blank, M. Siegelman, and Z. Mineroff. Lack of selectivity for syntax relative to word meanings415

throughout the language network. bioRxiv, page 477851, 2020.416

B. Fischl. Freesurfer. Neuroimage, 62(2):774–781, 2012.417

K. Friston. The free-energy principle: a unified brain theory? Nature reviews neuroscience, 11(2):127–138, 2010.418

J. Gauthier and A. Ivanova. Does the brain represent words? an evaluation of brain decoding studies of language419

understanding. arXiv preprint arXiv:1806.00591, 2018.420

A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, and M. S. Hämäläinen.421

Mne software for processing meg and eeg data. NeuroImage, 86:446 – 460, 2014. ISSN 1053-8119. doi: https:422

//doi.org/10.1016/j.neuroimage.2013.10.027. URL http://www.sciencedirect.com/science/article/pii/423

S1053811913010501.424

U. Güçlü and M. A. van Gerven. Deep neural networks reveal a gradient in the complexity of neural representations425

across the ventral stream. Journal of Neuroscience, 35(27):10005–10014, 2015.426

P. Hagoort. On broca, brain, and binding: a new framework. Trends in Cognitive Sciences, 9:416–423, 2005.427

P. Hagoort. The neurobiology of language beyond single-word processing. Science, 366(6461):55–58, 2019.428

D. Hermes, V. Rangarajan, B. L. Foster, J.-R. King, I. Kasikci, K. J. Miller, and J. Parvizi. Electrophysiological429

responses in the ventral temporal cortex during reading of numerals and calculation. Cerebral cortex, 27(1):567–575,430

2017.431

G. Hickok and D. Poeppel. The cortical organization of speech processing. 8(5):393–402, 2007. ISSN 1471-0048.432

doi: 10.1038/nrn2113. URL https://www.nature.com/articles/nrn2113. Number: 5 Publisher: Nature433

Publishing Group.434

A. G. Huth, W. A. de Heer, T. L. Griffiths, F. E. Theunissen, and J. L. Gallant. Natural speech reveals the semantic maps435

that tile human cerebral cortex. 532(7600):453–458, 2016. ISSN 0028-0836, 1476-4687. doi: 10.1038/nature17637.436

URL http://www.nature.com/articles/nature17637.437

S. Jain and A. Huth. Incorporating context into language encoding models for fmri. In S. Bengio, H. Wallach,438

H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Pro-439

cessing Systems 31, pages 6628–6637. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/440

7897-incorporating-context-into-language-encoding-models-for-fmri.pdf.441

A. Kell, D. Yamins, E. Shook, S. Norman-Haignere, and J. McDermott. A task-optimized neural network replicates442

human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron, 98, 04 2018.443

doi: 10.1016/j.neuron.2018.03.044.444

S.-M. Khaligh-Razavi and N. Kriegeskorte. Deep supervised, but not unsupervised, models may explain it cortical445

representation. PLoS computational biology, 10(11):e1003915, 2014.446

P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens,447

C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: Open source toolkit for statistical machine translation.448

In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume449

Proceedings of the Demo and Poster Sessions, pages 177–180. Association for Computational Linguistics. URL450

https://www.aclweb.org/anthology/P07-2045.451

N. Kriegeskorte. Deep neural networks: a new framework for modeling biological vision and brain information452

processing. Annual review of vision science, 1:417–446, 2015.453

G. Lample and A. Conneau. Cross-lingual language model pretraining. Advances in Neural Information Processing454

Systems (NeurIPS), 2019.455

12

.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288doi: bioRxiv preprint

Page 13: Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

A PREPRINT - JUNE 29, 2020

J. Loula, M. Baroni, and B. M. Lake. Rearranging the familiar: Testing compositional generalization in recurrent456

networks. CoRR, abs/1807.07545, 2018. URL http://arxiv.org/abs/1807.07545.457

N. Mesgarani, C. Cheung, K. Johnson, and E. F. Chang. Phonetic feature encoding in human superior temporal458

gyrus. 343(6174):1006–1010, 2014. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.1245994. URL https:459

//science.sciencemag.org/content/343/6174/1006. Publisher: American Association for the Advancement460

of Science Section: Report.461

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. 2013. URL462

http://arxiv.org/abs/1301.3781.463

T. M. Mitchell, S. V. Shinkareva, A. Carlson, K.-M. Chang, V. L. Malave, R. A. Mason, and M. A. Just. Predicting human464

brain activity associated with the meanings of nouns. 320(5880):1191–1195, 2008. ISSN 0036-8075, 1095-9203. doi:465

10.1126/science.1152876. URL https://www.sciencemag.org/lookup/doi/10.1126/science.1152876.466

F. Mollica, M. Siegelman, E. Diachek, S. T. Piantadosi, Z. Mineroff, R. Futrell, H. Kean, P. Qian, and E. Fedorenko.467

Composition is the core driver of the language-selective network. 1(1):104–134. doi: 10.1162/nol_a_00005. URL468

https://doi.org/10.1162/nol_a_00005. Publisher: MIT Press.469

Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela. Adversarial nli: A new benchmark for natural470

language understanding. arXiv preprint arXiv:1910.14599, 2019.471

C. Pallier, A.-D. Devauchelle, and S. Dehaene. Cortical representation of the constituent structure of sentences.472

Proceedings of the National Academy of Sciences, 108(6):2522–2527, 2011. ISSN 0027-8424. doi: 10.1073/pnas.473

1018711108. URL https://www.pnas.org/content/108/6/2522.474

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,475

V. Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):476

2825–2830, 2011.477

J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Empirical Methods in478

Natural Language Processing (EMNLP), pages 1532–1543, 2014. URL http://www.aclweb.org/anthology/479

D14-1162.480

C. J. Price. The anatomy of language: a review of 100 fmri studies published in 2009. Annals of the new York Academy481

of Sciences, 1191(1):62–88, 2010.482

L. Pylkkanen and J. R. Brennan. Composition: The neurobiology of syntactic and semantic structure building. 2019.483

L. Pylkkänen and A. Marantz. Tracking the time course of word recognition with meg. Trends in cognitive sciences, 7484

(5):187–189, 2003.485

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask486

learners. OpenAI Blog, 1(8):9, 2019.487

S. Reddy Oota, N. Manwani, and B. Raju S. fMRI Semantic Category Decoding using Linguistic Encoding of Word488

Embeddings. arXiv e-prints, art. arXiv:1806.05177, Jun 2018.489

R. Rehurek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the490

LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA.491

http://is.muni.cz/publication/884893/en.492

B. A. Richards, T. P. Lillicrap, P. Beaudoin, Y. Bengio, R. Bogacz, A. Christensen, C. Clopath, R. P. Costa, A. de Berker,493

S. Ganguli, C. J. Gillon, D. Hafner, A. Kepecs, N. Kriegeskorte, P. Latham, G. W. Lindsay, K. D. Miller, R. Naud,494

C. C. Pack, P. Poirazi, P. Roelfsema, J. Sacramento, A. Saxe, B. Scellier, A. C. Schapiro, W. Senn, G. Wayne,495

D. Yamins, F. Zenke, J. Zylberberg, D. Therien, and K. P. Kording. A deep learning framework for neuroscience.496

22(11):1761–1770. ISSN 1097-6256, 1546-1726. doi: 10.1038/s41593-019-0520-2. URL http://www.nature.497

com/articles/s41593-019-0520-2.498

Y.-P. Ruan, Z.-H. Ling, and Y. Hu. Exploring semantic representation in brain activity using word embeddings.499

In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 669–679,500

Austin, Texas, Nov. 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1064. URL https:501

//www.aclweb.org/anthology/D16-1064.502

J. Sassenhagen and C. J. Fiebach. Traces of meaning itself: Encoding distributional word vectors in brain activity.503

bioRxiv, 2019. doi: 10.1101/603837. URL https://www.biorxiv.org/content/early/2019/04/09/603837.504

J.-M. Schoffelen, R. Oostenveld, N. Lam, J. Udden, A. Hultén, and P. Hagoort. A 204-subject multimodal neuroimaging505

dataset to study language processing. Scientific Data, 6, 12 2019. doi: 10.1038/s41597-019-0020-y.506

C. Tang, L. Hamilton, and E. Chang. Intonational speech prosody encoding in the human auditory cortex. Science, 357507

(6353):797–801, 2017.508

13

.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288doi: bioRxiv preprint

Page 14: Language processing in brains and deep neural networks ... · 7/3/2020  · LANGUAGE PROCESSING IN BRAINS AND DEEP NEURAL NETWORKS: COMPUTATIONAL CONVERGENCE AND ITS LIMITS A PREPRINT

A PREPRINT - JUNE 29, 2020

H. Tang, M. Schrimpf, W. Lotter, C. Moerman, A. Paredes, J. O. Caro, W. Hardesty, D. Cox, and G. Kreiman. Recurrent509

computations for visual pattern completion. Proceedings of the National Academy of Sciences, 115(35):8835–8840,510

2018.511

M. Toneva and L. Wehbe. Interpreting and improving natural-language processing (in machines) with natural language-512

processing (in the brain). CoRR, abs/1905.11833, 2019. URL http://arxiv.org/abs/1905.11833.513

A. M. Turing. Computing machinery and intelligence. In Parsing the Turing Test, pages 23–65. Springer, 2009.514

D. C. Van Essen. A population-average, landmark-and surface-based (pals) atlas of human cerebral cortex. Neuroimage,515

28(3):635–662, 2005.516

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all517

you need. In NIPS, 2017.518

D. L. Yamins and J. J. DiCarlo. Using goal-driven deep learning models to understand sensory cortex. Nature519

neuroscience, 19(3):356, 2016.520

D. L. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo. Performance-optimized hierarchical521

models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23):522

8619–8624, 2014.523

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence?524

arXiv preprint arXiv:1905.07830, 2019.525

14

.CC-BY 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 4, 2020. ; https://doi.org/10.1101/2020.07.03.186288doi: bioRxiv preprint