Relating Simple Sentence Representations in Deep …2.2 Simple Sentence Corpus In this paper, we aim to understand simple sen-tence processing in deep neural networks (DNN) and the

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5137–5154Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

5137

Relating Simple Sentence Representations in Deep Neural Networks andthe Brain

Sharmistha Jat1∗ Hao Tang2 Partha Talukdar1 Tom Mitchell2

1Indian Institute of Science, Bangalore2School of Computer Science, Carnegie Mellon University

{sharmisthaj,ppt}@[email protected], [email protected]

Abstract

What is the relationship between sentence rep-resentations learned by deep recurrent mod-els against those encoded by the brain? Isthere any correspondence between hidden lay-ers of these recurrent models and brain re-gions when processing sentences? Can thesedeep models be used to synthesize brain datawhich can then be utilized in other extrin-sic tasks? We investigate these questions us-ing sentences with simple syntax and seman-tics (e.g., The bone was eaten by the dog.).We consider multiple neural network archi-tectures, including recently proposed ELMoand BERT. We use magnetoencephalography(MEG) brain recording data collected from hu-man subjects when they were reading thesesimple sentences.

Overall, we find that BERT’s activations cor-relate the best with MEG brain data. We alsofind that the deep network representation canbe used to generate brain data from new sen-tences to augment existing brain data. To thebest of our knowledge, this is the first workshowing that the MEG brain recording whenreading a word in a sentence can be used todistinguish earlier words in the sentence. Ourexploration is also the first to use deep neuralnetwork representations to generate syntheticbrain data and to show that it helps in improv-ing subsequent stimuli decoding task accuracy.

1 Introduction

Deep learning methods for natural language pro-cessing have been very successful in a variety ofNatural Language Processing (NLP) tasks. How-ever, the representation of language learned bysuch methods is still opaque. The human brainis an excellent language processing engine, andthe brain representation of language is of coursevery effective. Even though both brain and deep

∗ This research was carried out during a research intern-ship at the Carnegie Mellon University.

learning methods are representing language, therelationships among these representations are notthoroughly studied. Wehbe et al. (2014b) and Haleet al. (2018) studied this question in some limitedcapacity. Wehbe et al. (2014b) studied the pro-cessing of a story context at a word level duringlanguage model computation. Hale et al. (2018)studied the syntactic composition in RNNG model(Dyer et al., 2016) with human encephalography(EEG) data.

We extend this line of research by investigatingthe following three questions: (1) what is the rela-tionship between sentence representations learnedby deep learning networks and those encoded bythe brain; (2) is there any correspondence betweenhidden layer activations in these deep models andbrain regions; and (3) is it possible for deep re-current models to synthesize brain data so thatthey can effectively be used for brain data aug-mentation. In order to evaluate these questions,we focus on representations of simple sentences.We employ various deep network architectures,including recently proposed ELMo (Peters et al.,2018) and BERT (Devlin et al., 2019) networks.We use MagnetoEncephaloGraphy (MEG) brainrecording data of simple sentences as the targetreference. We then correlate the representationslearned by these various networks with the MEGrecordings. Overall, we observe that BERT rep-resentations are the most predictive of MEG data.We also observe that the deep network models areeffective at synthesizing brain data which are use-ful in overcoming data sparsity in stimuli decodingtasks involving brain data.

In summary, in this paper we make the follow-ing contributions.

• We initiate a study to relate representationsof simple sentences learned by various deepnetworks with those encoded in the brain.We establish correspondences between acti-vations in deep network layers with brain ar-

5138

eas.

• We demonstrate that deep networks are capa-ble of predicting change in brain activity dueto differences in previously processed wordsin the sentence.

• We demonstrate effectiveness of using deepnetworks to synthesize brain data for down-stream data augmentation.

We have made our code and data1 publiclyavailable to support further research in this area.

2 Datasets

In this section, we describe the MEG dataset andSimple Sentence Corpus used in the paper.

2.1 MEG Dataset

Magnetoencephalography (MEG) is a non-invasive functional brain imaging techniquewhich records magnetic fields produced byelectrical currents in the brain. Sensors in theMEG helmet allow for recording of magneticfluctuations caused by changes in neural activityof the brain. For the experiments in this paper, weused three different MEG datasets collected whensubjects were shown simple sentences as stimulus.These datasets are summarized in Table 1, pleasesee (Rafidi, 2014) for more details. Additionaldataset details are mentioned in appendix sectionA.1. In the MEG helmet, 306 sensors weredistributed over 102 locations and sampled at1kHz. Native English speaking subjects wereasked to read simple sentences. Each word withina sentence was presented for 300ms with 200mssubsequent rest. To reduce noise in the brainrecordings, we represent a word’s brain activityby averaging 10 sentence repetitions (Sudre et al.,2012). Comprehension questions followed 10% ofsentences, to ensure semantic engagement. MEGdata was acquired using a 306 channel ElektaNeuromag device. Preprocessing included spatialfiltering using temporal signal space separation(tSSS), low-pass filtering 150Hz with notch filtersat 60 and 120Hz, and downsampling to 500Hz(Wehbe et al., 2014b). Artifacts from tSSS-filteredsame-day empty room measurements, ocular andcardiac artifacts were removed via Signal SpaceProjection (SSP).

1https://github.com/SharmisthaJat/ACL2019-SimpleSentenceRepr-DNN-Brain

Dataset #Sentences Voice RepetitionPassAct1 32 P+A 10PassAct2 32 P+A 10

Act3 120 A 10

Table 1: MEG datasets used in this paper. Column‘Voice’ refers to the sentence voice, ‘P’ is for passivesentences and ‘A’ is for active. Repetition is the numberof times the human subject saw a sentence. For ourexperiments, we average MEG data corresponding tomultiple repetitions of a single sentence.

2.2 Simple Sentence CorpusIn this paper, we aim to understand simple sen-tence processing in deep neural networks (DNN)and the brain. In order to train DNNs to repre-sent simple sentences, we need a sizeable corpusof simple sentences. While the MEG datasets de-scribed in Section 2.1 contain a few simple sen-tences, that set is too small to train DNNs ef-fectively. In order to address this, we created anew Simple Sentence Corpus (SSC), consisting ofa mix of simple active and passive sentences ofthe form “the woman encouraged the girl” and“the woman was encouraged by the boy”, respec-tively. The SSC dataset consists of 256,145 sen-tences constructed using the following two sets.

• Wikipedia: We processed the 2009Wikipedia dataset to get sentences matchingthe following patterns.“the [noun+] was [verb+] by the [noun+]”“the [noun+] [verb+] the [noun+]”If the last word in the pattern matched isnot noun, then we retain the additionaldependent clause in the sentence. We wereable to extract 117,690 active, and 8210passive sentences from wikipedia.

• NELL triples: In order to ensure broader cov-erage of Subject-Verb-Object (SVO) triplesin our sentence corpus, we used the NELLSVO triples2 (Talukdar et al., 2012). We sub-sample SVO triples based on their frequency(threshold = 6), a frequent verb list, and Free-base to get meaningful sentences. Any triplewith subject or object or verb not in Freebaseis discarded from the triple set.

– Active sentence: Convert the verb to itspast tense and concatenate the triple us-

2NELL SVO triples: http://rtw.ml.cmu.edu/resources/svo/

https://github.com/SharmisthaJat/ACL2019-SimpleSentenceRepr-DNN-Brain

https://github.com/SharmisthaJat/ACL2019-SimpleSentenceRepr-DNN-Brain

http://rtw.ml.cmu.edu/resources/svo/

http://rtw.ml.cmu.edu/resources/svo/

5139

Figure 1: Encoding model for MEG data. 306 channel 500ms MEG signal for a single word was compressed to306× 5 by averaging 100ms data into a single column. This MEG brain recording data is then encoded from textrepresentation vector to brain activity using ridge regression. The evaluation is done using 5 fold cross-validation.Please see Section 4 for more details.

ing the following pattern: “the [subject][verb-past-tense] the [object]”.

– Passive sentence: Concatenate the tripleusing pattern: “the [object] was [verb-past-tense] by the [subject]”

We generate 86,452 active and 43,793 pas-sive sentences in total from the NELL triples.

We train our deep neural network models with90% of sentences in this dataset and test on theremaining 10%. We used the spaCy (Honnibaland Montani, 2017) library to predict POS tags forwords in this dataset.

3 Methods

We test correlations between brain activity anddeep learning model activations (LeCun et al.,2015) for a given sentence using a classificationtask, similar to previous works (Mitchell et al.,2008; Wehbe et al., 2014a,b). If we are able topredict brain activity from the neural network ac-tivation, then we hypothesize that there exists arelationship between the process captured by theneural network layer and the brain. The schematicof our encoding approach is shown in Figure 1.

We investigate various deep neural networkmodels using context sensitivity tests to evalu-ate their performance in predicting brain activ-ity. Working with these models and their respec-tive training assumptions help us in understand-ing which assumption contributes to the correla-tions with the brain activity data. We process thesentences incrementally for each model to preventinformation from future words from affecting thecurrent representation, in line with how informa-tion is processed by the brain. For example, in the

sentence “the dog ate the biscuit”, the representa-tion of the word “ate” is calculated by processingsentence segment “the dog ate” and taking the lastrepresentation in each layer as the context for theword “ate”. The following embedding models areused to represent sentences.

• Random Embedding Model: In this model,we represent each word in a context by arandomly generated 300-dimensional vector.Each dimension is uniformly sampled be-tween [0,1]. The results from this model helpus establish the random baseline.

• GloVe Additive Embedding Model: Thismodel represents a word context as the aver-age of the current word’s GloVe embedding(Pennington et al., 2014) and the previousword context. The first word in a sentence isinitialized with its GloVe embedding as con-text.

• Simple Bi-directional LSTM LanguageModel: We build a language model follow-ing (Inan et al., 2016). Given a sequence ofwords w1 . . . wt, we predict the next wordwt+1 using a two layer bidirectional-LSTMmodel (Hochreiter and Schmidhuber, 1997).The model is trained on the simple languagecorpus data as described in Section 2.1 witha cross-entropy loss. We evaluate our modelon 10% held out text data. The perplexity forthe Bi-directional Language model is 9.97 ontest data (the low perplexity value is due tothe simple train and test dataset).

• Multi-task Model: Motivated by the brain’smultitask capability, we build a model to pre-dict next word and POS tag information. The

5140

multitask model is a simple two layer bidi-rectional LSTM model with separate linearlayers predicting each of the tasks given theoutput of the last LSTM layer (Figure 2). Themodel is trained on the simple sentence cor-pus data as described in Section 2.1 with across-entropy loss. The model’s accuracy is96.9% on the POS-tag prediction task and hasperplexity of 9.09 on the 10% test data. Thehigh accuracy and low perplexity are due tothe simple nature of our language dataset.

• ELMO (Peters et al., 2018): ELMo is arecent state-of-the-art deep contextualizedword representation method which models aword’s features as internal states of a deepbidirectional language model (biLM) pre-trained on a large text corpus. The contex-tualized word vectors are able to capture in-teresting word characteristics like polysemy.ELMO has been shown to improve perfor-mance across multiple tasks, such sentimentanalysis and question answering.

• BERT (Devlin et al., 2019): BERT usesa novel technique called Masked LanguageModel (MLM). MLM randomly masks sometokens inputs and then predicts them. Unlikeprevious models, this technique can use bothleft and right context to predict the maskedtoken. The training also predicts the nextsentence. The embedding in this model con-sists of 3 components: token embedding, sen-tence embedding and transformer positionalembedding. Due to the presence of sentenceembeddings, we observe an interesting per-formance of the embedding layer in our ex-periments.

4 Experiments and Results

With human brain as the reference language pro-cessing engine, we investigate the relationshipbetween deep neural network representation andbrain activity recorded while processing the samesentence. For this task, we perform experimentsat both the macro and micro sentence contextlevel. The macro-context experiments evaluatethe overall performance of deep neural networksin predicting brain data for input words (all words,nouns, verbs etc.). The micro-context experi-ments, by contrast, focus on evaluating the perfor-mance of deep neural network representations in

Figure 2: Architecture diagram for the simple multi-task model. The second LSTM layer’s output is pro-cessed by 2 linear layers each producing the next-wordand the POS-tag prediction. We process each sentenceincrementally to get the prediction for word at the nthposition, this helps in removing forward bias from fu-ture words and therefore is consistent with the infor-mation our brain receives when processing the samesentence. Our Simple Bi-directional LSTM languagemodel also has a similar architecture with just one out-put linear layer for next word prediction.

detecting minor changes in sentence context priorto the token being processed.

Regression task: Similar to previous research(Mitchell et al., 2008; Wehbe et al., 2014b), weuse a classification task to align model representa-tions with brain data. MEG data (Section 2.1) isused for these experiments. The task classifies be-tween a candidate word and the true word a subjectis reading at the time of brain activity recording.The classifier uses an intermediate regression stepto predict the MEG activity from deep neural net-work representation for the true and the candidateword. The classifier then chooses the word withleast Euclidean distance between the predicted andthe true brain activity. A correct classification sug-gests that the deep neural network representationcaptures important information to differentiate be-tween brain activity at words in different contexts.Detailed steps of this process are described as fol-lows.

Regression training: We perform regressionfrom the neural-network representation (for eachlayer) to the brain activity for the same inputwords in context. We normalized, preprocessed

5141

and trained on the MEG data as described by (We-hbe et al., 2014b) (Section 2.3.2). We average thesignal from every sensor (total 306) over 100msnon-overlapping windows, yielding a 306×5 sizedMEG data for each word. To train the regressionmodel, we take the training portion of the datain each fold, (X,Y ), in the tuple (xi, yi), xi isthe layer representation for an input word i in aneural network model, and yi is the correspond-ing MEG recording of size 1530 (flattened 306*5).The Ridge regression model (f) (Pedregosa et al.,2011) is learned with generalized cross-validationto select λ parameter (Golub et al., 1979). Ridgeregression model’s α parameter is selected fromrange [0.1, . . . , 100, 1000]. The trained regressionmodel is used to estimate MEG activity from thestimulus features, i.e., yi = f(xi).

Regression testing: The trained regressionmodel is used to predict yi for each word stimulus(xi) in the test fold during cross-validation. Weperform a pair-wise test for the classification ac-curacy (Acc) (Mitchell et al., 2008). The chanceaccuracy of this measure is 0.5. We use Euclideandistance (Edist) as given in (1) for the measure.

Acc =

1, if Edist(f(xi), yi) + Edist(f(xj), yj)

≤ Edist(f(xi), yj) + Edist(f(xj), yi)

0, otherwise(1)

4.1 Macro-context ExperimentsThe macro-context experiments aggregate classi-fication performance of each model’s layer on theentire stimuli set. We also evaluate on smaller setssuch as only the nouns, verbs, passive sentencewords, active sentence words, etc. The macro ex-periments help us to compare all the models ona large stimuli set. In summary, we observe thefollowing: (1) the intermediate layers of state-of-the-art deep neural network models are most pre-dictive of brain activity (Jain and Huth (2018) alsoobserve this on a 3 layer LSTM language model),(2) in-context representations are better at predict-ing brain activity than out-of-context representa-tions (embeddings), and (3) Temporal lobe is pre-dicted with highest accuracy from deep neural net-work representations.

Detailed Observations: The results of pair-wise classification tests for various models are pre-sented in Figure 3. All the results reported in thissection are for PassAct1 dataset. From the figure,

we observe that BERT and ELMo outperform thesimple models in predicting brain activity data. Inthe neural network language models, the middlelayers perform better at predicting brain activitythan the shallower or deeper layers. This couldbe due to the fact that the shallower layers repre-sent low-level features and the deeper layers rep-resent more task-oriented features. We tested thishypothesis by examining the performance scoresat each lobe of the brain. For each area, we testedthe left and right hemispheres independently andcompared these performances with the bilateralfrontal lobe as well as the activity across all re-gions. In particular, we examined the primary vi-sual areas (left and right occipital lobe), speechand language processing areas (left temporal) andverbal memory (right temporal), sensory percep-tion (left parietal) and integration (right parietal),language related movements (left frontal) and non-verbal functioning (right frontal). The frontal lobewas tested bilaterally as it is associated with higherlevel processing such as problem solving, lan-guage processing, memory, judgement, and socialbehavior.

From our results, we observe that lower layerssuch as BERT layer 5 have very high accuracyfor right occipital and left occipital lobe associatedwith low-level visual processing task. In contrast,higher layers such as linear layers in the MultitaskModel and in Language Model have the highestaccuracy in the left temporal region of the brain.Figure 4 shows the pairwise classification accu-racy for a given brain region for best layers fromeach model. The accuracy is highest in left tem-poral region, responsible for syntactic and seman-tic processing of language. These results establishcorrespondences between representations learnedby deep neural methods and those in the brain.Further experiments are needed to improve our un-derstanding of this relationship.

We performed additional experiments to predicton a restricted stimuli set. In each of these ex-periments, a subset of stimuli, for example activesentences, passive sentences, noun, and verb stim-uli were used in classification training and test-ing. Detailed results for this experiment are docu-mented in the appendix section (Figure 9). Fromthe results, we observe that active sentences arepredicted better (best accuracy = 0.93) than pas-sive sentences (best accuracy = 0.87). This mightbe attributed to the nature of training datasets for

5142

Figure 3: Pairwise classification accuracy of brain activity data predicted from various model layer representations.We average 4 consecutive layers of BERT into one value. We find that BERT and ELMO model layers perform thebest. The middle layers of most models and BERT, in particular, are good at predicting brain activity. Read ‘ f’ asforward layer and ‘Emb’ as the embedding layer.

Figure 4: Pairwise accuracy of various brain regions from some selected deep neural network model layers. Theleft part of the brain which is considered central to language understanding is predicted with higher accuracy,especially left temporal region (L = left, R = right).

deep neural networks, as active sentences are dom-inant in the training data of most of the pre-trainedmodels. We also observe that for passive sen-tences, our simple multitask model (trained usingabout 250K active and passive sentences) has alower performance gap between active and pas-sive sentence as compared to ELMO and BERTmodels. This may be due to a more balanced ac-tive and passive sentence used to train the multi-task model. Noun stimuli are predicted with thehighest accuracy of 0.81, while the accuracy forverbs is 0.65. Both Multitask and ELMo mod-els dominate verb prediction results, while BERTlags in this category. Further experiments shouldbe done to compare the ability of Transformer(Vaswani et al., 2017) versus Recurrent NeuralNetwork based models to represent verbs.

4.2 Micro-context Experiments

In these micro-context experiments, we evaluateif our models are able to retain information fromwords in the sentence prior to the word being pro-

cessed. For such context sensitivity tests, we onlyuse the first repetition of the sentence shown to hu-man subjects. This helps to ensure that the sen-tence has not been memorized by the subjects,which might affect the context sensitivity tests.

Training: The micro-context experiment setupis illustrated in Figure 5. To train the regres-sion model, each training instance correspondingto a word has the form (xi, yi), where xi is thelayer representation for an input word i in a neu-ral network model, and yi is the correspondingMEG brain recording data of size 1530 (flattened306 × 5). During testing, we restrict the pairwisetests to word pairs (xi, xj) which satisfy some con-ditions. For example in noun context sensitivitytest, the pair of words should be such that, theyappear in a sentence with the same words exceptthe noun. We describe these candidate word testpairs, in detail, in the following sections.

In each of the following sensitivity tests, we per-form a pair-wise accuracy test among the samecandidate word (bold items) from sentences which

5143

Figure 5: Experimental setup for micro-context tests. Given two sentences with similar words except one in thepast (underlined), the test evaluates if the deep neural network model representation contains sufficient informationto tell the two words apart. Please see Section 4.2 for more details.

are identical except for one word (underlineditems). We vary the non-identical word type(noun, verb, adjective, determiner) among the twosentences to test the contribution of each of theseword types to the context representation further insentence. This test helps us understand what partsof the context are retained or forgotten by the neu-ral network model representation. Detailed resultsof each test are included in the appendix section(Figure 10). Please note that the part of BERTword embedding is the sentence embedding, there-fore the BERT embedding performs better than0.5, unlike other embeddings.

4.2.1 Noun sensitivity“The dog ate the” vs. “The girl ate the”

For the PassAct1 dataset, we observe that sim-ple GloVe additive model (classification accuracy= 0.52) loses information about the noun whileit is retained by most layers of other models likeBERT (accuracy = 0.92), ELMo (accuracy = 0.91).Higher level layers, such as linear layer for POS-tag prediction (accuracy = 0.65), also performpoorly. This seems obvious due to the task itsolves which focuses on POS-tag property at theword ‘the’ rather than the previous context. Insummary, we observe that the language modelcontext preserves noun information well.

4.2.2 Verb sensitivity“The dog saw the” vs. “The dog ate the”

For the PassAct1 dataset, we observe that simi-lar to noun sensitivity, most language model layers(accuracy = 0.92), except for simple GloVe Ad-ditive model, preserve the verb memory. By de-sign, the GloVe Additive model retains little con-text from the past words, and therefore the resultverifies the experiment setup.

4.2.3 First determiner sensitivity“A dog” vs. “The dog”

For the PassAct2 dataset, we observe that de-terminer information is retained well by most lay-ers. However, the shallow layers retain informa-tion better than the deeper layers. For example,BERT layer 3 (accuracy = 0.82), Multitask lstm0 backward (accuracy = 0.82), BERT Layer 18/19(accuracy 0.78). Since the earlier layers have ahigher correlation with shallow feature processing,the determiner information may be useful for theearly features in neural network representation.

4.2.4 Adjective sensitivity“The happy child” vs. “The child”

For the Act3 dataset, we observe that middlelayers of most models (BERT, Multitask) retainthe adjective information well. However, sur-prisingly simple multitask model (lstm 1 forwardlayer accuracy = 0.89) retains adjective informa-tion better than BERT model (layer 7 accuracy =0.84). This could be due to the importance of ad-jective in context for POS tag prediction. Thisresult encourages the design of language modelswith diverse cost functions based on the kind ofsentence context information that needs to be pre-served in the final task.

4.2.5 VisualisationWe visualise the average agreement of model pre-dicted brain activity (from BERT layer 18) andtrue brain activity for candidate stimuli in micro-sensitivity tests. Please note that the micro-sensitivity tests predict brain activity for stimuliwith almost similar past context except one word,this makes the task harder. We preprocess thebrain activity values to be +1 for all positive valuesand -1 for all negative values. The predicted brain

5144

activity (y′) and the true brain activity (y) are then

compared to form an agreement activity (y′′), re-

sulting in a zero value for all locations where thesign predicted was incorrect. We average theseagreement activities (y

′′) for all test examples in a

cross-validation fold to form a single activity im-age (Y

′′). Figure 8 shows Y

′′for the word ‘the’ in

noun-sensitivity tests Section 4.2.1 (additional re-sults are in the appendix section). We observe thatour model prediction direction agrees with brainprediction direction in most of the brain regions.This shows that our neural network layer represen-tation can preserve information from earlier wordsin the sentence.

4.3 Semi-supervised training usingsynthesized brain activity

In this section, we consider the question ofwhether previously trained linear regressionmodel (X1), which predicts brain activity for agiven sentence, can be used to produce usefulsynthetic brain data (i.e., sentence-brain activitypairs). Constraints like high cost of MEG record-ing and physical limits on an individual subjectduring data collection, favor such synthetic datageneration. We evaluate effectiveness of this syn-thetically generated brain data for data augmenta-tion in the stimulus prediction task (Mitchell et al.,2008). Specifically, we train a decoding model(X2) to predict brain activity during a stimulusreading based on GloVe vectors for nouns. Weconsider two approaches. In the first approach,the same brain activity data as in previous sectionswas used. In the second approach, the real brainactivity data is augmented with the synthetic ac-tivities generated by the regression model (X1).

In our experiment, we generate new sentencesusing the same vocabulary as the original sen-tences in the PassAct1 dataset. Details of theoriginal 32 sentences (Section A.1.1) along withthe 160 generated sentences (Section A.1.2) aregiven in the appendix section. We process the160 generated sentences with BERT layer 18 toget word stimulus features in context. The en-coding model (X1) was trained using the PassAct1dataset. Please note that BERT layer 18 was cho-sen based on the high accuracy results on macro-context tests, therefore the layer aligned well withthe whole brain activity. The choice of represen-tation (deep neural network layer) to encode brainactivity should be done carefully, as each represen-

tation may be good at encoding different parts ofbrain. A good criteria for representation selectionrequires further research.

To demonstrate the efficacy of the syntheticdataset, we present the accuracy in predictingnoun (or verb) stimuli from observed MEG activ-ity with and without the additional synthetic MEGdata. With linear ridge regression model (X2), aGloVe (Pennington et al., 2014) feature to brain-activity prediction models were trained to predictthe MEG activity when a word is observed . To testthe model performance, we calculate the accuracyof the predicted brain activity given the true brainactivity during a word processing (Equation 1).All the experiments use 4-fold cross-validation.Figure 7 shows the increase in the noun/verb pre-diction accuracy with additional synthetically gen-erated data. The statistical significance is calcu-lated over 400 random label permutation tests.

To summarize, these results show the utility ofusing previously trained regressor model to pro-duce synthetic training data to improve accuracyon additional tasks. Given the high cost of col-lecting MEG recordings from human subjects andtheir individual capacity to complete the task, thisdata augmentation approach may provide an effec-tive alternative in many settings.

5 Related Work

Usage of machine learning models in neuro-science has been gaining popularity. Methodsin this field use features of words and con-texts to predict brain activity using various tech-niques (Agrawal et al., 2014). Previous researchhave used functional magnetic resonance imaging(FMRI) (Glover, 2011) and Magnetoencephalog-raphy (MEG) (Hmlinen et al., 1993) to recordbrain activity. Prefrontal cortex in rhesus monkeyswas studied in Mante et al. (2013). They showedthat an appropriately trained recurrent neural net-work model reproduces key physiological obser-vations and suggests a new mechanism of inputselection and integration. Barak (2017) arguesthat RNNs with reverse engineering can providea framework for modeling in neuroscience, po-tentially serving as a powerful hypothesis gener-ation tool. Prior research by Mitchell et al. (2008),Wehbe et al. (2014b), Jain and Huth (2018), Haleet al. (2018), Pereira et al. (2018), and Sun et al.(2019) have established a general correspondencebetween a computational model and brain’s re-

5145

Figure 6: Average sign agreement activity for noun sensitivity stimuli ‘the’. The red and blue colored areas arethe +ive and -ive signed brain region agreement respectively, while the white colored region displays brain regionswith prediction error. We observe that in most regions of the brain, the predicted and true activity agree on theactivity sign, thereby providing evidence that deep learning representations can capture useful information aboutlanguage processing consistent with the brain recording.

(a) Noun prediction results (b) Verb prediction results

Figure 7: Accuracy with and without synthetically generated MEG brain data on two stimuli prediction tasks: (a)Nouns (left) and (b) Verbs (right). We trained two models – one using true MEG brain recording and the other usingboth true and synthetically generated MEG brain data (Augmented data model). We observe that the augmenteddata model results in accuracy improvement on both tasks, on average 2.1% per subject for noun prediction and2.4% for verb. Accuracy (chance) is the random permutation test accuracy, with the green shaded area representingstandard deviation. Please see Section 4.3 for details.

sponse to naturalistic language. We follow theseprior research in our analysis work and extend theresults by doing a fine-grained analysis of the sen-tence context. Additionally, we also use deep neu-ral network representations to generate syntheticbrain data for extrinsic experiments.

6 Conclusion

In this paper, we study the relationship betweensentence representations learned by deep neuralnetwork models and those encoded by the brain.We encode simple sentences using multiple deepnetworks, such as ELMo, BERT, etc. We makeuse of MEG brain imaging data as reference. Rep-resentations learned by BERT are the most effec-tive in predicting brain activity. In particular, mostmodels are able to predict activity in the left tem-poral region of the brain with high accuracy. Thisbrain region is also known to be responsible forprocessing syntax and semantics for language un-derstanding. To the best of our knowledge, this

is the first work showing that the MEG data, whenreading a word in a sentence, can be used to dis-tinguish earlier words in the sentence. Encouragedby these findings, we use deep networks to gen-erate synthetic brain data to show that it helps inimproving accuracy in a subsequent stimulus de-coding task. Such data augmentation approach isvery promising as actual brain data collection inlarge quantities from human subjects is an expen-sive and labor-intensive process. We are hopefulthat the ideas explored in the paper will promotefurther research in understanding relationships be-tween representations learned by deep models andthe brain during language processing tasks.

7 Acknowledgments

This work was supported by The Government ofIndia (MHRD) scholarship and BrainHub CMU-IISc Fellowship awarded to Sharmistha Jat. Wethank Dan Howarth and Erika Laing for help withMEG data preprocessing.

5146

ReferencesPulkit Agrawal, Dustin Stansbury, Jitendra Malik, and

Jack L. Gallant. 2014. Pixels to voxels: Modelingvisual representation in the human brain. CoRR,abs/1407.5104.

Omri Barak. 2017. Recurrent neural networks as versa-tile tools of neuroscience research. Current Opinionin Neurobiology, 46:1 – 6. Computational Neuro-science.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In Proc. of NAACL.

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,and Noah A. Smith. 2016. Recurrent neural networkgrammars. In Proc. of NAACL.

Gary H Glover. 2011. Overview of functional magneticresonance imaging. Neurosurgery clinics of NorthAmerica, 22(2):133–vii.

Gene H. Golub, Michael Heath, and Grace Wahba.1979. Generalized cross-validation as a method forchoosing a good ridge parameter. Technometrics,21(2):215–223.

John Hale, Chris Dyer, Adhiguna Kuncoro, andJonathan Brennan. 2018. Finding syntax in humanencephalography with beam search. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 2727–2736. Association for Computa-tional Linguistics.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural Comput., 9(8):1735–1780.

Matthew Honnibal and Ines Montani. 2017. spacy 2:Natural language understanding with bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear.

Matti Hmlinen, Riitta Hari, Risto Ilmoniemi, JukkaKnuutila, and Olli V. Lounasmaa. 1993. Magne-toencephalography: Theory, instrumentation, andapplications to noninvasive studies of the workinghuman brain. Rev. Mod. Phys., 65:413–.

Hakan Inan, Khashayar Khosravi, and Richard Socher.2016. Tying word vectors and word classifiers:A loss framework for language modeling. CoRR,abs/1611.01462.

Shailee Jain and Alexander Huth. 2018. Incorporatingcontext into language encoding models for fmri. InS. Bengio, H. Wallach, H. Larochelle, K. Grauman,N. Cesa-Bianchi, and R. Garnett, editors, Advancesin Neural Information Processing Systems 31, pages6628–6637. Curran Associates, Inc.

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.2015. Deep learning. Nature, 521:436.

Valerio Mante, David Sussillo, Krishna V. Shenoy, andWilliam T. Newsome. 2013. Context-dependentcomputation by recurrent dynamics in prefrontalcortex. Nature, 503:78 EP –.

Tom M. Mitchell, Svetlana V. Shinkareva, AndrewCarlson, Kai-Min Chang, Vicente L. Malave,Robert A. Mason, and Marcel Adam Just. 2008.Predicting human brain activity associated with themeanings of nouns. Science, 320(5880):1191–1195.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Pretten-hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. 2011. Scikit-learn: Machine Learn-ing in Python . Journal of Machine Learning Re-search, 12:2825–2830.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for wordrepresentation. In EMNLP, volume 14, pages 1532–1543.

Francisco Pereira, Bin Lou, Brianna Pritchett, SamuelRitter, Samuel J. Gershman, Nancy Kanwisher,Matthew Botvinick, and Evelina Fedorenko. 2018.Toward a universal decoder of linguistic meaningfrom brain activation. Nature Communications,9(1):963.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proc. of NAACL.

Nicole Rafidi. 2014. The role of syntax in semanticprocessing: A study of active and passive sentences.[Online; accessed 2-March-2019].

Gustavo Sudre, Dean Pomerleau, Mark Palatucci, LeilaWehbe, Alona Fyshe, Riitta Salmelin, and TomMitchell. 2012. Tracking neural coding of percep-tual and semantic features of concrete nouns. Neu-roImage, 62:451–63.

Jingyuan Sun, Shaonan Wang, Jiajun Zhang, andChengqing Zong. 2019. Towards sentence-levelbrain decoding with distributed representations.AAAI Press.

Partha Pratim Talukdar, Derry Wijaya, and TomMitchell. 2012. Acquiring temporal constraints be-tween relations. In Proceedings of the 21st ACM In-ternational Conference on Information and Knowl-edge Management, CIKM ’12, pages 992–1001,New York, NY, USA. ACM.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors, Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008. Curran As-sociates, Inc.

http://arxiv.org/abs/1407.5104


https://doi.org/https://doi.org/10.1016/j.conb.2017.06.003

https://doi.org/https://doi.org/10.1016/j.conb.2017.06.003

https://arxiv.org/abs/1810.04805



https://doi.org/10.1016/j.nec.2010.11.001

https://doi.org/10.1016/j.nec.2010.11.001

https://doi.org/10.1080/00401706.1979.10489751

https://doi.org/10.1080/00401706.1979.10489751

http://aclweb.org/anthology/P18-1254

http://aclweb.org/anthology/P18-1254

https://doi.org/10.1162/neco.1997.9.8.1735

https://doi.org/10.1162/neco.1997.9.8.1735

https://doi.org/10.1103/RevModPhys.65.413






http://papers.nips.cc/paper/7897-incorporating-context-into-language-encoding-models-for-fmri.pdf

http://papers.nips.cc/paper/7897-incorporating-context-into-language-encoding-models-for-fmri.pdf

https://doi.org/10.1038/nature14539




https://doi.org/10.1126/science.1152876

https://doi.org/10.1126/science.1152876

https://doi.org/10.1038/s41467-018-03068-4

https://doi.org/10.1038/s41467-018-03068-4

https://www.ml.cmu.edu/research/dap-papers/DAP_Rafidi.pdf

https://www.ml.cmu.edu/research/dap-papers/DAP_Rafidi.pdf

https://doi.org/10.1016/j.neuroimage.2012.04.048

https://doi.org/10.1016/j.neuroimage.2012.04.048

https://doi.org/10.1145/2396761.2396886

https://doi.org/10.1145/2396761.2396886

http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

5147

Leila Wehbe, Brian Murphy, Partha Talukdar, AlonaFyshe, Aaditya Ramdas, and Tom Mitchell. 2014a.Simultaneously uncovering the patterns of brain re-gions involved in different story reading subpro-cesses. PloS one, 9:e112575.

Leila Wehbe, Ashish Vaswani, Kevin Knight, andTom M. Mitchell. 2014b. Aligning context-basedstatistical models of language with brain activityduring reading. In EMNLP, pages 233–243. ACL.

https://doi.org/10.1371/journal.pone.0112575



5148

A Appendices

A.1 Dataset details

Following are the sentences used in the paper forexperiments described in Section 4. We list downthe sentences in PassAct1 dataset and the gener-ated sentences in the sections Section A.1.1 andSection A.1.2 respectively. The two datasets aredisjoint in terms of the sentences they contain, butare built using the same vocabulary. Datasets Pas-sAct2 dataset and Act3 dataset are detailed in sub-sections A.1.3 and A.1.4 respectively.

A.1.1 PassAct1 dataset sentencesthe boy was liked by the girlthe girl was watched by the manthe man was despised by the womanthe woman was encouraged by the boythe girl was liked by the womanthe man was despised by the boythe girl was liked by the boythe boy was watched by the womanthe man was encouraged by the girlthe woman was despised by the manthe woman was watched by the boythe girl was encouraged by the womanthe man was despised by the girlthe boy was liked by the manthe boy was watched by the girlthe woman was encouraged by the manthe man despised the womanthe girl encouraged the manthe man liked the boythe girl despised the manthe woman encouraged the girlthe boy watched the womanthe man watched the girlthe girl liked the boythe woman despised the manthe boy encouraged the womanthe woman liked the girlthe boy despised the manthe man encouraged the womanthe girl watched the boythe woman watched the boythe boy liked the girl

A.1.2 PassAct1 dataset artificially generatedsentences

the girl was despised by the manthe man despised the girl

the man was liked by the girlthe girl was liked by the manthe girl liked the manthe man liked the girlthe girl was encouraged by the manthe man encouraged the girlthe man was watched by the girlthe girl watched the manthe boy was despised by the manthe man despised the boythe man was liked by the boythe boy liked the manthe man was encouraged by the boythe boy was encouraged by the manthe boy encouraged the manthe man encouraged the boythe man was watched by the boythe boy was watched by the manthe boy watched the manthe man watched the boythe man was despised by the womenthe women was despised by the manthe women despised the manthe man despised the womenthe man was liked by the womenthe women was liked by the manthe women liked the manthe man liked the womenthe man was encouraged by the womenthe women was encouraged by the manthe women encouraged the manthe man encouraged the womenthe man was watched by the womenthe women was watched by the manthe women watched the manthe man watched the womenthe girl was despised by the manthe man despised the girlthe girl was liked by the manthe man was liked by the girlthe man liked the girlthe girl liked the manthe girl was encouraged by the manthe man encouraged the girlthe man was watched by the girlthe girl watched the manthe girl was despised by the boythe boy was despised by the girlthe boy despised the girlthe girl despised the boythe girl was encouraged by the boy

5149

the boy was encouraged by the girlthe boy encouraged the girlthe girl encouraged the boythe girl was watched by the boythe boy watched the girlthe girl was despised by the womenthe women was despised by the girlthe women despised the girlthe girl despised the womenthe girl was liked by the womenthe women was liked by the girlthe women liked the girlthe girl liked the womenthe girl was encouraged by the womenthe women was encouraged by the girlthe women encouraged the girlthe girl encouraged the womenthe girl was watched by the womenthe women was watched by the girlthe women watched the girlthe girl watched the womenthe boy was despised by the manthe man despised the boythe man was liked by the boythe boy liked the manthe boy was encouraged by the manthe man was encouraged by the boythe man encouraged the boythe boy encouraged the manthe boy was watched by the manthe man was watched by the boythe man watched the boythe boy watched the manthe boy was despised by the girlthe girl was despised by the boythe girl despised the boythe boy despised the girlthe boy was encouraged by the girlthe girl was encouraged by the boythe girl encouraged the boythe boy encouraged the girlthe girl was watched by the boythe boy watched the girlthe boy was despised by the womenthe women was despised by the boythe women despised the boythe boy despised the womenthe boy was liked by the womenthe women was liked by the boythe women liked the boythe boy liked the women

the boy was encouraged by the womenthe women was encouraged by the boythe women encouraged the boythe boy encouraged the womenthe boy was watched by the womenthe women was watched by the boythe women watched the boythe boy watched the womenthe women was despised by the manthe man was despised by the womenthe man despised the womenthe women despised the manthe women was liked by the manthe man was liked by the womenthe man liked the womenthe women liked the manthe women was encouraged by the manthe man was encouraged by the womenthe man encouraged the womenthe women encouraged the manthe women was watched by the manthe man was watched by the womenthe man watched the womenthe women watched the manthe women was despised by the girlthe girl was despised by the womenthe girl despised the womenthe women despised the girlthe women was liked by the girlthe girl was liked by the womenthe girl liked the womenthe women liked the girlthe women was encouraged by the girlthe girl was encouraged by the womenthe girl encouraged the womenthe women encouraged the girlthe women was watched by the girlthe girl was watched by the womenthe girl watched the womenthe women watched the girlthe women was despised by the boythe boy was despised by the womenthe boy despised the womenthe women despised the boythe women was liked by the boythe boy was liked by the womenthe boy liked the womenthe women liked the boythe women was encouraged by the boythe boy was encouraged by the womenthe boy encouraged the women

5150

the women encouraged the boythe women was watched by the boythe boy was watched by the womenthe boy watched the womenthe women watched the boy

A.1.3 PassAct2 dataset sentencesthe monkey inspected the peacha monkey touched a schoolthe school was inspected by the studenta peach was touched by a studentthe peach was inspected by the monkeya school was touched by a monkeya doctor inspected a doorthe doctor touched the hammerthe student found a doora student kicked the hammerthe student inspected the schoola student touched a peacha monkey found the hammerthe monkey kicked a doora dog inspected a hammerthe dog touched the doora dog found the peachthe dog kicked a schoolthe doctor found a schoola doctor kicked the peacha school was kicked by the dogthe peach was found by a dogthe door was touched by the doga hammer was inspected by a dogthe peach was kicked by a doctora school was found by the doctorthe hammer was touched by the doctora door was inspected by a doctorthe hammer was kicked by a studenta door was found by the studentthe hammer was found by a monkeya door was kicked by the monkey

A.1.4 Act3 dataset sentencesthe teacher broke the small camerathe student planned the protestthe student walked along the long hallthe summer was hotthe storm destroyed the theaterthe storm ended during the morningthe duck flewthe duck lived at the lakethe activist dropped the new cellphone

the editor carried the magazine to the meetingthe boy threw the baseball over the fencethe bicycle blocked the green doorthe boat crossed the small lakethe boy held the footballthe bird landed on the bridgethe bird was redthe reporter wrote about the trialthe red plane flew through the cloudthe red pencil was on the deskthe reporter met the angry doctorthe reporter interviewed the politician during thedebatethe tired lawyer visited the islandthe tired jury left the courtthe artist found the red ballthe artist hiked along the mountainthe angry lawyer left the officethe army built the small hospitalthe army marched past the schoolthe artist drew the riverthe actor gave the football to the teamthe angry activist broke the chairthe cellphone was blackthe company delivered the computerthe priest approached the lonely familythe patient put the medicine in the cabinetthe pilot was friendlythe policeman arrested the angry driverthe policeman read the newspaperthe politician celebrated at the hotelthe trial ended in springthe tree grew in the parkthe tourist hiked through the forestthe activist marched at the trialthe tourist ate bread on vacationthe vacation was peacefulthe dusty feather landed on the highwaythe accident destroyed the empty labthe horse kicked the fencethe happy girl played in the forestthe guard slept near the doorthe guard opened the windowthe glass was coldthe green car crossed the bridgethe voter read about the electionthe wealthy farmer fed the horsethe wealthy family celebrated at the partythe window was dustythe boy kicked the stone along the streetthe old farmer ate at the expensive hotel

5151

the man saw the fish in the riverthe man saw the dead mousethe man read the newspaper in churchthe lonely patient listened to the loud televisionthe girl dropped the shiny dimethe couple laughed at dinnerthe council read the agreementthe couple planned the vacationthe fish lived in the riverthe flood damaged the hospitalthe big horse drank from the lakethe corn grew in springthe woman bought medicine at the storethe woman helped the sick touristthe woman took the flower from the fieldthe worker fixed the door at the churchthe businessman slept on the expensive bedthe businessman lost the computer at the airportthe businessman laughed in the theaterthe chicken was expensive at the restaurantthe lawyer drank coffeethe judge met the mayorthe judge stayed at the hotel during the vacationthe jury listened to the famous businessmanthe hurricane damaged the boatthe journalist interviewed the judgethe dog ate the eggthe doctor helped the injured policemanthe diplomat bought the aggressive dogthe council feared the protestthe park was empty in winterthe parent watched the sick childthe cloud blocked the sunthe coffee was hotthe commander ate chicken at dinnerthe commander negotiated with the councilthe commander opened the heavy doorthe old judge saw the dark cloudthe young engineer worked in the officethe farmer liked soccerthe mob approached the embassythe mob damaged the hotelthe minister spoke to the injured patientthe minister visited the prisonthe minister found cash at the airportthe minister lost the spiritual magazinethe mouse ran into the forestthe parent took the cellphonethe soldier delivered the medicine during the floodthe soldier arrested the injured activistthe small boy feared the storm

the egg was bluethe editor gave cash to the driverthe editor damaged the bicyclethe expensive camera was in the labthe engineer built the computerthe family survived the powerful hurricanethe child held the soft featherthe clever scientist worked at the labthe author interviewed the scientist after the floodthe artist shouted in the hotel

5152

(a) Verb sign agreement image between true and predicted brainactivations

(b) Adjective sign agreement image between true and predictedbrain activations

(c) Determiner sign agreement image between true and pre-dicted brain activations

Figure 8: Sign agreement image for verb, determiner and adjective sensitivity test stimuli. The red and bluecolored areas are the +ive and -ive signed brain region agreement. While, the white colored region displays brainregions with prediction error. We observe that in most regions of the brain the predicted and true image agree onthe activity sign, thereby proving that deep learning representations can capture useful information about languageprocessing.

5153

Figure 9: Pairwise Accuracy of predicting brain encodings for noun, verb, passive & active sentences. For eachof the category the Ridge regression model is learned and tested on the stimulus subset like only nouns or onlypassive sentences. The color of a cell represents the value within overall accuracy scale with red indicating smallvalues, yellow intermediate and green high values. We observe that Nouns are predicted better than verbs. Andactive sentences are predicted better than passive sentences.

5154

Figure 10: Micro-context sensitivity test results for all the layers. The color of a cell represents the value withinoverall accuracy scale with red indicating small values, yellow intermediate and green high values. We observethat noun and verbs are retained in the context with same accuracy followed by determiner and then adjective.

Documents

Relating Simple Sentence Representations in Deep …2.2 Simple Sentence Corpus In this paper, we aim to understand simple sen-tence processing in deep neural networks (DNN) and the