28
Interpreting Hierarchical Linguistic Interactions in DNNs Die Zhang Huilin Zhou Xiaoyi Bao Da Huo Ruizhao Chen Xu Cheng Hao Zhang Mengyue Wu Quanshi Zhang Shanghai Jiao Tong University {zizhan52,zhouhuilin116,zjbaoxiaoyi,sjtuhuoda, stelledge,xcheng8,1603023-zh,mengyuewu, zqs1022}@sjtu.edu.cn Abstract This paper proposes a method to disentangle and quantify interactions among words that are encoded inside a DNN for natural language processing. We construct a tree to encode salient interactions extracted by the DNN. Six metrics are proposed to analyze properties of interactions between constituents in a sentence. The interaction is defined based on Shapley values of words, which are considered as an unbiased estimation of word contributions to the network prediction. Our method is used to quantify word interactions encoded inside the BERT, ELMo, LSTM, CNN, and Transformer networks. Experimental results have provided a new perspective to understand these DNNs, and have demonstrated the effectiveness of our method. 1 Introduction Deep neural networks (DNNs) have shown promise in various tasks of natural language processing (NLP), but a DNN is usually considered as a black-box model. In recent years, explaining features encoded inside a DNN has become an emerging direction. Based on the inherent hierarchical structure of natural language, many methods use latent tree structures of language to guide the DNN to learn interpretable feature representations [4, 8, 29, 30, 31, 36, 42, 45]. However, the interpretability usually conflicts with the discrimination power [1]. There is a considerable gap between pursuing the interpretability of features and pursuing superior performance. Therefore, in this study, we aim to explain a trained black-box DNN in a post-hoc manner, so that the explanation of the DNN does not affect its performance. This is essentially different from previous studies of designing new network architectures or losses to learn interpretable features, e.g. physically embedding tree structures into a DNN. Given a trained DNN, in this paper, we propose to analyze interactions among input words, which are used by the DNN to make a prediction. Our method generates a tree structure to objectively reflect interactions among words. Mathematically, the interaction of several words is quantified as the difference of the contribution when these words contribute jointly to the prediction w.r.t. when each individual word contributes independently to the prediction. The interaction between words may bring either positive or negative effects on the prediction. For example, the word green and the word hand in the sentence he is a green hand have a strong and positive interaction, because the words green and hand contribute to the person’s identity jointly, rather than independently. Preprint. Under review. arXiv:2007.04298v1 [cs.CL] 29 Jun 2020

arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

Interpreting Hierarchical Linguistic Interactions inDNNs

Die Zhang Huilin Zhou Xiaoyi Bao Da Huo

Ruizhao Chen Xu Cheng Hao Zhang Mengyue Wu

Quanshi ZhangShanghai Jiao Tong University

zizhan52,zhouhuilin116,zjbaoxiaoyi,sjtuhuoda,stelledge,xcheng8,1603023-zh,mengyuewu,

[email protected]

Abstract

This paper proposes a method to disentangle and quantify interactions among wordsthat are encoded inside a DNN for natural language processing. We construct atree to encode salient interactions extracted by the DNN. Six metrics are proposedto analyze properties of interactions between constituents in a sentence. Theinteraction is defined based on Shapley values of words, which are considered as anunbiased estimation of word contributions to the network prediction. Our method isused to quantify word interactions encoded inside the BERT, ELMo, LSTM, CNN,and Transformer networks. Experimental results have provided a new perspectiveto understand these DNNs, and have demonstrated the effectiveness of our method.

1 Introduction

Deep neural networks (DNNs) have shown promise in various tasks of natural language processing(NLP), but a DNN is usually considered as a black-box model. In recent years, explaining featuresencoded inside a DNN has become an emerging direction. Based on the inherent hierarchical structureof natural language, many methods use latent tree structures of language to guide the DNN to learninterpretable feature representations [4, 8, 29, 30, 31, 36, 42, 45]. However, the interpretabilityusually conflicts with the discrimination power [1]. There is a considerable gap between pursuing theinterpretability of features and pursuing superior performance.

Therefore, in this study, we aim to explain a trained black-box DNN in a post-hoc manner, so that theexplanation of the DNN does not affect its performance. This is essentially different from previousstudies of designing new network architectures or losses to learn interpretable features, e.g. physicallyembedding tree structures into a DNN.

Given a trained DNN, in this paper, we propose to analyze interactions among input words, whichare used by the DNN to make a prediction. Our method generates a tree structure to objectivelyreflect interactions among words. Mathematically, the interaction of several words is quantified asthe difference of the contribution when these words contribute jointly to the prediction w.r.t. wheneach individual word contributes independently to the prediction. The interaction between words maybring either positive or negative effects on the prediction. For example, the word green and the wordhand in the sentence he is a green hand have a strong and positive interaction, because the wordsgreen and hand contribute to the person’s identity jointly, rather than independently.

Preprint. Under review.

arX

iv:2

007.

0429

8v1

[cs

.CL

] 2

9 Ju

n 20

20

Page 2: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

the sun

the sun rises

the sun rises

the sun rises .B([S])

B([Sl])

rises ...

B([Sr])Bbetween

Figure 1: A tree to represent interactions among words.The tree is built to explain a trained DNN. Each leaf node(blue) represents an input word in the sentence. Each non-leaf node encodes the significance of interactions within aconstituent.

The core challenge in this study is to guarantee the objectiveness of the explanation. I.e. the treeneeds to reflect true interactions among words without significant bias. We notice that the Shapleyvalue is widely considered as a unique unbiased estimation of the word contribution [21], whichsatisfies four desirable properties (linearity, dummy, symmetry and efficiency) [10]. Thus, we definethe interaction benefit among words based on the Shapley value. Let us consider a constituentwith m words. φ1, φ2, . . . , φm denote numerical contributions of each word to the prediction ofa DNN, respectively. φall represents the numerical contribution of the entire constituent to theprediction. Hence, B = φall−

∑mi=1 φi measures the interaction benefit of this constituent. IfB > 0,

interactions among these m words have positive effects on the prediction; otherwise, negative effects.Here, φ1, ..., φm, φall can be computed as Shapley values.

Given a trained DNN and an input sentence with n words, Figure 1 shows the tree structure thatreflects word interactions encoded inside the DNN. In the tree, n leaf nodes represent n input words.Each non-leaf node corresponds to a constituent of the input sentence. A parent node connects twochild nodes with significant interaction benefits. We use the parent node to encode interactions amongits child sub-constituents.

More specifically, there are two types of interactions among words, i.e. (1) interactions within aconstituent and (2) interactions between constituents.

• Interactions within a constituent exist among any two or more words in the constituent. Forthe sentence the sun is shining in the sky, interactions within the constituent in the sky consistof interactions among all combinations of words, including interactions (1) between (in, the), (2)between (the, sky), (3) between (in, sky) and (4) among (in, the, sky).

• Interactions between constituents. In the aforementioned sentence, interactions between theconstituent the sun and its adjacent constituent is shining are composed of all potential interactionsamong all combinations of words from the two constituents, including interactions (1) between theand is; (2) between the and shining; (3) between sun and is; (4) between sun and shining; (5) betweenthe and is shining; (6) between sun and is shining; (7) between the sun and is; (8) between the sunand shining; (9) between the sun and is shining.

We use a tree structure to select and encode the most salient interactions among words, in order toreveal the signal processing in a DNN. We further propose additional metrics to diagnose interactionsamong words, e.g. the quantification of interactions within a constituent, the quantification ofinteractions between two adjacent constituents, and ratios of interactions that are modeled andunmodeled by the tree.

Theoretically, our method can be used as a generic tool to analyze DNNs with different architecturesfor various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39].Experimental results have demonstrated the effectiveness of our method.

Contributions of this paper can be summarized as follows. (1) We propose a method to extract andquantify interactions among words. (2) A tree structure is automatically generated to represent salientinteractions encoded in a DNN. (3) We further design six metrics to analyze interactions, whichprovides new perspectives to understand DNNs.

2 Related Work

• Hierarchical representations of natural language. Many studies integrated hierarchical struc-tures of natural language into DNNs for better representations [9, 36, 41, 42]. Other studies learnedsyntactic parsers [8, 13, 18, 19, 20, 23], although these methods pursued a high parsing accuracy,instead of explaining the DNN. Essentially, the learning of the syntactic parser aimed to make theparser fit syntactic structures defined by people. Nevertheless, the post-hoc explanation of a DNNwas proposed to objectively explain the signal processing in a DNN. In this way, we hope to provide

2

Page 3: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

a generic tool to analyze DNNs in a post-hoc manner, without being affected by the subjective bias inhuman annotations.

Learning interpretable DNNs: Some studies designed specific network architectures to learn inter-pretable feature representations, which reflected hierarchical structures of natural language. Chunget al. [5] revised an RNN to generate a hierarchical structure. Shen et al. [30] designed a novelnetwork to automatically capture the latent tree structure of an input sentence.

Post-hoc explanation of DNNs: Another important direction was to explain DNNs using hierarchicalstructures. Yogatama et al. [46] evaluated the ability of various RNNs for natural language tocapture syntactic dependencies. Murdoch et al. [24] estimated contributions of input words to theprediction of an LSTM as well as inter-word relationships.1 Singh et al. [33] generated a treestructure to explain the predictions of a DNN. Reif et al. [27] found that the attention matrices inBERT contained syntactic representations. Raganato and Tiedemann [26] exploited attention weightsof the Transformer to analyze what kind of linguistic information was learned by the encoder ofthe model. Jin et al. [15] provided hierarchical explanations by quantifying the importance of eachword or phrase. Voita et al. [40] studied the evolution of token representations across layers in theTransformer under different learning objectives. Lundberg and Lee [21] proposed SHAP value toassign each feature an importance value for a prediction. Simonyan et al. [32] visualized saliencymaps for the class prediction to understand deep CNNs.

Unlike above studies of estimating importance/attribution/contribution/saliency of inputs, we focus oninteractions among words encoded inside DNNs. Chen and Jordan [3] used a “predefined” syntacticconstituency structure to assign an importance score to each word in a sentence, and to quantifyinteractions2 between sibling nodes on a parse tree, instead of learning the linguistic structure. Janizeket al. [14] explained pairwise feature interactions by extending the Integrated Gradients explanationmethod. Lundberg et al. [22] defined the SHAP interaction values to quantify interaction effectsbetween two features. Cui et al. [6] estimated global pairwise interactions from a trained Bayesianneural network. Tsang et al. [37] detected statistical interactions from the weights of feedforwardneural networks. An ensemble tree-based method [35] was proposed to detect variable interactions.It compared the predictive performance of two regression trees, one with interactions between twovariables of interest, and the other with the absence of the interactions. The neural interactiontransparency framework [38] was presented to separate feature interactions by way of regularization,and could only be applied to fully connected vanilla multi-layer perceptrons. Greenside et al. [11]identified interactions between all pairs of discrete features in an input DNA sequence. However,these studies mainly focus on interactions between two variables [6, 11, 14, 22] or are limited tospecific network architectures [35, 37, 38]. Instead, we aim to quantify interactions among multiplevariables in DNNs with arbitrary architectures without any prior linguistic structure. More specifically,our method uses a tree to organize the extracted interactions hierarchically.

• Shapley values. The Shapley value [28] was first introduced in the game theory. Given a game withmultiple players, each player is supposed to pursue a high score/award. Sometimes, some playersmay form a coalition in order to pursue more awards. Since each player contributes differently to thecoalition, the final award distributed to each player should be unequal. The Shapley value is widelyconsidered as a unique unbiased approach to fairly allocating the total award to each player, whichsatisfies four desirable properties, including the linearity, dummy, symmetry and efficiency properties.Please see the supplementary material for details of these properties.

Given a game vN with n players, let N = 1, 2, ..., n represent the set of n players. The superscriptN indicates the set of players participating in the game. Let 2N denote all the potential subsetsof N . For example, there are three players a, b and c in a game. Hence, N = a, b, c and2N = ∅, a, b, c, a, b, a, c, b, c, a, b, c. vN is a set function mapping from eachsubset to a real number (i.e. vN : 2N 7→ R). For any subset of players S ⊆ N , where S can beregarded as a coalition, vN (S) represents the award of the coalition. Considering that the player ais not in the coalition S (i.e. a /∈ S), then if player a joins the coalition S, the overall award of thecoalition would be vN (S ∪ a). vN (S ∪ a) − vN (S) is considered as the marginal award ofplayer a. The Shapley value φv(a) is an unbiased contribution estimation of player a in the game.φv(a) is formulated as the weighted sum of marginal awards of player a brought to all possible

1Although Murdoch et al. [24] called the inter-word relationships interactions, such interactions had essentialdifference from the interaction defined in this paper.

2The interaction was defined as deviation of composition from linearity.

3

Page 4: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

coalitions S ⊆ N\a.

φv(a) =∑

S⊆N\a

(|N | − |S| − 1)!|S|!|N |!

(vN (S ∪ a)− vN (S)) (1)

Due to the exponential number of coalitions, the computation of Shapley values is NP-hard. Asampling-based method [2] can be used to approximate Shapley values.

3 Algorithm

3.1 Interactions

Interactions between two players. In the game theory, some players may form a coalition tocompete with other players and win an award. Considering that the Shapley value is an unbiasedestimation of each player’s contribution [21], we quantify interactions based on the Shapley value.Suppose that there are n players N = 1, 2, ..., n in a game v. Without loss of generality, werandomly select a pair of players a, b ∈ N . Shapley values of players a and b are denoted by φv(a)and φv(b), respectively. If players a and b cooperate to form a coalition Sab = a, b, we canconsider the coalition as a new singleton, which is represented using brackets, [Sab]. In this way,the game can be considered to have n − 1 players, and one of them is the singleton [Sab]. I.e. aand b always appear together in the game. The interaction benefit between a and b is defined asB([Sab]) = φv

N\a,b∪[Sab]([Sab])− (φvN\b

(a) + φvN\a

(b)). N \ a, b ∪ [Sab] representsthe set of players in N excluding a, b and being added a new singleton player [Sab]. The absolutevalue of the interaction benefit |B([Sab])| represents the significance of the interaction. B([Sab]) > 0indicates a cooperative relationship between a and b. Whereas, B([Sab]) < 0 indicates an adversarialrelationship between a and b.

Extension to interactions among multiple players. We extend the two-player interaction to inter-actions among multiple players. When the game has n players, let us consider a subset of playersS ( N as a coalition, which is regarded as a new singleton player [S]. The interaction benefit of thecoalition S is defined as follows (please see the supplementary material for more discussions).

B([S]) = φv(N\S)∪[S]

([S])−∑a∈S

φv(N\S)∪a

(a) (2)

In this way, the interaction benefit measures the additional award brought by the singleton player[S] w.r.t. the individual contribution of each player computed in Equation (1) without requiring allplayers in S to appear together. The Shapley value φv

(N\S)∪[S]([S]) is computed only considering

the set of players when we remove all players in S from N and add a new singleton player [S] in thegame. Similarly, φv

(N\S)∪a(a) is computed only considering the set of players when we remove all

players in S from N and add the player a. If B([S]) is greater/less than 0, interactions of players inS have positive/negative effects, revealing the cooperative/adversarial relationship among players.

Furthermore, players in S can be divided into two disjoint subsets S1, S2 (i.e. S1∩S2 = ∅, S1∪S2 =S). Accordingly, the interaction benefit can be decomposed into three terms:

B([S]) = B([S1]) +B([S2]) +Bbetween(S1, S2) (3)The first and second terms B([S1]) and B([S2]) indicate interaction benefits among players withinS1 and S2, respectively. The third term Bbetween(S1, S2) indicates interaction benefits among playersselected from both S1 and S2. Bbetween(S1, S2) will be introduced in detail in Section 3.2.

Properties of interaction benefits. Theoretically, the overall interaction benefit, B([S]), S ⊆ N ,can be decomposed into elementary interaction components Iv

N

(S). The elementary interactioncomponent was originally proposed in [10] (please see the supplementary material for details).The elementary interaction component Iv

N

(S) measures the marginal benefit received from thecoalition [S], from which benefits of all potential smaller coalitions S′ ( S are removed. Forexample, let S = a, b, c. Then, Iv

N

(S) measures interactions caused by [S] = (a, b, c), andignores all potential interactions caused by coalitions of (a, b), (a, c), (b, c), (a), (b), (c). Therefore,the elementary interaction component is formulated as follows.

IvN

(S) = Iv(N\S)∪[S]

([S])−∑

S′(S,S′ 6=∅Iv

(N\S)∪S′

(S′) (4)

4

Page 5: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

In particular, for any singleton player [S], we have Iv(N\S)∪[S]

([S]) = φv(N\S)∪[S]

([S]). Thus,we can compute Iv

N

(S) via dynamic programming. Therefore, B([S]) can be decomposed intoelementary interaction components as follows (please see the supplementary material for the proof).

B([S]) =∑

S′⊆S,|S′|>1

Iv(N\S)∪S′

(S′) (5)

3.2 Fine-Grained analysis of interactions between two sets of players

Interactions between two sets of players Bbetween(S1, S2) can be further decomposed into three partsψinter, ψintra

1 , ψintra2 . Please see the supplementary material for the proof.

Bbetween(S1, S2) = ψinter + ψintra1 + ψintra

2 (6)

where

ψinter =∑

L⊆S,L 6⊂S1,L 6⊂S2,|L|>1Iv

(N\S)∪L

(L) (7)

ψintra1 =

∑L⊆S1,|L|>1

Iv(N\S)∪L

(L)−∑

L⊆S1,|L|>1

Iv(N\S1)∪L

(L) = B([S1])|N ′=(N\S2) −B([S1]) (8)

ψintra2 =

∑L⊆S2,|L|>1

Iv(N\S)∪L

(L)−∑

L⊆S2,|L|>1

Iv(N\S2)∪L

(L) = B([S2])|N ′=(N\S1) −B([S2]) (9)

ψinter represents all potential interaction benefits caused by coalitions whose elements are selectedfrom both S1 and S2. B([S1])|N ′=(N\S2) denotes interaction benefits of the singleton [S1], when theset of players in the game is N ′ = (N \S2). ψintra

1 indicates the difference of internal interactionsamong players in the set S1 in the absence and presence of players in the set S2.

3.3 Interactions encoded inside a DNN

We aim to analyze interactions among words, which are encoded inside a trained DNN. Given aninput sentence, we construct a tree to disentangle and quantify interactions among input words.

Given an input sentence with n words, we first introduce the Shapley value of input words w.r.t. theprediction of the DNN. Here, we consider each word as a player, and the scalar output of a DNN asthe aforementioned award in the game. If a DNN has a scalar output, we can take the scalar output asthe award v. If the DNN outputs a vector for multi-category classification, we select the score beforethe softmax layer corresponding to the true class as the award score. To compute v(S), we maskwords in N \S in the input sentence, and feed the modified input into the DNN. The embedding ofthe masked word is set to a dummy vector, which refers to a padding of the input to the DNN. Then,the Shapley value of each word a is approximated using a sampling-based method [2].

As Figure 1 shows, we construct a binary tree with n leaf nodes. Each leaf node represents a word,while each non-leaf node represents a constituent. Two adjacent nodes with strong interactions willbe merged into a node in the next layer. For each sub-structure of a parent node S with two childnodes Sl and Sr, we can obtain the following equation according to Equation (3).

B([S]) =B([Sl]) +B([Sr]) +Bbetween(Sl, Sr)

=B([Sll]) +B([Slr]) +B([Srl]) +B([Srr])

+Bbetween(Sll, Slr) +Bbetween(Srl, Srr) +Bbetween(Sl, Sr)

=∑

H∈non-leaf nodesBbetween(Hl, Hr)

(10)

B([S]) can be recursively decomposed into the sum of interaction benefits between two child nodesof all non-leaf nodes. Please see the supplementary material for the proof.

3.4 Metrics for interactions and the construction of a tree

Metrics for interactions. Besides B([Sl]), B([Sr]) and Bbetween(Sl, Sr), we define three additionalmetrics to provide insightful analysis of interactions among words. Let us consider a sub-structure of

5

Page 6: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

400 1000 1600Number of sampling

0.02

0.06

0.10

Inst

abilit

y

SST-2BERTELMoTransformerCNNLSTM

400 1000 1600Number of sampling

0.0

0.1

0.2

0.3

Inst

abilit

y

CoLABERTELMoTransformerCNNLSTM

(a)

400 1000 1600Number of sampling

0.01

0.03

0.05

0.07

Erro

rs o

f B([S

])

SST-2BERTCNNELMoLSTMTransformer

400 1000 1600Number of sampling

0.00

0.04

0.08

0.12

Erro

rs o

f B([S

])

CoLABERTCNNELMoLSTMTransformer

(b)

Figure 2: The instability of sampling-based Shapley values (a), and errors of the estimated interactionbenefits (b) along with the number of sampling times.

a' a b b'

BabBa'a Bbb'

ϕa ϕb

c

Figure 3: Interaction benefits between constituents.The interaction benefitBab is more significant thanBa′a and Bbb′ , so the tree merges a and b to forma coalition c.

a1 a2 a3 a4 a5 a6 a7

AND AND

OR

a8 a9 a10 a11

AND

Figure 4: A model in the AND-OR dataset.Each leaf node is a binary variable.

a parent node c (corresponding to the constituent S) and two child nodes a and b (corresponding tosub-constituents Sl and Sr). As Figure 3 shows, a′ is the left adjacent node of a, and b′ is the rightadjacent node of b. We propose the metric density of modeled interactions for a candidate coalitionsuch as a, b, denoted by r(a, b). This metric measures the ratio of interaction benefits betweentwo adjacent nodes a and b to total interaction benefits related to a and b. The density of modeledinteractions is approximated as follows.

r(a, b) =interaction benefits between a and b

total interaction benefits related to a and b≈ |Bab||Bab|+ |Ba′a|+ |Bbb′ |+ |φa|+ |φb|

(11)

where Bab = Bbetween(Sa, Sb), φa and φb can be approximated as φv(N\Sa)∪[Sa]

([Sa]) andφv

(N\Sb)∪[Sb]([Sb]), respectively. To measure interaction benefits that are not represented by thetree, a metric called density of unmodeled interactions denoted by s(a, b) is given.

s(a, b) =unmodeled interaction benefits

total interaction benefits related to a and b≈ |Ba′a|+ |Bbb′ ||Bab|+ |Ba′a|+ |Bbb′ |+ |φa|+ |φb|

(12)

Note that neither r(a, b) nor s(a, b) is an accurate estimation of the ratio of interactions. If twoconstituents are far away (e.g. not adjacent), their interaction benefits are usually small and sometimescan be neglected. Therefore, we only consider interaction benefits between adjacent nodes (i.e. Ba′a,Bab, Bbb′). We have demonstrated very little effects of such neglection in Table 4. In addition,according to Equation (6), we have Bbetween(Sl, Sr) = ψinter + ψintra

l + ψintrar . Therefore, we define

the following metric to measure the ratio of inter-constituent interactions.

t = |ψinter|/(|ψinter|+ |ψintral + ψintra

r |) (13)

Construction of a tree. We introduce a method to construct a tree structure. We use the metricr(a, b) in Equation (11) to quantify the significance of interactions between two adjacent constituents,and to guide the construction of the tree. We are given a trained DNN and an input sentence. TheDNN can be trained for various tasks, such as sentiment classification, and estimation of linguisticacceptability. We construct the tree in a bottom-up manner. Let Ω denote the set of current candidatenodes to merge. In the beginning, each word ai of the input sentence is initialized as a leaf node,Ω = a1, a2, ..., an. In each step, we compute the value of each pair of adjacent nodes r(ai, ai+1).Then, we select and merge two adjacent nodes with the largest value of r(ai, ai+1). In this way, weuse a greedy strategy to build up the tree, so that salient interactions among words are represented.

4 Experiments

• Instability and accuracy of Shapley values. According to Equation (1), the accurate computationof Shapley values was NP-hard. Castro et al. [2] proposed a sampling-based method to approximate

6

Page 7: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

Table 1: The rate of incorrect extractions ofword interactions, which verifies the assump-tion that effects of non-adjacent nodes canbe neglected on the SST-2 dataset (see thesupplementary meterial for more results).

# of merges BERT ELMo CNN LSTM

1 0.00 0.02 0.01 0.06

2 0.00 0.06 0.02 0.13

3 0.00 0.12 0.02 0.19

4 0.03 0.15 0.07 0.15

5 0.03 0.16 0.07 0.14

Table 2: Fitness (the unlabeled F1) between theextracted trees from NLP models and syntactictrees, which demonstrates that interactions en-coded in a DNN are not quite related to the syn-tactic structure.

Dataset BERT ELMo CNN LSTM

CoLA 39.85% 17.08% 16.69% 14.07%

SST-2 19.58% 18.65% 12.82% 32.68%

Transformer Random LB RB

CoLA 3.79% 15.18% 2.68% 60.46%

SST-2 26.19% 19.95% 12.27% 47.35%

Table 3: Comparison of the cor-rectness of the extracted interac-tions on the AND-OR dataset.

F1 Recall

Our method 45.1% 96.8%SHAP interaction 38.6% 80.9%

Random 13.2% 27.6%

LB 8.4% 18.1%

RB 4.3% 10.0%

emotional have

good funchildThere is no pleasure watching a suffer . in

I just loved every minite of this film Too much of the humor falls flat . .

1st merge 2nd merge 3rd merge

it up to .all adds

it could been worsebut .a widly experience .inconsistent

Figure 5: Examples of the phenomenon that constituents with dis-tinct emotional attitudes have strong interactions and are extractedin the first three steps for BERT learned on the SST-2 dataset.

Shapley values with polynomial computational complexity. In order to evaluate the instability ofB([S]), we quantified the change of the instability of Shapley values along with the increase of thenumber of sampling times. Let us compute the Shapley value φv(a) for each word by sampling Ttimes. We repeated such a procedure of computing Shapley values two times. Then, the instabilityof the computation of Shapley values was measured as 2||φ− φ

′ ||/(||φ||+ ||φ′ ||) where φ and φ′

denoted two vectors of word-wise Shapley values computed in these two times. The overall instabilityof Shapley values was reported as the average value of the instability of all sentences. Figure 2(a)shows the change of the instability of Shapley values along with the number of sampling times T .We found that when T ≥ 1000, we obtained stable Shapley values.

In addition, we also evaluated the accuracy of the estimation of interaction benefits B([S]). Theproblem was that the ground truth value of B([S]) was computed using the NP-hard brute-forcemethod in Equation (1). Considering the NP-hard computational cost, we only conducted suchevaluations on sentences with no more than 10 words. The average absolute difference (i.e. the error)between the estimated B([S]) and its ground truth value over all sentences is reported in Figure 2(b).We found that the estimated interaction benefits were accurate enough when the number of samplingtimes was greater than 1000.

• Effects of non-adjacent nodes. To compute r(a, b), we only considered interaction benefitsbetween two adjacent nodes, and assumed that interactions of non-adjacent nodes were much lesssignificant than those of adjacent nodes. To verify this assumption, we defined the followingmetric to quantify the interaction benefit r′(a, c) between two non-adjacent nodes a and c, andevaluated whether the most salient interaction between adjacent nodes a, b detected by our methodwas more significant than interactions between all potential non-adjacent nodes. We use r′(a, c) =|Bac|/(|Bac|+ |Ba′a|+ |Baa′′ |+ |Bc′c|+ |Bcc′′ |+ |φa|+ |φc|) to quantify the interaction densitybetween non-adjacent nodes a and c, where a′ and a′′ were the left and right adjacent nodes of a, c′and c′′ were the left and right adjacent nodes of c. If the interaction density r(a, b) estimated by ourmethod was higher than that between potential non-adjacent nodes, we considered this as a correctextraction of word interactions. Table 4 reports the rate of incorrect extractions of word interactionsover all sentences during the construction of the tree (please see the supplementary material for moreresults). Based on this assumption, our method performed correctly in most cases.

• Correctness of the extracted interaction. We aimed to evaluate whether the extracted interactionobjectively reflected the true interaction in the model, but the core challenge was that it was impossible

7

Page 8: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

, separationcomedy aboutAn

.long too , and too

.and funny, Just

. John reagards bill as a

. to try and buy

. The farmer load

and film .deepBbetween=0.35

t=0.97

r=0.30s=0.05s=0.55r=0.33B([s])=1.75

B([s])=0.41B([s])=0.35

Bbetween=0.41t=0.77

Bbetween=1.28t=0.40

meaningful

s=0.26r=0.25

r=0.33 s=0.31

B([s])=4.35

Bbetween=2.66t=0.08

betried toJohn goodBbetween=1.82

t=0.50

r=0.43s=0.29

Bbetween=5.31t=0.40

Bbetween=-0.82t=0.43

r=0.25 s=0.02

s=0.46r=0.47

B([s])=12.04

B([s])=12.81

B([s])=7.95

a

Bbetween=3.34t=0.41

t=0.87

r=0.49 s=0.22

B([s])=0.48

Bbetween=0.48

r=0.59 s=0.16

B([s])=2.56

boy .t=0.94

r=0.47 s=0.33

B([s])=1.31

Bbetween=1.31

Bbetween=-0.27t=0.35

r=0.10 s=0.00

B([s])=11.76

a

r=0.21 s=0.00B([s])=5.87

Bbetween=1.06t=0.24

the cart with apples

some whiskey

good friend

up ##liftingas moving as ever

Too slow little happens

alien ##ationabsurd ##ist loss .

I want

Figure 6: Trees extracted from BERT trained on the SST-2 dataset (left) and the CoLA dataset (right),respectively. Metrics are shown around each non-leaf node. Please see the supplementary materialfor more results of different models.

to annotate ground-truth interactions between words. It was because the human’s understanding ofword interactions was not necessarily equivalent to objective interactions encoded in a DNN. In thisway, we constructed a dataset with ground-truth interactions between the inputs, as follows.

We constructed a dataset with 2048 models. Each model was implemented as a boolean function,whose input was 11 binary variables a1, a2, · · · , a11 ∈ 0, 1. The output of the model was a binaryvariable which consisted of AND, OR operations in a tree structure (e.g. the tree in Figure 4). Weevaluated whether the extracted interaction could reflect the true AND, OR constituents in the input.

The unlabeled F1 and unlabeled recall were used to evaluate the correctness of the extracted interaction.We compared our method with four baselines. The first baseline was [22], which defined a typeof two-player interaction (i.e. SHAP interaction), and we extended this technique to construct atree. I.e. we merged the two adjacent nodes with the largest absolute SHAP interaction value. Sincethere was no other method to construct a tree for interactons, the other three baselines Random, left-branching (LB) and right-branching (RB) trees (used in [29]) were selected to show the performanceof trivial solutions. As Table 3 shows, our method outperformed all baselines. Note that theoretically,there did not exist a 100% F1 score, because the extracted binary tree was naturally different from theground-truth n-ary tree.

• Analysis of DNNs based on interactions. We learned DNNs for binary sentiment classificationbased on the SST-2 dataset [34], and learned DNNs to predict whether a sentence was linguisticallyacceptable based on the CoLA dataset [43]. For each task, we learned five DNNs, including theBERT [7], the ELMo [25], the CNN proposed in [16], the two-layer unidirectional LSTM [12], andthe Transformer [39].

We used our method to extract tree structures that encoded interactions among words inside varioustrained DNNs. Figure 6 illustrates trees extracted from BERT on different tasks. (1) For thelinguistic acceptability task, BERT usually combined noun phrases firstly, while the subject wascombined almost at last. ELMo and LSTM were prone to construct a tree with a “subject+verb-phrase+noun/adjective-phrase” structure. CNN usually extracted small constituents including apreposition or an article, e.g. “afraid of,” “fix the.” Transformer tended to encode interactions amongadjacent constituents sequentially. (2) For the sentiment analysis task, as Figure 5 shows, most treesof these DNNs usually extracted constituents with distinct positive/negative emotional attitudes inearly stages (please see the supplementary material for more results of different models).

Comparison of the fitness between the extracted trees and syntactic trees: Furthermore, we comparedthe fitness between the automatically extracted tree and the syntactic tree of the sentence. To this end,given an input sentence, we used the Berkeley Neural Parser [17] to generate the syntactic tree as theground-truth.3 We used the unlabeled F1 to evaluate the fitness. Experimental results are reported inTable 2, which demonstrates the logic of interactions modeled by the DNN was significantly differentfrom human knowledge.

3The parser’s performance was good enough to take its parsing results as ground-truth.

8

Page 9: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

In addition, our method can be also applied to build a tree for interactions w.r.t. the computa-tion of features in an intermediate layer. Please see the supplementary material for details of suchexperiments.

5 Conclusions

In this paper, we have defined and extracted interaction benefits among words encoded in a DNN,and have used a tree structure to organize word interactions hierarchically. Besides, six metrics aredefined to disentangle and quantify interactions among words. Our method can be regarded as ageneric tool to objectively diagnose various DNNs for NLP tasks, which provides new insights ofthese DNNs.

References[1] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantify-

ing interpretability of deep visual representations. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 6541–6549, 2017.

[2] Javier Castro, Daniel Gómez, and Juan Tejada. Polynomial calculation of the shapley value based onsampling. Computers & Operations Research, 36(5):1726–1730, 2009.

[3] Jianbo Chen and Michael I Jordan. Ls-tree: Model interpretation when the data are linguistic. arXivpreprint arXiv:1902.04187, 2019.

[4] Jihun Choi, Kang Min Yoo, and Sang-goo Lee. Learning to compose task-specific tree structures. InThirty-Second AAAI Conference on Artificial Intelligence, 2018.

[5] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks.CoRR, abs/1609.01704, 2016. URL http://arxiv.org/abs/1609.01704.

[6] Tianyu Cui, Pekka Marttinen, and Samuel Kaski. Learning global pairwise interactions with bayesianneural networks. arXiv: Learning, 2019.

[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec-tional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[8] Andrew Drozdov, Patrick Verga, Mohit Yadav, Mohit Iyyer, and Andrew McCallum. Unsupervisedlatent tree induction with deep inside-outside recursive auto-encoders. In Proceedings of the 2019Conference of the North American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and Short Papers), pages 1129–1141, Minneapolis, Minnesota,June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1116. URL https://www.aclweb.org/anthology/N19-1116.

[9] Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent neural networkgrammars. In Proceedings of the 2016 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, pages 199–209, San Diego, California,June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1024. URL https://www.aclweb.org/anthology/N16-1024.

[10] Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among playersin cooperative games. International Journal of Game Theory, 28:547–565, 1999.

[11] Peyton Greenside, Tyler Shimko, Polly Fordyce, and Anshul Kundaje. Discovering epistatic featureinteractions from neural network models of regulatory DNA sequences. Bioinformatics, 34(17):i629–i637,09 2018. ISSN 1367-4803. doi: 10.1093/bioinformatics/bty575. URL https://doi.org/10.1093/bioinformatics/bty575.

[12] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780,1997.

[13] Phu Mon Htut, Kyunghyun Cho, and Samuel R Bowman. Inducing constituency trees through neuralmachine translation. arXiv preprint arXiv:1909.10056, 2019.

[14] Joseph D Janizek, Pascal Sturmfels, and Su-In Lee. Explaining explanations: Axiomatic feature interactionsfor deep networks. arXiv preprint arXiv:2002.04138, 2020.

9

Page 10: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

[15] Xisen Jin, Zhongyu Wei, Junyi Du, Xiangyang Xue, and Xiang Ren. Towards hierarchical importanceattribution: Explaining compositional semantics for neural sequence models. In International Conferenceon Learning Representations, 2020. URL https://openreview.net/forum?id=BkxRRkSKwr.

[16] Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha,Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1181. URLhttps://www.aclweb.org/anthology/D14-1181.

[17] Nikita Kitaev and Dan Klein. Constituency parsing with a self-attentive encoder. In Proceedings of the 56thAnnual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne,Australia, July 2018. Association for Computational Linguistics.

[18] Nikita Kitaev, Steven Cao, and Dan Klein. Multilingual constituency parsing with self-attention andpre-training. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pages 3499–3505, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1340. URL https://www.aclweb.org/anthology/P19-1340.

[19] Bowen Li, Lili Mou, and Frank Keller. An imitation learning approach to unsupervised parsing. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3485–3492, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1338.URL https://www.aclweb.org/anthology/P19-1338.

[20] Xiang Lisa Li and Jason Eisner. Specializing word embeddings (for parsing) by information bottleneck. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9thInternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2744–2754,Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1276. URL https://www.aclweb.org/anthology/D19-1276.

[21] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances inneural information processing systems, pages 4765–4774, 2017.

[22] Scott M Lundberg, Gabriel G Erion, and Su-In Lee. Consistent individualized feature attribution for treeensembles. arXiv preprint arXiv:1802.03888, 2018.

[23] Khalil Mrini, Franck Dernoncourt, Trung Bui, Walter Chang, and Ndapa Nakashole. Rethinking self-attention: An interpretable self-attentive encoder-decoder parser. arXiv preprint arXiv:1911.03875, 2019.

[24] W. James Murdoch, Peter J. Liu, and Bin Yu. Beyond word importance: Contextual decomposition toextract interactions from LSTMs. In International Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=rkRwGg-0Z.

[25] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of theNorth American Chapter of the Association for Computational Linguistics: Human Language Technolo-gies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June 2018. Association forComputational Linguistics. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202.

[26] Alessandro Raganato and Jörg Tiedemann. An analysis of encoder representations in transformer-based ma-chine translation. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and InterpretingNeural Networks for NLP, pages 287–297, Brussels, Belgium, November 2018. Association for Com-putational Linguistics. doi: 10.18653/v1/W18-5431. URL https://www.aclweb.org/anthology/W18-5431.

[27] Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B Viegas, Andy Coenen, Adam Pearce, and BeenKim. Visualizing and measuring the geometry of bert. In H. Wallach, H. Larochelle, A. Beygelz-imer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Sys-tems 32, pages 8594–8603. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/9065-visualizing-and-measuring-the-geometry-of-bert.pdf.

[28] Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317,1953.

[29] Yikang Shen, Zhouhan Lin, Chin wei Huang, and Aaron Courville. Neural language modeling by jointlylearning syntax and lexicon. In International Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=rkgOLb-0W.

10

Page 11: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

[30] Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. Ordered neurons: Integrating treestructures into recurrent neural networks. In International Conference on Learning Representations, 2019.URL https://openreview.net/forum?id=B1l6qiR5F7.

[31] Haoyue Shi, Hao Zhou, Jiaze Chen, and Lei Li. On tree-based neural sentence modeling. In Proceedingsof the Conference on Empirical Methods in Natural Language Processing, 2018.

[32] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualisingimage classification models and saliency maps. CoRR, abs/1312.6034, 2014.

[33] Chandan Singh, W. James Murdoch, and Bin Yu. Hierarchical interpretations for neural network predictions.In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkEqro0ctQ.

[34] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, andChristopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank.In Proceedings of the 2013 conference on empirical methods in natural language processing, pages1631–1642, 2013.

[35] Daria Sorokina, Rich Caruana, Mirek Riedewald, and Daniel Fink. Detecting statistical interactions withadditive groves of trees. In Proceedings of the 25th international conference on Machine learning, pages1000–1007, 2008.

[36] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th International Joint Conference on Natural Language Processing(Volume 1: Long Papers), pages 1556–1566, Beijing, China, July 2015. Association for ComputationalLinguistics. doi: 10.3115/v1/P15-1150. URL https://www.aclweb.org/anthology/P15-1150.

[37] Michael Tsang, Dehua Cheng, and Yan Liu. Detecting statistical interactions from neural network weights.In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ByOfBggRZ.

[38] Michael Tsang, Hanpeng Liu, Sanjay Purushotham, Pavankumar Murali, and Yan Liu. Neural interac-tion transparency (nit): Disentangling learned interactions for improved interpretability. In S. Bengio,H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in NeuralInformation Processing Systems 31, pages 5804–5813. Curran Associates, Inc., 2018.

[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processingsystems, pages 5998–6008, 2017.

[40] Elena Voita, Rico Sennrich, and Ivan Titov. The bottom-up evolution of representations in the transformer:A study with machine translation and language modeling objectives. In Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Processing and the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP), pages 4396–4406, Hong Kong, China, November 2019.Association for Computational Linguistics. doi: 10.18653/v1/D19-1448. URL https://www.aclweb.org/anthology/D19-1448.

[41] Xing Wang, Zhaopeng Tu, Longyue Wang, and Shuming Shi. Self-attention with structural positionrepresentations. In Proceedings of the 2019 Conference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),pages 1403–1409, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653/v1/D19-1145. URL https://www.aclweb.org/anthology/D19-1145.

[42] Yaushian Wang, Hung-Yi Lee, and Yun-Nung Chen. Tree transformer: Integrating tree structures intoself-attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),pages 1061–1070, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653/v1/D19-1098. URL https://www.aclweb.org/anthology/D19-1098.

[43] Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. arXivpreprint arXiv:1805.12471, 2018.

[44] Robert J Weber. Probabilistic values for games. The Shapley Value. Essays in Honor of Lloyd S. Shapley,pages 101–119, 1988.

[45] Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. Learning to composewords into sentences with reinforcement learning. arXiv preprint arXiv:1611.09100, 2016.

11

Page 12: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

[46] Dani Yogatama, Yishu Miao, Gabor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, and Phil Blunsom.Memory architectures in recurrent neural network language models. In International Conference onLearning Representations, 2018. URL https://openreview.net/forum?id=SkFqf0lAZ.

12

Page 13: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

A Properties of Shapley values

In this section, we discuss about four desirable properties of Shapley values, which are mentioned in Line 118 ofthe paper.

In game theory, the Shapley value is a unique value function that satisfies all the following axioms [44]:

• Linearity axiom: When two games v and w are combined into a single game v + w, their Shapley values canbe added, i.e. φ(v+w)(i) = φv(i) + φw(i) for each player i in N . Similarly, for any c ∈ R and i ∈ N , therewill be φcv(i) = cφv(i).

• Dummy axiom: A player i ∈ N is referred to as a dummy player if v(S ∪ i) = v(S) + v(i) for each subsetS ⊆ N \ i. Thus, if i ∈ N is a dummy player, φv(i) = v(i), which indicates player i has no interactions toany coalition.

• Symmetry axiom: Given two players i, j ∈ N , if v(S ∪ i) = v(S ∪ j) for each subset S ⊆ N \ i, j,φv(i) = φv(j). In other words, if two players have the same interactions with all other players in the game,then they have the same Shapley value.

• Efficiency axiom: The sum of Shapley values of all players in N is equal to the award of all players in N(i.e.

∑i∈N φv(i) = v(N)). This axiom guarantees the overall award can be distributed to all players in the

game.

B Interactions among multiple players

In this section, we mainly discuss about how to extend interactions between two players to interactions amongmultiple players, which is mentioned in Line 151 of the paper.

Given a game v with n players, N = 1, 2, ..., n is the set of n players. If player a and player b form acoalition Sab = a, b, we regard the coalition as a new singleton player [Sab]. We define the interaction benefitbetween players a and b as B([Sab]).

B([Sab]) = φv(N\a,b)∪[Sab]([Sab])− (φvN\b

(a) + φvN\a(b)) (14)

(N \ a, b) ∪ [Sab] represents the set of players in N excluding a, b and being added a new singleton player[Sab].

Then, we extend the interaction between two players to interactions among multiple players. For example, if aset of players S form a coalition, which is regarded as a new singleton player [S], the interaction benefit amongplayers in the coalition is defined as follows (also see Equation (2) of the paper).

B([S]) = φv(N\S)∪[S]([S])−

∑a∈S

φv(N\S)∪a(a) (15)

C Elementary interaction components

In this section, we introduce the elementary interaction component in more detail, which is mentioned in Line166 of the paper.

In a game v, the elementary interaction component of players among coalition S ⊆ N is denoted by Iv(S). Thedefinition of the elementary interaction component [10] is given as follows.

∀S ⊆ N, Iv(S) =∑

T⊆N\S

(n− t− s)!t!(n− s+ 1)!

∑L⊆S

(−1)s−lv(L ∪ T ) (16)

where n, t, s, and l are the size of the corresponding sets N , T , S, and L, respectively. Note that for a singletonplayer, the elementary interaction component corresponds to the Shapley value, i.e. Iv(a) = φv(a) where a is asingleton player.

If a set of players S form a coalition, we regard the coalition as a singleton player [S]. Let us take two players aand b for example. If players a and b form a coalition S = a, b, which can be considered as a new singletonplayer [S]. The interaction benefit between players a and b is as follows.

Iv(a, b) = Iv(N\a,b)∪[a,b]

([a, b])− IvN\b

(a)− IvN\a

(b)

= φv(N\a,b)∪[a,b]([a, b])− φvN\b

(a)− φvN\a(b)

(17)

13

Page 14: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

Therefore, if the marginal award of the coalition φv(N\a,b)∪[a,b]([a, b]) is larger than the sum of marginal

awards of players a and b (i.e. φvN\b(a) + φvN\a

(b)), players a and b are likely to cooperate in the gamev. In other words, the positive (or negative) value of Iv(a, b) indicates a positive (or negative) interactionbetween players a and b.

Besides, Iv(S) satisfies the following recursive axiom for each S ⊆ N, |S| > 1. K is a non-empty propersubset of S.

Iv(S) = Iv(N\S)∪[S]

([S])−∑

K(S,K 6=∅

IvN\K

(S \K)

= Iv(N\S)∪[S]

([S])−∑

S′(S,S′ 6=∅

Iv(N\S)∪S′

(S′)(18)

D Proof of the relationship between interaction benefits and elementaryinteraction components

In this section, we mainly prove the relationship between interaction benefits and elementary interactioncomponents, which is mentioned in Line 174 of the paper (also see Equation (5) of the paper).

According to Equation (15) and Equation (18), we can establish the relationship between the interaction benefitand the elementary interaction component.

B([S]) = φv(N\S)∪[S]([S])−

∑a∈S

φv(N\S)∪a(a)

= Iv(N\S)∪[S]

([S])−∑a∈S

Iv(N\S)∪a

(a)

=∑

S′⊆S,|S′|>1

Iv(N\S)∪S′

(S′)

(19)

E Proof of interactions between two sets of players Bbetween(S1, S2)

In this section, we mainly prove the fine-grained analysis of interactions between two sets of players, which ismentioned in Line 177 of the paper (also see Equation (6) of the paper).

Given a set of players S, we can split S into two subsets S1 and S2, S1 ∩ S2 = ∅, S1 ∪ S2 = S. According toEquation (19), we have:

B([S]) =∑

L⊆S,|L|>1

Iv(N\S)∪L

(L) (20)

B([S1]) =∑

L⊆S1,|L|>1

Iv(N\S1)∪L

(L) (21)

B([S2]) =∑

L⊆S2,|L|>1

Iv(N\S2)∪L

(L) (22)

Therefore, we derive the following equation.

B([S]) = B([S1]) +B([S2]) +∑

L⊆S,|L|>1

Iv(N\S)∪L

(L)

−∑

L⊆S1,|L|>1

Iv(N\S1)∪L

(L)−∑

L⊆S2,|L|>1

Iv(N\S2)∪L

(L)

= B([S1]) +B([S2]) +Bbetween(S1, S2)

(23)

14

Page 15: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

Bbetween(S1, S2) =∑

L⊆S,|L|>1

Iv(N\S)∪L

(L)−∑

L⊆S1,|L|>1

Iv(N\S1)∪L

(L)−∑

L⊆S2,|L|>1

Iv(N\S2)∪L

(L)

=∑

L⊆S,L 6⊂S1,L 6⊂S2,|L|>1Iv

(N\S)∪L

(L)

+∑

L⊆S1,|L|>1Iv

(N\S)∪L

(L)−∑

L⊆S1,|L|>1Iv

(N\S1)∪L

(L)

+∑

L⊆S2,|L|>1Iv

(N\S)∪L

(L)−∑

L⊆S2,|L|>1Iv

(N\S2)∪L

(L)

= ψinter + ψintra1 + ψintra

2

(24)

Where

ψinter =∑

L⊆S,L 6⊂S1,L 6⊂S2,|L|>1Iv

(N\S)∪L

(L) (25)

ψintra1 =

∑L⊆S1,|L|>1

Iv(N\S)∪L

(L)−∑

L⊆S1,|L|>1Iv

(N\S1)∪L

(L)

= B([S1])|N′=(N\S2) −B([S1])(26)

ψintra2 =

∑L⊆S2,|L|>1

Iv(N\S)∪L

(L)−∑

L⊆S2,|L|>1Iv

(N\S2)∪L

(L)

= B([S2])|N′=(N\S1) −B([S2])(27)

Bbetween(S1, S2) reflects all interactions across players from S1 and S2.

F Proof of the decomposition of B([S])

In this section, we mainly prove the decomposition of the interaction benefit B([S]), which is mentioned in Line199 of the paper (also see Equation (10) of the paper).

B([S]) =B([Sl]) +B([Sr]) +Bbetween(Sl, Sr)

=B([Sll]) +B([Slr]) +B([Srl]) +B([Srr])

+Bbetween(Sll, Slr) +Bbetween(Srl, Srr) +Bbetween(Sl, Sr)

=∑

H∈non-leaf nodesBbetween(Hl, Hr)

(28)

Note that the interaction benefit of a leaf node is zero, so only interaction benefits between two child nodes ofnon-leaf nodes will be left at the end of the recursion in Equation (28).

G Experimental Results

We provided more results of “Effects of non-adjacent nodes” (Line 246) and “Analysis of DNNs based oninteractions” (Line 277) experiments in Section 4 (Line 227) of the paper, as well as further experiments toanalyze interactions encoded in intermediate layers, which were mentioned in Line 297 of the paper.

• Effects of non-adjacent nodes. Here, we provided more results of this experiment. Specifically, as Table 4shows, we reported the rate of incorrect extractions of word interactions over all sentences during the constructionof the tree on the SST-2 dataset, and the CoLA dataset, respectively. Note that Table 4 was a complement toTable 1 of the paper.

• Analysis of DNNs based on interactions. (1) For the linguistic acceptability task, we provided more resultsof trees extracted from different NLP models on the SST-2 dataset and the CoLA dataset, respectively (seeFigures 9—18, which were complements to Figure 6 of the paper). We found that BERT usually combinednoun phrases firstly, while the subject was combined almost at last. ELMo and LSTM were prone to construct atree with a “subject+verb-phrase+noun/adjective-phrase” structure. CNN usually extracted small constituentsincluding a preposition or an article. Transformer tended to encode interactions among adjacent constituentssequentially. (2) For the sentiment analysis task, we found that most trees of these DNNs usually extractedconstituents with distinct positive/negative emotional attitudes in early stages of the construction of the tree.More examples of this phenomenon were given in Tables 5—8, which were complements to Figure 5 of thepaper.

15

Page 16: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

Table 4: The rate of incorrect extractions of word interactions, which verifies the assumption thateffects of non-adjacent nodes can be neglected on the SST-2 dataset (left) and the CoLA dataset(right).

# of merges BERT ELMo CNN LSTM

1 0.00 0.02 0.01 0.06

2 0.00 0.06 0.02 0.13

3 0.00 0.12 0.02 0.19

4 0.03 0.15 0.07 0.15

5 0.03 0.16 0.07 0.14

# of merges BERT ELMo CNN LSTM

1 0.02 0.02 0.01 0.06

2 0.04 0.06 0.02 0.13

3 0.01 0.12 0.02 0.19

4 0.03 0.15 0.07 0.15

5 0.02 0.16 0.07 0.14

Table 5: Constituents extracted from ELMo in the first three steps during the construction of the tree.sentence 1st merge 2nd merge 3rd merge

I just loved every minute of this film . just loved this film of this film

But it could have been worse . it could But it could been worse

Too much of humor falls flat . falls flat falls flat . humor falls flat .

There is no pleasure in watching a child suffer . no pleasure no pleasure in no pleasure in watching

It all adds up to good fun . good fun good fun . to good fun .

Table 6: Constituents extracted from CNN in the first three steps during the construction the tree.sentence 1st merge 2nd merge 3rd merge

A deep and meaningful film . A deep meaningful film meaningful film .

Dense with characters and contains some thrilling moments . thrilling moments Dense with and contains

It treats women like idiots . like idiots treats women It treats women

Just embarrassment and a vague sense of shame . sense of shame . embarrassment and

Just one bad idea after another . one bad idea after idea after another

Table 7: Constituents extracted from LSTM in the first three steps during the construction of the tree.sentence 1st merge 2nd merge 3rd merge

Just one bad idea after another . bad idea one bad idea after another

But it could have been worse . been worse But it could have

There is no pleasure in watching a child suffer . no pleasure is no pleasure child suffer

Too slow , too long and too little happens . too slow too long too slow ,

It treats women like idiots . like idiots like idiots . It treats

Table 8: Constituents extracted from Transformer in the first three steps during the construction ofthe tree.

sentence 1st merge 2nd merge 3rd merge

No way i can believe this load of junk . this load junk . i can

Just one bad idea after another . bad idea one bad idea Just one bad idea

There is no pleasure in watching a child suffer . no pleasure no pleasure in is no pleasure in

I just loved every minute of this film . loved every just loved every just loved every minute

But it could have been worse . been worse it could have been worse

16

Page 17: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

G.1 Interactions encoded in intermediate layers.

Besides interactions w.r.t. the network output, we used our method to analyze interactions w.r.t. the computationof an intermediate-layer feature f . More specifically, we used fN and fS to represent the intermediate-layerfeatures when the input of the network was a set of words N and S in the sentence, respectively. Since theintermediate-layer features fN and fS were high dimensional vectors, we used the scalar 〈fN ,fS〉/||fN ||to represent the award v(S), where ||fN || was used for normalization. In this way, we evaluated interactionsencoded in different layers of BERT. The extracted trees from different intermediate layers of BERT are shown inFigure 7 (for the BERT learned on the SST-2 dataset) and Figure 8 (for the BERT learned on the CoLA dataset).

and film .deepBbetween=-0.04

t=0.93

r=0.11 s=0.07

s=0.007r=0.07

B([s])=-0.09

B([s])=-0.04 B([s])=-0.04

Bbetween=-0.04t=0.99

Bbetween=-0.06t=0.99

meaningful

s=0.07r=0.13

r=0.07 s=0.00

B([s])=-0.26

Bbetween=-0.08t=0.73

a

r=0.05s=0.001

B([s])=-0.07

t=0.47Bbetween=-0.03

and film .deepBbetween=10.70

t=0.98

r=0.13 s=0.09

s=0.04r=0.29

B([s])=-0.60

B([s])=-3.82 B([s])=10.70

Bbetween=-3.82t=0.75

Bbetween=3.76t=0.45

meaningful

s=0.02r=0.17

r=0.69 s=0.00

B([s])=74.75

Bbetween=59.79t=0.26

a

r=0.27 s=0.08

B([s])=22.17

t=0.63Bbetween=12.52

SST-2

4th

8th

Figure 7: The extracted trees of interactions encoded in different intermediate layers of BERT learnedon the SST-2 dataset.

17

Page 18: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

betried toJohn goodBbetween=-0.01

t=0.88

r=0.12 s=0.01

Bbetween=-0.04t=0.98

Bbetween=1.52t=0.10

r=0.08 s=0.01

s=0.05r=0.07B([s])=2.75

B([s])=-0.04

B([s])=-0.04

a

Bbetween=0.13t=0.09

r=0.16 s=0.03

B([s])=-0.07

Bbetween=-0.07

r=0.03 s=0.07

B([s])=-0.01

boy .

t=0.18

r=0.10 s=0.00

B([s])=0.08

Bbetween=0.14

Bbetween=-0.05t=0.65

r=0.06s=0.003

B([s])=-0.17

t=0.99

betried toJohn goodBbetween=0.29

t=0.99

r=0.20 s=0.14

Bbetween=-0.53t=0.88

Bbetween=1.52t=0.10

r=0.30 s=0.004

s=0.01r=0.19

B([s])=2.75

B([s])=0.76

B([s])=-0.53

a

Bbetween=0.52t=0.49

r=0.16 s=0.12

B([s])=-0.48

Bbetween=-0.48

r=0.18 s=0.06

B([s])=0.29

boy .

t=0.77

r=0.14 s=0.00

B([s])=1.27

Bbetween=-1.00

Bbetween=0.84t=0.002

r=0.16 s=0.18B([s])=1.28

t=0.96

CoLA

4th

8th

Figure 8: The extracted trees of interactions encoded in different intermediate layers of BERT learnedon the CoLA dataset.

18

Page 19: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

worsehave beencouldBbetween=0.001t=0.8298

r=0.012 s=0.1477

B([s])=0.001

Bbetween=0.1088t=0.9709

r=0.0438 s=0.0009

B([s])=-0.1795r=0.039 s=0.132

Bbetween=-0.0179t=0.3935

B([s])=-0.0272

itBut

r=0.0652s=0.0007

B([s])=-0.024

Bbetween=0.451t=0.8965

r=0.0159 s=0.0

B([s])=0.2245

.

SST-2LS

TMCNN

Transformer

BER

TEL

Mo

beenhavecould

t=0.9895

r=0.3587 s=0.1196

B([s])=-0.1459

B([s])=1.3651

r=0.1241 s=0.0391

Bbetween=0.5114t=0.7425

B([s])=1.6299

itBut

Bbetween=-0.574t=0.9914

r=0.6044 s=0.1978

B([s])=-0.574

worse .

B([s])=-0.3068

t=0.9262Bbetween=-0.024

r=0.1122 s=0.033

t=0.9664Bbetween=-0.3068

beenhavecould

t=0.9963

r=0.2191s=0.0508

B([s])=-0.4388Bbetween=0.1318t=0.7637

B([s])=-0.3036

it

Bbetween=0.0389t=0.907

r=0.1159s=0.0553

B([s])=0.0569

worse .But

r=0.1747 s=0.1776

t=0.781Bbetween=0.3742

Bbetween=1.3651

B([s])=-0.1974

r=0.0206 s=0.0

t=0.3392Bbetween=-0.2716

B([s])=0.0829

r=0.1013 s=0.0184

t=0.321Bbetween=-1.5201

beenhavecould

t=0.8028

r=0.3396 s=0.2739

B([s])=0.1935

it

Bbetween=-0.3793t=0.9773

r=0.2409 s=0.2035

B([s])=-0.3793

worseBut .

Bbetween=0.0935t=0.851

B([s])=-0.2098

Bbetween=0.0182t=0.3193

r=0.0895s=0.1546

B([s])=0.0182

Bbetween=0.0422t=0.2817

r=0.0352 s=0.0

B([s])=-0.1208

Bbetween=-0.4388

r=0.0862 s=0.0841

r=0.0696 s=0.0264

beenhavecould

r=0.0525 s=0.0

Bbetween=-0.1792t=0.4577

B([s])=2.9284

it

Bbetween=0.124t=0.9533

r=0.2743 s=0.2985

B([s])=0.124

worse .But

Bbetween=-0.6598t=0.9884

r=0.1975 s=0.1491

B([s])=-0.6598

Bbetween=-0.5145t=0.6828

r=0.1255s=0.0767

B([s])=-2.872

Bbetween=-0.1368t=0.3898

B([s])=-3.0415

Bbetween=0.566

Bbetween=-1.839t=0.9478

r=0.297 s=0.0746

B([s])=-2.3809

r=0.043 s=0.0

Bbetween=0.7715t=0.9771

r=0.2527s=0.2574

B([s])=0.7715Bbetween=1.3387t=0.7805

r=0.3478 s=0.2014

B([s])=2.0279Bbetween=0.1959t=0.962

r=0.4053s=0.3309

B([s])=0.2967 Bbetween=0.4865t=0.7565

r=0.1407 s=0.1354

B([s])=2.5282

Figure 9: Extracted trees of different NLP models trained on the SST-2 dataset.

19

Page 20: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

goodup toadds

Bbetween=-0.0596t=0.8826

r=0.0285 s=0.0048

B([s])=-0.0596

r=0.066s=0.0079

Bbetween=-0.1275t=0.9949

B([s])=-0.1275

allIt

Bbetween=0.1413t=0.3439

r=0.0342s=0.0151

B([s])=-0.4644

.

SST-2LS

TMCNN

Transformer

BER

TEL

Mo

fun

goodup toadds

Bbetween=-1.531t=0.6867

r=0.2281 s=0.0789

B([s])=-1.8638r=0.1554 s=0.0291

Bbetween=0.3391t=0.852

B([s])=0.3391

all .funIt

goodup toadds

Bbetween=-0.9351t=0.9232

r=0.2778 s=0.0071

B([s])=-0.9351

r=0.0423 s=0.0319

Bbetween=-0.0049t=0.6401

B([s])=-0.0049

all .funIt

goodup toaddsBbetween=-1.4835

t=0.9791

r=0.2376 s=0.0678

B([s])=-1.4835

r=0.5049 s=0.0559

Bbetween=0.4806t=0.9546

B([s])=0.4806

all fun .It

goodup toaddsBbetween=0.894

t=0.9336

r=0.235 s=0.1102

B([s])=0.894

r=0.1269 s=0.0183

Bbetween=1.4277t=0.7169

B([s])=7.5308

all fun .It

r=0.0258 s=0.2137

Bbetween=0.0067t=0.758

B([s])=0.0067

r=0.0138s=0.0073

Bbetween=0.0285t=0.0294

B([s])=-0.0138

r=0.1088 s=0.0096

Bbetween=-0.4949t=0.7258

B([s])=-0.6098

r=0.0191 s=0.0

Bbetween=0.0809t=0.0628

B([s])=-0.382

r=0.233s=0.1093

Bbetween=-0.7157t=0.804

B([s])=-0.3768

r=0.1528s=0.1188

Bbetween=-0.5767t=0.9804

B([s])=-0.5767

r=0.0845s=0.0621

Bbetween=-0.4505t=0.8752

B([s])=-2.348

r=0.095 s=0.0

Bbetween=-0.7287t=0.6618

B([s])=-4.9881

r=0.1425s=0.0442

Bbetween=-1.2844t=0.3717

B([s])=-4.2624

r=0.1116s=0.004

Bbetween=0.1409t=0.9722

B([s])=0.1431r=0.0531s=0.0348

Bbetween=-0.1177t=0.7915

B([s])=-0.9727

r=0.1045 s=0.0437

Bbetween=-0.2834t=0.6768

B([s])=-1.2383

r=0.3203 s=0.0211

Bbetween=-2.1547t=0.9731

B([s])=-3.2931

r=0.0325 s=0.0

Bbetween=0.1114t=0.2771

B([s])=-3.1834

r=0.4019 s=0.3708

Bbetween=0.3545t=0.8161

B([s])=0.3545

r=0.225s=0.4015

Bbetween=0.4299t=0.3699

B([s])=1.2773

r=0.2473s=0.1087

Bbetween=1.4963t=0.0496

B([s])=1.6923

r=0.2645 s=0.0969

Bbetween=1.6853t=0.0639

B([s])=3.3586

r=0.1323 s=0.0

Bbetween=0.7391t=0.563

B([s])=4.0873

Bbetween=0.6637t=0.9882

r=0.1443 s=0.1478

B([s])=1.6339

Bbetween=0.7945t=0.6959

r=0.1735 s=0.0074

B([s])=2.4027

Bbetween=1.133t=0.8019

r=0.1639 s=0.1085

B([s])=3.7525

r=0.2129s=0.0701

Bbetween=2.3135t=0.7103

B([s])=6.1338

r=0.1381 s=0.0

Bbetween=1.6472t=0.9513

B([s])=9.1684

Figure 10: Extracted trees of different NLP models trained on the SST-2 dataset.

20

Page 21: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

anotheridea afterbad

r=0.1541s=0.0618

Bbetween=-0.1791t=0.9944

B([s])=-0.1791

oneJust

Bbetween=0.1016t=0.8939

r=0.1005 s=0.0214

B([s])=-0.0768

.

SST-2LS

TMCNN

Transformer

BER

TEL

Mo

r=0.0475 s=0.0187

Bbetween=0.0395t=0.8847

B([s])=-0.0353

r=0.0206s=0.3343

Bbetween=-0.001t=0.113

B([s])=-0.001

anotheridea afterbad

r=0.21 s=0.3443

Bbetween=0.4237t=0.953

B([s])=0.4237

one .

r=0.0058 s=0.0

Bbetween=0.0392t=0.6092

B([s])=2.6727

Just

anotheridea afterbad

r=0.184 s=0.1195

Bbetween=-0.2757t=0.9808

B([s])=-0.2757

one .

r=0.1116 s=0.0187

Bbetween=-0.1213t=0.8726

B([s])=-0.1139

Just

anotheridea afterbad

r=0.1908 s=0.056

Bbetween=-0.6567t=0.9679

B([s])=-0.6567

one .

r=0.1653 s=0.5209

Bbetween=0.21t=0.9747

B([s])=0.21

Just

anotheridea afterbad

r=0.2438 s=0.0886

Bbetween=0.6444t=0.8822

B([s])=0.6444

one .

r=0.0531 s=0.0

Bbetween=0.2712t=0.6689

B([s])=4.9744

Just

r=0.0791 s=0.0036

Bbetween=-0.0691t=0.9602

B([s])=-0.1038

r=0.0443 s=0.0

Bbetween=0.0443t=0.9306

B([s])=-0.0605

r=0.1334 s=0.175

Bbetween=-0.7692t=0.9734

B([s])=-0.7692

r=0.3112s=0.1084

t=0.9014

B([s])=1.4699

Bbetween=1.077

r=0.1581 s=0.1876

t=0.7421

B([s])=2.1827

Bbetween=0.6392

r=0.1656 s=0.0092

Bbetween=1.1214t=0.8954

B([s])=2.6409

r=0.1915 s=0.0416

Bbetween=0.2601t=0.88

B([s])=0.01r=0.0911 s=0.0306

Bbetween=0.0208t=0.9665

B([s])=0.0208

r=0.045 s=0.1823

Bbetween=0.0097t=0.8675

B([s])=0.0329

r=0.138

Bbetween=-0.1751t=0.9244

B([s])=-0.2562

s=0.0

r=0.2345 s=0.0293

Bbetween=-0.9735t=0.8266

B([s])=-1.7219

r=0.1805 s=0.0902

Bbetween=-0.9006t=0.7165

B([s])=-2.3772

r=0.284 s=0.001

Bbetween=1.0836t=0.2163

B([s])=-1.1185

t=0.192

B([s])=-1.5542

r=0.1132 s=0.0

Bbetween=-0.4427

r=0.1947 s=0.224

Bbetween=0.6942t=0.6873

B([s])=1.231

r=0.3772 s=0.1614

Bbetween=2.0623t=0.8574

B([s])=3.3135

r=0.1421 s=0.1119

Bbetween=0.7469t=0.6926

B([s])=4.1615r=0.1072s=0.002

Bbetween=0.5347t=0.7281

B([s])=4.7118

Figure 11: Extracted trees of different NLP models trained on the SST-2 dataset.

21

Page 22: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

theofmuchToo

Bbetween=0.0592t=0.9383

r=0.0466 s=0.0359

B([s])=0.1268

.

SST-2LS

TMCNN

Transformer

BER

TEL

Mo

r=0.0831 s=0.0088

Bbetween=0.0584t=0.9108

B([s])=0.0584

r=0.0374 s=0.0218

Bbetween=0.0174t=0.5219

B([s])=0.0762

r=0.0349s=0.0052

Bbetween=0.0421t=0.7728

B([s])=0.1748

falls flathumor

theofmuchToo

Bbetween=0.3142t=0.7915

r=0.1023 s=0.1788

B([s])=1.347

r=0.1304 s=0.0267

Bbetween=0.3131t=0.8672

B([s])=0.3131

r=0.1149 s=0.3835

Bbetween=-0.0715t=0.905

B([s])=-0.0715

falls flathumor .

r=0.1543s=0.0459

Bbetween=0.3021t=0.8537

B([s])=0.0657 r=0.1258 s=0.0969

Bbetween=-0.2353t=0.8985

B([s])=-0.2353

theofmuch

Bbetween=-0.4689t=0.9401

r=0.2396 s=0.1337

B([s])=-0.4689

r=0.1377 s=0.0001

Bbetween=0.0741t=0.9753

B([s])=0.0741

r=0.2353 s=0.0259

Bbetween=-0.4295t=0.827

B([s])=-0.401

falls flathumor .Too

theofmuchBbetween=-0.0884t=0.9173

r=0.1645 s=0.6295

B([s])=-0.0884

r=0.079 s=0.0487

Bbetween=-0.1478t=0.2893

B([s])=-0.6557

r=0.3903 s=0.0213

Bbetween=-3.2978t=0.8175

B([s])=-4.3701

falls flathumorToo .

theofmuch

Bbetween=0.7887t=0.9542

r=0.2466 s=0.193

B([s])=0.7887

r=0.2959 s=0.0112

Bbetween=1.271t=0.9234

B([s])=3.4548

falls flathumor .Too

r=0.082 s=0.0

Bbetween=-0.1153t=0.709

B([s])=0.1243

r=0.3109 s=0.0305

Bbetween=0.7583t=0.7639

B([s])=1.0002t=0.9078

r=0.0916 s=0.0179

B([s])=0.5316

Bbetween=-0.8125 t=0.6884

r=0.0073 s=0.0

B([s])=-0.864

Bbetween=0.0851t=0.07

r=0.1116 s=0.0005

B([s])=-0.9558

Bbetween=-1.5032

Bbetween=0.4067t=0.8731

r=0.2398 s=0.0886

B([s])=0.0719Bbetween=0.1942t=0.8014

r=0.1722 s=0.0954

B([s])=0.0551

Bbetween=-0.049t=0.8788

r=0.054 s=0.0406

B([s])=-0.0097

Bbetween=0.0901t=0.858

r=0.0719 s=0.0

B([s])=-0.3099

r=0.2545

Bbetween=-0.5192t=0.9513

B([s])=-0.5192

s=0.0156

Bbetween=-0.4138t=0.9142

r=0.1639 s=0.0523

B([s])=-0.4138

Bbetween=0.339t=0.322

r=0.1628 s=0.0435

B([s])=-0.2062 Bbetween=0.4469t=0.1979

r=0.1217 s=0.0

B([s])=-3.9076

Bbetween=0.8989t=0.9375

r=0.2487 s=0.1814

B([s])=1.7186

Bbetween=0.6227t=0.8497

r=0.1878 s=0.0537

B([s])=2.2058

Bbetween=0.8498t=0.724

r=0.179 s=0.0087

B([s])=4.2811

Bbetween=0.5887t=0.6611

r=0.1126 s=0.0048

B([s])=4.9104

Bbetween=0.6359t=0.6698

r=0.1087 s=0.0

B([s])=5.5262

Figure 12: Extracted trees of different NLP models trained on the SST-2 dataset.

22

Page 23: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

badsoisThis

Bbetween=0.0437t=0.9758

r=0.0367 s=0.0412

B([s])=0.0437

.

SST-2LS

TMCNN

Transformer

BER

TEL

Mo

r=0.1004 s=0.0278

Bbetween=0.1336t=0.9993

B([s])=0.2348 r=0.0421 s=0.0489

Bbetween=0.0481t=0.8671

B([s])=0.1026

r=0.0287 s=0.0

Bbetween=0.019t=0.9014

B([s])=0.2541

badsoisBbetween=-0.92t=0.9892

r=0.1841 s=0.0152

B([s])=-0.92

.

r=0.3565 s=0.1126

Bbetween=0.4846t=0.9781

B([s])=0.4846

r=0.1498 s=0.0

Bbetween=-0.5102

t=0.5493

B([s])=-0.4072

This

badsois

Bbetween=0.2477t=0.9828

r=0.1213 s=0.123

B([s])=0.2477

.

r=0.1674 s=0.1632

Bbetween=-0.0184t=0.8443

B([s])=-0.0184

r=0.1329 s=0.4042

Bbetween=0.0629t=0.8722

B([s])=0.0456

This

badsois

Bbetween=-1.83t=0.993

r=0.2898 s=0.0303

B([s])=-1.83

.

r=0.3967 s=0.0247

Bbetween=-0.0811t=0.9237

B([s])=-0.0811

r=0.3077 s=0.0

Bbetween=1.4974t=0.1957

B([s])=-1.0794

This

badsoisBbetween=0.9261t=0.9835

r=0.2389 s=0.2761

B([s])=0.9261

.

r=0.1907 s=0.0

Bbetween=1.008t=0.7177

B([s])=4.8007 r=0.272 s=0.0075

Bbetween=1.2635t=0.8621

B([s])=3.8201

This

r=0.1484 s=0.0387

t=0.8572

B([s])=-1.4207

Bbetween=0.5245

r=0.0157 s=0.0

Bbetween=0.0215t=0.6689

B([s])=0.3187

r=0.1355s=0.0211

Bbetween=-0.6422t=0.7814

B([s])=-2.4375

Bbetween=1.7235t=0.908

r=0.3223 s=0.2448

B([s])=2.5458

Figure 13: Extracted trees of different NLP models trained on the SST-2 dataset.

23

Page 24: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

CoLA

LSTM

CNN

Transformer

BER

TEL

Mo

havecan notHe

r=0.0002 s=0.00

Bbetween=-0.0006t=0.64

s=0.0001r=0.0011B([s])=-0.0082

B([s])=-0.0083

been

Bbetween=0.0001t=0.53

t=0.98

r=0.0093 s=0.0052

B([s])=-0.004

Bbetween=-0.004 working

B([s])=-0.0078

r=0.0008 s=0.00

t=0.56Bbetween=0.0005

B([s])=-0.0067

r=0.0053 s=0.0026

t=0.79Bbetween=-0.0027

.

B([s])=-0.0083

r=0.0031s=0.0015

t=0.71Bbetween=-0.0017

havecan notHe been working .t=0.95

r=0.37 s=0.18

B([s])=-0.25

t=0.98

r=0.25 s=0.26

B([s])=-0.62

t=0.96

r=0.34 s=0.05

B([s])=0.19

Bbetween=-0.25 Bbetween=-0.62 Bbetween=0.19

t=0.77

r=0.26 s=0.24

B([s])=-0.85

Bbetween=-0.21

t=0.96

r=0.32 s=0.18

B([s])=-0.22

Bbetween=0.52

t=0.48

r=0.65 s=0.00

B([s])=-1.18

Bbetween=-0.73

havecan notHe been working .t=0.91

r=0.16 s=0.13

B([s])=-0.0001t=0.89

r=0.12 s=0.04

B([s])=-0.0002

Bbetween=-0.0001Bbetween=-0.0001

t=0.94

r=0.06 s=0.02

B([s])=-0.0003

Bbetween=-0.0002

t=0.95

r=0.06 s=0.008

B([s])=-0.0005

Bbetween=-0.0002

t=0.99

r=0.07 s=0.01

B([s])=-0.0007

Bbetween=-0.0002

t=0.98

r=0.09 s=0.00

B([s])=-0.001

Bbetween=-0.0003

havecan notHe been working .

t=0.55

r=0.51 s=0.29

B([s])=2.26

t=0.86

r=0.48 s=0.11

B([s])=0.68

t=0.34

r=0.57 s=0.19

B([s])=15.93

Bbetween=1.83

Bbetween=0.68

Bbetween=4.83

t=0.46

r=0.58 s=0.26

B([s])=5.86

Bbetween=3.92

t=0.43

r=0.45 s=0.28

B([s])=11.17

Bbetween=5.41

t=0.88

r=0.15 s=0.00

B([s])=16.35

Bbetween=0.41

notHe can have been workingt=0.94

r=0.14 s=0.19

B([s])=0.07t=0.56

r=0.09 s=0.05

B([s])=0.16

Bbetween=0.07

Bbetween=0.08

t=0.56

r=0.13 s=0.08

B([s])=0.33

Bbetween=0.17

t=0.70

r=0.24 s=0.09

B([s])=0.56

Bbetween=0.22

t=0.71

r=0.20 s=0.004

B([s])=0.27

Bbetween=-0.30

t=0.60

r=0.05 s=0.00

B([s])=0.35

Bbetween=0.08

.

Figure 14: Extracted trees of different NLP models trained on the CoLA dataset.

24

Page 25: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

withthe cartloadedfarmerThe apples .

CoLA

LSTM

CNN

Transformer

BER

TEL

Mo

t=0.80

r=0.0079

B([s])=-0.0013

s=0.014

Bbetween=-0.0013

t=0.82

r=0.0084s=0.006

B([s])=-0.0041

Bbetween=-0.0026

t=0.98

r=0.0069 s=0.0057

B([s])=-0.0066

Bbetween=-0.0025

t=0.77

r=0.0041s=0.0025

B([s])=-0.0082

Bbetween=-0.0017t=0.75

r=0.0105s=0.0006

B([s])=-0.0149

Bbetween=-0.0067

t=0.57

r=0.0008 s=0.00

B([s])=-0.0156

Bbetween=-0.0005

t=0.50

r=0.00 s=0.00

B([s])=-0.0155

Bbetween=0.00

withthe cartloadedfarmerThe apples .

t=0.92

r=0.25 s=0.31

B([s])=0.63

Bbetween=0.21

Bbetween=0.38t=0.96

r=0.44 s=0.15

B([s])=0.38

Bbetween=0.54t=0.92

r=0.62 s=0.24

B([s])=1.18Bbetween=-0.34

t=0.94

r=0.39 s=0.22

B([s])=0.84Bbetween=-0.93

t=0.73

r=0.72 s=0.11

B([s])=-0.10

Bbetween=0.20t=0.95

r=0.29 s=0.29

B([s])=0.20

Bbetween=-0.49t=0.25

r=0.46 s=0.00

B([s])=-0.48

withthe cartloadedfarmerThe apples .Bbetween=-0.00

t=0.96

r=0.11 s=0.12

B([s])=-0.00Bbetween=-0.0001

t=0.96

r=0.20 s=0.07

B([s])=-0.0001Bbetween=-0.0001

t=0.94

r=0.16 s=0.06

B([s])=-0.0002

Bbetween=-0.0001t=0.94

r=0.05 s=0.03

B([s])=-0.0004

Bbetween=-0.0001t=0.96

r=0.06 s=0.09

B([s])=-0.0001

Bbetween=-0.0005t=0.96

r=0.20 s=0.05

B([s])=-0.0009Bbetween=-0.0004

t=0.9997

r=0.09 s=0.00

B([s])=-0.0012

withthe cartloadedfarmerThe apples .Bbetween=0.89

t=0.93

r=0.41 s=0.34

B([s])=0.89

Bbetween=-0.74t=0.93

r=0.47 s=0.22

B([s])=-0.74Bbetween=3.53t=0.42

r=0.63 s=0.11

B([s])=4.22Bbetween=4.21

t=0.71

r=0.60 s=0.21

B([s])=8.41

Bbetween=-1.56t=0.34

r=0.27 s=0.15

B([s])=7.02Bbetween=3.69

t=0.39

r=0.58 s=0.05

B([s])=10.75Bbetween=-0.12

t=0.42

r=0.05 s=0.00

B([s])=10.62

theloadedfarmerTheBbetween=-0.07

t=0.96

r=0.09 s=0.04

B([s])=-0.07

Bbetween=-0.14t=0.86

r=0.11 s=0.11

B([s])=-0.26

withcart apples .Bbetween=0.18

t=0.94

r=0.23 s=0.01

B([s])=0.18Bbetween=-0.06

t=0.98

r=0.08 s=0.02

B([s])=0.12Bbetween=0.05

t=0.56

r=0.07 s=0.02

B([s])=0.17

Bbetween=0.25t=0.98

r=0.20 s=0.04

B([s])=-0.09Bbetween=0.85t=0.65

r=0.43 s=0.00

B([s])=0.54

Figure 15: Extracted trees of different NLP models trained on the CoLA dataset.

25

Page 26: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

dressMary aboughtCarmenBbetween=-0.0032

t=0.9797

r=0.0085s=0.0035 Bbetween=-0.0005t=0.0517

r=0.0042 s=0.0026

B([s])=-0.0043

B([s])=-0.0032

.

Bbetween=-0.001t=0.0575

r=0.008 s=0.004

Bbetween=0.0003t=0.0598

r=0.0077 s=0.0109

B([s])=-0.0028B([s])=-0.0038

r=0.0012 s=0.0000

Bbetween=0.0002t=0.3575

B([s])=-0.0041

dressMary aboughtCarmen

Bbetween=-0.26t=0.81

r=0.43 s=0.13

Bbetween=0.43t=0.95

r=0.37 s=0.09

B([s])=0.43

B([s])=0.17

.Bbetween=-0.46

t=0.99

r=0.38 s=0.13Bbetween=-0.73t=0.96

r=0.39 s=0.00

B([s])=-1.52

B([s])=-0.46

r=0.25 s=0.22

Bbetween=-0.42t=0.99

B([s])=-0.24

dressMary aboughtCarmen

r=0.15 s=0.00

Bbetween=-0.0001t=0.95

r=0.06 s=0.05

B([s])=-0.0001

B([s])=-0.0002

B([s])=-0.0003

B([s])=-0.0009

.

B([s])=-0.0001t=0.91

Bbetween=-0.0002

r=0.07 s=0.03

t=0.80Bbetween=-0.0001

r=0.06 s=0.01

t=0.95

r=0.08 s=0.05

Bbetween=-0.0001

t=0.99Bbetween=-0.0005

abought MaryCarmen

Bbetween=2.09t=0.46

r=0.33 s=0.06

Bbetween=4.67t=0.50

s=0.32r=0.52B([s])=8.50

B([s])=7.44

dress

Bbetween=1.07t=0.18

t=0.65

r=0.30 s=0.36

B([s])=0.65

Bbetween=0.65

r=0.41 s=0.49

B([s])=2.71

.

B([s])=7.98

r=0.16 s=0.00

t=0.29Bbetween=-0.52

dressMary aboughtCarmen

Bbetween=0.23t=0.94

r=0.24 s=0.00

Bbetween=0.70t=0.78

Bbetween=-0.15t=0.94

r=0.10 s=0.04

s=0.03r=0.32

B([s])=-0.15

B([s])=0.41

B([s])=0.71

.

Bbetween=-0.29t=0.33

t=0.92

r=0.10 s=0.07

B([s])=-0.10

Bbetween=-0.10

r=0.10 s=0.001

B([s])=0.11

CoLA LS

TMCNN

Transformer

BER

TEL

Mo

Figure 16: Extracted trees of different NLP models trained on the CoLA dataset.

26

Page 27: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

handsometo behimselfbelievedAnson .

CoLA

LSTM

CNN

Transformer

BER

TEL

Mo

Bbetween=0.0052t=0.98

r=0.0134 s=0.0103

B([s])=0.0052 Bbetween=-0.0044t=0.83

r=0.009 s=0.0071

B([s])=0.0008Bbetween=0.0029t=0.95

r=0.0088 s=0.0048

B([s])=0.0038

Bbetween=-0.0t=0.02

r=0.0018 s=0.00

B([s])=-0.0

Bbetween=-0.0018t=0.92

r=0.0047 s=0.0002

B([s])=0.002Bbetween=-0.0016

t=0.96

r=0.0051 s=0.00

B([s])=0.0004

handsometo behimselfbelievedAnson .Bbetween=0.32

t=0.93

r=0.31 s=0.43

B([s])=0.32

Bbetween=0.48t=0.99

r=0.35 s=0.39

B([s])=0.48Bbetween=0.65

t=0.80

r=0.54 s=0.17

B([s])=1.00 Bbetween=-0.98t=0.79

r=0.50 s=0.04

B([s])=0.45Bbetween=0.08t=0.31

r=0.14 s=0.07

B([s])=0.52 Bbetween=-0.18t=0.21

r=0.19 s=0.00

B([s])=0.34

handsometo behimselfbelievedAnson .Bbetween=-0.0001

t=0.89

r=0.11 s=0.14

B([s])=-0.0001Bbetween=-0.0001

t=0.89

r=0.13 s=0.04

B([s])=-0.0002Bbetween=-0.0001

t=1.00

r=0.04 s=0.02

B([s])=-0.0003

Bbetween=-0.0003t=0.98

r=0.05 s=0.01

B([s])=-0.0006

Bbetween=-0.0t=0.93

r=0.04 s=0.06

B([s])=-0.0

Bbetween=-0.0004t=0.98

r=0.05 s=0.00

B([s])=-0.0011

handsometo behimselfbelievedAnson .Bbetween=1.02

t=0.97

r=0.55 s=0.23

B([s])=1.02Bbetween=2.03

t=0.83

r=0.60 s=0.17

B([s])=3.03Bbetween=2.90

t=0.48

r=0.42 s=0.38

B([s])=5.80Bbetween=4.27t=0.11

r=0.52 s=0.38

B([s])=10.18Bbetween=1.10

t=0.01

r=0.37 s=0.01

B([s])=11.35 Bbetween=-0.09t=0.44

r=0.04 s=0.00

B([s])=11.26

handsometo behimselfbelievedAnson .Bbetween=0.24

t=0.98

r=0.31 s=0.10

B([s])=0.24

Bbetween=-0.22t=0.99

r=0.16 s=0.11

B([s])=-0.22Bbetween=0.17

t=0.82

r=0.22 s=0.25

B([s])=-0.04

Bbetween=0.10t=0.41

r=0.16 s=0.15

B([s])=0.34Bbetween=-0.50

t=0.82

r=0.24 s=0.10

B([s])=-0.17Bbetween=0.08

t=0.99

r=0.07 s=0.00

B([s])=-0.33

Figure 17: Extracted trees of different NLP models trained on the CoLA dataset.

27

Page 28: arXiv:2007.04298v1 [cs.CL] 29 Jun 2020 · for various tasks, including the BERT [7], ELMo [25], LSTM [12], CNN [16] and Transformer [39]. Experimental results have demonstrated the

topolitics yesterdayabouttalkedMike .

CoLA

LSTM

CNN

Transformer

BER

TEL

Mo

Bbetween=-0.0002t=0.92

r=0.06 s=0.01

B([s])=-0.0008

my friendsBbetween=-0.0042

t=0.99

r=0.0099 s=0.0074

B([s])=-0.0042

Bbetween=0.0026t=0.57

r=0.0108 s=0.0064

B([s])=-0.0016Bbetween=-0.0015t=0.61

r=0.0077 s=0.0097

B([s])=-0.0049

Bbetween=-0.0027t=0.41

r=0.012 s=0.0074

B([s])=-0.0076

Bbetween=-0.0019t=0.65

r=0.0058 s=0.0044

B([s])=-0.0034

Bbetween=-0.0009t=0.06

r=0.0135 s=0.0044

B([s])=-0.0085

Bbetween=-0.0001t=0.83

r=0.0024 s=0.0026

B([s])=-0.0001

Bbetween=0.0001t=0.43

r=0.0013 s=0.00

B([s])=-0.0086

topolitics yesterdayabouttalkedMike .my friendsBbetween=-0.17

t=0.96

r=0.25 s=0.19

B([s])=-0.17

Bbetween=-0.36t=0.92

r=0.19 s=0.29

B([s])=-0.36

Bbetween=-0.20t=0.90

r=0.38 s=0.21

B([s])=-0.20

Bbetween=-0.78t=0.98

r=0.38 s=0.12

B([s])=-0.78Bbetween=-0.46t=0.07

r=0.20 s=0.28

B([s])=-1.03 Bbetween=0.92t=0.94

r=0.19 s=0.11

B([s])=-1.04Bbetween=0.65

t=0.53

r=0.15 s=0.08

B([s])=-0.59 Bbetween=-0.81t=0.42

r=0.23 s=0.00

B([s])=-1.41

topolitics yesterdayabouttalkedMike .my friends

.t=0.87

r=0.07 s=0.11

B([s])=-0.0t=0.88

r=0.09 s=0.11

B([s])=-0.0001

Bbetween=-0.0

Bbetween=-0.0001

t=0.96

r=0.14 s=0.14

B([s])=-0.0003

Bbetween=-0.0001

t=0.96

r=0.25 s=0.17

B([s])=-0.0004

Bbetween=-0.0002Bbetween=-0.0002

t=0.93

r=0.06 s=0.01

B([s])=-0.0006Bbetween=-0.0002

t=0.95

r=0.05 s=0.01

B([s])=-0.001

Bbetween=-0.0003t=0.97

r=0.06 s=0.00

B([s])=-0.0013

topolitics yesterdayabouttalkedMike .my friendsBbetween=0.84

t=0.94

r=0.44 s=0.24

B([s])=0.84Bbetween=1.42

t=0.54

r=0.53 s=0.14

B([s])=2.17

Bbetween=2.29t=0.49

r=0.47 s=0.23

B([s])=4.59

Bbetween=1.24t=0.59

r=0.63 s=0.13

B([s])=5.99

Bbetween=1.23t=0.97

r=0.52 s=0.31

B([s])=1.23Bbetween=6.39

t=0.39

r=0.75 s=0.09

B([s])=14.44Bbetween=-0.61

t=0.15

r=0.17 s=0.14

B([s])=13.88

Bbetween=-0.78t=0.18

r=0.25 s=0.00

B([s])=13.13

topolitics yesterdayabouttalkedMike .my friendsBbetween=0.08

t=0.86

r=0.21 s=0.03

B([s])=0.08Bbetween=0.10

t=0.78

r=0.25 s=0.11

B([s])=0.17Bbetween=0.06

t=0.27

r=0.07 s=0.03

B([s])=0.25

Bbetween=0.16t=0.27

r=0.12 s=0.05

B([s])=0.41

t=0.86

r=0.17

B([s])=0.05

s=0.45

Bbetween=0.05

t=0.97

r=0.15 s=0.08

B([s])=-0.10

Bbetween=-0.16

t=0.96

r=0.16 s=0.003

B([s])=-0.30

Bbetween=-0.19

Bbetween=-0.19t=0.12

r=0.08 s=0.00

B([s])=-0.12

Figure 18: Extracted trees of different NLP models trained on the CoLA dataset.

28