Abstract - arXivdepending on the PLTM, we select an appropriate task (Section3), transform each statement in the set into its task-speciﬁc probe, and evaluate how well the PTLM can

RICA: Evaluating Robust Inference CapabilitiesBased on Commonsense Axioms

Pei Zhou Rahul Khanna Seyeon Lee Bill Yuchen Lin Daniel HoJay Pujara Xiang Ren

Department of Computer Science and Information Sciences InstituteUniversity of Southern California

{peiz,rahulkha,seyeonle,yuchen.lin,hsiaotuh,jpujara,xiangren}@usc.edu

AbstractPre-trained language models (PTLMs) haveachieved impressive performance on common-sense inference benchmarks, but their abilityto employ commonsense to make robust infer-ences, which is crucial for effective communi-cations with humans, is debated. In the pursuitof advancing fluid human-AI communication,we propose a new challenge, RICA: RobustInference capability based on CommonsenseAxioms, that evaluates robust commonsenseinference despite textual perturbations. To gen-erate data for this challenge, we develop asystematic and scalable procedure using com-monsense knowledge bases and probe PTLMsacross two different evaluation settings. Ex-tensive experiments on our generated probesets with more than 10k statements show thatPTLMs perform no better than random guess-ing on the zero-shot setting, are heavily im-pacted by statistical biases, and are not ro-bust to perturbation attacks. We also findthat fine-tuning on similar statements offer lim-ited gains, as PTLMs still fail to generalizeto unseen inferences. Our new large-scalebenchmark exposes a significant gap betweenPTLMs and human-level language understand-ing and offers a new challenge for PTLMs todemonstrate commonsense. 1

1 Introduction

Smooth and effective communication requires theability to make various forms of commonsense in-ferences (Clark and Brennan, 1991). When a friendtexts, “I’m going to perform in front of thousandstomorrow,” you may reply reassuringly, “Deepbreaths, you’ll do great!” Implicit to this com-munication is a commonsense logical inferencethat a person performing in front of a crowd mayfeel anxious, and that a reassuring remark helpsease anxiety (Figure 1). A growing body of liter-ature (Bosselut et al., 2019; Petroni et al., 2019)

1Links to our code and leaderboard are our project page:https://sites.google.com/usc.edu/rica.

Text Message:“I’m going to perform in front of thousands tomorrow…”----------------------------Explicit Knowledge:Friend is going to perform in front of many people tomorrow-----------------------------Commonsense Axiom:Performing in front of people can cause anxiety

Text Message“Deep breaths, you’ll do great!”------------------------------Inference Made:My friend might be anxious, let me try to calm them

Linguistically-VariedStatements of the same Commonsense Axiom• A person performing

in front of people might be nervous

• People performing in front of people find it harder to be relaxed

• It can be hard for someone to be calm when they’re about to perform

Figure 1: Human communication requires common-sense inferences. RICA evaluates such inferences viacommonsense axioms with many linguistic variations.

shows pre-trained language models (PTLMs) areable to catalog the types of commonsense relation-ships necessary for fluid communication. However,as we show in this paper, PTLMs have a shockinginability to leverage such commonsense knowledgeto make robust inferences.

Here we focus on two specific characteristicscrucial to human-AI communications: (1) com-bining commonsense knowledge with informationexpressed in natural language to make inferencesand (2) producing consistent inferences amidstlogically-equivalent yet linguistically-varied para-phrases. We focus on commonsense axioms, suchas “Performing in front of people can cause anxi-ety”, and exploit the flexibility of language to ex-press the same axiom in many forms — e.g., “Per-forming in front of people makes it hard to staycalm.” We test these characteristics by generatingself-contained commonsense statements involvingnovel entities (“Prindag is going to perform in frontof a crowd, so prindag is more likely to feel ner-vous.”) and adapt them to two evaluation settings.

Unfortunately, these two capabilities havelargely been overlooked by existing natural lan-guage inference (NLI) benchmarks (Williamset al., 2018) and knowledge probing studies fortransformer-based PTLMs (Vaswani et al., 2017;Devlin et al., 2019; Liu et al., 2019; Clark et al.,

arX

iv:2

005.

0078

2v3

[cs

.CL

] 1

3 A

pr 2

021

https://sites.google.com/usc.edu/rica

2020; Petroni et al., 2019). Most existing com-monsense reasoning-focused datasets (Zhang et al.,2017; Williams et al., 2018; Ostermann et al., 2019;Talmor et al., 2019b) do not systematically evaluaterobustness against linguistic variations, meaningwe cannot preclude the possibility that models arelearning spurious patterns to solve the needed task.

To fill this gap, we introduce RICA, a challengeto evaluate a model’s Robust Inference capabil-ity based on Commonsense Axioms in English.RICA draws on linguistic and cognitive scienceresearch (Schank and Abelson, 1977; Alshawi andvan Eijck, 1989) suggesting humans translate lan-guage to logical representations and reason usingthese abstract representations. RICA consists of aset of natural language statements in the “premise-conclusion” format that require reasoning usinglatent (implicit) commonsense relationships. Weformulate these abstract commonsense relations be-tween entities in first-order logic and refer to themas commonsense axioms (see Fig. 1). To insulatefrom PTLM biases and test human-like acquisitionability on new words (Carey and Bartlett, 1978),RICA uses novel entities, which are unseen stringsused to ground axioms into natural language. Fi-nally, we introduce a set of linguistic perturbationsthat paraphrase a commonsense axiom into naturallanguage in various forms.

Each component of RICA is generalizable, pro-viding a systematic procedure to generate myr-iad commonsense statements. In this paper, wegenerate 257k commonsense statements capturing43k axioms comprising different types of com-monsense, such as physical, material, and socialproperties. To demonstrate the quality of RICA,we create a manually-curated set of 1.6k probesbased on commonsense axioms, and also under-take a large-scale, crowdsourced verification of 10kgenerated statements with multiple human annota-tors. RICA is built by leveraging existing common-sense knowledge bases such as ConceptNet (Liuand Singh, 2004) and ATOMIC (Sap et al., 2019a)to support easy expansion. Furthermore, RICA’sstatements can be posed as popular PTLM taskssuch as masked word prediction or sentence prob-ability, making our benchmark widely applicable.RICA provides an extensible platform for evaluat-ing commonsense reasoning in a variety of PTLMs.

When evaluating state-of-the-art transformer-based PTLMs on the RICA probes following azero-shot setting (e.g., predicting “more” vs. “less”

Masked Word Prediction: A prindag is lighter than a fluberg, so a prindag should float [MASK]

than a fluberg. [more] or [less]

Novel Entity Pair: prindag and fluberg

Sentence Probability: A prindag is …, so …. more than a fluberg. [Correct]

VSA prindag is …, so … less than a fluberg. [Incorrect]

Textual Entailment: (S1: A prindag is lighter than a fluberg. S2: A prindag should float more

than a fluberg.): [Entailment](S1: A prindag is lighter than a fluberg. S2: A prindag should float less

than a fluberg.): [Contradiction](S1: A prindag … S2: A prindag is more skilled than a fluberg.): [Neutral]

Figure 2: Illustration of two evaluation settings with apair of novel entities used by RICA probes.

in the first example in Fig. 2), we consistentlydiscover their performance is on par with ran-dom guessing. Even after fine-tuning with largeamounts of labeled examples, PTLMs exhibit a sig-nificant gap relative to human performance. Wedrill down into this finding through (1) zero-shot,(2) low-resource, (3) high-resource, and (4) noisytraining settings and find that even with apprecia-ble performance gains on automatically generatedprobes in high resource settings, PTLMs still re-main on par with random guessing on difficult,human-curated RICA probes. To better understandthese results, we identify a pervasive intrinsic biasin PTLMs that demonstrates positivity bias in hu-man languages (Dodds et al., 2015).

Contributions. Our contributions are summarizedas follows: (1) We propose a new textual inferencechallenge, RICA. Our challenge tests PTLMs’ abil-ity to use commonsense axioms in many differentlinguistic forms, and can be framed in two prob-ing tasks for PTLMs. (2) We propose a system thatallows for the expansion of our challenge and show-case its usefulness by generating more than 257kprobes for RICA. (3) We conduct a large-scaleevaluation on a human-verified probe set (10k) anda more diverse, manually-curated probe set (1.6k)and find that current PTLMs perform similarly toa random baseline on our probes, are heavily im-pacted by statistical biases, and are not robust tolinguistic perturbations. We will release the codeand the probe dataset for future research.

2 The RICA ChallengeThe RICA challenge is posed as a set of tex-tual statements (sentences), each expressing a la-tent commonsense relationship in the “premise-conclusion” format (see Stage 5 in Fig. 3 for ex-amples). These statements use generated novelentities such as “prindag” and “fluberg” instead ofreal-world entities such as “thimble” and “elephant”to separate factual recalling from reasoning. Each

2. Logical TemplateRel(A,B,r) à

Comp(Prop(A,p), Prop(B,p))

1. Base Predicates• Property(A,p)• Relation(A,B,r)• Comparator(x,y)

5. Commonsense Statement SetA is B’s lawyer, so A is more knowledgeable about law than BB is A’s lawyer, so A is not more knowledgeable about law than BA is B’s lawyer, so A is less clueless about law than BA is B’s lawyer, so B is less informed on the law than A

…Replace A and B with Novel Entities: A à prindag B à fluberg

Relation Property

Lawyer Knowledge of Law

Doctor Takes care of people

… …

4. Created AxiomRel(A,B,lawyer) àComp(Prop(A,knowledge of law), Prop(B,knowledge of law))

Text Conversion Module

Perturbation Functions

3. Knowledge Table

Figure 3: Overview of the workflow of our statement construction process. The output is a set linguistically-diverse of masked sentences that follow the same reasoning template.

statement can be viewed as an instantiation of acommonsense principle, such as “smaller objectscannot contain larger objects.”

We express these commonsense principles infirst-order logic, further generalizing statementsthrough the use of general predicates for objectproperties (e.g., size) and object-object relations(e.g., containment). We turn these logical formulaeinto the associated textual statements using a setof perturbation operators and a conversion module,which together produce a logically-equivalent setof commonsense statements. In the rest of this sec-tion, we first provide a formal definition of RICAchallenge, then provide a detailed description ofthe statement construction process.

2.1 Challenge FormulationFormally, we define a commonsense axiom ai, ex-pressed via a first-order-logic (FOL) formula, as arelationship between entities that can be inferredusing commonsense knowledge (see Stage 4 inFig. 3). To test whether PTLMs understand an ax-iom ai, as well as examine their robustness to lin-guistic variations, we instantiate the axiom ai by aset of m syntactically-different commonsense state-ments

{si1, s

i2, ..., s

im

}, each expressing the same

logic of the axiom. Each statement takes the formof an inferential implication with a premise andconclusion. Finally, depending on the PLTM, weselect an appropriate task (Section 3), transformeach statement in the set into its task-specific probe,and evaluate how well the PTLM can leveragethe logic of ai to solve each of ai’s correspond-ing probes. We deem a model “successful” on thechallenge (or, understands the axioms) only if it canperform like humans on all probes of the axioms.

2.2 Statement Set Construction Process

This subsection introduces our proposed procedurefor the construction of commonsense inference

TERMINOLOGY Description

Logical Template (LT) General FOL formula constructed frompredicates and logical connectives

Arguments Specific entities and relations to fillpredicates in LTs

Axiom Commonsense relationship expressedin FOL by filling a LT with arguments

Commonsense Statement Natural language sentence afterconverting an axiom using a TT

Statement Set Statements that inform the sameaxiom after applying perturbations

Evaluation Instances/Probe A set of statements after adoptingto an evaluation task

Table 1: Description of terminology used in RICA.

statement sets for the challenge. A list of terminolo-gies and descriptions can be found in Table 1 andan overview of our workflow is shown in Figure 3.

Stage 1. Define Predicates. In FOL, predicatesare used to denote a property of objects or a re-lation between objects and every predicate sym-bol comes with an arity larger or equal to 1.We define three general high-level predicates thatserve as the backbone for the logical formulationsof our axioms: Property, Comparator and Re-lation. (1) PROP(A, p) represents that entity Ahas a certain property p. “PROP(A, glass)” indi-cates that A is made of glass. (2) REL(A,B, r)represents that A and B have a certain relationr. “REL(A,B, lawyer)” indicates that A is B’slawyer. (3) COMP(x, y) represents a compara-tive relationship between values x and y, where“COMP” will be replaced with comparison wordslike “better,” “more,” or “easier.” We will later de-fine multiple sub-types of these predicates to crawlfrom Knowledge Bases (KBs) to ensure a widecoverage of common knowledge.

Stage 2. Compose Logical Templates. We manu-ally create first-order logical formulae, referred toas logical templates (LT), using the predicates de-fined in Stage 1. Each formula takes the form of animplication, expressing an inference based on com-monsense knowledge. For example, REL(A,B, r) →

COMP(PROP(A, p), PROP(B, p)) expresses the logicalinference that can be made based on relationabout two entities A and B, and the comparisonof their common property. An instantiated ver-sion of this template can be REL(A,B, lawyer) →MORE(PROP(A, know law), PROP(B, know law)).

Stage 3. Populating Knowledge Tables. Materi-alizing the abstract relationships in a logical tem-plate requires connecting abstract logic to com-monsense knowledge. We define a structure calledknowledge table (KT) that contains valid argumentsto populate a specific LT and form a FOL represen-tation of the axiom. KTs are generated by crawl-ining commonsense KBs such as ConceptNet (Liuand Singh, 2004) and ATOMIC (Sap et al., 2019a).The first step of the crawling process is to narrowdown the predicates to specific types. For example,PROP is general enough to capture an entity’s ca-pabilities (e.g., knowledge of law) or its intrinsicproperties (e.g., hardness). We pre-define severaltype constraints for both properties (PROP) and re-lations (REL). For PROP, we consider Capability,Attribute, and Condition. For REL, we considerRole and Action. Note that these categories canbe extended for wider coverage of knowledge andallow our LTs to be adapted to a broader rangeof KB schemas. After specifying type constraints,we specify steps for crawling the arguments ei-ther from commonsense KBs such as Concept andATOMIC or general web KB such as Wikipedia. Inour example in Fig. 3, we can crawl occupationsfrom Wikipedia, and then query ConceptNet fortriples with the occupation as the subject and Ca-pableOf as the relationship to create a KT withprofessions and capabilities. We show more exam-ples in Appendix A.

Stage 4. Creating Axioms. Combining knowl-edge tables and logical templates allows us to gen-erate commonsense axioms at scale, which arepartially-filled LT formulae. For example in Fig. 3Stage 3, the arguments of predicates REL, PROP,and COMP are set in order to reflect the common-sense relationship between lawyer and knowledgeof law, while leaving the entities A and B un-grounded. Once the predicates are instantiated, wecall this partially-filled LT a commonsense axiom.

Stage 5. Generate Statement Sets. After fillingthe logical templates, each partially-filled LT rep-resents one commonsense axiom. To comprehen-sively challenge models’ understanding of an ab-stract axiom, we construct a statement set express-

LINGUISTIC OPERATOR EXAMPLE

NEGATION NEG(fit into) = not fit intoANTONYM ANT(fit into) = contain

PARAPHRASE PARA(fit into) = put into

PARAPHRASE INVERSIONPARA(ANT(fit into)) = Para(contain)

= hold inside

NEGATION ANTONYMNEG(ANT(fit into)) = NEG(contain)

= not contain

NEGATION PARAPHRASENEG(PARA(fit into)) = NEG(put into)

= not put into

NEGATION PARA INVNEG(PARA(ANT(fit into))) = NEG(PARA(

contain))= NEG(hold inside)=not hold inside

Table 2: Linguistic operators, logic, and examples.

ing the same axiom with different phrasings, i.e.,logically-equivalent yet linguistically-varied. Wedefine several perturbations to apply on the argu-ments from knowledge tables.

(1) Linguistic Operators. We define seven typesof linguistic operators to facilitate and formalizeperturbations, shown in Table 2. We construct thelast four operators by combining some of the sin-gle operators listed in the first three rows. Notethat for NEGATION, ANTONYM, PARAPHRASE IN-VERSION, and NEGATION PARAPHRASE types, thelogic of the original phrase is changed, so words inthe statements have to be changed accordingly. Forexample, if we apply ANTONYM to “fit into” in theprobe “A is smaller than B, so A is more likely tofit into B,” we will get “A is smaller than B, so A isless likely to contain B.” (2) Asymmetry Operator.Most of our logical templates use several strongly-ordered comparisons and relationships allowing usto introduce asymmetries that preserve meaning.For example, MORE(A,B) → ¬MORE(B,A)and REL(A,B, parent) → ¬REL(B,A, parent).Using this invariant, we can swap the positions oftwo entities for these predicates and the logic willalso be negated, so we denote this perturbation asASYM(P(A,B)) → P(B,A) = ¬P(A,B).

We apply the defined operators to the argumentsin the predicates to first form a set of partially-filledLTs (axioms) and use for a conversion module toconvert axioms to statements with diverse perturba-tions. In practice, this module can be a sequence-to-sequence (seq2seq) model (that takes in FOL andoutputs natural language text), or human-writtentemplates. Finally, commonsense axioms are gen-eral logical relationships that hold for all entities.To formulate specific commonsense statements, wegenerate specific novel entities. These entities arerandomly generated character strings from length3 to 12 that are not seen in the training data of thePTLMs. Using novel entities enables us to avoid

E.g.: A is made of glass, B is made of stone, so A is less opaque than B

LT1: Prop(A,p)ΛProp(B,q)àComp(Prop(A,m),Prop(B,m))

E.g.: A is B's priest, so A spends more time praying than B

LT2: Rel(A,B,r)à Comp(Prop(A,m),Prop(B,m))

E.g.: A is able to concentrate more than B , so A is more effective than B

LT4:Comp(Prop(A,m),Prop(B,m))àComp(Prop(A,n),Prop(B,n))

E.g.: A makes the varsity team but not B, so A is more skilled than B

LT3: Prop(A,p)Λ¬Prop(B,p)àComp(Prop(A,m),Prop(B,m))

E.g.: A turned on the heater, so A was cold before turning on the heater

LT5: Prop(A,p)à Comp(Prop(A,m),Prop(B,m))

Table 3: Example first-order logical templates we con-struct for our probes and an example for each template.

conflating fact-based recall with commonsense rea-soning when evaluating PTLMs.

3 Experiment Setup

3.1 Probing Tasks

To examine transformer-based PTLMs’ perfor-mance on RICA challenge, we draw conclusionsfrom evaluation results on two distinct probingtasks shown in Figure 2, described as follows.

Masked Word Prediction (MWP) Inspired bythe masked word prediction objective in BERT (De-vlin et al., 2019), we examine if the models can re-cover masked-out keywords in the statement giventhe remaining context. Since RICA’s statementstake the form of implications, we mask words inthe consequent to evaluate the inference perfor-mance, given the premise. Specifically, we chooseto mask the comparative words (from COMP) suchas ‘‘more/less” and “better/worse” since they notonly capture the commonsense relationship, butalso focus on masking positions where only a fewoptions are appropriate logically and syntactically.For example, in the statement “A is B’s parent, soA is more likely to care for B”, we mask “more”.

Sentence Probability (SP) evaluates if PTLMsassign higher probability for statements that ex-press commonsense axioms versus contradictorystatements. We input RICA statements into thePTLM, computing probabilities by multiplyingeach word’s probability conditioned on the previ-ous words, i.e., the left-to-right language modelingloss. For each RICA statement, we pair it with anincorrect (non-commonsense) statement by swap-ping the comparative word (i.e., the masked wordin MWP) with its opposite word. In the same ex-ample above, we create that probe’s pair as: “A isB’s parent, so A is less likely to care for B”.

3.2 Probing Data Details

Raw Set Following the process in Section 2, weuse the three high-level predicates to generate fiveLTs as shown in Table 3. Then we construct knowl-edge tables to fill in each template by crawlingfrom two commonsense KBs: ConceptNet (Liuand Singh, 2004) and ATOMIC (Sap et al., 2019a).Specifically, for each LT, we design 1 to 4 crawl-ing strategies based on the type constraints we im-pose on the predicates so that it covers multipleaspects of commonsense knowledge (for all strate-gies please see Table 4 in Appendix A). For ex-ample, the example shown for LT1 in Table 3 isabout inference of physical properties based on thematerial of two objects as we constrain PROP inthe premise to be materials. However, we can alsoconstrain PROP in the premise to be animals sothat we can use the same template to examine infer-ence of properties based on the animal types of Aand B, e.g., “A is a fish, B is a horse, so A is morelikely to be in the bottom of the sea than B.”

We have 11 type-constrained LTs and we pop-ulate the KTs using 11 human-designed crawlingstrategies, resulting in around 43k axioms. Thenwe apply the perturbation operators as describedbefore to form a set of 257k perturbed axioms.For this large set, we apply negation and asym-metry operators automatically by adding negationand switching the order of entities. To convertFOL axioms to text, we train a seq2seq modelbased on BART (Lewis et al., 2019) on 200 manu-ally converted axiom-text pairs covering each type-constrained LT and each perturbation type. Tocheck for language quality of the generated probesfrom BART, we manually inspect 5% of the 10kset and we found that only 4 out of 500 (0.8%)randomly sampled probes contain grammar or flu-ency issues. Since all probes follow a premise-conclusion format, we find that using 200 pairs offirst-order logic (FOL) and aligned text for fine-tuning BART is sufficient to convert FOL into text,both from our manual inspection and the crowd-sourcing verification of the generated probes. Wetried increasing the training set size and didn’t ob-serve a clear difference in quality. Finally, we re-place entities to unseen entities to form a set of257k commonsense statements.

Human-Verified Set To ensure the quality ofcrawled data, we conduct human evaluation us-ing Amazon Mechanical Turk (AMT) on 10k ofour collected 257k statements covering 1.7k dif-

ferent commonsense axioms. We present a pair ofstatements by flipping the comparative term in theoriginal statement to its opposite, and ask two anno-tators to choose the one that follows commonsense.If they disagree, we then take the pairs and do a sec-ond round of turking by asking three annotators anduse majority voting to decide what is the right sen-tence in the pair. We replace the original statementwith the opposite one if there are more annotatorsthink that the other one in the pair follows morecommonsense. The fleiss-kappa agreement (Fleiss,1971) on the two rounds of turking is 0.72 and 0.52,indicating that some statements are difficult for hu-mans to verify. Of 10k statements in the verifiedset, we sample 10% (1k with 170 axioms) that moreworkers tend to agree (probes from the first roundand from the second round that have more than 2people agreed on) to form our Human-Verified TestSet.Human-Curated Set To further challenge themodel on more flexible forms of text, we ask hu-mans to paraphrase axioms to contain all 7 typesof linguistic perturbations including compositeones that are hard to generate using automated ap-proaches. Specifically, given an axiom in FOL,a human annotator is asked to provide input thatholds pieces of a probe, for example the origi-nal conclusion (“A knows more about law thanB”), the paraphrased conclusion (“more knowledge-able. . . ”), double negation of the phrase (“not moreignorant. . . ”), etc, that are then programmatically(code included) joined together via templates toform all the probes of a probe set. We focus on80 axioms covering physical, social, and temporaltypes of knowledge and create 1.6k commonsensestatements.Joint Test Set Combines the Human-Curated andHuman-Verified sets, for a total of 2.6k statements.

3.3 Evaluation SettingsUsing the collected probe data introduced above,we consider four evaluation settings to examinemodels’ capabilities to perform robust inference onour dataset.1. Zero-Shot: In the zero-shot setting, we testmodels without any exposure to training data.2. Low-Resource: For low-resource setting, wefine-tune the models on 1k (10%) of the verified10k set to determine how a small amount of in-domain text influences PTLM performance.3. High-Resource: We use 90% of the verifiedtraining set (8k for training, 1k for validation). We

further increase the number of training instances byintroducing 5 different novel entities for each state-ment, yielding 40k training instances that include 5repetitions of each probe with different novel enti-ties, providing models more opportunities to learnpatterns in the training set.4. Raw Large-Scale Training: Finally, to analyzethe effects of training on an even larger but noisierset with the similar format. Starting from the rawset of 257k crawled statements, we sample 100kstatements from 17k axioms ensuring no overlapwith the test set.

3.4 Baseline MethodsWe evaluate multiple state-of-the-art transformer-based PTLMs covering both masked and generativelanguage models. For the masked word predic-tion task, we consider the BERT-base and BERT-large uncased (Devlin et al., 2019) and RoBERTa-base and RoBERTa-large (Liu et al., 2019), twotransformer-based (Vaswani et al., 2017) maskedlanguage models that show strong results on manybenchmarks. For sentence probability, we considerGPT-2 (Radford et al.), a unidirectional languagemodel for left-to-right language generation.

4 Results and AnalysisWe examine the performance of multiple languagemodels on each evaluation setting on our probedata, including zero-shot and fine-tuning on vari-ous splits, and present ablation studies to analyzeperformance more thoroughly. All of our resultsare averages of testing on 3 seeds.

4.1 Zero-Shot PerformanceAs shown in the first group of bars in Figures 4aand 4b, the average binary accuracies of all fivemodels on both MWP and SP tasks are around0.5, regardless of the test data. A random baselinethat chooses between the two comparative wordswould have an accuracy of 0.5. This shows thatthe tested models barely beat a random guessingbaseline without training.

Is Knowledge-Augmented Model Better? To seeif adding commonsense knowledge during traininghelps, we also test COMeT (Bosselut et al., 2019),a generative model for knowledge graph comple-tion whose backbone is GPT, but is further trainedon large knowledge bases such as ConceptNet (Liuand Singh, 2004) or ATOMIC (Sap et al., 2019a)(we test both)––we consider (and anecdotally ob-serve) COMET to posses knowledge of our com-

Zero-Shot Low-Res. High-Res. Noisy 100kData Settings

0

20

40

60

80

100Av

g Ac

cura

cy

4956

64

4549

6372

4653

59

8272

5260

8778

51 50

66

33

Performance on Different Settings on Human-Verified SetBERT-BaseBERT-LargeRoBERTa-BaseRoBERTa-LargeGPT2

(a) Performance on Human-Verified Set

Zero-Shot Low-Res. High-Res. Noisy 100kData Settings

0

20

40

60

80

100

Avg

Accu

racy

50 50 50 5049 50 50 5149 49 50 5050 51 525450 50 50 49

Performance on Different Settings on Human-Curated SetBERT-BaseBERT-LargeRoBERTa-BaseRoBERTa-LargeGPT2

(b) Performance on Human-Curated Set

0 10000 20000 30000 40000 50000 60000 70000 80000Number of Training Examples

50

55

60

65

70

75

80

85

Test

Acc

urac

y

Performance of RoBERTa-large on Different Test SetsHuman-VerifiedHuman-CuratedJoint Set

(c) Fine-tuning curve for RoBERTa-largeFigure 4: Performance of different transformer-based models on different settings of our data. BERT and RoBERTaare evaluated using masked word prediction and GPT2 is evaluated using sentence probability. Zero-shot perfor-mance is no better than random guessing. More data helps greatly for human-verified test set (10k) although noisytraining hinders the improvement. Increasing data does not help at all for our human-curated set.

monsense axioms. However, COMET performs onpar with standard GPT-2, demonstrating the dis-tinction between storing commonsense axioms andreasoning with axioms.

Human Performance To benchmark human per-formance, we sampled 5% of our joint test set con-sisting of both human-verified and human-curateddata and gathered answers from 20 subjects (an-notators) with diverse backgrounds who were notinvolved in the probe construction process. We con-sider this as zero-shot testing for humans as theyhave not seen the training set before. Humans ob-tained 91.7% accuracy, taking a majority vote foreach probe, with substantial inter-annotator agree-ment (0.768 Kappa (Cohen, 1960)).

4.2 Fine-tuning Performance

To study if poor performance in §4.1 is due to alack of exposure to RICA’s probe sets, we conductexperiments to fine-tune baseline language models.As introduced in §3.3, we consider training on low-resource data by sampling a subset of the verifiedset, high-resource by filling multiple novel entitiesin the verified set, and the noisy 100k data. Wefine-tune the base and large versions of BERT andRoBERTa using the same masking approach asthe MWP evaluation, and fine-tune GPT-2 on thecausal language modeling task. Details for thetraining are in the appendix.

More Data Helps on Human-Verified Set Fig-ure 4a, shows fine-tuning on our probe set helpsthe model, especially for RoBERTa-base andRoBERTa-large, where the high-resource settingsurpasses 80% accuracy. This demonstrates withenough data, PTLMs are able to reach near-humanperformance on generated axioms. The low-resource and raw training settings, however, pose

an enduring challenge for all tested models.Diversity of Curated Set Stumps All. Evaluat-ing models fine-tuned on human-verified data onthe human-curated set, where human editors pro-vide greater diversity in probes, tells a differentstory. The model accuracy (Figure 4b) remainsnear 50%, on par with random guessing, for allmodels in all settings. This indicates that expos-ing these models to numerous linguistically sim-ilar sentences does not improve inference ability.Furthermore, we evaluate training data sensitivityfor both the human-verified and human-curatedset (Figure 4c). We vary training set size from0 to 80k for RoBERTa-large. Our results showthat performance on the human-verified set sat-urates around 80% accuracy after 10k instances,but human-curated accuracy remains close to 50%throughout. This casts doubt on the model’s gener-alizability and whether the improved performancemay be due to pattern-matching seq2seq generation,not commonsense acquisition. An inability to im-prove on reasoning tasks after fine-tuning supportsthe challenging nature of RICA, which cannot betrivially solved by fine-tuning.4.3 Performance AnalysisPositivity Bias in PTLMs. We find a pattern thatwhen PTLMs are asked to infer a comparative rela-tionship between the property of two entities, themodel is heavily biased towards predicting wordsthat evoke positive emotions (positive valence) re-gardless of what commonsense axiom is embeddedin the statement. Figure 5a shows that the accu-racy for “positive valence” words such as “more”and “easier” is much higher than “negative valence”words such as “less” and “harder”. Fine-tuningon our probes, which have a balanced number ofsentences containing positive and negative compar-atives, helps mitigate this bias for RoBERTa-base

Human RB-base RB-base-ft RB-largeRB-large-ft GPT2 GPT2-ftPre-trained Language Models and Human

0

20

40

60

80

100

Avg

Accu

racy

90.0 87.2

45.8

89.9

60.5

77.7

50.9

93.8

12.5

58.7

12.2

49.8

20.9

48.3

Performance on Positive and Negative Comparators"Pos" Words"Neg" WordsRandom Guess

(a) Results on Positivity Bias

B-base B-large RB-base RB-large GPT-2Pre-trained Language Models

0

20

40

60

80

100

Avg

Accu

racy

Comparison of Accuracy between Novel Entities and Real Names

Novel Entities (Random Strings)English Names

(b) Ablation on Novel Entities

orig neg ant para para_inv

neg_antneg_pa

raneg_pa

ra_inv

Linguistic Perturbations

0

20

40

60

80

100

Avg

Accu

racy

Linguistic Variation Test on RoBERTa-Base Fine-tunedAvg AccuracyRandom Guess

(c) Results per Linguistic PerturbationFigure 5: Results of fine-tuning and the ablation study on novel entities. Shows that (a) models are biased topositive words, requiring fine-tuning to correct (b) poor performance persists after replacing novel entities withreal names––indicating the use of random strings is not hindering PTLMs’ abilities, (c) fine-tuning mitigates thebias towards positive words, but the inconsistency issue for linguistic variation become obvious.

and GPT-2. We conjecture that this may be due tothe frequency difference between positive valencewords and negative valence words related to re-porting bias in language (Gordon and Van Durme,2013). Dodds et al. (2015) shows a universal pos-itivety bias in human languages and to check ifour comparators also possess it, we use GoogleNgram Viewer 2 to find frequencies for the maskedwords, and confirm that the positive valence wordsare about 5 times more frequent than their negativecounterparts. This correlation supports the claimthat PTLMs do not reason as humans do, but areguided by statistical patterns. Our challenge revealsthis bias clearly for PTLMs and show that trainingon our data helps mitigate it.

Ablation of Novel Entities In order to ensurenovel entities used in RICA did not impact PTLMperformance, we conducted an ablation study on4,800 of our human-curated set (each statement isrepeated for 3 times). These probes involved socialcommonsense, where novel entities took the placeof names. We conduct an ablation by choosingcommon names instead of novel entities, producingprobes containing only previously-seen words. AsFigure 5b shows, the performance of all models inthree settings did not change significantly, stronglysuggesting that novel entities are not critical toPTLM performance. We conclude novel entitiesdo not introduce helpful or distracting sub-words.

Impact of Linguistic Perturbations Beforefine-tuning, a heavy bias for positive valence wordsinterfered with the perturbations analysis, sinceeach perturbation has a balanced number of posi-tive and negative valence words. After fine-tuning,however, the bias is mitigated and we find signif-icant variations in performance for different per-turbation types (Figure 5c). This shows that lan-

2https://books.google.com/ngrams

guage variation greatly affects a model’s capabilityto make inference on our commonsense probes,while suggesting models do not comprehend theaxioms. Interestingly, the composite perturbationtypes such as NEGATION ANTONYM are not neces-sarily harder for PTLMs, even though performanceon ANTONYM is the lowest. We speculate thatthe model is exploiting some pattern in NEGATIONANTONYM that is not present for just ANTONYM.

5 Related Work

Machine Commonsense has a long history inAI, with classical work primarily focusing onexecuting symbolic rules as hand-crafted pro-grams for machines to learn (Mccarthy, 1960).The majority of recent commonsense reasoningbenchmarks (Zellers et al., 2018; Talmor et al.,2019b; Bisk et al., 2020; Sap et al., 2019b) testa model’s ability to choose the correct option givena context and a question; PTLMs have reachedhigh performance on these benchmarks after fine-tuning. We differ from these benchmarks by fo-cusing on robustness to linguistic variation viaour linguistically-varied commonsense statements.RICA also challenges PTLMs on two evaluationtasks to better probe the PTLMs’ representations.

Reasoning-focused Inference There have beenmany benchmarks that focus on reasoning abili-ties in multiple tasks such as reading comprehen-sion (Huang et al., 2019; Yu et al., 2020), dialoguesystems (Cui et al., 2020), and NLI (Williams et al.,2018), that involve inferences on language. Re-cent work also aims to probe models in these tasksto see if reasoning is actually achieved (Richard-son and Sabharwal, 2020; Richardson et al., 2020).RICA focuses on two missing aspects in thesedatasets, namely, we formalize commonsense us-ing logical forms and propose perturbations to testrobustness of models.

https://books.google.com/ngrams

Probing PTLMs Prior works in analyzing the(commonsense) reasoning ability of PTLMs haveprimarily focused on creating probing tasks by gen-erating ad-hoc masked sentences either from knowl-edge bases (Petroni et al., 2019; Feldman et al.,2019) or existing datasets (Zhou et al., 2020; Tal-mor et al., 2019a; Kwon et al., 2019). This first lineof works aim to test if PTLMs can work as knowl-edge bases, i.e. can they retrieve factual knowledge;our work focuses on implicit commonsense rela-tions, not facts. We differ from the second lineof work by proposing a systematic procedure togenerate probes and evaluate for robustness. Clarket al. (2020) shows that PTLMs can emulate deduc-tive reasoning given explicit rules, but we focus onunstated commonsense relations.

6 ConclusionWe design RICA as an AI challenge to test ro-bust inference capabilities on linguistically-variedprobes covering different commonsense axioms.RICA is built on a systematic process to constructprobes using FOL formulae, perturbation operators,and novel entities. Following this approach, wegenerate and verify more than 10k statements from1.7k axioms and test multiple PTLMs in varioussettings. We find that PTLMs perform on par withrandom guessing on zero-shot setting, have strongpositivity bias, and are not robust under linguisticperturbations.

7 Ethical Considerations

Our work aims to pose a new challenge to improveeffective human-AI communications by collectingnew data in English, which benefits English speak-ers more. We have conducted human evaluationusing Amazon Mechanical Turks. We pay turkersaround $11 per hour, above the national minimumwage and engage in constructive discussions if theyhave concerns about the process. We also give eachannotation instance enough time so that we do notpressure annotators.

Our data construction process makes uses ofavailable public resources: Wikipedia, Concept-Net (Liu and Singh, 2004), and ATOMIC (Sapet al., 2019a), which could contain societal biases.Although our probes do not involve specific demo-graphics, we admit the possibility that biases inknowledge resources are included in our data. Wehave provided detailed descriptions about our dataconstruction process to minimize potential confu-

sions.

ReferencesHiyan Alshawi and Jan van Eijck. 1989. Logical forms

in the core language engine. In 27th Annual Meet-ing of the Association for Computational Linguistics,pages 25–32.

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, JianfengGao, and Yejin Choi. 2020. Piqa: Reasoning aboutphysical commonsense in natural language. AAAI.

Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.2019. Comet: Commonsense transformers for au-tomatic knowledge graph construction. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 4762–4779.

Susan Carey and Elsa Bartlett. 1978. Acquiring a sin-gle new word.

Herbert H Clark and Susan E Brennan. 1991. Ground-ing in communication.

Peter Clark, Oyvind Tafjord, and Kyle Richardson.2020. Transformers as soft reasoners over language.IJCAI 2020.

Jacob Cohen. 1960. A coefficient of agreement fornominal scales. Educational and psychological mea-surement, 20(1):37–46.

Leyang Cui, Yu Wu, Shujie Liu, Yue Zhang, and MingZhou. 2020. Mutual: A dataset for multi-turn dia-logue reasoning. arXiv preprint arXiv:2004.04494.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In Proc. of NAACL-HLT.

Peter Sheridan Dodds, Eric M Clark, Suma Desu,Morgan R Frank, Andrew J Reagan, Jake RylandWilliams, Lewis Mitchell, Kameron Decker Harris,Isabel M Kloumann, James P Bagrow, et al. 2015.Human language reveals a universal positivity bias.Proceedings of the National Academy of Sciences,112(8):2389–2394.

Joshua Feldman, Joe Davison, and Alexander M. Rush.2019. Commonsense knowledge mining from pre-trained models. In EMNLP/IJCNLP.

Joseph L Fleiss. 1971. Measuring nominal scale agree-ment among many raters. Psychological bulletin,76(5):378.

Jonathan Gordon and Benjamin Van Durme. 2013. Re-porting bias and knowledge acquisition. In Proceed-ings of the 2013 workshop on Automated knowledgebase construction, pages 25–30.

Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, andYejin Choi. 2019. Cosmos qa: Machine readingcomprehension with contextual commonsense rea-soning. arXiv preprint arXiv:1909.00277.

Sunjae Kwon, Cheongwoong Kang, Jiyeon Han, andJaesik Choi. 2019. Why do masked neural languagemodels still need common sense knowledge? arXivpreprint arXiv:1911.03024.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Ves Stoyanov, and Luke Zettlemoyer. 2019.Bart: Denoising sequence-to-sequence pre-trainingfor natural language generation, translation, andcomprehension. arXiv preprint arXiv:1910.13461.

Hugo Liu and Push Singh. 2004. Conceptnet—a practi-cal commonsense reasoning tool-kit. BT technologyjournal, 22(4):211–226.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

John W. Mccarthy. 1960. Programs with commonsense.

Simon Ostermann, Sheng Zhang, Michael Roth, andPeter Clark. 2019. Commonsense inference in nat-ural language processing (COIN) - shared task re-port. In Proceedings of the First Workshop on Com-monsense Inference in Natural Language Process-ing, pages 66–74, Hong Kong, China. Associationfor Computational Linguistics.

Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,Patrick Lewis, Anton Bakhtin, Yuxiang Wu, andAlexander Miller. 2019. Language models as knowl-edge bases? In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. Language mod-els are unsupervised multitask learners. OpenAIBlog 1.8 (2019): 9.

Kyle Richardson, Hai Hu, Lawrence S Moss, andAshish Sabharwal. 2020. Probing natural languageinference models through semantic fragments. InAAAI, pages 8713–8721.

Kyle Richardson and Ashish Sabharwal. 2020. Whatdoes my qa model know? devising controlled probesusing expert knowledge. Transactions of the Associ-ation for Computational Linguistics, 8:572–588.

Maarten Sap, Ronan Le Bras, Emily Allaway, Chan-dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,Brendan Roof, Noah A Smith, and Yejin Choi.2019a. Atomic: An atlas of machine commonsense

for if-then reasoning. In Proceedings of the AAAIConference on Artificial Intelligence, volume 33,pages 3027–3035.

Maarten Sap, Hannah Rashkin, Derek Chen, RonanLe Bras, and Yejin Choi. 2019b. Social iqa: Com-monsense reasoning about social interactions. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 4453–4463.

Roger C Schank and Robert P Abelson. 1977. Scripts,plans, goals and understanding: An inquiry into hu-man knowledge structures.

Daniel Smilkov, Nikhil Thorat, Been Kim, FernandaViégas, and Martin Wattenberg. 2017. Smoothgrad:removing noise by adding noise. ICML WorkshopWorkshop on Visualization for Deep Learning.

Alon Talmor, Yanai Elazar, Yoav Goldberg, andJonathan Berant. 2019a. olmpics–on what lan-guage model pre-training captures. arXiv preprintarXiv:1912.13283.

Alon Talmor, Jonathan Herzig, Nicholas Lourie, andJonathan Berant. 2019b. Commonsenseqa: A ques-tion answering challenge targeting commonsenseknowledge. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4149–4158.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Eric Wallace, Jens Tuyls, Junlin Wang, Sanjay Subra-manian, Matt Gardner, and Sameer Singh. 2019. Al-lennlp interpret: A framework for explaining predic-tions of nlp models. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 7–12.

Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long Papers), pages 1112–1122. Association forComputational Linguistics.

Weihao Yu, Zihang Jiang, Yanfei Dong, and JiashiFeng. 2020. Reclor: A reading comprehensiondataset requiring logical reasoning. arXiv preprintarXiv:2002.04326.

https://doi.org/10.18653/v1/D19-6007https://doi.org/10.18653/v1/D19-6007https://doi.org/10.18653/v1/D19-6007http://aclweb.org/anthology/N18-1101http://aclweb.org/anthology/N18-1101

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and YejinChoi. 2018. Swag: A large-scale adversarial datasetfor grounded commonsense inference. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing (EMNLP).

Sheng Zhang, Rachel Rudinger, Kevin Duh, and Ben-jamin Van Durme. 2017. Ordinal common-sense in-ference. Transactions of the Association for Compu-tational Linguistics, 5:379–395.

Xuhui Zhou, Yue Zhang, Leyang Cui, and DandanHuang. 2020. Evaluating commonsense in pre-trained language models. AAAI.

A Probing Data Details

A.1 Raw Set Collection

We define 1-4 combinations of type constraints onthe predicates for each LT and designe crawlingstrategies accordingly using resources: Concept-Net, ATOMIC, and Wikipedia. Descriptions foreach of the 11 strategies are included in Table 4.All data and code for crawling strategies is includedin the supplementary materials.

A.2 Turking Details for Human-Verified Set

We present a pair of statements by flipping thecomparative term in the original statement to itsopposite, and ask two annotators to choose the onethat follows commonsense. The AMT page forturkers to annotate is shown in Figure 8. If theydisagree, we then take the pairs and do a secondround of turking by asking three annotators and usemajority voting to decide what is the right sentencein the pair. We replace the original statement withthe opposite one if there are more annotators thinkthat the other one in the pair follows more common-sense. In total, around 2500 pairs are sent to thesecond round and 300 pairs are flipped to the op-posite according to annotators. The estimated timefor completing each instance is around 20 secondsand we pay each instance $0.06, which translatesto around $11 per hour.

A.3 Human-Curated Set Details

We show all perturbations for one probe in Table 7and 60 of our human-curated set’s unperturbedstatement in Table 8 (for temporal refer to sup-plementary material). Full data is included in thesupplementary material.

B Experimental Details

Model Detail We test our probes on in total 10models, with the number of parameters and otherdetails in Table 6. For RoBERTa-base, RoBERTa-large, RoBERTa-large-MNLI, and BART-large-MNLI, we use the fairseq implementation 3. ForBERT-base-uncased, BERT-large-uncased, AL-BERT, and GPT-2, we use the huggingface trans-formers library 4. For COMET trained on Concept-

3https://github.com/pytorch/fairseq/tree/master/examples/roberta, https://github.com/pytorch/fairseq/tree/master/examples/bart

4https://huggingface.co/transformers/model_doc/albert.html, https://

Net and ATOMIC, we follow their github repo 5.Fine-tuning Details We fine-tune BERT-base-uncased, BERT-large-uncased, RoBERTa-base,and RoBERTa-large based on HappyTransform-ers 6 framework, using a consistent learning rateof 1e-5. We fine-tune GPT-2 based on hugging-face transformers library’s example code 7, us-ing their default parameters. We train them onone NVIDIA Quadro RTX 6000 GPU for 10epochs and after each epoch we test the fine-tuned model on our validation set, and savethe model with the highest validation set perfor-mance. Fine-tuning RoBERTa-base and GPT-2 takes around 30 minutes for each epoch andRoBERTa-large takes around 1 hour. The best vali-dation performance for RoBERTa-base is the fourthepoch, with perplexity 1.3378140926361084 andevaluation loss’: 0.2910370217429267. ForRoBERTa-large, the best is epoch 5, with per-plexity 1.3949965238571167 and evaluation loss0.3328918993473053. For GPT-2, the best isepoch 3, with perplexity 1.2786548795017285.Interpretation Details We use the AllenInterpretdemo 8. To identify important context words, werun the algorithm over the same probe for 5 times,each with different entity names, and select thewords that are ranked in the top 5 most importantwords at least 3 times. We find that the interpreta-tions are not very consistent as the most importantwords change when we input the same sentence formultiple times and will also change when differentnames are used, so we conduct 5 trials with differ-ent names for each probe and pick the words thatappear in the majority of the trials.

C Additional Studies

Does explicitly providing commonsense knowl-edge help? Shocked by the severe bias observedin PTLMs, we construct an easier set of probes,where we explicitly state all knowledge needed tomake the correct logical inference. We have twosettings for this test, one where parroting the now-provided commonsense fact is all that is needed to

huggingface.co/transformers/model_doc/gpt2.html

5https://github.com/atcbosselut/comet-commonsense

6https://github.com/EricFillion/happy-transformer

7https://github.com/huggingface/transformers/tree/master/examples/language-modeling

8https://demo.allennlp.org/masked-lm

https://github.com/pytorch/fairseq/tree/master/examples/robertahttps://github.com/pytorch/fairseq/tree/master/examples/robertahttps://github.com/pytorch/fairseq/tree/master/examples/barthttps://github.com/pytorch/fairseq/tree/master/examples/barthttps://github.com/pytorch/fairseq/tree/master/examples/barthttps://huggingface.co/transformers/model_doc/albert.htmlhttps://huggingface.co/transformers/model_doc/albert.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://huggingface.co/transformers/model_doc/gpt2.htmlhttps://github.com/atcbosselut/comet-commonsensehttps://github.com/atcbosselut/comet-commonsensehttps://github.com/EricFillion/happy-transformerhttps://github.com/EricFillion/happy-transformerhttps://github.com/huggingface/transformers/tree/master/examples/language-modelinghttps://github.com/huggingface/transformers/tree/master/examples/language-modelinghttps://github.com/huggingface/transformers/tree/master/examples/language-modelinghttps://demo.allennlp.org/masked-lm

Logical Template Type Constraint Crawling Strategy Example Axiom (Adjusted for readability)

1

Attribute-Material (126)Get a list of materials, and find properties in ConceptNet using HasProperty;then find a second material using NotMadeOf from the previous property.

Material(A, glass) and Material(B, wood), so More(clear(A), clear(B))

Attribute-Grade (132)Input an ordered list of numbers and form pairs and comparative relationsfollowing the orders

Grade(A, first) and Grade(B, third), so More(young(A), young(B))

Condition-Location (1k)Get a list of places with descending latitude from Wikipedia and form pairsby the relation (higher latitude is colder than lower latitude), consideringboth hemispheres.

Location(A,equator) and Location(B, north pole), soMore(living in hot condition(A), living in hot condition(B))

Attribute-Animal (10k)Get a list of animals from Wikipedia and find properties in ConceptNetusing CapableOf and LocateAt

Animal(A, fish) and Animal(B, horse), so More(locate at the bottomof the sea(A), locate at the bottom of the sea(B))

2Role (1.2k)

Input a list of occupations from Wikipedia and find properties in ConceptNetusing CapableOf

Priest(A,B), so More(Pray(A), Pray(B))

Action (10k)For every event in ATOMIC that involves two people, we find propertiesby following the Attribute edge in ATOMIC

Forces upon(A, B), so More(pushy(A), pushy(B))

3Action (10k)

For each event in ATOMIC that involves people, we find propertiesby following the Attribute edge in ATOMIC, note that we replace PersonXwith ”themself” and PersonY with ”another person” to sound natural

Assesses patient(A) and not Assesses patient(B),so More(analytical(A), analytical(B))

Capability-Physical (100)Input a list of adjectives describing objects, we find properties by followingUsedFor edge in ConceptNet

Tie knot(A) and not Tie knot(B), so More(elastic(A), elastic(B))

4Action (10k) Similarly to LT3-Event More(Concentrate(A), Concentrate(B)), so More(Effective(A), Effective(B))

Capability-Physical (100) Simiarly to LT3-PhysicalMore(square(A), square(B)), soBetter(divide two space(A), divide two space(B))

5 Attribute-Temporal (100)Manually come up with temporal ordered-events, included in Human-CuratedSet

entered the building(A),so before(outside(A))

Table 4: Crawling strategies for 11 type-constrained KT crawling for our Raw Set.

CATEGORY EXAMPLE

Physical (30%)A is smaller than B,so A is easier to put into a box than B.

Material (30%)A is made out of glass and B is made out of stone,so A is more transparent than B.

Social (30%)A makes the varsity team while B does not,so A is more skilled than B.

Temporal (10%)A was eating dinner,so A was hungry before eating dinner.

Table 5: Different types of commonsense axioms in-cluded in our human-curated probe set

Model DetailsBERT-base-uncased 12-layer, 768-hidden, 12-heads, 125M parametersBERT-large-uncased 24-layer, 1024-hidden, 16-heads, 355M parameters

RoBERTa-base 12-layer, 768-hidden, 12-heads, 125M parametersRoBERTa-large 24-layer, 1024-hidden, 16-heads, 355M parameters

ALBERT 12 repeating layer, 128 embedding,4096-hidden, 64-heads, 223M parametersGPT-2 12-layer, 768-hidden, 12-heads, 117M parameters.

COMET-Concept GPT-2 config + Traning on ConceptNetCOMET-ATOMIC GPT-2 config + Traning on ATOMICRoBERTa-L-MNLI 24-layer, 1024-hidden, 16-heads, 355M parameters

BART-L-MNLI 24-layer, 1024-hidden, 16-heads, 406M parameters+ a classification head

Table 6: Models tested and details.

correctly answer the probe, and the other where asimple negation switch of the commonsense fact isneeded to solve the probe:

• A is made of glass, B is made of stone, andglass is more transparent than stone, so A is[MASK] transparent than stone. (parrot)

• A is made of glass, B is made of stone, andglass is more transparent than stone, so A isnot [MASK] transparent than stone. (negationswitch)

We do this so to investigate whether RoBERTais actually able to use the provided commonsensefact, or is it possibly just pattern matching.

We add this piece of background knowledge tothe 60 original (unperturbed) statements along withtheir corresponding negated statements to form an“easier” setting of our task. As shown in Figure 6,we find two patterns PTLMs exhibit. For RoBERTa,ALBERT, and GPT-2, there is a stark difference inperformance between the two settings. When theyare being asked to parrot the commonsense fact,the performances jump up to near perfect scores,however when all they have to do is the equivalentof applying a negation operator on the fact, theyfail even worse than when they are not provided thefact. These results suggest that in the parrot easiersetting, it is likely RoBERTa, ALBERT, and GPT-2are just parroting the commonsense fact they see inthe sentence and not utilizing some sort of reason-ing ability, as when asked to perform the simplestof logical operations they fail. The other patternwe notice is that providing background knowledgedoes not help or hurt the performances for COMETand models tested on the textual entailment task.For COMET models, this may be due to the factthat COMET is trained on triplets from knowledgebases: given a head entity and a relation, predictthe tail entity, so it is not used to taking auxiliaryknowledge into its input. As for models fine-tunedon MNLI, the performance stays unchanged be-cause they still think most of the sentence pairs ofour probes are neutral, failing to grasp the embed-ded logical inference step.

Case Study on Contextual Clues To gain a bet-ter understanding on model behaviors, we con-duct analysis to identify context words that themodel relies on when solving our probes. We usethe SmoothGrad (Smilkov et al., 2017) algorithmfrom AllenNLP Interpret (Wallace et al., 2019) for

RB-base RB-large ALBERT GPT-2 CM-CN CM-A RB-NLI BART-NLIPre-trained Language Models

0.0

0.2

0.4

0.6

0.8

1.0Av

g Ac

cura

cyAccuracy on Probes with Background Knowledge

Background KnowledgeKnowledge + Negation Switch

Figure 6: Results of average performance of PTLMs whenwe provide background knowledge in our probes. ForRoBERTa, ALBERT, and GPT-2, we notice a huge increasein accuracy when provided knowledge. However, we find thatthey are merely parroting what appears in the context sincewhen we apply a negation in the probe, which should changethe prediction, they are simply predicting the same as thecontext shows, resulting in performance drop. For COMETmoddels and models tested on the NLI setting, we do notobserve the same pattern and it seems that adding knowledgedoes not help or hurt.

masked word prediction on our probes with realpeople’s names (the same set as our ablation study)using BERT. Aggregated across all probe sets, wefind that the three words BERT finds most impor-tant are: “than”, “not”, and “so”, which make senseas they are indicators for comparison, negation, andcausality, respectively.

“Not” and “so” are also the textual forms of thelogical connectives ¬ and →, which we use toconstruct LTs.

Furthermore, we find that BERT also regardsargument words (inputs into LTs’ predicates via aknowledge table, such as “lawyer” or “knowledgeof law”) important. The model finds on average3.4 words as contextual clues and 1.5 out of themare knowledge-specific argument words. This find-ing shows that a PTLM is able to recognize wordsspecific to the commonsense axiom tested. How-ever, noticing all these clues does not necessarilyaid in a PTLM’s ability to understand their logicalimplications, as evidenced by their performances.In other words, a PTLM, in this case BERT, knowsthat these words are important when making a deci-sion, but it does not know how to properly answerRICA’s questions based on these lexical signals.

10 20 30 40 50 60 70 80Equivalent Sets

0.0

0.2

0.4

0.6

0.8

1.0

Avg

Accu

racy

%

Avg Accuracy per Equivalent Set across all Perturbations

Avg Accuracy %Random Guess

Figure 7: Results of average accuracy of RoBERTa-large on MWP. We can see that the PTLM makesrandom-guessing like predictions across all sets.

linguistic perturbation asymmetric perturbation probe

original original A is wider than B, so A finds it harder to slip through cracks than Boriginal asymmetric premise B is wider than A, so A finds it easier to slip through cracks than Boriginal asymmetric conclusion A is wider than B, so B finds it easier to slip through cracks than Anegation original A is wider than B, so A does not find it easier to slip through cracks than Bnegation asymmetric premise B is wider than A, so A does not find it harder to slip through cracks than Bnegation asymmetric conclusion A is wider than B, so B does not find it harder to slip through cracks than Aantonym original A is wider than B, so A finds it easier to be blocked by cracks than Bantonym asymmetric premise B is wider than A, so A finds it harder to be blocked by cracks than Bantonym asymmetric conclusion A is wider than B, so B finds it harder to be blocked by cracks than Aparaphrase original A is wider than B, so A is worse at fitting into openings than Bparaphrase asymmetric premise B is wider than A, so A is better at fitting into openings than Bparaphrase asymmetric conclusion A is wider than B, so B is better at fitting into openings than Aparaphrase inversion original A is wider than B, so A is more impeded by small openings than Bparaphrase inversion asymmetric premise B is wider than A, so A is less impeded by small openings than Bparaphrase inversion asymmetric conclusion A is wider than B, so B is less impeded by small openings than Anegation antonym original A is wider than B, so A does not find it harder to be blocked by cracks than Bnegation antonym asymmetric premise B is wider than A, so A does not find it easier to be blocked by cracks than Bnegation antonym asymmetric conclusion A is wider than B, so B does not find it easier to be blocked by cracks than Anegation paraphrase original A is wider than B, so A is not better at fitting into openings than Bnegation paraphrase asymmetric premise B is wider than A, so A is not worse at fitting into openings than Bnegation paraphrase asymmetric conclusion A is wider than B, so B is not worse at fitting into openings than Anegation paraphrase inversion original A is wider than B, so A is not less impeded by small openings than Bnegation paraphrase inversion asymmetric premise B is wider than A, so A is not more impeded by small openings than Bnegation paraphrase inversion asymmetric conclusion A is wider than B, so B is not more impeded by small openings than A

Table 7: An example probe set––24 logically equivalent, but semantically different statements.

Figure 8: AMT annotation user interface for human verification on collected set.

template probe

1 A is made out of glass and B is made out of stone, so A is more transparent than B1 A is made out of cotton and B is made out of glass, so A is less sharp than B1 A is made out of concrete and B is made out of paper, so A should be more heavy than B1 A is made out of metal and B is made out of rubber, so A should float worse than B1 A is made out of glass and B is made out of copper, so A is more fragile than B1 A is made out of steel and B is made out of wool, so A is less soft than B1 A is made out of wood and B is made out of glass, so A is more combustible than B1 A is made out of sponge and B is made out of nylon, so A is worse for water resistance than B1 A is made out of copper and B is made out of concrete, so A is more ductile than B1 A is made out of metal and B is made out of cloth, so A is less foldable than B1 A is made out of chocolate and B is made out of metal, so A is harder to keep frozen than B1 A is made out of metal and B is made out of dirt, so A is a better electrical conductor than B1 A is made out of stone and B is made out of helium, so A has a harder time flying than B1 A is made out of honey and B is made out of water, so A is more viscous than B1 A is made out of titanium and B is made out of rubber, so A is less elastic than B1 A is made out of water and B is made out of methane, so A is more safe to store than B1 A is made out of mercury and B is made out of oxygen, so A is worse for your health to consume than B1 A is made out of wood and B is made out of fur, so A will more easily expand when heated than B1 A is made out of concrete and B is made out of wood, so A is less penetrable than B1 A is made out of glass and B is made out of tar, so A will reflect light better than B3 A makes the varsity team while B does not, so A is more skilled than B3 A is going to perform for people while B is not, so A finds it harder to be relaxed than B3 A won the competition while B did not, so A finds it easier to be happy than B4 A is able to concentrate more than B, so A finds it easier to be productive than B3 A bullies people while B does not, so A is less kind than B2 A is B’s boss, so A commands more respect than B4 A has more work than B, so A finds it harder to be at ease than B2 A has a crush on B, so A finds it harder to be relaxed around B4 A has more dedication than B, so A will have a harder time failing than B2 A is B’s parent, so A initially takes more care of B2 A is B’s doctor, so A takes more care of B2 A hurt B’s feelings, so A must be more insensitive than B2 A is B’s priest, so A spends less time sinning than B2 A is B’s lawyer, so A is less ignorant of the law than B4 A has a lot less money than B, so A is less financially secure than B4 A watches more tv shows than B, so A is more capable of understanding pop-culture references than B2 A always loses to B in tennis, so A is a less proficient tennis player than B2 A makes B late, so A has less reason to be annoyed at B4 A is a better friend than B, so A is more thoughtful than B2 A is B’s teacher, so A should be more informed than B4 A is smaller than B, so A is easier to put into a box than B4 A is heavier than B, so A is better at sinking than B4 A is denser than B, so A should withstand piercing more easily than B4 A is wider than B, so A finds it harder to slip through cracks than B4 A is hotter than B, so A should be easier to melt than B4 A is more elastic than B, so A should bounce better than B4 A is tougher than B, so A is harder to rip apart than B4 A is harder than B, so A is less comfortable than B4 A is taller than B, so A will cast a more lengthy shadow than B4 A is lighter than B, so A finds it harder to support weight than B4 A has less momentum than B, so A has a worse ability to damage on impact than B4 A is more luminous than B, so A is more dangerous to look at than B4 A is more soluble than B, so A is harder to discern in water than B4 A is more pungent than B, so A is easier to detect than B4 A is smaller than B, so A finds it harder to displace liquid in a tub than B4 A is shorter than B, so A is worse for keeping things out of reach than B4 A is larger than B, so A is more difficult to carry than B4 A is more taut than B, so A is worse at withstanding additional force than B4 A is much hotter than B, so A will be more painful to hold onto than B4 A is more magnetic than B, so A is harder to separate from another magnet than B

Table 8: Sixty probes and their corresponding logical templates

BERT-base BERT-largeEasy Set Hard Set Joint Set Easy Set Hard Set Joint Set

Zero-shot 49.32 49.7 49.56 49.15 49.35 49.27

Low resource

10% 56.38 49.85 52.37 63.08 50.18 55.1620% 59.3 50.5 53.89 65.64 50.37 56.2630% 55.09 50.22 52.1 65.3 50.27 56.0750% 60.62 50.33 54.3 63.48 50.25 55.35

Full resourcewith 1 novel entity 54.26 50.5 51.95 61.02 49.46 53.97

with 5 novel entities 64.48 50.58 55.94 69.92 49.72 57.51with 10 novel entities 82.58 51.18 63.3 85.93 50.64 64.74

100k 45.4 50.09 46.03 46.14 50.93 47.02

RoBERTa-base RoBERTa-largeEasy Set Hard Set Joint Set Easy Set Hard Set Joint Set

Zero-shot 53.6 49.39 55.32 51.81 49.69 58.79

Low resource

10% 59.08 49.36 53.1 59.93 50.65 54.2320% 60.53 49.41 53.7 64.31 50.31 55.7130% 60.79 49.93 54.21 65.47 50.95 56.5550% 64.11 49.4 55.07 75.88 52.58 61.57

Full resourcewith 1 novel entity 64.24 49.89 55.43 79.1 52.84 62.97

with 5 novel entities 82.32 49.93 62.43 87.03 51.69 65.32with 10 novel entities 85.63 50.64 64.14 84.74 51.08 64.07

100k 72.35 50.02 70.06 78.13 53.92 73.71

GPT2Easy Set Hard Set Joint Set

Zero-shot 51.27 49.6 50.1

Low resource

10% 50.57 49.91 50.2920% 48.22 49.33 49.0130% 48.18 49.2 48.9750% 50.44 49.87 49.96

Full resourcewith 1 novel entity 55.95 49.34 52.16

with 5 novel entities 66.3 49.53 55.98with 10 novel entities 71.6 49.91 58.25

100k 32.94 49.16 35.21

Documents

Abstract - arXivdepending on the PLTM, we select an appropriate task (Section3), transform each statement in the set into its task-speciﬁc probe, and evaluate how well the PTLM can