Sharks are not the threat humans are'': Argument

Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, pages 210–222April 20, 2021 ©2021 Association for Computational Linguistics

210

“Sharks are not the threat humans are”: Argument ComponentSegmentation in School Student Essays

Tariq Alhindi ⇤

Department of Computer ScienceColumbia University

[email protected]

Debanjan GhoshEducational Testing Service

[email protected]

Abstract

Argument mining is often addressed bya pipeline method where segmentation oftext into argumentative units is conductedfirst and proceeded by an argument com-ponent identification task. In this research,we apply a token-level classification to iden-tify claim and premise tokens from a newcorpus of argumentative essays written bymiddle school students. To this end, wecompare a variety of state-of-the-art mod-els such as discrete features and deep learn-ing architectures (e.g., BiLSTM networksand BERT-based architectures) to identifythe argument components. We demon-strate that a BERT-based multi-task learn-ing architecture (i.e., token and sentencelevel classification) adaptively pretrainedon a relevant unlabeled dataset obtains thebest results.

1 Introduction

Computational argument mining focuses onsubtasks such as identifying the ArgumentativeDiscourse Units (ADUs) (Peldszus and Stede,2013), their nature (i.e., claim or premise),and the relation (i.e., support/attack) betweenthem (Ghosh et al., 2014; Wacholder et al.,2014; Stab and Gurevych, 2014, 2017; Stedeand Schneider, 2018; Nguyen and Litman, 2018;Lawrence and Reed, 2020). Argumentation isessential in academic writing as it enhancesthe logical reasoning, as well as, critical think-ing capacities of students (Ghosh et al., 2020).Thus, in recent times, argument mining hasbeen used to assess students’ writing skills in es-say scoring and provide feedback on the writing(Song et al., 2014; Somasundaran et al., 2016;Wachsmuth et al., 2016; Zhang and Litman,2020).

⇤Work performed during internship at ETS

Should Artificial Sweeteners be Banned in America?

Diet soda , sugar - free gum, and low - calorie sweeten-ers are what most people see as a way to sweeten up aday without the calories.

Despite the lack of calories, artificial sweeteners havemultiple negative health e↵ects.

Over the past century, science has made it possible toreplicate food with fabricated alternatives that simplifyweight loss.

Although many thought these new replacements wouldbenefit overall health, there are more negative e↵ectson manufactured food than the food they replaced.

Artificial sweeteners have a huge impact on currentday society.

Legends: B-Claim I-Claim B-Premise I-Premise O-Arg

Figure 1: Excerpt from an annotated essay withClaim Premise segments in BIO notation

While argument mining literature has ad-dressed students writing in the educational con-text, so far, it has primarily addressed collegelevel writing (Blanchard et al., 2013; Persingand Ng, 2015; Beigman Klebanov et al., 2017;Eger et al., 2017) except for a very few ones(Attali and Burstein, 2006; Lugini et al., 2018;Correnti et al., 2020). Instead, in this paper,we concentrate on identifying arguments fromessays written by middle school students. Tothis end, we perused a new corpus of 145 ar-gumentative essays written by middle schoolstudents to identify the argument components.These essays are obtained from an Educationalapp - Writing Mentor - that operates as aGoogle-docs Add-on.1

Normally, research that investigates college

1https://mentormywriting.org

211

students writing in the context of argumentmining apply a pipeline of subtasks to first de-tect arguments at the token-level text units andsubsequently classify the text units to argumentcomponents (Stab and Gurevych, 2017). How-ever, middle school student essays are vastly dif-ferent from college students’ writing (detailedin Section 3). We argue they are more di�cultto analyze through the pipeline approach dueto run-on sentences, unsupported claims, andthe presence of several claims in a sentence.Thus, instead of segmenting the text intoargumentative/non-argumentative units first,we conduct a token-level classification task toidentify the type of the argument component(e.g., B/I tokens from claims and premises)directly by joining the first and the secondsubtask in a single task. Figure 1 presents anexcerpt from an annotated essay with theircorresponding gold annotations of claims (e.g.,“Diet soda . . . the calories”) and premises (e.g.,“there are . . . replaced”). The legends representthe tokens by the standard BIO notations.

We propose a detailed experimental setup toidentify the argument components using bothfeature-based machine learning techniques anddeep learning models. For the former, we usedseveral structural, lexical, and syntactic fea-tures in a sequence classification frameworkusing the Conditional Random Field (CRF)classifier (La↵erty et al., 2001). For the latter,we employ a BiLSTM network and, finally, atransformer architecture - BERT (Devlin et al.,2019) with its pretrained and task-specific fine-tuned models. We achieve the best result froma particular BERT architecture (7.5% accu-racy improvement over the discrete features)that employs a joint multitask learning objec-tive with an uncertainty-based weighting oftwo task-specific losses: (a) the main task oftoken-level sequence classification, and (b) theauxiliary task of sentence classification (i.e.,whether a sentence contains argument or not).We make the dataset (student essays) from ourresearch publicly available.2

2 Related Work

The majority of the prior work on argumentmining addressed the problems of argument

2https://github.com/EducationalTestingService/argument-component-essays

segmentation, component, and relation identifi-cation modeled in a pipeline of subtasks (Peld-szus and Stede, 2015; Stab and Gurevych, 2017;Potash et al., 2017; Niculae et al., 2017) excepta few research (Schulz et al., 2019). However,most of the research assumes the availabilityof segmented argumentative units and do thesubsequent tasks such as the classification ofargumentative component types (Biran andRambow, 2011; Stab and Gurevych, 2014; Parkand Cardie, 2014), argument relations (Ghoshet al., 2016; Nguyen and Litman, 2016), and ar-gument schemes (Hou and Jochim, 2017; Fengand Hirst, 2011).

Previous work on argument segmentationincludes approaches that model the task as asentence classification to argumentative or non-argumentative sentences (Moens et al., 2007;Palau and Moens, 2009; Mochales and Moens,2011; Rooney et al., 2012; Lippi and Torroni,2015; Ajjour et al., 2017; Chakrabarty et al.,2019), or by defining heuristics to identify argu-mentative segment boundaries (Madnani et al.,2012; Persing and Ng, 2015; Al-Khatib et al.,2016). Although we conduct segmentation, wefocus on the token-level classification to directlyidentify the argument component’s type. Thissetup is related to Schulz et al. (2018) where au-thors analyzed students’ diagnostic reasoningskills via token level identification. Our jointmodel using BERT is similar to (Eger et al.,2017). However, we set the main task as thetoken-level classification where the auxiliarytask of argumentative sentence identificationassists the main task to attain a better perfor-mance.

As stated earlier, most of the research onargumentative writing in an educational con-text focuses on identifying argument structures(i.e., argument components and their relations)(Persing and Ng, 2016; Nguyen and Litman,2016) as well as to predict essays scores fromfeatures derived from the essays (e.g., numberof claims and premises, number of supportedclaims, number of dangling claims) (Ghoshet al., 2016). Related investigations have alsoexamined the challenge of scoring a certain di-mension of essay quality, such as relevance tothe prompt (Persing and Ng, 2014), opinionsand their targets (Farra et al., 2015), argumentstrength (Persing and Ng, 2015) among others.

212

Majority of the above research are conductedin the context of college-level writing. For in-stance, Nguyen and Litman (2018) investigatedargument structures in TOEFL11 corpus (Blan-chard et al., 2013) which was also the main fo-cus of (Ghosh et al., 2016). Beigman Klebanovet al. (2017) and Persing and Ng (2015) ana-lyzed writing of university students and Staband Gurevych (2017) used data from “essayfo-rum.com”, where college entrance examinationis the largest forum. Although, writing qualityin essays by young writers has been addressed(Attali and Burstein, 2006; Attali and Powers,2008; Deane, 2014), identification of argumentswas not part of these studies. Computationalanalysis of arguments from school students is ininfancy except for a few research (Lugini et al.,2018; Afrin et al., 2020; Ghosh et al., 2020).We believe our dataset (Section 3) will be use-ful for researchers working at the intersectionof argument mining and education.

3 Data

We obtained a large number of English essays(over 10K) through the Writing Mentor Educa-tional App. This App is a Google Docs add-ondesigned to provide instructional writing sup-port, especially for academic writing. The add-on provides students to write argumentative ornarrative essays and receive feedback on theirwritings. We selected a subset of 145 argumen-tative essays for the annotation purpose. Es-says were either self-labeled as “argumentative”or annotators identified their argumentativenature from the titles (e.g., “Should ArtificialSweeteners be Banned in America ?”).3 Essayscovered various social issues related to climatechange, veteran care, e↵ects of wars, whethersharks are dangerous or not, etc. We denotethis corpus as ARG2020 in the remaining sec-tions of the paper. We employed three expertannotators (with academic and professionalbackground in Linguistics and Education) toidentify the argument components. The anno-tators were instructed to read sentences fromthe essays and identify the claims (defined as,“a potentially arguable statement that indicatesa person is arguing for or arguing against some-

3Other metadata reveal that middle school studentswrite these essays. However, we did not use any suchinformation while annotating the essays.

thing. Claims are not clarification or elabo-ration statements.”) that the argument is inreference to. Next, once the claims are iden-tified, the annotators annotated the premises(defined as, “reasons given by either for sup-porting or attacking the claims making thoseclaims more than mere assertions”).4 Earlier re-search has addressed college level writing, andeven such resources are scarce except for a fewcorpora (Stab and Gurevych, 2017) (denotedas SG2017 in this paper). On the contrary,ARG2020 is based on middle school studentswriting, which di↵ers from college level writingSG2017 in several aspects briefly discussed inthe next paragraph.

First, we notice that essays in SG2017 main-tain distinct paragraphs such as the introduc-tion (initiates the major claim in the essay),the conclusion (summarizes the arguments),and a few paragraphs in between that expressmany claims and their premises. However, es-says written by middle school students do notalways comply with such writing conventionsto keep a concrete introduction and conclu-sion paragraph, rather, they write many shortparagraphs (7-8 paragraphs on average) peressay while each paragraph contains multipleclaims. Second, in general, claims in collegeessays in SG2017 are justified by one or mul-tiple premises, whereas ARG2020 has manyunsupported claims. For instance, the excerptfrom the annotated essay in Figure 1 containstwo unsupported claims (e.g., “Diet soda, sugar. . . without the calories” and “artificial sweet-eners . . . health e↵ects”). Third, middle schoolstudents often put opinions (e.g., “Sugar substi-tutes produce sweet food without feeling guiltyconsequences”) or matter-of-fact statements(e.g., “Even canned food and dairy productscan be artificially sweetened”) that are not ar-gumentative claims but structurally they areidentical to claims. Fourth, multiple claimsfrequently appear in a single sentence thatare separated by discourse markers or commas.Fifth, many essays contain run-on sentences(e.g., “this is hard on the family, they have ahard time adjusting”) that make the task ofparsing even tricky. We argue these reasons

4Definitions are from (Stab and Gurevych, 2017)and Argument: Claims, Reasons, Evidence - Depart-ment of Communication, University of Pittsburgh(https://bit.ly/396Ap3H)

213

Corpora Split Essays B-Claim I-Claim B-Premise I-Premise O-Arg Totaltraining 100 1,780 21,966 317 3,552 51,478 79,093

ARG2020 dev 10 171 1,823 32 371 4,008 6,405test 35 662 8,207 92 1,018 14,987 24,966

Table 1: Token counts of each category in the training, dev, and test sets of ARG2020

make identifying argument claims and premisesfrom ARG2020 more challenging.

The annotators were presented with specificguidelines and examples for annotation. Weconducted a pilot task first where all the threeannotators annotated ten essays and exchangedtheir notes for calibration. Following that, wecontinued pair-wise annotation tasks (30 es-says for each pair of annotators), and finally,individual annotators annotated the remainingessays. Since the annotation task involves iden-tifying each argumentative component’s words,we have to account for fuzzy boundaries (e.g.,in-claim vs. not-in-claim tokens) to measurethe IAA. We considered the Krippendor↵’s ↵(Krippendor↵, 2004) metric to compute theIAA. We measure the ↵ between each pair ofannotators and report the average. For claimwe have a modest agreement of 0.71 that iscomparable to (Stab and Gurevych, 2014) andfor premise, we have a high agreement of 0.90.Out of the 145 essays from ARG2020 we

randomly assign 100 essays for training, 10essays for dev, and the remaining 35 essays fortest. Table 1 represents the data statistics inthe standard BIO format. We find the numberof claims is almost six times the number ofpremises showing that the middle school stu-dents often fail to justify their proposed claims.We keep identifying opinions and argumenta-tive relations (support/attack) as future work.

4 Experimental Setup

Majority of the argumentation research firstsegment the text in argumentative and non-argumentative segments and then identify thestructures such as components and relations(Stab and Gurevych, 2017). Petasis (2019)mentioned that the granularity of computa-tional approaches addressing the second taskof argument component identification is di-verse because some approaches consider de-tecting components at the clause level (e.g., ap-proaches focused on the SG2017 corpus (Staband Gurevych, 2014, 2017; Ajjour et al., 2017;

Eger et al., 2017)) and others at the sentencelevels (Chakrabarty et al., 2019; Daxenbergeret al., 2017). We avoided both approaches forthe following two reasons. First, middle schoolstudent essays often contain run-on sentences,and it is unclear how to handle clause levelannotations because parsing might be inaccu-rate. Second, around 62% of the premises inthe training set appears to be in the same sen-tence as their claims. This makes sentence clas-sification to either claim or premise impractical(Figure 1 contains one such example). Thus,instead of relying on the pipeline approach,we tackle the problem by identifying argumentcomponents from the token-level classificationakin to Schulz et al. (2019). Our unit of se-quence tagging is a sentence, unlike a passage(Eger et al., 2017). We apply a five-way token-classification (or sequence tagging) task whileusing the standard BIO notation for the claimand premise tokens (See Table 1). Any tokenthat is not “B-Claim”, “I-Claim”, “B-Premise”,or “I-Premise” is denoted as “O-Arg”. As ex-pected, the number of “O-Arg” tokens is muchlarger than the other categories (see Table 1).

We explore three separate machine learn-ing approaches well-established for studyingtoken-based classification. First, we experi-ment with the sequence classifier ConditionalRandom Field (CRF) that exploits state-of-the-art discrete features. Second, we implementa BiLSTM network (with and without CRF)based on the BERT embeddings. Finally, weexperiment with the fine-tuned BERT modelswith/without multitask learning setting.

4.1 Feature-based Models

Akin to (Stab and Gurevych, 2017) we exper-iment with three groups of discrete features:structural, syntactic and lexico-syntactic withsome modifications. In addition, we experi-ment with embedding features extracted fromthe contextualized pre-trained language modelof BERT.

214

Discrete Features For each token in a givenessay, we extract structural features that in-clude token position (e.g., the relative and ab-solute position of the token in the sentence,paragraph and, essay from the beginning of theessay) and punctuation features (e.g., whetherthe token is, preceded, or succeeded by punc-tuation). Such position features have shownto be useful in identifying claims and premisesagainst sentences that do not contain any ar-gument (Stab and Gurevych, 2017). We alsoextract syntactic features for each token thatinclude part-of-speech tag of the token andnormalized length to the lowest common ances-tor (LCA) of the token and its preceding (andsucceeding) token in the parse tree. In con-trast with (Stab and Gurevych, 2017), we usedependency parsing as the base for the syntac-tic features rather than constituency parsing.Finally, we extract lexico-syntactic features (de-noted as lexSyn in Table 2) that include thedependency relation governing the token inthe dependency parse tree and the token it-self, plus its governing dependency relation asanother feature. This is also di↵erent than(Stab and Gurevych, 2017) where the authorsused lexicalized-parse tree (Collins, 2003) togenerate their lexico-syntactic features. Thesefeatures are e↵ective in identifying the argu-mentative discourse units. We also observedthat using dependency parse trees as a basisfor the lexico-syntactic features yields betterresults than constituency parse trees in ourpilot experiments.

Embedding Features from BERT BERT(Devlin et al., 2019), a bidirectional transformermodel, has achieved state-of-the-art perfor-mance in many NLP tasks. BERT is initiallytrained on the tasks of masked language model-ing (MLM) and next sentence prediction (NSP)over very large corpora of English Wikipediaand BooksCorpus. During its training, a spe-cial token “[CLS]” is added to the beginning ofeach training instance, and the “[SEP]” tokensare added to indicate the end of utterance(s)and separate, in case of two utterances.

Pretrained BERT (“bert-base-uncased”) canbe used directly by extracting the token rep-resentations’ embeddings. We use the averageembeddings of the top four layers as suggestedin Devlin et al. (2019). For tokens with more

than one word-piece when running BERT’s to-kenizer, their final embeddings feature is theaverage vector of all of their word-pieces. Thisfeature yields a 768D-long vector that we useindividually as well as in combination with theother discrete features in our experiments. Weutilize the sklearn-crfsuite tool for our CRFexperiments.5

4.2 BiLSTM-CRF Models

To compare our models with standard sequencetagging models for argument segmentation(Petasis, 2019; Ajjour et al., 2019; Hua et al.,2019), we experiment with the BiLSTM-CRFsequence tagging model introduced by Ma andHovy (2016) using the flair library (Akbik et al.,2019). We use the standard BERT (“bert-base-uncased”) embeddings (768D) in the em-bedding layer and projected to a single-layerBiLSTM of 256D. BiLSTMs provide the con-text to the token’s left and right, which provedto be useful for sequence tagging tasks. Wetrain this model with and without a CRF de-coder to see its e↵ect on this task. The CRFlayer considers both the output of the BiL-STM layer and the other neighboring tokens’labels, which improves the accuracy of the mod-eling desired transitions between labels (Maand Hovy, 2016).

4.3 Transformers Fine-tuned Models

Pre-trained BERT can also be used for trans-fer learning by fine-tuning on a downstreamtask, i.e., claim and premise token identifica-tion task where training instances are fromthe labeled dataset ARG2020. We denote thismodel as BERTbl. Besides fine-tuning withthe labeled data, we also experiment with amultitask learning setting as well as conductedadaptive pretraining (Gururangan et al., 2020),that is continued pretraining on unlabeled cor-pora that can be task and domain relevant. Wediscuss the settings below.

Transformers Multitask Learning Mul-titask learning aims to leverage useful infor-mation in multiple related tasks to improvethe performance of each task (Caruana, 1997).We treat the sequence labeling task of five-way token-level argument classification as themain task while we adopt the binary task

5https://sklearn-crfsuite.readthedocs.io

215

of sentence-level argument identification (i.e.,whether the candidate sentence contains an ar-gument (Ghosh et al., 2020) as the auxiliarytask. Here, if any sentence in the candidateessay contains claim or premise token(s), thesentence is labeled as the positive category (i.e.,argumentative), otherwise non-argumentative.We hypothesize that this auxiliary task of iden-tifying argumentative sentences in a multitasksetting could be useful for the main task oftoken-level classification.

We deploy two classification heads - one foreach task - and the relevant gold labels arepassed to them. For the auxiliary task, thelearned representation for the “[CLS]” tokenis passed to the classification head. The twolosses from these individual heads are addedand propagated back through the model. Thisallows BERT to model the nuances of bothtasks and their interdependence simultaneously.However, instead of simply adding the lossesfrom the two tasks, we employ dynamic weight-ing of task-specific losses during the trainingprocess, based on the homoscedastic uncer-tainty of tasks, as proposed in Kendall et al.(2018):

L =X

t

1

2�2tLt + log �2

t (1)

where Lt and �t depict the task-specific loss andits variance (updated through backpropaga-tion), respectively, over the training instances.We denote this model as BERTmt.

Adaptive Pre-training Learning Weadaptively pretrained BERT over two un-labeled corpora. First, we train on a taskrelevant Reddit corpus of 5.5 million opinion-ated claims that was released by Chakrabartyet al. (2019). These claims are self-labeled bythe acronym: IMO/IMHO (in my (humble)opinion), which is commonly used in Reddit.We denote this model as BERTIMHO. Next,we train on a task and domain relevantcorpus of around 10K essays that we obtainedoriginally (See section 3) from the WritingMentor App, excluding the annotated set ofARG2020 essays. We denote this model asBERTessay. Figure 2 displays the use of theadaptive pretraining step (in orange block) andthe two classification heads (in green blocks)employed for the multitask variation.

Pretrained BERT

Unlabeled domain & task relevant data (IMHO/Essay)

Adaptively Pretrained BERT

Adaptive Pretraining

Classification Head: Sentence

(Arg/No-Arg) Classification

Classification Head: Token Classification

Figure 2: BERT fine-tuning with adaptive pre-training on unlabeled data from a relevant domainfollowed by fine-tuning on the labeled dataset withthe multitask variation.

For brevity, the parameter tuning descrip-tion for all the models and experiments - dis-crete feature-based and deep-learning ones(e.g., CRF, BiLSTM, BERT) is in the sup-plemental material.

5 Results and Discussion

We present our experiments’ results using theCRF, BiLSTM, and BERT models under dif-ferent settings. We report the individual F1,Accuracy, and Macro-F1 (abbrev. as “Acc.”and “F1”) scores for all the categories in Table2 and Table 3.

We apply the discrete features (structural,syntactic, lexico-syntactic (“lexSyn”)) togetherand individually to the CRF model. We ob-serve the structural and syntactic features donot perform well individually, especially in thecase of premise tokens (See Table 5 in AppendixA.3) and therefore, we only report the resultsof all discrete features (Discrete* in Table 2)and individually only the performance of thelexSyn features. Stab and Gurevych (2017)noticed that structural features are e↵ectiveto identify argument components, especiallyfrom the introduction and conclusion sectionsof the college level essays because they containfew argumentatively relevant content. On thecontrary, as stated earlier, school student es-

216

CRFFeatures B-Claim I-Claim B-Premise I-Premise O-Arg Acc. F1lexSyn .395 .530 .114 .176 .768 .673 .397Discrete* .269 .504 0 .013 .695 .595 .296Embeddings .401 .560 .048 .139 .769 .676 .384Embeddings+lexSyn .482 .610 .134 .180 .764 .682 .434Embeddings+Discrete* .434 .593 .055 .152 .762 .676 .399

BiLSTMSetup B-Claim I-Claim B-Premise I-Premise O-Arg Acc. F1BiLSTM .556 .680 .239 .438 .797 .735 .542BiLSTM-CRF .558 .676 .199 .378 .789 .727 .520

BERTSetup B-Claim I-Claim B-Premise I-Premise O-Arg Acc. F1BERTbl .563 .674 .274 .425 .795 .728 .546BERTblIMHO .571 .681 .304 .410 .795 .730 .540BERTblessay .564 .676 .261 .406 .792 .747 .561BERTmt .567 .685 .242 .439 .805 .741 .548BERTmtIMHO .562 .684 .221 .413 .794 .731 .534BERTmtessay .580 .702 .254 .427 .810 .752 .574

Table 2: F1 scores for Claim and Premise Token Detection on the test set. Underlined: highest Accu-racy/F1 in group. Bold: highest Accuracy/F1 overall. *Discrete: includes structural, syntactic, andlexSyn features.

says do not always comply with such writingconventions. Table 2 displays that the lexSynfeature independently performs better by al-most 8% accuracy than the combination ofthe other discourse features. This correlatesto the findings from prior work on SG2017(Stab and Gurevych, 2017) where the lexSynfeatures reached the highest F1 on a similarcorpus. Next, we augment the embedding fea-tures from the BERT pre-trained model withthe discrete features and notice a marginalimprovement in the accuracy score (less than1%) over the performance of lexSyn features.This improvement is achieved from the higheraccuracy in detecting the claim terms (e.g., Em-bedding+Discrete* achieves around 17% and10%, an improvement over Discrete* features inthe case of B-Claim and I-Claim, respectively).However, the accuracy of detecting the premisetokens is still significantly low. We assumethat this could be due to the low frequencyof premises in the training set, which seemsto be more challenging for the CRF model tolearn useful patterns from the pre-trained em-beddings. On the contrary, the O-Arg tokenis the most frequent in the essays and that isreflected in the overall high accuracy scores forthe O-Arg tokens (i.e., over 76% on average).

The overall performance(s) improve whenwe apply the BiLSTM networks on the test

data. Accuracy improves by 5.3% in the caseof BiLSTM against the Embeddings+lexSynfeatures. However, results do not improve whenwe augment the CRF classifier on top of theLSTM networks (BiLSTM-CRF). Instead, theperformance drops by 0.8% accuracy (See Ta-ble 2). On related research, Petasis (2019)have conducted extensive experiments withthe BiLSTM-CRF architecture with varioustypes of embeddings and demonstrated thatonly the specific combination of embeddings(e.g., GloVe+Flair+BERT) achieves higher per-formance than BiLSTM-only architecture, butwe leave such experiments for future work.

In the case of BERT based experiments,we observe BERTbl, obtains an accuracy of73% that is comparable to the BiLSTM perfor-mance. In terms of the individual categories,we observe BERTbl achieves around 7.5% im-provement over the BiLSTM-CRF classifierfor the B-Premise tokens. We also observethat the two adaptive-pretrained models (e.g.,BERTIMHO and BERTessay) perform betterthan the BERTbl where BERTessay achieves thebest accuracy of 74.7%, a 2% improvement overBERTbl. Although BERTIMHO was trainedon a much larger corpus than BERTessay, weassume since BERTessay was trained on a do-main relevant corpus it achieves the highest F1.Likewise, in the case of multitask models, we

217

observe BERTmt performs better than BERTbl

by 1.3%. This shows that using argumentativesentence identification as an auxiliary task isbeneficial for token-level classification. With re-gards to the adaptive-pretrained models, akinto the BERTbl based experiments, we observeBERTmtessay perform best by achieving thehighest accuracy over 75%.

Argument Segmentation We choose thefive-way token-level classification of argumentcomponent over the standard pipeline approachbecause the standard level of granularity (sen-tence or clause-based) is not applicable to ourtraining data. In order to test the benefit ofthe five-way token-level classification, we alsocompare it against the traditional approachof segmentation of argumentative units intoargumentative and non-argumentative tokens.We again follow the standard BIO notationfor a three-way token classification setup (B-Arg, I-Arg, and O-Arg) for argument segmenta-tion. In this setup, the B-Claim and B-Premiseclasses are merged into B-Arg, and I-Claim andI-Premise are merged into I-Arg, while the O-Arg class remains unchanged. The results ofall of our models on this task are shown inTable 3. We notice similar patterns (exceptfor BERTmtIMHO that performs better thanBERTmt this time) in this three-way classifi-cation task as we saw in the five-way classi-fication. The best model remains to be theBERTmtessay with 77.3% accuracy, which is animprovement of 2-3% over the BiLSTM andother BERT-based architecture.

In summary, we have two main observationsfrom Table 2 and Table 3. First, the bestmodel in Table 3 reports only about 3% im-provement over the result from Table 2 whichshows that the five-way token-level classifica-tion is comparable against the standard task ofargument segmentation. Second, the accuracyof the argument segmentation task is muchlower than the accuracy of college-level essaycorpus SG2017 (Stab and Gurevych, 2017) re-ported accuracy of 89.5%). This supports thechallenges of analyzing middle school studentessays.

5.1 Qualitative Analysis

Since we have explored three separate machinelearning approaches with a variety of experi-

CRFFeatures B-Arg I-Arg O-Arg Acc. F1lexSyn .385 .518 .768 .683 .557Discrete .288 .493 .710 .625 .497Embeddings .379 .596 .767 .699 .581Emb+lexSyn .468 .622 .768 .708 .619Emb+Discrete .381 .599 .767 .699 .582

BiLSTMSetup B-Arg I-Arg O-Arg Acc. F1BiLSTM .546 .730 .792 .759 .689BiLSTM-CRF .553 .707 .793 .752 .684

BERTSetup B-Arg I-Arg O-Arg Acc. F1BERTbl .567 .698 .795 .750 .687BERTblIMHO .558 .717 .778 .744 .684BERTblessay .567 .707 .795 .754 .690BERTmt .555 .702 .803 .758 .688BERTmtIMHO .568 .719 .804 .764 .700BERTmtessay .563 .735 .811 .773 .710

Table 3: F1 scores for Argument Token Detectionon the test set. Underlined: highest Accuracy/F1in group. Bold: highest Accuracy/F1 overall.

ments, we analyze the results obtained fromthe BERTmtessay model that has performed thebest (Table 2). According to the confusion ma-trix, there are three major sources of errors: (a)around 2500 “O-Arg” tokens are wrongly clas-sified as “I-Claim” (b) 2162 “I-Claim” tokensare wrongly classified as “O-Arg”, and (c) 273“I-Premise” tokens are erroneously classified as“I-Claim”. Here, (a) and (b) are not surpris-ing given these are the two categories with thelargest number of tokens. For (c) we lookedat a couple of examples, such as “because of[Walmart ’s goal of saving money]premise, [cus-tomers see value in Walmart that is absent fromother retailers]claim”. Here, the premise tokensare wrongly classified as O-Arg tokens. This isprobably because the premise appears beforethe claim, which is uncommon in our trainingset. We notice some of the other sources oferrors, and we discuss them as follows:

non-arguments classified as arguments:This error occurs often, but it is more challeng-ing for opinions or hypothetical examples thatresemble arguments but are not necessarily ar-guments. For instance, the opinion “that actu-ally makes me feel good afterward . . . ” and thehypothetical example “Next , you will not beeating due to your lack of money” are similarto an argument, and the classifier erroneouslyclassifies them as claim. In the future, we plan

218

to include the labeled opinions during trainingto investigate how the model(s) handle opinionsvs. arguments during the classification.

missing multiple-claims from a sentence:In many examples, we observe multiple claimsappear in a single sentence, such as: “[Somecoral can recover from this]claim though [formost it is the final straw .]claim”. During pre-diction, the model predicts the first claim cor-rectly but then starts the second claim with an“I-Claim” label, which is an impossible tran-sition from “Arg-O” (i.e., does not enforcewell-formed spans). Besides, the model startsthe second claim wrongly at the word “most”rather than “for”. This indicates the model’sinability to distinguish discourse markers suchas “though” as potential separators between ar-gument components. This could be explainedby the fact that markers such as “though” or“because” are frequently part of an argumentclaim. Such as, in “[those games do not seemas violent even though they are at the samelevel]claim”, “though” is labeled as “I-Claim’.

investigating run-on sentences: Somesentences contain multiple claims, which arewritten as one sentence via a comma-splicerun-on such as “[Humans in today ’s world donot care about the consequences]claim, [onlythe money they may gain .]claim” which hastwo claims in the gold annotations but it waspredicted as one long claim by our best model.Another example is “[The oceans are also an-other dire need in today’s environment]claim,each day becoming more filled with trash andplastics.”, in which the claim is predicted cor-rectly in addition to another predicted claimstarting at the word “each”. The model tendsto over predicts claims when a comma comesin the middle of the sentence followed by anoun. However, in the former example, theadverb “only” that has a “B-Claim” label fol-lows the comma rather than the more frequentnouns. Such instances add more complexity tounderstand and model argument structures inmiddle school student writing.

e↵ect of the multitask learning: We ex-amined the impact of multitask learning andnotice two characteristics. First, as expected,the multitask model can identify claims andpremises that are missed by the single task

model(s), such as: “[many more negative ef-fects that come with social media . . . ”]claim”that was correctly identified by the multitaskmodel. Second, the clever handling of theback-propagation helps the multitask modelto reduce false positives to be more precise.Many non-argumentative sentences, such as:“internet’s social networks help teens find com-munities . . . ” and opinions, such as: “take$1.3 billion o↵ $11.3 billion the NCAA makesand give it to players” are wrongly classified asclaims by the single task models but are cor-rectly classified as non-argumentative by themultitask model.

6 Conclusion

We conduct a token-level classification task toidentify the type of the argument componenttokens (e.g., claims and premises) by combin-ing the argument segmentation and componentidentification in one single task. We peruseda new corpus collected from essays written bymiddle school students. W Our findings showthat a multitask BERT performs the best withan absolute gain of 7.5% accuracy over the dis-crete features. We also conducted an in-depthcomparison against the standard segmentationstep (i.e., classifying the argumentative vs. non-argumentative units) and proposed a thoroughqualitative analysis.

Middle school student essays often containrun-on sentences or unsupported claims thatmake the task of identifying argument com-ponents much harder. We achieve the bestperformance using a multitask framework withan adaptive pretrained model, and we plan tocontinue to augment other tasks (e.g., opinionand stance identification) under a similar mul-titask framework (Eger et al., 2017). We planto generate personalized feedback for the stu-dents (e.g., which are the supported claims inthe essay?) that is useful in automated writingassistance.

Acknowledgments

The authors would like to thank TuhinChakrabarty, Elsbeth Turcan, Smaranda Mure-san and Jill Burstein for their helpful commentsand suggestions.

219

References

Tazin Afrin, Elaine Lin Wang, Diane Litman, Lind-say Clare Matsumura, and Richard Correnti.2020. Annotation and classification of evidenceand reasoning revisions in argumentative writ-ing. In Proceedings of the Fifteenth Workshopon Innovative Use of NLP for Building Educa-tional Applications, pages 75–84, Seattle, WA,USA → Online. Association for ComputationalLinguistics.

Yamen Ajjour, Milad Alshomary, HenningWachsmuth, and Benno Stein. 2019. Modelingframes in argumentation. In Proceedings ofthe 2019 Conference on Empirical Methodsin Natural Language Processing and the 9thInternational Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP), pages2922–2932, Hong Kong, China. Association forComputational Linguistics.

Yamen Ajjour, Wei-Fan Chen, Johannes Kiesel,Henning Wachsmuth, and Benno Stein. 2017.Unit segmentation of argumentative texts. InProceedings of the 4th Workshop on ArgumentMining, pages 118–128.

Alan Akbik, Tanja Bergmann, Duncan Blythe,Kashif Rasul, Stefan Schweter, and Roland Voll-graf. 2019. FLAIR: An easy-to-use frameworkfor state-of-the-art NLP. In Proceedings of the2019 Conference of the North American Chapterof the Association for Computational Linguis-tics (Demonstrations), pages 54–59, Minneapo-lis, Minnesota. Association for ComputationalLinguistics.

Khalid Al-Khatib, Henning Wachsmuth, JohannesKiesel, Matthias Hagen, and Benno Stein. 2016.A news editorial corpus for mining argumenta-tion strategies. In Proceedings of COLING 2016,the 26th International Conference on Compu-tational Linguistics: Technical Papers, pages3433–3443, Osaka, Japan. The COLING 2016Organizing Committee.

Yigal Attali and Jill Burstein. 2006. Automatedessay scoring with e-rater v.2. The Journal ofTechnology, Learning and Assessment, 4(3).

Yigal Attali and Don Powers. 2008. A developmen-tal writing scale. ETS Research Report Series,2008(1):i–59.

Beata Beigman Klebanov, Binod Gyawali, andYi Song. 2017. Detecting Good Arguments ina Non-Topic-Specific Way: An Oxymoron? InProceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Vol-ume 2: Short Papers), pages 244–249, Vancou-ver, Canada. Association for Computational Lin-guistics.

Or Biran and Owen Rambow. 2011. Identify-ing justifications in written dialogs. In 2011

IEEE Fifth International Conference on Seman-tic Computing, pages 162–168. IEEE.

Daniel Blanchard, Joel Tetreault, Derrick Hig-gins, Aoife Cahill, and Martin Chodorow. 2013.Toefl11: A corpus of non-native english. ETSResearch Report Series, 2013(2):i–15.

Rich Caruana. 1997. Multitask learning. Machinelearning, 28(1):41–75.

Tuhin Chakrabarty, Christopher Hidey, and KathyMcKeown. 2019. IMHO fine-tuning improvesclaim detection. In Proceedings of the 2019 Con-ference of the North American Chapter of theAssociation for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Longand Short Papers), pages 558–563, Minneapolis,Minnesota. Association for Computational Lin-guistics.

Michael Collins. 2003. Head-driven statistical mod-els for natural language parsing. Computationallinguistics, 29(4):589–637.

Richard Correnti, Lindsay Clare Matsumura,Elaine Wang, Diane Litman, Zahra Rahimi, andZahid Kisa. 2020. Automated scoring of stu-dents’ use of text evidence in writing. ReadingResearch Quarterly, 55(3):493–520.

Johannes Daxenberger, Ste↵en Eger, Ivan Haber-nal, Christian Stab, and Iryna Gurevych. 2017.What is the essence of a claim? cross-domainclaim identification. In Proceedings of the 2017Conference on Empirical Methods in NaturalLanguage Processing, pages 2055–2066.

Paul Deane. 2014. Using writing process and prod-uct features to assess writing quality and explorehow those features relate to other literacy tasks.ETS Research Report Series, (1):1–23.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-trainingof deep bidirectional transformers for languageunderstanding. In Proceedings of the 2019 Con-ference of the North American Chapter of theAssociation for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Longand Short Papers), pages 4171–4186, Minneapo-lis, Minnesota. Association for ComputationalLinguistics.

Ste↵en Eger, Johannes Daxenberger, and IrynaGurevych. 2017. Neural end-to-end learning forcomputational argumentation mining. In Pro-ceedings of the 55th Annual Meeting of the Asso-ciation for Computational Linguistics (Volume1: Long Papers), pages 11–22.

Noura Farra, Swapna Somasundaran, and JillBurstein. 2015. Scoring persuasive essays usingopinions and their targets. In Proceedings of theWorkshop on Innovative Use of NLP for Build-ing Educational Applications, pages 64–74.

220

Vanessa Wei Feng and Graeme Hirst. 2011. Clas-sifying arguments by scheme. In Proceedings ofthe 49th Annual Meeting of the Association forComputational Linguistics: Human LanguageTechnologies, pages 987–996, Portland, Oregon,USA. Association for Computational Linguis-tics.

Debanjan Ghosh, Beata Beigman Klebanov, andYi Song. 2020. An exploratory study of ar-gumentative writing by young students: Atransformer-based approach. In Proceedingsof the Fifteenth Workshop on Innovative Useof NLP for Building Educational Applications,pages 145–150, Seattle, WA, USA. Associationfor Computational Linguistics.

Debanjan Ghosh, Aquila Khanam, Yubo Han, andSmaranda Muresan. 2016. Coarse-grained argu-mentation features for scoring persuasive essays.In Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics (Vol-ume 2: Short Papers), pages 549–554, Berlin,Germany. Association for Computational Lin-guistics.

Debanjan Ghosh, Smaranda Muresan, Nina Wa-cholder, Mark Aakhus, and Matthew Mitsui.2014. Analyzing argumentative discourse unitsin online interactions. In Proceedings of the firstworkshop on argumentation mining, pages 39–48.

Suchin Gururangan, Ana Marasovic, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, DougDowney, and Noah A. Smith. 2020. Don’t stoppretraining: Adapt language models to domainsand tasks. In Proceedings of the 58th AnnualMeeting of the Association for ComputationalLinguistics, pages 8342–8360. Association forComputational Linguistics.

Yufang Hou and Charles Jochim. 2017. Argu-ment relation classification using a joint infer-ence model. In Proceedings of the 4th Work-shop on Argument Mining, pages 60–66, Copen-hagen, Denmark. Association for ComputationalLinguistics.

Xinyu Hua, Zhe Hu, and Lu Wang. 2019. Argu-ment generation with retrieval, planning, andrealization. In Proceedings of the 57th AnnualMeeting of the Association for ComputationalLinguistics, pages 2661–2672, Florence, Italy.Association for Computational Linguistics.

Alex Kendall, Yarin Gal, and Roberto Cipolla.2018. Multi-task learning using uncertainty toweigh losses for scene geometry and semantics.In Proceedings of the IEEE conference on com-puter vision and pattern recognition, pages 7482–7491.

Klaus Krippendor↵. 2004. Measuring the reliabil-ity of qualitative text analysis data. Quality andquantity, 38:787–800.

John La↵erty, Andrew McCallum, and Fer-nando CN Pereira. 2001. Conditional randomfields: Probabilistic models for segmenting andlabeling sequence data.

John Lawrence and Chris Reed. 2020. Argumentmining: A survey. Computational Linguistics,45(4):765–818.

Marco Lippi and Paolo Torroni. 2015. Context-independent claim detection for argument min-ing. In Twenty-Fourth International Joint Con-ference on Artificial Intelligence.

Luca Lugini, Diane Litman, Amanda Godley, andChristopher Olshefski. 2018. Annotating stu-dent talk in text-based classroom discussions. InProceedings of the Thirteenth Workshop on In-novative Use of NLP for Building EducationalApplications, pages 110–116.

Xuezhe Ma and Eduard Hovy. 2016. End-to-end se-quence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meetingof the Association for Computational Linguis-tics (Volume 1: Long Papers), pages 1064–1074,Berlin, Germany. Association for ComputationalLinguistics.

Nitin Madnani, Michael Heilman, Joel Tetreault,and Martin Chodorow. 2012. Identifying high-level organizational elements in argumentativediscourse. In Proceedings of the 2012 Confer-ence of the North American Chapter of the Asso-ciation for Computational Linguistics: HumanLanguage Technologies, pages 20–28, Montreal,Canada. Association for Computational Linguis-tics.

Raquel Mochales and Marie-Francine Moens. 2011.Argumentation mining. Artificial Intelligenceand Law, 19(1):1–22.

Marie-Francine Moens, Erik Boiy,Raquel Mochales Palau, and Chris Reed.2007. Automatic detection of arguments inlegal texts. In Proceedings of the 11th interna-tional conference on Artificial intelligence andlaw, pages 225–230.

Huy Nguyen and Diane Litman. 2016. Context-aware argumentative relation mining. In Pro-ceedings of the 54th Annual Meeting of the Asso-ciation for Computational Linguistics (Volume1: Long Papers), pages 1127–1137.

Huy V Nguyen and Diane J Litman. 2018. Argu-ment mining for improving the automated scor-ing of persuasive essays. In Thirty-Second AAAIConference on Artificial Intelligence.

Vlad Niculae, Joonsuk Park, and Claire Cardie.2017. Argument mining with structured SVMsand RNNs. In Proceedings of the 55th AnnualMeeting of the Association for Computational

221

Linguistics (Volume 1: Long Papers), pages 985–995, Vancouver, Canada. Association for Com-putational Linguistics.

Raquel Mochales Palau and Marie-Francine Moens.2009. Argumentation mining: The detection,classification and structure of arguments in text.In Proceedings of the 12th International Confer-ence on Artificial Intelligence and Law, ICAIL’09, page 98–107, New York, NY, USA. Associa-tion for Computing Machinery.

Joonsuk Park and Claire Cardie. 2014. Identify-ing appropriate support for propositions in on-line user comments. In Proceedings of the FirstWorkshop on Argumentation Mining, pages 29–38, Baltimore, Maryland. Association for Com-putational Linguistics.

Andreas Peldszus and Manfred Stede. 2013. Fromargument diagrams to argumentation miningin texts: A survey. International Journal ofCognitive Informatics and Natural Intelligence(IJCINI), 7(1):1–31.

Andreas Peldszus and Manfred Stede. 2015. Jointprediction in MST-style discourse parsing forargumentation mining. In Proceedings of the2015 Conference on Empirical Methods in Nat-ural Language Processing, pages 938–948, Lis-bon, Portugal. Association for ComputationalLinguistics.

Isaac Persing and Vincent Ng. 2014. Modelingprompt adherence in student essays. In Proceed-ings of the 52nd Annual Meeting of the Associ-ation for Computational Linguistics (Volume 1:Long Papers), pages 1534–1543.

Isaac Persing and Vincent Ng. 2015. Modelingargument strength in student essays. In Pro-ceedings of the 53rd Annual Meeting of the As-sociation for Computational Linguistics and the7th International Joint Conference on NaturalLanguage Processing (Volume 1: Long Papers),pages 543–552.

Isaac Persing and Vincent Ng. 2016. End-to-endargumentation mining in student essays. In Pro-ceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Com-putational Linguistics: Human Language Tech-nologies, pages 1384–1394.

Georgios Petasis. 2019. Segmentation of argumen-tative texts with contextualised word represen-tations. In Proceedings of the 6th Workshop onArgument Mining, pages 1–10, Florence, Italy.Association for Computational Linguistics.

Peter Potash, Alexey Romanov, and AnnaRumshisky. 2017. Here’s my point: Jointpointer architecture for argument mining. InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing, pages1364–1373.

Niall Rooney, Hui Wang, and Fiona Browne. 2012.Applying kernel methods to argumentation min-ing. In FLAIRS Conference, volume 172.

Claudia Schulz, Ste↵en Eger, Johannes Daxen-berger, Tobias Kahse, and Iryna Gurevych. 2018.Multi-task learning for argumentation mining inlow-resource settings. In Proceedings of the 2018Conference of the North American Chapter ofthe Association for Computational Linguistics:Human Language Technologies, Volume 2 (ShortPapers), pages 35–41, New Orleans, Louisiana.Association for Computational Linguistics.

Claudia Schulz, Christian M Meyer, and IrynaGurevych. 2019. Challenges in the automaticanalysis of students’ diagnostic reasoning. InProceedings of the AAAI Conference on Artifi-cial Intelligence, volume 33, pages 6974–6981.

Swapna Somasundaran, Brian Riordan, BinodGyawali, and Su-Youn Yoon. 2016. Evalu-ating argumentative and narrative essays us-ing graphs. In Proceedings of COLING 2016,the 26th International Conference on Compu-tational Linguistics: Technical Papers, pages1568–1578.

Yi Song, Michael Heilman, Beata Beigman Kle-banov, and Paul Deane. 2014. Applying argu-mentation schemes for essay scoring. In Proceed-ings of the First Workshop on ArgumentationMining, pages 69–78.

Christian Stab and Iryna Gurevych. 2014. Identi-fying argumentative discourse structures in per-suasive essays. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 46–56.

Christian Stab and Iryna Gurevych. 2017. Pars-ing argumentation structures in persuasive es-says. Computational Linguistics, 43(3):619–659.

Manfred Stede and Jodi Schneider. 2018. Argu-mentation mining. Synthesis Lectures on Hu-man Language Technologies, 11(2):1–191.

Nina Wacholder, Smaranda Muresan, DebanjanGhosh, and Mark Aakhus. 2014. Annotat-ing multiparty discourse: Challenges for agree-ment metrics. In Proceedings of LAW VIII-The8th Linguistic Annotation Workshop, pages 120–128.

Henning Wachsmuth, Khalid Al-Khatib, andBenno Stein. 2016. Using argument mining toassess the argumentation quality of essays. InProceedings of COLING 2016, the 26th Inter-national Conference on Computational Linguis-tics: Technical Papers, pages 1680–1691, Osaka,Japan. The COLING 2016 Organizing Commit-tee.

222

Haoran Zhang and Diane Litman. 2020. Auto-mated topical component extraction using neu-ral network attention scores from source-basedessay scoring. In Proceedings of the 58th An-nual Meeting of the Association for Computa-tional Linguistics, pages 8569–8584, Online. As-sociation for Computational Linguistics.

A Appendix

A.1 Parameter Tuning

CRF experiment: For the CRF model, wesearch over the two regularization parametersc1 and c2 by sampling from exponential distri-butions with 0.5 scale for c1 and 0.05 scale forc2 using a 3 cross-validation over 50 iterations,which takes about 20 minutes of run-time. Thefinal values are 0.8 for c1 and 0.05 for c2 for thebest CRF model that uses LexSyn and BERTembeddings features.

BiLSTM experiment: For BiLSTM net-works based experiments we searched the hyperparameters over the dev set. Particularly weexperimented with di↵erent mini-batch size(e.g., 16, 32), dropout value (e.g., 0.1, 0.3, 0.5,0.7), number of epochs (e.g., 40, 50, 100 withearly stopping), hidden state of sized-vectors(256). Embeddings were generated using BERT(“bert-base-uncased”) (768 dimensions). Aftertuning we use the following hyper-parametersfor the test set: mini-batch size of 32, numberof epochs = 100 (stop between 30-40 epochs),and dropout value of 0.1. The model has oneBiLSTM layer with size 256 of the hidden layer.

BERT based models: We use the dev par-tition for hyperparameter tuning (batch sizeof 8, 16, 32, 48), run for 3,5,6 epochs, learningrate of 3e-5) and optimized networks with theAdam optimizer. The training partitions werefine-tuned for 5 epochs with batch size = 16.Each training epoch took between 08:46 ⇠ 9minutes over a K-80 GPU with 48GB vRAM.

A.2 Results of Discourse FeatureGroups

We show below the results of using each of thethree feature groups individually: structural,syntactic and lexical-syntactic. As mentionedin the results section of the paper, we cansee below that the structural and syntacticfeatures do not do well when used individually.Therefore, they were excluded from further

Features Structural Syntactic LexSyn All DiscreteB-claim .294 .218 .395 .269I-claim .461 .390 .530 .504B-premise 0 0 .114 0I-premise .018 .009 .176 .013O-Arg .655 .745 .768 .695Accuracy .560 .625 .673 .595Macro F1 .285 .272 .397 .296

Table 4: Accuracy and F1 scores for Claim andPremise Token Detection on the test set for eachgroup of the discrete features in the CRF model.

experimentation with BERT embeddings. Onlythe LexSyn features were tested individuallywith the embeddings.

Documents

Sharks are not the threat humans are'': Argument