Acknowledgement Entity Recognition in CORD-19 Papers

Proceedings of the First Workshop on Scholarly Document Processing pages 10ndash19Online November 19 2020 ccopy2020 Association for Computational Linguistics

httpsdoiorg1018653v1P17

10

Acknowledgement Entity Recognition in CORD-19 PapersJian Wu

Computer ScienceOld Dominion University

Norfolk VA USAjwucsoduedu

Pei Wang Xin WeiComputer Science

Old Dominion UniversityNorfolk VA USA

pwang001xwei001oduedu

Sarah Michele Rajtmajer C Lee GilesInformation Sciences and Technology

Pennsylvania State UniversityUniversity Park PA USA

Christopher GriffinApplied Research LaboratoryPennsylvania State University

University Park PA USA

Abstract

Acknowledgements are ubiquitous in schol-arly papers Existing acknowledgement en-tity recognition methods assume all namedentities are acknowledged Here we exam-ine the nuances between acknowledged andnamed entities by analyzing sentence struc-ture We develop an acknowledgement ex-traction system ACKEXTRACT based onopen-source text mining software and eval-uate our method using manually labeleddata ACKEXTRACT uses the PDF of ascholarly paper as input and outputs ac-knowledgement entities Results show anoverall performance of F1 = 092 Webuilt a supplementary database by link-ing CORD-19 papers with acknowledge-ment entities extracted by ACKEXTRACT

including persons and organizations andfind that only up to 50ndash60 of named enti-ties are actually acknowledged We fur-ther analyze chronological trends of ac-knowledgement entities in CORD-19 pa-pers All codes and labeled data are pub-licly available at httpsgithubcom

lamps-labackextract

1 Introduction

Acknowledgements have been an institutionalizedpart of research publications for some time (Blaise2001) Acknowledgement statements show the au-thorsrsquo public gratitude and recognition to individ-uals organizations and grants for various contri-butions Acknowledged individuals and organiza-tions have been under-presented in author rankingand citation impact analysis mostly due to theirpresumed sub-authorship contribution A recentsurvey found that discipline academic rank andgender have a significant effect on coauthorship

disagreement rate (Smith et al 2019) leading tonon-author collaborators receiving less attentionRecently the presence of non-author collaboratorsin the biomedical and social sciences (Paul-Huset al 2017) showed that non-author collaboratorsare not rare and their presence varies significantlyby disciplines

Acknowledgements can be classified dependingon the nature of the contribution Song et al (2020)classified sentences in acknowledgement sectionsinto 6 categories declaration financial peer inter-active communication and technical support pre-sentation general acknowledgement and generalstatement They can also be classified based onthe type of entities such as individual organizationand grant Since 2008 funding acknowledgementshave been indexed by Web of Science Howeverthere is still no dedicated software to accurately rec-ognize acknowledged people and organizations andgenerate a centralized acknowledgement databaseEarly works on acknowledgements were based ondatasets manually extracted from specific journalswhich was not scalable Building such a largedatabase can support further study of acknowledge-ments at a larger scale

There are several scenarios that make the ac-knowledgement entity recognition (AER) task chal-lenging The upper panel of Figure 1 shows ex-amples of sentences appearing in an isolated ac-knowledgement section The ldquoUtrecht Universityrdquois mentioned but should not be counted as an ac-knowledgement entity because it is just the affil-iation of ldquoArno van Vlietrdquo who is acknowledgedAcknowledgement statements can also appear atfootnotes (Figure 1 Bottom) mixed with other foot-note andor body text Author names may alsoappear in the statements such as ldquoAndreonirdquo in thisexample and should be excluded

Existing works on AER leverage off-the-shelfnamed entity recognition (NER) packages such

11

Figure 1 Upper Acknowledgement statements appearin an isolated section (Stewart et al 2018) Bottomacknowledgement statements appear in a footnote (An-dreoni and Gee 2015)

as the Natural Language Toolkits (NLTK) (Bird2006) eg Khabsa et al (2012) Paul-Hus et al(2020) followed by simple semi-manual datacleansing resulting in a fraction of entities thatare mentioned but not actually acknowledged Inthis paper we design an automatic AER systemcalled ACKEXTRACT that further classifies extrac-tion results from open source NER packages thatrecognize people and organizations and distinguishentities that are actually acknowledged The extrac-tor finds acknowledgement statements from iso-lated sections and other locations such as footnoteswhich is common for papers in social and behav-ioral sciences Our contributions are

1 Develop ACKEXTRACT as open-source soft-ware to automatically extract acknowledge-ment entities from research papers

2 Apply ACKEXTRACT on the CORD-19dataset and supplement the dataset with a cor-pus of classified acknowledgement entities forfurther studies

3 Use the CORD-19 dataset as a case study todemonstrate that acknowledgement studieswithout classifying named entities can sig-nificantly overestimate the number of entitiesthat are actually acknowledged because manypeople and organizations are mentioned butnot explicitly acknowledged

2 Related Works

Early work on acknowledgement extraction wasmanually applied which was labor-intensiveCronin et al (1993) extracted a total of 9561 peerinteractive communication (PIC) names from a to-tal of 4200 research sociology articles most werepersonsrsquo names They also defined the followingsix categories of acknowledgement moral supportfinancial support editorial support presentational

support instrumentaltechnical support and con-ceptual support or PIC (Cronin et al 1992)

Councill et al (2005) used a hybrid method forautomatic AER from research papers and automati-cally created an acknowledgement index(Giles andCouncill 2004) The algorithm first used a heuris-tic method for identifying acknowledgement pas-sages It then uses an SVM model for identifyinglines containing acknowledgement sentences out-side labeled acknowledgement sections A regularexpression was used to extract entity names fromacknowledging text This method achieved an over-all precision of about 0785 and a recall of 0896 onCiteSeer papers (Giles et al 1998) The algorithmdoes not distinguish entity types

Khabsa et al (2012) leveraged OpenCalais1 andAlchemyAPI2 free services at that time to extractnamed entities from acknowledgement sections andbuilt ACKSEER a search engine for acknowledge-ment entities They merged outputs of both NERAPIs and generated a list and disambiguated entitymentions using the longest common subsequence(LCS) algorithm The ground truth contains 200top-cited CiteSeerX papers in which 130 had ac-knowledgement sections They achieved 923and 916 precision and recall for acknowledge-ment section extraction but did not evaluate entityextraction

Recent studies of acknowledgements tend to useresults from off-the-shelf NER packages with sim-ple filters assuming that named entities were ac-knowledged entities For example Paul-Hus et al(2020) uses the Stanford NER module in NLTK toextract persons Song et al (2020) also directly usepeople and organizations recognized by the Stan-ford CoreNLP (Manning et al 2014) These worksachieved a high recall by recognizing most nameentities in the acknowledgements but ignored theirrelations to the papers where they appear resultingin a fraction of entities that are mentioned but notactually acknowledged Song et al (2020) considergrammar structure such as verb tense and voiceand sentence patterns when labeling sentences totheir six categories For example ldquowas fundedrdquo isfollowed by an ldquoorganizationrdquo However they onlylabel sentences and do not annotate them down to

1httpswebarchiveorgweb20081023021111httpopencalaiscomcalaisAPI

2httpswebarchiveorgweb20090921044923httpwwwalchemyapicom

12

Figure 2 Architecture of ACKEXTRACT

the entity level Our system examines the relation-ship between entities and the current work withthe purpose of discriminating acknowledgement en-tities from named entities In this system we focuson people and organizations

Recently Dai et al (2019) proposed GrantEx-tractor a pipeline system to extract grant support in-formation from scientific papers A model combin-ing BiLSTM-CRF and pattern matching was usedto extract entities of grant numbers and agenciesfrom funding sentences which are identified usingheuristic methods The system achieves a micro-F1

up to 090 in extracting grant pairs (agency num-ber)

Kayal et al (2019) proposed an ensemble ap-proach called FUNDINGFINDER for extractingfunding information from text The authors con-struct feature vectors for candidate entities usingwhether the entities are recognized by four NERimplementation Stanford (Conditional RandomField model) LingPipe (Hidden Markov model)OpenNLP (Maximum Entropy model) and Else-vierrsquos Fingerprint Engine The F1-measure forfunding body is only 068plusmn 03

Our method is different from existing methodsin threefold (1) It is built on top of state-of-the-artneural NER methods which results in a relativelyhigh recall (2) It uses a heuristic method to filterout entities that are just mentioned but not acknowl-edged (3) It extracts both organizations and people

3 Dataset

On March 16th 2020 Allen Institute of Artifi-cial Intelligence released the first version of theCOVID-19 Open Research Dataset (CORD-19)(Wang et al 2020) in collaboration with severalother institutions The dataset contains metadataand segmented full text of research articles selectedby searching a list of keywords about coronavirusSARS-CoV MERS and other related terms from

four digital libraries including WHO PubMed Cen-tral BioRxiv and MedRxiv The initial dataset con-tained about 28k papers and was updated weeklywith papers from new sources and the latest pub-lication We used the dataset released on April10 2020 containing over 59312 metadata recordsamong which 54756 have the full text in JSONformat The CORD-19 papers were generated byprocessing PDFs using the S2ORD pipeline (Loet al 2019) in which GROBID (Lopez 2009) wasemployed for document segmentation and metadataextraction

The full text in JSON files is directly used forsentence segmentation However we observe thatGROBID extraction results are not perfect In par-ticular we estimate the fraction of acknowledge-ment sections omitted We also estimate the num-ber of acknowledgement entities omitted in thedata release due to the extraction error of GROBIDTo do this we downloaded 45916 full-text PDFpapers collected by the Internet Archive (IA) be-cause the CORD-19 dataset does not include PDFfiles3 We found 13103 CORD-19 papers in theIA dataset

4 Acknowledgement Extraction

41 OverviewThe architecture of our acknowledgement entity ex-traction system is depicted in Figure 2 The systemcan be divided into the following modules

1 Document segmentation CORD-19 pro-vides full text as JSON files but in generalmost research articles are published in PDFso our first step is converting a PDF docu-ment to text and segment it into sections Weuse GROBID that has shown superior perfor-mance over many other document extractionmethods (Lipinski et al 2013) The output isan XML file in TEI schema

2 Sentence segmentation Paragraphs are seg-mented into sentences We compare severalsentence segmentation software packages andchoose Stanza (Qi et al 2020) because of itsrelatively high accuracy

3 Sentence classification Sentences areclassified into acknowledgement and non-acknowledgement statements The result isa set of acknowledgement statements insideor outside the acknowledgement sections

3httpsarchiveorgdownloadcovid19_fatcat_20200410

13

4 Entity recognition Named entities are ex-tracted from acknowledgement statementsWe compare four commonly used NER soft-ware packages and choose Stanza because ofits relatively high performance In this workwe focus on person and organization

5 Entity classification In this module we clas-sify named entities by analyzing sentencestructures aiming at discriminating namedentities that are actually acknowledged ratherthan just mentioned We demonstrate thattriple extraction packages such as REVERB

and OLLIE fail to handle acknowledgementstatements with multiple entities in objects inour dataset The results are acknowledgemententities including people or organizations

42 Document SegmentationThe majority of scholarly papers are published inPDF format which are not readily readable by textprocessors Several attempts have been made toconvert PDF into text (Bast and Korzen 2017) andsegment the document into section and sub-sectionlevels GROBID is a machine learning library forextracting parsing and re-structuring raw docu-ments into TEI encoded documents Other simi-lar methods have been recently developed such asOCR++ (Singh et al 2016) and Science Parse4Lipinski et al (2013) compared 7 metadata extrac-tion methods and found that GROBID (version040) achieved superior performance over the oth-ers GROBID trained a cascading of conditionalrandom field (CRF) models on PubMed and com-puter science papers The recent version (060)has a set of powerful functionalities such as extract-ing and parsing headers and segmenting full-textextraction GROBID supports a batch mode andan API service mode the latter enables large scaledocument processing on multi-core servers suchas in CiteSeerX (Wu et al 2015) A benchmark-ing result for version 060 shows that the sectiontitle parsing achieves and F1 = 070 under thestrict matching criteria and F1 = 075 under thesoft matching criteria5 Singh et al (2016) claimsOCR++ achieves better performance than GROBIDin several fields evaluated on computer science pa-pers However the lack of a service mode APIand multi-domain adaptability limits its usabilityScience-Parse only extracts key metadata such as

4httpsgithubcomallenaispv25httpsgrobidreadthedocsioen

latestBenchmarking-pmc

title authors year and venue Therefore we adoptGROBID to convert PDF documents into XMLfiles

Depending on the structure and provenance ofPDFs GROBID may miss acknowledgements incertain papers To estimate the fraction of papersin which acknowledgements were missed by GRO-BID we visually inspected a random sample of 200papers from the CORD-19 dataset and found thatonly 146 papers (73) contain acknowledgementstatements out of which GROBID successfully ex-tracted all acknowledgement statements from 120papers (82) For the remaining 26 papers thatGROBID failed to parse 17 papers are in sections9 papers are in footnotes We developed a heuristicmethod that can extract acknowledgement state-ments from all 120 papers with acknowledgementstatements output by GROBID

43 Sentence Segmentation

The acknowledgement sections or statements ex-tracted above are paragraphs which needs to besegmented (or tokenized) into sentences we com-pared four software packages for sentence segmen-tation including NLTK (Bird 2006) Stanza (Qiet al 2020) Gensim (Rehurek and Sojka 2010)and the Pragmatic Segmenter6

NLTK includes a sentence tokenization methodsent tokenize() which uses an unsupervised al-gorithm to build a model for abbreviated wordscollocations and words that start sentences andthen uses that model to find sentence boundariesStanza is a Python natural language analysis pack-age developed by the Stanford NLP group Sen-tence segmentation is modeled as a tagging prob-lem over character sequences where the neuralmodel predicts whether a given character is the endof a sentence The split sentences() function inGensim package splits a text and returns list of sen-tences from a given text string using unsupervisedpattern recognition The Pragmatic Segmenter isa rule-based sentence boundary detection gem thatworks out-of-the-box across many languages

To compare the above methods we created aground truth corpus by randomly selecting ac-knowledgment sections or statements from 47 pa-pers and manually segmenting them resulting in100 sentences Table 1 shows the comparison re-sults for four methods The precision is calculated

6httpsgithubcomdiasks2pragmatic_segmenter

14

Method Precision Recall F1

Gensim 065 064 065NLTK 072 069 070

Pragmatic 086 076 081Stanza 081 088 084

Table 1 Sentence segmentation performance

by dividing the number of correctly segmented sen-tences by the total number of sentences segmentedThe recall is calculated by dividing the number ofcorrectly segmented sentences by the total numberof manually segmented sentences Stanza outper-forms the other three achieving an F1 = 084

44 Sentence Classification

Not all sentences in acknowledgement sec-tions express acknowledgement such as thefollowing sentence The funders played no

role in the study or preparation of the

manuscript Song et al (2020) In this modulewe classify sentences into acknowledgement andnon-acknowledgement statements We developeda set of regular expressions that match bothverbs (eg thank gratitude to indebted to)adjectives (eg grateful to) and nouns (eghelpful comments useful feedback) to cover asmany cases as possible To evaluate this methodwe manually selected 100 sentences including 50positive and negative samples from the sentencesobtained in Section 43 Our results show that96 out of 100 sentences were classified correctlyresulting accuracy of 096

45 Entity Recognition

In this step named entities are extracted usingstate-of-the-art NER software packages includingNLTK Bird (2006) Stanford CoreNLP Manninget al (2014) spaCy Honnibal and Montani (2017)and Stanza (Qi et al 2020) Stanza is a Pythonlibrary offering fully neural pre-trained models thatprovide state-of-the-art performance on many rawtext processing tasks when it was released TheNER model adopted the contextualized sequencetagger in Akbik et al (2018) The architecture in-cludes a Bi-LSTM character-level language modelfollowed by a one-layer Bi-LSTM sequence taggerwith a conditional random field (CRF) encoder Al-though Stanza was developed based on StanfordCoreNLP they exhibit differential performances inour NER task

The ground truth is built by randomly selecting

Method Entity Precision Recall F1

NLTK Person 045 068 055Org 059 077 067

spaCy Person 074 088 080Org 063 074 068

Stanford- Person 088 087 087CoreNLP Org 068 080 073

Stanza Person 089 093 091Org 060 089 072

Table 2 Comparison of NER software package

100 acknowledgement paragraphs from sectionsfootnotes and body text and manually annotatingperson and organization entities without discrimi-nating whether they are acknowledged or not Thisresults in 146 person and 209 organization entitiesThe comparison results indicate that overall Stanzaoutperforms the other three achieving F1 = 091for person and F1 = 072 for organization Es-pecially the recall of Stanza is 9 higher thanStanford CoreNLP (Table 2)

46 Entity Classification

As we showed not all named entities are acknowl-edged such as the ldquorsquoUtrecht Universityrdquo in Fig-ure 1 Therefore it is necessary to build a classifierto discriminate acknowledgement entities ndash entitiesthat are thanked by the paper or the authors fromnamed entities

The majority of acknowledgement statements inacademic articles have a relatively standard subject-predicate-object (SPO) structure They use a lim-ited number of words or phrases such as ldquothankrdquoldquoacknowledgerdquo ldquoare gratefulrdquo ldquois supportedrdquo andldquois fundedrdquo as predicates However the object cancontain multiple named entities some of whichare used as attributes of the others In rare casescertain sentences may not have subjects and predi-cates such as the first sentence in Figure 4

Our approach can be divided into three stepsTwo representative examples are illustrated in Fig-ure 3 The pseudo-code is shown in Algorithm 1

Step 1 We resolve the type of voice (active orpassive) subject and predicate of a sentence usingdependency parsing by Stanza This is becausenamed entities can appear as subjective or objec-tive parts We then locate all named entities Thesemantic meaning of a predicate and its type ofvoice can be used to determine whether entities ac-knowledged are in the objective part or subjectivepart In most cases the target entities are objects

15

Step 2 A sentence with multiple entities in theobjective part is split into shorter sentences calledldquosubsentencesrdquo so that each subsentence is asso-ciated with only up to one named entity This isdone by first splitting the sentence by ldquoandrdquo Foreach subsentence if the subject and predicate aremissing we fill them up using the subject and pred-icate of the original sentence The object in eachsubsentence does not necessarily contain an entityFor example in the right panel of Figure 3 becauseldquoexpertiserdquo is not a named entity it is replaced byldquononerdquo in the subsentence The SPO relations thatdo not contain named entities are removed

There are two scenarios In the first scenarioa sentence or a subsentence may contain a list ofentities with only the first being acknowledgedIn the third example of Figure 4 the named en-tities are [lsquoQi Yangrsquo lsquoMorehouse School of

Medicinersquo lsquoAtlantarsquo lsquoGArsquo] but only the firstentity is acknowledged The rest entities whichare recognized as organizations or locations areused for supplementing more information In thisscenario only the first named entity is extracted

Step 3 In the second scenario acknowledgedentities are connected by commas or ldquoandrdquosuch as in We thank Shirley Hauta Yurij

Popowych Elaine van Moorlehem and Yan

Zhou for help in the virus production Inthis scenario entities in this structure have thesame type indicating that they play similar rolesThis parallel pattern can be captured by regularexpressions The entities resolved in Step 2 and 3will be merged to form the final set

The method to find the parallel structure isas follows First check each entity whether itstype is person If so the entities are substitutedwith integer indexes The sentence becomesThe authors would like to thank 0 1 and

2 and the kind support of Bayer Animal

Health GmbH and Virbac Group If there are3 or more consecutive numbers in this formthis is the parallel pattern which is captured byregular expressions The pattern also allows textbetween names (Figure 5) Next the numbersin this part will be extracted and mapped tocorresponding entities In the example above thenumbers [012] correspond to the index of theentities [Norbert Mencke Lourdes Mottier

David McGahieand] Similar pattern recognitionare performed for organization The process isdepicted in Figure 5

We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan

for their support and expertise


for their support

We would also like to thank expertise

[lsquowersquo lsquothankrsquo lsquoCanadian Wildlife Health Cooperativersquo]

[lsquowersquo lsquothankrsquo lsquoNicolas Olivarezrsquo]

[lsquoCanadian Wildlife Health Cooperativersquo]

We thank Anne Broadwater and Nicolas Olivarez of the de Silva lab for technical assistance with the assays

We thank Anne Broadwater of the de Silva lab for technical assistance with the assays

We thank Nicolas Olivarez of the de Silva lab for technical assistance with the assays

[lsquowersquo lsquothankrsquo lsquoAnne Broadwaterrsquo]

[lsquowersquo lsquothankrsquo none]

[lsquoAnne Broadwaterrsquo lsquoNicolas Olivarezrsquo]

Figure 3 Process mapping multiple subject-predicate-object relations to acknowledgement entities with tworepresentative examples ldquoNonerdquo means the objectdoes not contain named entities


Stanza 057 091 070RB(without step 3) 094 078 085

RB 094 090 092

Table 3 Performance of Stanza and our relation-based(RB) classifier Precision and recall show overall re-sults by combining person and organization

We investigated open information extractionmethods such as REVERB (Fader et al 2011) andOLLIE (Mausam et al 2012) We found that theyonly work for sentences with relatively simplestructures such as The authors wish to thank

Ming-Wei Guo for reagents and technical

assistance but fail with more complicatedsentences with long objective part or parallelsubsentences (Figure 4) We also investigated thesemantic role labeling (SRL) library in AllenNLPtoolkit (Shi and Lin 2019) For the third sentencein Figure 4 the AllenNLP SRL library resolves theentire string after ldquothankrdquo as an argument but failsto distinguish entity types Our method can handleall statements in Figure 4

As a baseline we use Stanza to extract all namedentities and compare its performance with ourrelation-based (RB) classifier The ground truthis compiled using the same corpus described inSection 45 except that only acknowledgement en-tities (as opposed to all named entities) are labeledpositive The results (Table 3) show that Stanzaachieves high recall but poor precision indicatingthat a significant fraction (sim 40) of named enti-ties are not acknowledged In contrast our classi-fier (RB) achieves a precision of 094 with a smallloss of recall achieving an F1 = 092

One limitation of the RB classifier is that it relieson text quality Sentences are expected to follow

16

To Aulio Costa Zambenedetti for the art work Nilson Fidecircncio Silvio Marques Tania Schepainski and Sibelli Tanjoni de Souza for technical assistance

The authors also thank the Program for Technological Development in Tools for Health-PDTIS FIOCRUZ for the use of its facilities (Platform RPT09H -Real Time PCR -Instituto Carlos ChagasFiocruz-PR)The authors wish to thank Ms Qi Yang DNA sequencing lab Morehouse School of Medicine Atlanta GA and the Research Cores at Morehouse School of Medicine supported through the Research Centers at Minority Institutions (RCMI) Program NIHNCRRRCMI Grant G12-RR03034

Figure 4 Sentence examples from which REVERB orOLLIE failed to extract acknowledgement entities un-derlined which can be identified by our method

Norbert MenckeDavid McGahie

Bayer Animal HealthVirbac Group

Norbert Mencke Lourdes Mottier David McGahie

step 3step 2

The authors would like to thank Norbert Mencke Lourdes Mottier and David McGahie and the kind support of Bayer Animal Health GmbH and Virbac Group

[Norbert Mencke David McGahie Bayer Animal Health Virbac Group Lourdes Mottier]

Figure 5 Merge the results given by step 2 and step 3and obtain the final results

the editorial convention such that entity names areclearly segmented by period If two sentences arenot properly delimited eg missing a period theclassifier may make incorrect predictions

5 Data Analysis

The CORD-19 dataset contains PMC and PDFfolders We found that almost all papers in thePMC folders are included in the PDF foldersFor consistency we only work on papers in thePDF folders containing 37612 unique PDF pa-pers7 Using ACKEXTRACT we extracted a totalof 102965 named entities from 21469 CORD-19 papers The fraction of papers with namedentities (2146937612asymp57) is roughly consis-tent with the fraction obtained from the 200 sam-ples (120200asymp60 in Section 42) Among allnamed entities 58742 are acknowledgement enti-ties These numbers suggest that using our modelonly about 50 of named entities are acknowl-edged Using only named entities acknowledge-ment studies could significantly overestimate the

7We excluded 2003 duplicate papers with exactly the sameSHA1 values

Algorithm 1 Relation-Based Classifier1 Function find entity()2 pre-process with punctuation and format3 find candidate entities entity list by Stanza4 for x isin entity list do5 if x isin

subject part (differs based on predicate)then

6 x = none

7 entity listremove(none)8 return entity list

9 Input sentence10 Output entity list all11 find subject part predicate and the object part12 if ldquopredicate valuerdquo isin reg expression 1 then13 if ldquoandrdquo isin sentence then14 split into subsentences by lsquoandrsquo15 for each subsentence do16 entity list

larr find entity(subsentence)17 if entity list 6= none then18 entity list all

larr appendentity list[0]

19 else if ldquoandrdquo isin sentence then20 entity listlarr find entity(subsentence)21 entity list alllarr entity list[0]

22 else if ldquopredicate valuerdquo isin reg expression 2 then23 do the same in if part but focus on entities in

subject24 for x isin candidate list by Stanza do25 if x isin sentenceampxtype = type then26 orig listlarr append x27 sentlarr replace x by index(x)

28 if reg format isin sent then29 templarr find reg expression in sent30 numlistlarr find numbers in temp31 listlarr index orig list by numlist

32 combine list and entity list all

number of acknowledgement entities by relyingon entities recognized by NER software packageswithout further classification Here we analyze ourresults and study some trends of acknowledgemententities in the CORD-19 dataset

The top 10 acknowledged organizations (Ta-ble 4) are all funding agencies Overall NIH (NI-AID is an institute of NIH) is acknowledged themost but funding agencies in other countries arealso acknowledged a lot in CORD-19 papers

Figure 6 shows the numbers of CORD-19 pa-pers with vs without acknowledgement (personor organization) recognized from 1970 to 2020The huge leap around 2002 was due to the wave ofcoronavirus studies during the SARS period Thesmall drop-down in 2020 was due to data incom-pleteness The plot indicates that the fraction of

17

Organization Count

National Institutes of Health or NIH 1414National Natural Science Foundation of China or NSFC 615National Institute of Allergy amp Infectious Diseases or NIAID 209National Science Foundation or NSF 183Ministry of Health 160Wellcome Trust 146Deutsche Forschungsgemeinschaft 119Ministry of Education 119National Science Council 104Public Health Service 85

Table 4 Top 10 acknowledged organizations identifiedin our CORD-19 dataset

0

500

1000

1500

2000

2500

3000

3500

4000

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

2016

2018

2020

With Acknowledgement Without Acknowledgement

Figure 6 Numbers of CORD-19 papers with vs with-out acknowledgement recognized from 1970 to 2020

papers with acknowledgement has been increasinggradually over the last 20 years Figure 7 showsthe numbers of the top 10 acknowledged organiza-tions from 1983 to 2020 The figure indicates thatthe number of acknowledgements to NIH has beengradually decreasing over the past 10 years whilethe acknowledgements to NSF is roughly constantIn contrast the number has been gradually increas-ing from NSFC (a Chinese funding agency) Notethat the distribution of acknowledged organizationshas a long tail and organizations behind the top 10actually dominate the total number However thetop 10 organizations are the biggest research agen-cies and the trend to some extent reflects strategicshifts of funding support

6 Conclusion and Future Work

Here we extended the work of Khabsa et al (2012)and built an acknowledgement extraction frame-work denoted as ACKEXTRACT for research arti-cles ACKEXTRACT is based on heuristic meth-ods and state-of-the-art text mining libraries (egGROBID and Stanza) but features a classifierthat discriminates acknowledgement entities fromnamed entities by analyzing the multiple subject-predicate-object relations in a sentence Our ap-

0

50

100

150

200

250

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

Public Health Service National Science CouncilMinistry of Education Deutsche ForschungsgemeinschaftWellcome Trust Ministry of HealthNational Institute of Allergy amp Infectious DiseasesNIAID National Science FoundationNSFNational Natural Science Foundation of ChinaNSFC National Institutes of HealthNIH

Figure 7 Trend of acknowledged organizations of top10 organizations from 1983 to 2020

proach successfully recognizes acknowledgemententities that cannot be recognized by OIE packagessuch as REVERB OLLIE and the AllenNLP SRLlibrary

This method is applied to the CORD-19 datasetreleased on April 10 2020 processing one PDFdocument in 5 seconds on average Our resultsindicate that only 50ndash60 named entities are ac-knowledged The rest are mentioned to provideadditional information (eg affiliation or location)about acknowledgement entities Working on cleandata our method achieves an overall F1 = 092for person and organization entities The trendanalysis of the CORD-19 papers verifies that moreand more papers include acknowledgement enti-ties since 2002 when the SARS outbreak hap-pened The trend also reveals that the overallnumber of acknowledgements to NIH is gradu-ally decreasing over the past 10 years while morepapers acknowledge NSFC a Chinese fundingagency One caveat of our method is that organi-zations in different countries are not distinguishedFor example many countries have agencies calledldquoMinistry of Healthrdquo In the future we plan tobuild learning-based models for sentence classifi-cation and entity classification The code and dataof this project have been released on GitHub athttpsgithubcomlamps-labackextract

Acknowledgments

This work was partially supported by the DefenseAdvanced Research Projects Agency (DARPA)under cooperative agreement No W911NF-19-2-0272 The content of the information does notnecessarily reflect the position or the policy of theGovernment and no official endorsement shouldbe inferred

18

ReferencesAlan Akbik Duncan Blythe and Roland Vollgraf

2018 Contextual string embeddings for sequencelabeling In Proceedings of the 27th InternationalConference on Computational Linguistics pages1638ndash1649 Santa Fe New Mexico USA Associ-ation for Computational Linguistics

James Andreoni and Laura K Gee 2015 Gunning forefficiency with third party enforcement in thresholdpublic goods Experimental Economics 18(1)154ndash171

Hannah Bast and Claudius Korzen 2017 A bench-mark and evaluation for text extraction from PDFIn 2017 ACMIEEE Joint Conference on Digital Li-braries JCDL 2017 Toronto ON Canada June 19-23 2017 pages 99ndash108 IEEE Computer Society

Steven Bird 2006 NLTK the natural language toolkitIn ACL 2006 21st International Conference on Com-putational Linguistics and 44th Annual Meeting ofthe Association for Computational Linguistics Pro-ceedings of the Conference Sydney Australia 17-21July 2006 The Association for Computer Linguis-tics

Cronin Blaise 2001 Acknowledgement trendsin the research literature of information science57(3)427ndash433

Isaac G Councill C Lee Giles Hui Han and ErenManavoglu 2005 Automatic acknowledgement in-dexing Expanding the semantics of contribution inthe citeseer digital library In Proceedings of the 3rdInternational Conference on Knowledge Capture K-CAP rsquo05 pages 19ndash26 New York NY USA ACM

Blaise Cronin Gail McKenzie Lourdes Rubio andSherrill Weaver-Wozniak 1993 Accounting for in-fluence Acknowledgments in contemporary sociol-ogy Journal of the American Society for Informa-tion Science 44(7)406ndash412

Blaise Cronin Gail McKenzie and Michael Stiffler1992 Patterns of acknowledgement Journal ofDocumentation

S Dai Y Ding Z Zhang W Zuo X Huang andS Zhu 2019 Grantextractor Accurate grant sup-port information extraction from biomedical fulltextbased on bi-lstm-crf IEEEACM Transactions onComputational Biology and Bioinformatics pages1ndash1

Anthony Fader Stephen Soderland and Oren Etzioni2011 Identifying relations for open information ex-traction In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language ProcessingEMNLP 2011 27-31 July 2011 John McIntyre Con-ference Centre Edinburgh UK A meeting of SIG-DAT a Special Interest Group of the ACL pages1535ndash1545

C Lee Giles Kurt D Bollacker and Steve Lawrence1998 CiteSeer An automatic citation indexing sys-tem In Proceedings of the 3rd ACM InternationalConference on Digital Libraries June 23-26 1998Pittsburgh PA USA pages 89ndash98

C Lee Giles and Isaac G Councill 2004 Whogets acknowledged Measuring scientific contribu-tions through automatic acknowledgment indexingProceedings of the National Academy of Sciences101(51)17599ndash17604

Matthew Honnibal and Ines Montani 2017 spaCy 2Natural language understanding with Bloom embed-dings convolutional neural networks and incremen-tal parsing To appear

Subhradeep Kayal Zubair Afzal George TsatsaronisMarius Doornenbal Sophia Katrenko and MichelleGregory 2019 A framework to automatically ex-tract funding information from text In MachineLearning Optimization and Data Science pages317ndash328 Cham Springer International Publishing

Madian Khabsa Pucktada Treeratpituk and C LeeGiles 2012 Ackseer a repository and search en-gine for automatically extracted acknowledgmentsfrom digital libraries In Proceedings of the12th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo12 Washington DC USA June 10-14 2012 pages 185ndash194 ACM

Mario Lipinski Kevin Yao Corinna Breitinger Jo-eran Beel and Bela Gipp 2013 Evaluation ofheader metadata extraction approaches and tools forscientific pdf documents In Proceedings of the13th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo13 pages 385ndash386 New York NYUSA ACM

Kyle Lo Lucy Lu Wang Mark Neumann Rodney Kin-ney and Dan S Weld 2019 S2orc The semanticscholar open research corpus

Patrice Lopez 2009 Grobid Combining auto-matic bibliographic data recognition and term ex-traction for scholarship publications In Proceed-ings of the 13th European Conference on Re-search and Advanced Technology for Digital Li-braries ECDLrsquo09 pages 473ndash474 Berlin Heidel-berg Springer-Verlag

Christopher D Manning Mihai Surdeanu John BauerJenny Finkel Steven J Bethard and David Mc-Closky 2014 The Stanford CoreNLP natural lan-guage processing toolkit In Association for Compu-tational Linguistics (ACL) System Demonstrationspages 55ndash60

Mausam Michael Schmitz Robert Bart StephenSoderland and Oren Etzioni 2012 Open languagelearning for information extraction In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning EMNLP-CoNLL rsquo12

19

pages 523ndash534 Stroudsburg PA USA Associationfor Computational Linguistics

Adele Paul-Hus Adrian A Dıaz-Faes Maxime Sainte-Marie Nadine Desrochers Rodrigo Costas and Vin-cent Lariviere 2017 Beyond funding Acknowl-edgement patterns in biomedical natural and socialsciences PLOS ONE 12(10)1ndash14

Adele Paul-Hus Philippe Mongeon Maxime Sainte-Marie and Vincent Lariviere 2020 Who are theacknowledgees an analysis of gender and academicstatus Quantitative Science Studies 0(0)1ndash17

Peng Qi Yuhao Zhang Yuhui Zhang Jason Boltonand Christopher D Manning 2020 Stanza Apython natural language processing toolkit for manyhuman languages CoRR abs200307082

Radim Rehurek and Petr Sojka 2010 SoftwareFramework for Topic Modelling with Large Cor-pora In Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks pages 45ndash50 Valletta Malta ELRA httpismuniczpublication884893en

Peng Shi and Jimmy Lin 2019 Simple BERT mod-els for relation extraction and semantic role labelingCoRR abs190405255

Mayank Singh Barnopriyo Barua Priyank PalodManvi Garg Sidhartha Satapathy Samuel BushiKumar Ayush Krishna Sai Rohith Tulasi GamidiPawan Goyal and Animesh Mukherjee 2016OCR++ A robust framework for information extrac-tion from scholarly articles In COLING 2016 26thInternational Conference on Computational Linguis-tics Proceedings of the Conference Technical Pa-pers December 11-16 2016 Osaka Japan pages3390ndash3400 ACL

E Smith B Williams-Jones Z Master V LariviereC R Sugimoto A Paul-Hus M Shi and D BResnik 2019 Misconduct and misbehavior relatedto authorship disagreements in collaborative scienceSci Eng Ethics

Min Song Keun Young Kang Tatsawan Timakum andXinyuan Zhang 2020 Examining influential fac-tors for acknowledgements classification using su-pervised learning PLOS ONE 15(2)1ndash21

H Stewart K Brown A M Dinan N Irigoyen E JSnijder and A E Firth 2018 Transcriptional andtranslational landscape of equine torovirus J Virol92(17)

Lucy Lu Wang Kyle Lo Yoganand ChandrasekharRussell Reas Jiangjiang Yang Darrin Eide KathrynFunk Rodney Kinney Ziyang Liu William Mer-rill Paul Mooney Dewey Murdick Devvret RishiJerry Sheehan Zhihong Shen Brandon StilsonAlex D Wade Kuansan Wang Chris Wilhelm BoyaXie Douglas Raymond Daniel S Weld Oren Et-zioni and Sebastian Kohlmeier 2020 CORD-19 the covid-19 open research dataset CoRRabs200410706

Jian Wu Jason Killian Huaiyu Yang Kyle WilliamsSagnik Ray Choudhury Suppawong Tuarob Cor-nelia Caragea and C Lee Giles 2015 PdfmefA multi-entity knowledge extraction framework forscholarly documents and semantic search In Pro-ceedings of the 8th International Conference onKnowledge Capture K-CAP 2015 pages 131ndash138New York NY USA ACM

11

Figure 1 Upper Acknowledgement statements appearin an isolated section (Stewart et al 2018) Bottomacknowledgement statements appear in a footnote (An-dreoni and Gee 2015)

as the Natural Language Toolkits (NLTK) (Bird2006) eg Khabsa et al (2012) Paul-Hus et al(2020) followed by simple semi-manual datacleansing resulting in a fraction of entities thatare mentioned but not actually acknowledged Inthis paper we design an automatic AER systemcalled ACKEXTRACT that further classifies extrac-tion results from open source NER packages thatrecognize people and organizations and distinguishentities that are actually acknowledged The extrac-tor finds acknowledgement statements from iso-lated sections and other locations such as footnoteswhich is common for papers in social and behav-ioral sciences Our contributions are

1 Develop ACKEXTRACT as open-source soft-ware to automatically extract acknowledge-ment entities from research papers

2 Apply ACKEXTRACT on the CORD-19dataset and supplement the dataset with a cor-pus of classified acknowledgement entities forfurther studies

3 Use the CORD-19 dataset as a case study todemonstrate that acknowledgement studieswithout classifying named entities can sig-nificantly overestimate the number of entitiesthat are actually acknowledged because manypeople and organizations are mentioned butnot explicitly acknowledged

2 Related Works

Early work on acknowledgement extraction wasmanually applied which was labor-intensiveCronin et al (1993) extracted a total of 9561 peerinteractive communication (PIC) names from a to-tal of 4200 research sociology articles most werepersonsrsquo names They also defined the followingsix categories of acknowledgement moral supportfinancial support editorial support presentational

support instrumentaltechnical support and con-ceptual support or PIC (Cronin et al 1992)

Councill et al (2005) used a hybrid method forautomatic AER from research papers and automati-cally created an acknowledgement index(Giles andCouncill 2004) The algorithm first used a heuris-tic method for identifying acknowledgement pas-sages It then uses an SVM model for identifyinglines containing acknowledgement sentences out-side labeled acknowledgement sections A regularexpression was used to extract entity names fromacknowledging text This method achieved an over-all precision of about 0785 and a recall of 0896 onCiteSeer papers (Giles et al 1998) The algorithmdoes not distinguish entity types

Khabsa et al (2012) leveraged OpenCalais1 andAlchemyAPI2 free services at that time to extractnamed entities from acknowledgement sections andbuilt ACKSEER a search engine for acknowledge-ment entities They merged outputs of both NERAPIs and generated a list and disambiguated entitymentions using the longest common subsequence(LCS) algorithm The ground truth contains 200top-cited CiteSeerX papers in which 130 had ac-knowledgement sections They achieved 923and 916 precision and recall for acknowledge-ment section extraction but did not evaluate entityextraction

Recent studies of acknowledgements tend to useresults from off-the-shelf NER packages with sim-ple filters assuming that named entities were ac-knowledged entities For example Paul-Hus et al(2020) uses the Stanford NER module in NLTK toextract persons Song et al (2020) also directly usepeople and organizations recognized by the Stan-ford CoreNLP (Manning et al 2014) These worksachieved a high recall by recognizing most nameentities in the acknowledgements but ignored theirrelations to the papers where they appear resultingin a fraction of entities that are mentioned but notactually acknowledged Song et al (2020) considergrammar structure such as verb tense and voiceand sentence patterns when labeling sentences totheir six categories For example ldquowas fundedrdquo isfollowed by an ldquoorganizationrdquo However they onlylabel sentences and do not annotate them down to

1httpswebarchiveorgweb20081023021111httpopencalaiscomcalaisAPI

2httpswebarchiveorgweb20090921044923httpwwwalchemyapicom

12







3 Dataset










13














14


Gensim 065 064 065NLTK 072 069 070












NLTK Person 045 068 055Org 059 077 067

spaCy Person 074 088 080Org 063 074 068










15














for their support














RB 094 090 092







16







step 3step 2





5 Data Analysis





6 x = none













17

Organization Count



0

500

1000

1500

2000

2500

3000

3500

4000

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

2016

2018

2020






0

50

100

150

200

250

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020





Acknowledgments


18






















19













12







3 Dataset










13














14


Gensim 065 064 065NLTK 072 069 070












NLTK Person 045 068 055Org 059 077 067

spaCy Person 074 088 080Org 063 074 068










15














for their support














RB 094 090 092







16







step 3step 2





5 Data Analysis





6 x = none













17

Organization Count



0

500

1000

1500

2000

2500

3000

3500

4000

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

2016

2018

2020






0

50

100

150

200

250

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020





Acknowledgments


18






















19













13














14


Gensim 065 064 065NLTK 072 069 070












NLTK Person 045 068 055Org 059 077 067

spaCy Person 074 088 080Org 063 074 068










15














for their support














RB 094 090 092







16







step 3step 2





5 Data Analysis





6 x = none













17

Organization Count



0

500

1000

1500

2000

2500

3000

3500

4000

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

2016

2018

2020






0

50

100

150

200

250

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020





Acknowledgments


18






















19













14


Gensim 065 064 065NLTK 072 069 070












NLTK Person 045 068 055Org 059 077 067

spaCy Person 074 088 080Org 063 074 068










15














for their support














RB 094 090 092







16







step 3step 2





5 Data Analysis





6 x = none













17

Organization Count



0

500

1000

1500

2000

2500

3000

3500

4000

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

2016

2018

2020






0

50

100

150

200

250

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020





Acknowledgments


18






















19













15














for their support














RB 094 090 092







16







step 3step 2





5 Data Analysis





6 x = none













17

Organization Count



0

500

1000

1500

2000

2500

3000

3500

4000

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

2016

2018

2020






0

50

100

150

200

250

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020





Acknowledgments


18






















19













16







step 3step 2





5 Data Analysis





6 x = none













17

Organization Count



0

500

1000

1500

2000

2500

3000

3500

4000

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

2016

2018

2020






0

50

100

150

200

250

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020





Acknowledgments


18






















19













17

Organization Count



0

500

1000

1500

2000

2500

3000

3500

4000

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

2016

2018

2020






0

50

100

150

200

250

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020





Acknowledgments


18






















19













18






















19













19













Documents

Acknowledgement Entity Recognition in CORD-19 Papers