Upload
khangminh22
View
0
Download
0
Embed Size (px)
Citation preview
Proceedings of the First Workshop on Scholarly Document Processing pages 10ndash19Online November 19 2020 ccopy2020 Association for Computational Linguistics
httpsdoiorg1018653v1P17
10
Acknowledgement Entity Recognition in CORD-19 PapersJian Wu
Computer ScienceOld Dominion University
Norfolk VA USAjwucsoduedu
Pei Wang Xin WeiComputer Science
Old Dominion UniversityNorfolk VA USA
pwang001xwei001oduedu
Sarah Michele Rajtmajer C Lee GilesInformation Sciences and Technology
Pennsylvania State UniversityUniversity Park PA USA
Christopher GriffinApplied Research LaboratoryPennsylvania State University
University Park PA USA
Abstract
Acknowledgements are ubiquitous in schol-arly papers Existing acknowledgement en-tity recognition methods assume all namedentities are acknowledged Here we exam-ine the nuances between acknowledged andnamed entities by analyzing sentence struc-ture We develop an acknowledgement ex-traction system ACKEXTRACT based onopen-source text mining software and eval-uate our method using manually labeleddata ACKEXTRACT uses the PDF of ascholarly paper as input and outputs ac-knowledgement entities Results show anoverall performance of F1 = 092 Webuilt a supplementary database by link-ing CORD-19 papers with acknowledge-ment entities extracted by ACKEXTRACT
including persons and organizations andfind that only up to 50ndash60 of named enti-ties are actually acknowledged We fur-ther analyze chronological trends of ac-knowledgement entities in CORD-19 pa-pers All codes and labeled data are pub-licly available at httpsgithubcom
lamps-labackextract
1 Introduction
Acknowledgements have been an institutionalizedpart of research publications for some time (Blaise2001) Acknowledgement statements show the au-thorsrsquo public gratitude and recognition to individ-uals organizations and grants for various contri-butions Acknowledged individuals and organiza-tions have been under-presented in author rankingand citation impact analysis mostly due to theirpresumed sub-authorship contribution A recentsurvey found that discipline academic rank andgender have a significant effect on coauthorship
disagreement rate (Smith et al 2019) leading tonon-author collaborators receiving less attentionRecently the presence of non-author collaboratorsin the biomedical and social sciences (Paul-Huset al 2017) showed that non-author collaboratorsare not rare and their presence varies significantlyby disciplines
Acknowledgements can be classified dependingon the nature of the contribution Song et al (2020)classified sentences in acknowledgement sectionsinto 6 categories declaration financial peer inter-active communication and technical support pre-sentation general acknowledgement and generalstatement They can also be classified based onthe type of entities such as individual organizationand grant Since 2008 funding acknowledgementshave been indexed by Web of Science Howeverthere is still no dedicated software to accurately rec-ognize acknowledged people and organizations andgenerate a centralized acknowledgement databaseEarly works on acknowledgements were based ondatasets manually extracted from specific journalswhich was not scalable Building such a largedatabase can support further study of acknowledge-ments at a larger scale
There are several scenarios that make the ac-knowledgement entity recognition (AER) task chal-lenging The upper panel of Figure 1 shows ex-amples of sentences appearing in an isolated ac-knowledgement section The ldquoUtrecht Universityrdquois mentioned but should not be counted as an ac-knowledgement entity because it is just the affil-iation of ldquoArno van Vlietrdquo who is acknowledgedAcknowledgement statements can also appear atfootnotes (Figure 1 Bottom) mixed with other foot-note andor body text Author names may alsoappear in the statements such as ldquoAndreonirdquo in thisexample and should be excluded
Existing works on AER leverage off-the-shelfnamed entity recognition (NER) packages such
11
Figure 1 Upper Acknowledgement statements appearin an isolated section (Stewart et al 2018) Bottomacknowledgement statements appear in a footnote (An-dreoni and Gee 2015)
as the Natural Language Toolkits (NLTK) (Bird2006) eg Khabsa et al (2012) Paul-Hus et al(2020) followed by simple semi-manual datacleansing resulting in a fraction of entities thatare mentioned but not actually acknowledged Inthis paper we design an automatic AER systemcalled ACKEXTRACT that further classifies extrac-tion results from open source NER packages thatrecognize people and organizations and distinguishentities that are actually acknowledged The extrac-tor finds acknowledgement statements from iso-lated sections and other locations such as footnoteswhich is common for papers in social and behav-ioral sciences Our contributions are
1 Develop ACKEXTRACT as open-source soft-ware to automatically extract acknowledge-ment entities from research papers
2 Apply ACKEXTRACT on the CORD-19dataset and supplement the dataset with a cor-pus of classified acknowledgement entities forfurther studies
3 Use the CORD-19 dataset as a case study todemonstrate that acknowledgement studieswithout classifying named entities can sig-nificantly overestimate the number of entitiesthat are actually acknowledged because manypeople and organizations are mentioned butnot explicitly acknowledged
2 Related Works
Early work on acknowledgement extraction wasmanually applied which was labor-intensiveCronin et al (1993) extracted a total of 9561 peerinteractive communication (PIC) names from a to-tal of 4200 research sociology articles most werepersonsrsquo names They also defined the followingsix categories of acknowledgement moral supportfinancial support editorial support presentational
support instrumentaltechnical support and con-ceptual support or PIC (Cronin et al 1992)
Councill et al (2005) used a hybrid method forautomatic AER from research papers and automati-cally created an acknowledgement index(Giles andCouncill 2004) The algorithm first used a heuris-tic method for identifying acknowledgement pas-sages It then uses an SVM model for identifyinglines containing acknowledgement sentences out-side labeled acknowledgement sections A regularexpression was used to extract entity names fromacknowledging text This method achieved an over-all precision of about 0785 and a recall of 0896 onCiteSeer papers (Giles et al 1998) The algorithmdoes not distinguish entity types
Khabsa et al (2012) leveraged OpenCalais1 andAlchemyAPI2 free services at that time to extractnamed entities from acknowledgement sections andbuilt ACKSEER a search engine for acknowledge-ment entities They merged outputs of both NERAPIs and generated a list and disambiguated entitymentions using the longest common subsequence(LCS) algorithm The ground truth contains 200top-cited CiteSeerX papers in which 130 had ac-knowledgement sections They achieved 923and 916 precision and recall for acknowledge-ment section extraction but did not evaluate entityextraction
Recent studies of acknowledgements tend to useresults from off-the-shelf NER packages with sim-ple filters assuming that named entities were ac-knowledged entities For example Paul-Hus et al(2020) uses the Stanford NER module in NLTK toextract persons Song et al (2020) also directly usepeople and organizations recognized by the Stan-ford CoreNLP (Manning et al 2014) These worksachieved a high recall by recognizing most nameentities in the acknowledgements but ignored theirrelations to the papers where they appear resultingin a fraction of entities that are mentioned but notactually acknowledged Song et al (2020) considergrammar structure such as verb tense and voiceand sentence patterns when labeling sentences totheir six categories For example ldquowas fundedrdquo isfollowed by an ldquoorganizationrdquo However they onlylabel sentences and do not annotate them down to
1httpswebarchiveorgweb20081023021111httpopencalaiscomcalaisAPI
2httpswebarchiveorgweb20090921044923httpwwwalchemyapicom
12
Figure 2 Architecture of ACKEXTRACT
the entity level Our system examines the relation-ship between entities and the current work withthe purpose of discriminating acknowledgement en-tities from named entities In this system we focuson people and organizations
Recently Dai et al (2019) proposed GrantEx-tractor a pipeline system to extract grant support in-formation from scientific papers A model combin-ing BiLSTM-CRF and pattern matching was usedto extract entities of grant numbers and agenciesfrom funding sentences which are identified usingheuristic methods The system achieves a micro-F1
up to 090 in extracting grant pairs (agency num-ber)
Kayal et al (2019) proposed an ensemble ap-proach called FUNDINGFINDER for extractingfunding information from text The authors con-struct feature vectors for candidate entities usingwhether the entities are recognized by four NERimplementation Stanford (Conditional RandomField model) LingPipe (Hidden Markov model)OpenNLP (Maximum Entropy model) and Else-vierrsquos Fingerprint Engine The F1-measure forfunding body is only 068plusmn 03
Our method is different from existing methodsin threefold (1) It is built on top of state-of-the-artneural NER methods which results in a relativelyhigh recall (2) It uses a heuristic method to filterout entities that are just mentioned but not acknowl-edged (3) It extracts both organizations and people
3 Dataset
On March 16th 2020 Allen Institute of Artifi-cial Intelligence released the first version of theCOVID-19 Open Research Dataset (CORD-19)(Wang et al 2020) in collaboration with severalother institutions The dataset contains metadataand segmented full text of research articles selectedby searching a list of keywords about coronavirusSARS-CoV MERS and other related terms from
four digital libraries including WHO PubMed Cen-tral BioRxiv and MedRxiv The initial dataset con-tained about 28k papers and was updated weeklywith papers from new sources and the latest pub-lication We used the dataset released on April10 2020 containing over 59312 metadata recordsamong which 54756 have the full text in JSONformat The CORD-19 papers were generated byprocessing PDFs using the S2ORD pipeline (Loet al 2019) in which GROBID (Lopez 2009) wasemployed for document segmentation and metadataextraction
The full text in JSON files is directly used forsentence segmentation However we observe thatGROBID extraction results are not perfect In par-ticular we estimate the fraction of acknowledge-ment sections omitted We also estimate the num-ber of acknowledgement entities omitted in thedata release due to the extraction error of GROBIDTo do this we downloaded 45916 full-text PDFpapers collected by the Internet Archive (IA) be-cause the CORD-19 dataset does not include PDFfiles3 We found 13103 CORD-19 papers in theIA dataset
4 Acknowledgement Extraction
41 OverviewThe architecture of our acknowledgement entity ex-traction system is depicted in Figure 2 The systemcan be divided into the following modules
1 Document segmentation CORD-19 pro-vides full text as JSON files but in generalmost research articles are published in PDFso our first step is converting a PDF docu-ment to text and segment it into sections Weuse GROBID that has shown superior perfor-mance over many other document extractionmethods (Lipinski et al 2013) The output isan XML file in TEI schema
2 Sentence segmentation Paragraphs are seg-mented into sentences We compare severalsentence segmentation software packages andchoose Stanza (Qi et al 2020) because of itsrelatively high accuracy
3 Sentence classification Sentences areclassified into acknowledgement and non-acknowledgement statements The result isa set of acknowledgement statements insideor outside the acknowledgement sections
3httpsarchiveorgdownloadcovid19_fatcat_20200410
13
4 Entity recognition Named entities are ex-tracted from acknowledgement statementsWe compare four commonly used NER soft-ware packages and choose Stanza because ofits relatively high performance In this workwe focus on person and organization
5 Entity classification In this module we clas-sify named entities by analyzing sentencestructures aiming at discriminating namedentities that are actually acknowledged ratherthan just mentioned We demonstrate thattriple extraction packages such as REVERB
and OLLIE fail to handle acknowledgementstatements with multiple entities in objects inour dataset The results are acknowledgemententities including people or organizations
42 Document SegmentationThe majority of scholarly papers are published inPDF format which are not readily readable by textprocessors Several attempts have been made toconvert PDF into text (Bast and Korzen 2017) andsegment the document into section and sub-sectionlevels GROBID is a machine learning library forextracting parsing and re-structuring raw docu-ments into TEI encoded documents Other simi-lar methods have been recently developed such asOCR++ (Singh et al 2016) and Science Parse4Lipinski et al (2013) compared 7 metadata extrac-tion methods and found that GROBID (version040) achieved superior performance over the oth-ers GROBID trained a cascading of conditionalrandom field (CRF) models on PubMed and com-puter science papers The recent version (060)has a set of powerful functionalities such as extract-ing and parsing headers and segmenting full-textextraction GROBID supports a batch mode andan API service mode the latter enables large scaledocument processing on multi-core servers suchas in CiteSeerX (Wu et al 2015) A benchmark-ing result for version 060 shows that the sectiontitle parsing achieves and F1 = 070 under thestrict matching criteria and F1 = 075 under thesoft matching criteria5 Singh et al (2016) claimsOCR++ achieves better performance than GROBIDin several fields evaluated on computer science pa-pers However the lack of a service mode APIand multi-domain adaptability limits its usabilityScience-Parse only extracts key metadata such as
4httpsgithubcomallenaispv25httpsgrobidreadthedocsioen
latestBenchmarking-pmc
title authors year and venue Therefore we adoptGROBID to convert PDF documents into XMLfiles
Depending on the structure and provenance ofPDFs GROBID may miss acknowledgements incertain papers To estimate the fraction of papersin which acknowledgements were missed by GRO-BID we visually inspected a random sample of 200papers from the CORD-19 dataset and found thatonly 146 papers (73) contain acknowledgementstatements out of which GROBID successfully ex-tracted all acknowledgement statements from 120papers (82) For the remaining 26 papers thatGROBID failed to parse 17 papers are in sections9 papers are in footnotes We developed a heuristicmethod that can extract acknowledgement state-ments from all 120 papers with acknowledgementstatements output by GROBID
43 Sentence Segmentation
The acknowledgement sections or statements ex-tracted above are paragraphs which needs to besegmented (or tokenized) into sentences we com-pared four software packages for sentence segmen-tation including NLTK (Bird 2006) Stanza (Qiet al 2020) Gensim (Rehurek and Sojka 2010)and the Pragmatic Segmenter6
NLTK includes a sentence tokenization methodsent tokenize() which uses an unsupervised al-gorithm to build a model for abbreviated wordscollocations and words that start sentences andthen uses that model to find sentence boundariesStanza is a Python natural language analysis pack-age developed by the Stanford NLP group Sen-tence segmentation is modeled as a tagging prob-lem over character sequences where the neuralmodel predicts whether a given character is the endof a sentence The split sentences() function inGensim package splits a text and returns list of sen-tences from a given text string using unsupervisedpattern recognition The Pragmatic Segmenter isa rule-based sentence boundary detection gem thatworks out-of-the-box across many languages
To compare the above methods we created aground truth corpus by randomly selecting ac-knowledgment sections or statements from 47 pa-pers and manually segmenting them resulting in100 sentences Table 1 shows the comparison re-sults for four methods The precision is calculated
6httpsgithubcomdiasks2pragmatic_segmenter
14
Method Precision Recall F1
Gensim 065 064 065NLTK 072 069 070
Pragmatic 086 076 081Stanza 081 088 084
Table 1 Sentence segmentation performance
by dividing the number of correctly segmented sen-tences by the total number of sentences segmentedThe recall is calculated by dividing the number ofcorrectly segmented sentences by the total numberof manually segmented sentences Stanza outper-forms the other three achieving an F1 = 084
44 Sentence Classification
Not all sentences in acknowledgement sec-tions express acknowledgement such as thefollowing sentence The funders played no
role in the study or preparation of the
manuscript Song et al (2020) In this modulewe classify sentences into acknowledgement andnon-acknowledgement statements We developeda set of regular expressions that match bothverbs (eg thank gratitude to indebted to)adjectives (eg grateful to) and nouns (eghelpful comments useful feedback) to cover asmany cases as possible To evaluate this methodwe manually selected 100 sentences including 50positive and negative samples from the sentencesobtained in Section 43 Our results show that96 out of 100 sentences were classified correctlyresulting accuracy of 096
45 Entity Recognition
In this step named entities are extracted usingstate-of-the-art NER software packages includingNLTK Bird (2006) Stanford CoreNLP Manninget al (2014) spaCy Honnibal and Montani (2017)and Stanza (Qi et al 2020) Stanza is a Pythonlibrary offering fully neural pre-trained models thatprovide state-of-the-art performance on many rawtext processing tasks when it was released TheNER model adopted the contextualized sequencetagger in Akbik et al (2018) The architecture in-cludes a Bi-LSTM character-level language modelfollowed by a one-layer Bi-LSTM sequence taggerwith a conditional random field (CRF) encoder Al-though Stanza was developed based on StanfordCoreNLP they exhibit differential performances inour NER task
The ground truth is built by randomly selecting
Method Entity Precision Recall F1
NLTK Person 045 068 055Org 059 077 067
spaCy Person 074 088 080Org 063 074 068
Stanford- Person 088 087 087CoreNLP Org 068 080 073
Stanza Person 089 093 091Org 060 089 072
Table 2 Comparison of NER software package
100 acknowledgement paragraphs from sectionsfootnotes and body text and manually annotatingperson and organization entities without discrimi-nating whether they are acknowledged or not Thisresults in 146 person and 209 organization entitiesThe comparison results indicate that overall Stanzaoutperforms the other three achieving F1 = 091for person and F1 = 072 for organization Es-pecially the recall of Stanza is 9 higher thanStanford CoreNLP (Table 2)
46 Entity Classification
As we showed not all named entities are acknowl-edged such as the ldquorsquoUtrecht Universityrdquo in Fig-ure 1 Therefore it is necessary to build a classifierto discriminate acknowledgement entities ndash entitiesthat are thanked by the paper or the authors fromnamed entities
The majority of acknowledgement statements inacademic articles have a relatively standard subject-predicate-object (SPO) structure They use a lim-ited number of words or phrases such as ldquothankrdquoldquoacknowledgerdquo ldquoare gratefulrdquo ldquois supportedrdquo andldquois fundedrdquo as predicates However the object cancontain multiple named entities some of whichare used as attributes of the others In rare casescertain sentences may not have subjects and predi-cates such as the first sentence in Figure 4
Our approach can be divided into three stepsTwo representative examples are illustrated in Fig-ure 3 The pseudo-code is shown in Algorithm 1
Step 1 We resolve the type of voice (active orpassive) subject and predicate of a sentence usingdependency parsing by Stanza This is becausenamed entities can appear as subjective or objec-tive parts We then locate all named entities Thesemantic meaning of a predicate and its type ofvoice can be used to determine whether entities ac-knowledged are in the objective part or subjectivepart In most cases the target entities are objects
15
Step 2 A sentence with multiple entities in theobjective part is split into shorter sentences calledldquosubsentencesrdquo so that each subsentence is asso-ciated with only up to one named entity This isdone by first splitting the sentence by ldquoandrdquo Foreach subsentence if the subject and predicate aremissing we fill them up using the subject and pred-icate of the original sentence The object in eachsubsentence does not necessarily contain an entityFor example in the right panel of Figure 3 becauseldquoexpertiserdquo is not a named entity it is replaced byldquononerdquo in the subsentence The SPO relations thatdo not contain named entities are removed
There are two scenarios In the first scenarioa sentence or a subsentence may contain a list ofentities with only the first being acknowledgedIn the third example of Figure 4 the named en-tities are [lsquoQi Yangrsquo lsquoMorehouse School of
Medicinersquo lsquoAtlantarsquo lsquoGArsquo] but only the firstentity is acknowledged The rest entities whichare recognized as organizations or locations areused for supplementing more information In thisscenario only the first named entity is extracted
Step 3 In the second scenario acknowledgedentities are connected by commas or ldquoandrdquosuch as in We thank Shirley Hauta Yurij
Popowych Elaine van Moorlehem and Yan
Zhou for help in the virus production Inthis scenario entities in this structure have thesame type indicating that they play similar rolesThis parallel pattern can be captured by regularexpressions The entities resolved in Step 2 and 3will be merged to form the final set
The method to find the parallel structure isas follows First check each entity whether itstype is person If so the entities are substitutedwith integer indexes The sentence becomesThe authors would like to thank 0 1 and
2 and the kind support of Bayer Animal
Health GmbH and Virbac Group If there are3 or more consecutive numbers in this formthis is the parallel pattern which is captured byregular expressions The pattern also allows textbetween names (Figure 5) Next the numbersin this part will be extracted and mapped tocorresponding entities In the example above thenumbers [012] correspond to the index of theentities [Norbert Mencke Lourdes Mottier
David McGahieand] Similar pattern recognitionare performed for organization The process isdepicted in Figure 5
We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan
for their support and expertise
We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan
for their support
We would also like to thank expertise
[lsquowersquo lsquothankrsquo lsquoCanadian Wildlife Health Cooperativersquo]
[lsquowersquo lsquothankrsquo lsquoNicolas Olivarezrsquo]
[lsquoCanadian Wildlife Health Cooperativersquo]
We thank Anne Broadwater and Nicolas Olivarez of the de Silva lab for technical assistance with the assays
We thank Anne Broadwater of the de Silva lab for technical assistance with the assays
We thank Nicolas Olivarez of the de Silva lab for technical assistance with the assays
[lsquowersquo lsquothankrsquo lsquoAnne Broadwaterrsquo]
[lsquowersquo lsquothankrsquo none]
[lsquoAnne Broadwaterrsquo lsquoNicolas Olivarezrsquo]
Figure 3 Process mapping multiple subject-predicate-object relations to acknowledgement entities with tworepresentative examples ldquoNonerdquo means the objectdoes not contain named entities
Method Precision Recall F1
Stanza 057 091 070RB(without step 3) 094 078 085
RB 094 090 092
Table 3 Performance of Stanza and our relation-based(RB) classifier Precision and recall show overall re-sults by combining person and organization
We investigated open information extractionmethods such as REVERB (Fader et al 2011) andOLLIE (Mausam et al 2012) We found that theyonly work for sentences with relatively simplestructures such as The authors wish to thank
Ming-Wei Guo for reagents and technical
assistance but fail with more complicatedsentences with long objective part or parallelsubsentences (Figure 4) We also investigated thesemantic role labeling (SRL) library in AllenNLPtoolkit (Shi and Lin 2019) For the third sentencein Figure 4 the AllenNLP SRL library resolves theentire string after ldquothankrdquo as an argument but failsto distinguish entity types Our method can handleall statements in Figure 4
As a baseline we use Stanza to extract all namedentities and compare its performance with ourrelation-based (RB) classifier The ground truthis compiled using the same corpus described inSection 45 except that only acknowledgement en-tities (as opposed to all named entities) are labeledpositive The results (Table 3) show that Stanzaachieves high recall but poor precision indicatingthat a significant fraction (sim 40) of named enti-ties are not acknowledged In contrast our classi-fier (RB) achieves a precision of 094 with a smallloss of recall achieving an F1 = 092
One limitation of the RB classifier is that it relieson text quality Sentences are expected to follow
16
To Aulio Costa Zambenedetti for the art work Nilson Fidecircncio Silvio Marques Tania Schepainski and Sibelli Tanjoni de Souza for technical assistance
The authors also thank the Program for Technological Development in Tools for Health-PDTIS FIOCRUZ for the use of its facilities (Platform RPT09H -Real Time PCR -Instituto Carlos ChagasFiocruz-PR)The authors wish to thank Ms Qi Yang DNA sequencing lab Morehouse School of Medicine Atlanta GA and the Research Cores at Morehouse School of Medicine supported through the Research Centers at Minority Institutions (RCMI) Program NIHNCRRRCMI Grant G12-RR03034
Figure 4 Sentence examples from which REVERB orOLLIE failed to extract acknowledgement entities un-derlined which can be identified by our method
Norbert MenckeDavid McGahie
Bayer Animal HealthVirbac Group
Norbert Mencke Lourdes Mottier David McGahie
step 3step 2
The authors would like to thank Norbert Mencke Lourdes Mottier and David McGahie and the kind support of Bayer Animal Health GmbH and Virbac Group
[Norbert Mencke David McGahie Bayer Animal Health Virbac Group Lourdes Mottier]
Figure 5 Merge the results given by step 2 and step 3and obtain the final results
the editorial convention such that entity names areclearly segmented by period If two sentences arenot properly delimited eg missing a period theclassifier may make incorrect predictions
5 Data Analysis
The CORD-19 dataset contains PMC and PDFfolders We found that almost all papers in thePMC folders are included in the PDF foldersFor consistency we only work on papers in thePDF folders containing 37612 unique PDF pa-pers7 Using ACKEXTRACT we extracted a totalof 102965 named entities from 21469 CORD-19 papers The fraction of papers with namedentities (2146937612asymp57) is roughly consis-tent with the fraction obtained from the 200 sam-ples (120200asymp60 in Section 42) Among allnamed entities 58742 are acknowledgement enti-ties These numbers suggest that using our modelonly about 50 of named entities are acknowl-edged Using only named entities acknowledge-ment studies could significantly overestimate the
7We excluded 2003 duplicate papers with exactly the sameSHA1 values
Algorithm 1 Relation-Based Classifier1 Function find entity()2 pre-process with punctuation and format3 find candidate entities entity list by Stanza4 for x isin entity list do5 if x isin
subject part (differs based on predicate)then
6 x = none
7 entity listremove(none)8 return entity list
9 Input sentence10 Output entity list all11 find subject part predicate and the object part12 if ldquopredicate valuerdquo isin reg expression 1 then13 if ldquoandrdquo isin sentence then14 split into subsentences by lsquoandrsquo15 for each subsentence do16 entity list
larr find entity(subsentence)17 if entity list 6= none then18 entity list all
larr appendentity list[0]
19 else if ldquoandrdquo isin sentence then20 entity listlarr find entity(subsentence)21 entity list alllarr entity list[0]
22 else if ldquopredicate valuerdquo isin reg expression 2 then23 do the same in if part but focus on entities in
subject24 for x isin candidate list by Stanza do25 if x isin sentenceampxtype = type then26 orig listlarr append x27 sentlarr replace x by index(x)
28 if reg format isin sent then29 templarr find reg expression in sent30 numlistlarr find numbers in temp31 listlarr index orig list by numlist
32 combine list and entity list all
number of acknowledgement entities by relyingon entities recognized by NER software packageswithout further classification Here we analyze ourresults and study some trends of acknowledgemententities in the CORD-19 dataset
The top 10 acknowledged organizations (Ta-ble 4) are all funding agencies Overall NIH (NI-AID is an institute of NIH) is acknowledged themost but funding agencies in other countries arealso acknowledged a lot in CORD-19 papers
Figure 6 shows the numbers of CORD-19 pa-pers with vs without acknowledgement (personor organization) recognized from 1970 to 2020The huge leap around 2002 was due to the wave ofcoronavirus studies during the SARS period Thesmall drop-down in 2020 was due to data incom-pleteness The plot indicates that the fraction of
17
Organization Count
National Institutes of Health or NIH 1414National Natural Science Foundation of China or NSFC 615National Institute of Allergy amp Infectious Diseases or NIAID 209National Science Foundation or NSF 183Ministry of Health 160Wellcome Trust 146Deutsche Forschungsgemeinschaft 119Ministry of Education 119National Science Council 104Public Health Service 85
Table 4 Top 10 acknowledged organizations identifiedin our CORD-19 dataset
0
500
1000
1500
2000
2500
3000
3500
4000
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
With Acknowledgement Without Acknowledgement
Figure 6 Numbers of CORD-19 papers with vs with-out acknowledgement recognized from 1970 to 2020
papers with acknowledgement has been increasinggradually over the last 20 years Figure 7 showsthe numbers of the top 10 acknowledged organiza-tions from 1983 to 2020 The figure indicates thatthe number of acknowledgements to NIH has beengradually decreasing over the past 10 years whilethe acknowledgements to NSF is roughly constantIn contrast the number has been gradually increas-ing from NSFC (a Chinese funding agency) Notethat the distribution of acknowledged organizationshas a long tail and organizations behind the top 10actually dominate the total number However thetop 10 organizations are the biggest research agen-cies and the trend to some extent reflects strategicshifts of funding support
6 Conclusion and Future Work
Here we extended the work of Khabsa et al (2012)and built an acknowledgement extraction frame-work denoted as ACKEXTRACT for research arti-cles ACKEXTRACT is based on heuristic meth-ods and state-of-the-art text mining libraries (egGROBID and Stanza) but features a classifierthat discriminates acknowledgement entities fromnamed entities by analyzing the multiple subject-predicate-object relations in a sentence Our ap-
0
50
100
150
200
250
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Public Health Service National Science CouncilMinistry of Education Deutsche ForschungsgemeinschaftWellcome Trust Ministry of HealthNational Institute of Allergy amp Infectious DiseasesNIAID National Science FoundationNSFNational Natural Science Foundation of ChinaNSFC National Institutes of HealthNIH
Figure 7 Trend of acknowledged organizations of top10 organizations from 1983 to 2020
proach successfully recognizes acknowledgemententities that cannot be recognized by OIE packagessuch as REVERB OLLIE and the AllenNLP SRLlibrary
This method is applied to the CORD-19 datasetreleased on April 10 2020 processing one PDFdocument in 5 seconds on average Our resultsindicate that only 50ndash60 named entities are ac-knowledged The rest are mentioned to provideadditional information (eg affiliation or location)about acknowledgement entities Working on cleandata our method achieves an overall F1 = 092for person and organization entities The trendanalysis of the CORD-19 papers verifies that moreand more papers include acknowledgement enti-ties since 2002 when the SARS outbreak hap-pened The trend also reveals that the overallnumber of acknowledgements to NIH is gradu-ally decreasing over the past 10 years while morepapers acknowledge NSFC a Chinese fundingagency One caveat of our method is that organi-zations in different countries are not distinguishedFor example many countries have agencies calledldquoMinistry of Healthrdquo In the future we plan tobuild learning-based models for sentence classifi-cation and entity classification The code and dataof this project have been released on GitHub athttpsgithubcomlamps-labackextract
Acknowledgments
This work was partially supported by the DefenseAdvanced Research Projects Agency (DARPA)under cooperative agreement No W911NF-19-2-0272 The content of the information does notnecessarily reflect the position or the policy of theGovernment and no official endorsement shouldbe inferred
18
ReferencesAlan Akbik Duncan Blythe and Roland Vollgraf
2018 Contextual string embeddings for sequencelabeling In Proceedings of the 27th InternationalConference on Computational Linguistics pages1638ndash1649 Santa Fe New Mexico USA Associ-ation for Computational Linguistics
James Andreoni and Laura K Gee 2015 Gunning forefficiency with third party enforcement in thresholdpublic goods Experimental Economics 18(1)154ndash171
Hannah Bast and Claudius Korzen 2017 A bench-mark and evaluation for text extraction from PDFIn 2017 ACMIEEE Joint Conference on Digital Li-braries JCDL 2017 Toronto ON Canada June 19-23 2017 pages 99ndash108 IEEE Computer Society
Steven Bird 2006 NLTK the natural language toolkitIn ACL 2006 21st International Conference on Com-putational Linguistics and 44th Annual Meeting ofthe Association for Computational Linguistics Pro-ceedings of the Conference Sydney Australia 17-21July 2006 The Association for Computer Linguis-tics
Cronin Blaise 2001 Acknowledgement trendsin the research literature of information science57(3)427ndash433
Isaac G Councill C Lee Giles Hui Han and ErenManavoglu 2005 Automatic acknowledgement in-dexing Expanding the semantics of contribution inthe citeseer digital library In Proceedings of the 3rdInternational Conference on Knowledge Capture K-CAP rsquo05 pages 19ndash26 New York NY USA ACM
Blaise Cronin Gail McKenzie Lourdes Rubio andSherrill Weaver-Wozniak 1993 Accounting for in-fluence Acknowledgments in contemporary sociol-ogy Journal of the American Society for Informa-tion Science 44(7)406ndash412
Blaise Cronin Gail McKenzie and Michael Stiffler1992 Patterns of acknowledgement Journal ofDocumentation
S Dai Y Ding Z Zhang W Zuo X Huang andS Zhu 2019 Grantextractor Accurate grant sup-port information extraction from biomedical fulltextbased on bi-lstm-crf IEEEACM Transactions onComputational Biology and Bioinformatics pages1ndash1
Anthony Fader Stephen Soderland and Oren Etzioni2011 Identifying relations for open information ex-traction In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language ProcessingEMNLP 2011 27-31 July 2011 John McIntyre Con-ference Centre Edinburgh UK A meeting of SIG-DAT a Special Interest Group of the ACL pages1535ndash1545
C Lee Giles Kurt D Bollacker and Steve Lawrence1998 CiteSeer An automatic citation indexing sys-tem In Proceedings of the 3rd ACM InternationalConference on Digital Libraries June 23-26 1998Pittsburgh PA USA pages 89ndash98
C Lee Giles and Isaac G Councill 2004 Whogets acknowledged Measuring scientific contribu-tions through automatic acknowledgment indexingProceedings of the National Academy of Sciences101(51)17599ndash17604
Matthew Honnibal and Ines Montani 2017 spaCy 2Natural language understanding with Bloom embed-dings convolutional neural networks and incremen-tal parsing To appear
Subhradeep Kayal Zubair Afzal George TsatsaronisMarius Doornenbal Sophia Katrenko and MichelleGregory 2019 A framework to automatically ex-tract funding information from text In MachineLearning Optimization and Data Science pages317ndash328 Cham Springer International Publishing
Madian Khabsa Pucktada Treeratpituk and C LeeGiles 2012 Ackseer a repository and search en-gine for automatically extracted acknowledgmentsfrom digital libraries In Proceedings of the12th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo12 Washington DC USA June 10-14 2012 pages 185ndash194 ACM
Mario Lipinski Kevin Yao Corinna Breitinger Jo-eran Beel and Bela Gipp 2013 Evaluation ofheader metadata extraction approaches and tools forscientific pdf documents In Proceedings of the13th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo13 pages 385ndash386 New York NYUSA ACM
Kyle Lo Lucy Lu Wang Mark Neumann Rodney Kin-ney and Dan S Weld 2019 S2orc The semanticscholar open research corpus
Patrice Lopez 2009 Grobid Combining auto-matic bibliographic data recognition and term ex-traction for scholarship publications In Proceed-ings of the 13th European Conference on Re-search and Advanced Technology for Digital Li-braries ECDLrsquo09 pages 473ndash474 Berlin Heidel-berg Springer-Verlag
Christopher D Manning Mihai Surdeanu John BauerJenny Finkel Steven J Bethard and David Mc-Closky 2014 The Stanford CoreNLP natural lan-guage processing toolkit In Association for Compu-tational Linguistics (ACL) System Demonstrationspages 55ndash60
Mausam Michael Schmitz Robert Bart StephenSoderland and Oren Etzioni 2012 Open languagelearning for information extraction In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning EMNLP-CoNLL rsquo12
19
pages 523ndash534 Stroudsburg PA USA Associationfor Computational Linguistics
Adele Paul-Hus Adrian A Dıaz-Faes Maxime Sainte-Marie Nadine Desrochers Rodrigo Costas and Vin-cent Lariviere 2017 Beyond funding Acknowl-edgement patterns in biomedical natural and socialsciences PLOS ONE 12(10)1ndash14
Adele Paul-Hus Philippe Mongeon Maxime Sainte-Marie and Vincent Lariviere 2020 Who are theacknowledgees an analysis of gender and academicstatus Quantitative Science Studies 0(0)1ndash17
Peng Qi Yuhao Zhang Yuhui Zhang Jason Boltonand Christopher D Manning 2020 Stanza Apython natural language processing toolkit for manyhuman languages CoRR abs200307082
Radim Rehurek and Petr Sojka 2010 SoftwareFramework for Topic Modelling with Large Cor-pora In Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks pages 45ndash50 Valletta Malta ELRA httpismuniczpublication884893en
Peng Shi and Jimmy Lin 2019 Simple BERT mod-els for relation extraction and semantic role labelingCoRR abs190405255
Mayank Singh Barnopriyo Barua Priyank PalodManvi Garg Sidhartha Satapathy Samuel BushiKumar Ayush Krishna Sai Rohith Tulasi GamidiPawan Goyal and Animesh Mukherjee 2016OCR++ A robust framework for information extrac-tion from scholarly articles In COLING 2016 26thInternational Conference on Computational Linguis-tics Proceedings of the Conference Technical Pa-pers December 11-16 2016 Osaka Japan pages3390ndash3400 ACL
E Smith B Williams-Jones Z Master V LariviereC R Sugimoto A Paul-Hus M Shi and D BResnik 2019 Misconduct and misbehavior relatedto authorship disagreements in collaborative scienceSci Eng Ethics
Min Song Keun Young Kang Tatsawan Timakum andXinyuan Zhang 2020 Examining influential fac-tors for acknowledgements classification using su-pervised learning PLOS ONE 15(2)1ndash21
H Stewart K Brown A M Dinan N Irigoyen E JSnijder and A E Firth 2018 Transcriptional andtranslational landscape of equine torovirus J Virol92(17)
Lucy Lu Wang Kyle Lo Yoganand ChandrasekharRussell Reas Jiangjiang Yang Darrin Eide KathrynFunk Rodney Kinney Ziyang Liu William Mer-rill Paul Mooney Dewey Murdick Devvret RishiJerry Sheehan Zhihong Shen Brandon StilsonAlex D Wade Kuansan Wang Chris Wilhelm BoyaXie Douglas Raymond Daniel S Weld Oren Et-zioni and Sebastian Kohlmeier 2020 CORD-19 the covid-19 open research dataset CoRRabs200410706
Jian Wu Jason Killian Huaiyu Yang Kyle WilliamsSagnik Ray Choudhury Suppawong Tuarob Cor-nelia Caragea and C Lee Giles 2015 PdfmefA multi-entity knowledge extraction framework forscholarly documents and semantic search In Pro-ceedings of the 8th International Conference onKnowledge Capture K-CAP 2015 pages 131ndash138New York NY USA ACM
11
Figure 1 Upper Acknowledgement statements appearin an isolated section (Stewart et al 2018) Bottomacknowledgement statements appear in a footnote (An-dreoni and Gee 2015)
as the Natural Language Toolkits (NLTK) (Bird2006) eg Khabsa et al (2012) Paul-Hus et al(2020) followed by simple semi-manual datacleansing resulting in a fraction of entities thatare mentioned but not actually acknowledged Inthis paper we design an automatic AER systemcalled ACKEXTRACT that further classifies extrac-tion results from open source NER packages thatrecognize people and organizations and distinguishentities that are actually acknowledged The extrac-tor finds acknowledgement statements from iso-lated sections and other locations such as footnoteswhich is common for papers in social and behav-ioral sciences Our contributions are
1 Develop ACKEXTRACT as open-source soft-ware to automatically extract acknowledge-ment entities from research papers
2 Apply ACKEXTRACT on the CORD-19dataset and supplement the dataset with a cor-pus of classified acknowledgement entities forfurther studies
3 Use the CORD-19 dataset as a case study todemonstrate that acknowledgement studieswithout classifying named entities can sig-nificantly overestimate the number of entitiesthat are actually acknowledged because manypeople and organizations are mentioned butnot explicitly acknowledged
2 Related Works
Early work on acknowledgement extraction wasmanually applied which was labor-intensiveCronin et al (1993) extracted a total of 9561 peerinteractive communication (PIC) names from a to-tal of 4200 research sociology articles most werepersonsrsquo names They also defined the followingsix categories of acknowledgement moral supportfinancial support editorial support presentational
support instrumentaltechnical support and con-ceptual support or PIC (Cronin et al 1992)
Councill et al (2005) used a hybrid method forautomatic AER from research papers and automati-cally created an acknowledgement index(Giles andCouncill 2004) The algorithm first used a heuris-tic method for identifying acknowledgement pas-sages It then uses an SVM model for identifyinglines containing acknowledgement sentences out-side labeled acknowledgement sections A regularexpression was used to extract entity names fromacknowledging text This method achieved an over-all precision of about 0785 and a recall of 0896 onCiteSeer papers (Giles et al 1998) The algorithmdoes not distinguish entity types
Khabsa et al (2012) leveraged OpenCalais1 andAlchemyAPI2 free services at that time to extractnamed entities from acknowledgement sections andbuilt ACKSEER a search engine for acknowledge-ment entities They merged outputs of both NERAPIs and generated a list and disambiguated entitymentions using the longest common subsequence(LCS) algorithm The ground truth contains 200top-cited CiteSeerX papers in which 130 had ac-knowledgement sections They achieved 923and 916 precision and recall for acknowledge-ment section extraction but did not evaluate entityextraction
Recent studies of acknowledgements tend to useresults from off-the-shelf NER packages with sim-ple filters assuming that named entities were ac-knowledged entities For example Paul-Hus et al(2020) uses the Stanford NER module in NLTK toextract persons Song et al (2020) also directly usepeople and organizations recognized by the Stan-ford CoreNLP (Manning et al 2014) These worksachieved a high recall by recognizing most nameentities in the acknowledgements but ignored theirrelations to the papers where they appear resultingin a fraction of entities that are mentioned but notactually acknowledged Song et al (2020) considergrammar structure such as verb tense and voiceand sentence patterns when labeling sentences totheir six categories For example ldquowas fundedrdquo isfollowed by an ldquoorganizationrdquo However they onlylabel sentences and do not annotate them down to
1httpswebarchiveorgweb20081023021111httpopencalaiscomcalaisAPI
2httpswebarchiveorgweb20090921044923httpwwwalchemyapicom
12
Figure 2 Architecture of ACKEXTRACT
the entity level Our system examines the relation-ship between entities and the current work withthe purpose of discriminating acknowledgement en-tities from named entities In this system we focuson people and organizations
Recently Dai et al (2019) proposed GrantEx-tractor a pipeline system to extract grant support in-formation from scientific papers A model combin-ing BiLSTM-CRF and pattern matching was usedto extract entities of grant numbers and agenciesfrom funding sentences which are identified usingheuristic methods The system achieves a micro-F1
up to 090 in extracting grant pairs (agency num-ber)
Kayal et al (2019) proposed an ensemble ap-proach called FUNDINGFINDER for extractingfunding information from text The authors con-struct feature vectors for candidate entities usingwhether the entities are recognized by four NERimplementation Stanford (Conditional RandomField model) LingPipe (Hidden Markov model)OpenNLP (Maximum Entropy model) and Else-vierrsquos Fingerprint Engine The F1-measure forfunding body is only 068plusmn 03
Our method is different from existing methodsin threefold (1) It is built on top of state-of-the-artneural NER methods which results in a relativelyhigh recall (2) It uses a heuristic method to filterout entities that are just mentioned but not acknowl-edged (3) It extracts both organizations and people
3 Dataset
On March 16th 2020 Allen Institute of Artifi-cial Intelligence released the first version of theCOVID-19 Open Research Dataset (CORD-19)(Wang et al 2020) in collaboration with severalother institutions The dataset contains metadataand segmented full text of research articles selectedby searching a list of keywords about coronavirusSARS-CoV MERS and other related terms from
four digital libraries including WHO PubMed Cen-tral BioRxiv and MedRxiv The initial dataset con-tained about 28k papers and was updated weeklywith papers from new sources and the latest pub-lication We used the dataset released on April10 2020 containing over 59312 metadata recordsamong which 54756 have the full text in JSONformat The CORD-19 papers were generated byprocessing PDFs using the S2ORD pipeline (Loet al 2019) in which GROBID (Lopez 2009) wasemployed for document segmentation and metadataextraction
The full text in JSON files is directly used forsentence segmentation However we observe thatGROBID extraction results are not perfect In par-ticular we estimate the fraction of acknowledge-ment sections omitted We also estimate the num-ber of acknowledgement entities omitted in thedata release due to the extraction error of GROBIDTo do this we downloaded 45916 full-text PDFpapers collected by the Internet Archive (IA) be-cause the CORD-19 dataset does not include PDFfiles3 We found 13103 CORD-19 papers in theIA dataset
4 Acknowledgement Extraction
41 OverviewThe architecture of our acknowledgement entity ex-traction system is depicted in Figure 2 The systemcan be divided into the following modules
1 Document segmentation CORD-19 pro-vides full text as JSON files but in generalmost research articles are published in PDFso our first step is converting a PDF docu-ment to text and segment it into sections Weuse GROBID that has shown superior perfor-mance over many other document extractionmethods (Lipinski et al 2013) The output isan XML file in TEI schema
2 Sentence segmentation Paragraphs are seg-mented into sentences We compare severalsentence segmentation software packages andchoose Stanza (Qi et al 2020) because of itsrelatively high accuracy
3 Sentence classification Sentences areclassified into acknowledgement and non-acknowledgement statements The result isa set of acknowledgement statements insideor outside the acknowledgement sections
3httpsarchiveorgdownloadcovid19_fatcat_20200410
13
4 Entity recognition Named entities are ex-tracted from acknowledgement statementsWe compare four commonly used NER soft-ware packages and choose Stanza because ofits relatively high performance In this workwe focus on person and organization
5 Entity classification In this module we clas-sify named entities by analyzing sentencestructures aiming at discriminating namedentities that are actually acknowledged ratherthan just mentioned We demonstrate thattriple extraction packages such as REVERB
and OLLIE fail to handle acknowledgementstatements with multiple entities in objects inour dataset The results are acknowledgemententities including people or organizations
42 Document SegmentationThe majority of scholarly papers are published inPDF format which are not readily readable by textprocessors Several attempts have been made toconvert PDF into text (Bast and Korzen 2017) andsegment the document into section and sub-sectionlevels GROBID is a machine learning library forextracting parsing and re-structuring raw docu-ments into TEI encoded documents Other simi-lar methods have been recently developed such asOCR++ (Singh et al 2016) and Science Parse4Lipinski et al (2013) compared 7 metadata extrac-tion methods and found that GROBID (version040) achieved superior performance over the oth-ers GROBID trained a cascading of conditionalrandom field (CRF) models on PubMed and com-puter science papers The recent version (060)has a set of powerful functionalities such as extract-ing and parsing headers and segmenting full-textextraction GROBID supports a batch mode andan API service mode the latter enables large scaledocument processing on multi-core servers suchas in CiteSeerX (Wu et al 2015) A benchmark-ing result for version 060 shows that the sectiontitle parsing achieves and F1 = 070 under thestrict matching criteria and F1 = 075 under thesoft matching criteria5 Singh et al (2016) claimsOCR++ achieves better performance than GROBIDin several fields evaluated on computer science pa-pers However the lack of a service mode APIand multi-domain adaptability limits its usabilityScience-Parse only extracts key metadata such as
4httpsgithubcomallenaispv25httpsgrobidreadthedocsioen
latestBenchmarking-pmc
title authors year and venue Therefore we adoptGROBID to convert PDF documents into XMLfiles
Depending on the structure and provenance ofPDFs GROBID may miss acknowledgements incertain papers To estimate the fraction of papersin which acknowledgements were missed by GRO-BID we visually inspected a random sample of 200papers from the CORD-19 dataset and found thatonly 146 papers (73) contain acknowledgementstatements out of which GROBID successfully ex-tracted all acknowledgement statements from 120papers (82) For the remaining 26 papers thatGROBID failed to parse 17 papers are in sections9 papers are in footnotes We developed a heuristicmethod that can extract acknowledgement state-ments from all 120 papers with acknowledgementstatements output by GROBID
43 Sentence Segmentation
The acknowledgement sections or statements ex-tracted above are paragraphs which needs to besegmented (or tokenized) into sentences we com-pared four software packages for sentence segmen-tation including NLTK (Bird 2006) Stanza (Qiet al 2020) Gensim (Rehurek and Sojka 2010)and the Pragmatic Segmenter6
NLTK includes a sentence tokenization methodsent tokenize() which uses an unsupervised al-gorithm to build a model for abbreviated wordscollocations and words that start sentences andthen uses that model to find sentence boundariesStanza is a Python natural language analysis pack-age developed by the Stanford NLP group Sen-tence segmentation is modeled as a tagging prob-lem over character sequences where the neuralmodel predicts whether a given character is the endof a sentence The split sentences() function inGensim package splits a text and returns list of sen-tences from a given text string using unsupervisedpattern recognition The Pragmatic Segmenter isa rule-based sentence boundary detection gem thatworks out-of-the-box across many languages
To compare the above methods we created aground truth corpus by randomly selecting ac-knowledgment sections or statements from 47 pa-pers and manually segmenting them resulting in100 sentences Table 1 shows the comparison re-sults for four methods The precision is calculated
6httpsgithubcomdiasks2pragmatic_segmenter
14
Method Precision Recall F1
Gensim 065 064 065NLTK 072 069 070
Pragmatic 086 076 081Stanza 081 088 084
Table 1 Sentence segmentation performance
by dividing the number of correctly segmented sen-tences by the total number of sentences segmentedThe recall is calculated by dividing the number ofcorrectly segmented sentences by the total numberof manually segmented sentences Stanza outper-forms the other three achieving an F1 = 084
44 Sentence Classification
Not all sentences in acknowledgement sec-tions express acknowledgement such as thefollowing sentence The funders played no
role in the study or preparation of the
manuscript Song et al (2020) In this modulewe classify sentences into acknowledgement andnon-acknowledgement statements We developeda set of regular expressions that match bothverbs (eg thank gratitude to indebted to)adjectives (eg grateful to) and nouns (eghelpful comments useful feedback) to cover asmany cases as possible To evaluate this methodwe manually selected 100 sentences including 50positive and negative samples from the sentencesobtained in Section 43 Our results show that96 out of 100 sentences were classified correctlyresulting accuracy of 096
45 Entity Recognition
In this step named entities are extracted usingstate-of-the-art NER software packages includingNLTK Bird (2006) Stanford CoreNLP Manninget al (2014) spaCy Honnibal and Montani (2017)and Stanza (Qi et al 2020) Stanza is a Pythonlibrary offering fully neural pre-trained models thatprovide state-of-the-art performance on many rawtext processing tasks when it was released TheNER model adopted the contextualized sequencetagger in Akbik et al (2018) The architecture in-cludes a Bi-LSTM character-level language modelfollowed by a one-layer Bi-LSTM sequence taggerwith a conditional random field (CRF) encoder Al-though Stanza was developed based on StanfordCoreNLP they exhibit differential performances inour NER task
The ground truth is built by randomly selecting
Method Entity Precision Recall F1
NLTK Person 045 068 055Org 059 077 067
spaCy Person 074 088 080Org 063 074 068
Stanford- Person 088 087 087CoreNLP Org 068 080 073
Stanza Person 089 093 091Org 060 089 072
Table 2 Comparison of NER software package
100 acknowledgement paragraphs from sectionsfootnotes and body text and manually annotatingperson and organization entities without discrimi-nating whether they are acknowledged or not Thisresults in 146 person and 209 organization entitiesThe comparison results indicate that overall Stanzaoutperforms the other three achieving F1 = 091for person and F1 = 072 for organization Es-pecially the recall of Stanza is 9 higher thanStanford CoreNLP (Table 2)
46 Entity Classification
As we showed not all named entities are acknowl-edged such as the ldquorsquoUtrecht Universityrdquo in Fig-ure 1 Therefore it is necessary to build a classifierto discriminate acknowledgement entities ndash entitiesthat are thanked by the paper or the authors fromnamed entities
The majority of acknowledgement statements inacademic articles have a relatively standard subject-predicate-object (SPO) structure They use a lim-ited number of words or phrases such as ldquothankrdquoldquoacknowledgerdquo ldquoare gratefulrdquo ldquois supportedrdquo andldquois fundedrdquo as predicates However the object cancontain multiple named entities some of whichare used as attributes of the others In rare casescertain sentences may not have subjects and predi-cates such as the first sentence in Figure 4
Our approach can be divided into three stepsTwo representative examples are illustrated in Fig-ure 3 The pseudo-code is shown in Algorithm 1
Step 1 We resolve the type of voice (active orpassive) subject and predicate of a sentence usingdependency parsing by Stanza This is becausenamed entities can appear as subjective or objec-tive parts We then locate all named entities Thesemantic meaning of a predicate and its type ofvoice can be used to determine whether entities ac-knowledged are in the objective part or subjectivepart In most cases the target entities are objects
15
Step 2 A sentence with multiple entities in theobjective part is split into shorter sentences calledldquosubsentencesrdquo so that each subsentence is asso-ciated with only up to one named entity This isdone by first splitting the sentence by ldquoandrdquo Foreach subsentence if the subject and predicate aremissing we fill them up using the subject and pred-icate of the original sentence The object in eachsubsentence does not necessarily contain an entityFor example in the right panel of Figure 3 becauseldquoexpertiserdquo is not a named entity it is replaced byldquononerdquo in the subsentence The SPO relations thatdo not contain named entities are removed
There are two scenarios In the first scenarioa sentence or a subsentence may contain a list ofentities with only the first being acknowledgedIn the third example of Figure 4 the named en-tities are [lsquoQi Yangrsquo lsquoMorehouse School of
Medicinersquo lsquoAtlantarsquo lsquoGArsquo] but only the firstentity is acknowledged The rest entities whichare recognized as organizations or locations areused for supplementing more information In thisscenario only the first named entity is extracted
Step 3 In the second scenario acknowledgedentities are connected by commas or ldquoandrdquosuch as in We thank Shirley Hauta Yurij
Popowych Elaine van Moorlehem and Yan
Zhou for help in the virus production Inthis scenario entities in this structure have thesame type indicating that they play similar rolesThis parallel pattern can be captured by regularexpressions The entities resolved in Step 2 and 3will be merged to form the final set
The method to find the parallel structure isas follows First check each entity whether itstype is person If so the entities are substitutedwith integer indexes The sentence becomesThe authors would like to thank 0 1 and
2 and the kind support of Bayer Animal
Health GmbH and Virbac Group If there are3 or more consecutive numbers in this formthis is the parallel pattern which is captured byregular expressions The pattern also allows textbetween names (Figure 5) Next the numbersin this part will be extracted and mapped tocorresponding entities In the example above thenumbers [012] correspond to the index of theentities [Norbert Mencke Lourdes Mottier
David McGahieand] Similar pattern recognitionare performed for organization The process isdepicted in Figure 5
We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan
for their support and expertise
We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan
for their support
We would also like to thank expertise
[lsquowersquo lsquothankrsquo lsquoCanadian Wildlife Health Cooperativersquo]
[lsquowersquo lsquothankrsquo lsquoNicolas Olivarezrsquo]
[lsquoCanadian Wildlife Health Cooperativersquo]
We thank Anne Broadwater and Nicolas Olivarez of the de Silva lab for technical assistance with the assays
We thank Anne Broadwater of the de Silva lab for technical assistance with the assays
We thank Nicolas Olivarez of the de Silva lab for technical assistance with the assays
[lsquowersquo lsquothankrsquo lsquoAnne Broadwaterrsquo]
[lsquowersquo lsquothankrsquo none]
[lsquoAnne Broadwaterrsquo lsquoNicolas Olivarezrsquo]
Figure 3 Process mapping multiple subject-predicate-object relations to acknowledgement entities with tworepresentative examples ldquoNonerdquo means the objectdoes not contain named entities
Method Precision Recall F1
Stanza 057 091 070RB(without step 3) 094 078 085
RB 094 090 092
Table 3 Performance of Stanza and our relation-based(RB) classifier Precision and recall show overall re-sults by combining person and organization
We investigated open information extractionmethods such as REVERB (Fader et al 2011) andOLLIE (Mausam et al 2012) We found that theyonly work for sentences with relatively simplestructures such as The authors wish to thank
Ming-Wei Guo for reagents and technical
assistance but fail with more complicatedsentences with long objective part or parallelsubsentences (Figure 4) We also investigated thesemantic role labeling (SRL) library in AllenNLPtoolkit (Shi and Lin 2019) For the third sentencein Figure 4 the AllenNLP SRL library resolves theentire string after ldquothankrdquo as an argument but failsto distinguish entity types Our method can handleall statements in Figure 4
As a baseline we use Stanza to extract all namedentities and compare its performance with ourrelation-based (RB) classifier The ground truthis compiled using the same corpus described inSection 45 except that only acknowledgement en-tities (as opposed to all named entities) are labeledpositive The results (Table 3) show that Stanzaachieves high recall but poor precision indicatingthat a significant fraction (sim 40) of named enti-ties are not acknowledged In contrast our classi-fier (RB) achieves a precision of 094 with a smallloss of recall achieving an F1 = 092
One limitation of the RB classifier is that it relieson text quality Sentences are expected to follow
16
To Aulio Costa Zambenedetti for the art work Nilson Fidecircncio Silvio Marques Tania Schepainski and Sibelli Tanjoni de Souza for technical assistance
The authors also thank the Program for Technological Development in Tools for Health-PDTIS FIOCRUZ for the use of its facilities (Platform RPT09H -Real Time PCR -Instituto Carlos ChagasFiocruz-PR)The authors wish to thank Ms Qi Yang DNA sequencing lab Morehouse School of Medicine Atlanta GA and the Research Cores at Morehouse School of Medicine supported through the Research Centers at Minority Institutions (RCMI) Program NIHNCRRRCMI Grant G12-RR03034
Figure 4 Sentence examples from which REVERB orOLLIE failed to extract acknowledgement entities un-derlined which can be identified by our method
Norbert MenckeDavid McGahie
Bayer Animal HealthVirbac Group
Norbert Mencke Lourdes Mottier David McGahie
step 3step 2
The authors would like to thank Norbert Mencke Lourdes Mottier and David McGahie and the kind support of Bayer Animal Health GmbH and Virbac Group
[Norbert Mencke David McGahie Bayer Animal Health Virbac Group Lourdes Mottier]
Figure 5 Merge the results given by step 2 and step 3and obtain the final results
the editorial convention such that entity names areclearly segmented by period If two sentences arenot properly delimited eg missing a period theclassifier may make incorrect predictions
5 Data Analysis
The CORD-19 dataset contains PMC and PDFfolders We found that almost all papers in thePMC folders are included in the PDF foldersFor consistency we only work on papers in thePDF folders containing 37612 unique PDF pa-pers7 Using ACKEXTRACT we extracted a totalof 102965 named entities from 21469 CORD-19 papers The fraction of papers with namedentities (2146937612asymp57) is roughly consis-tent with the fraction obtained from the 200 sam-ples (120200asymp60 in Section 42) Among allnamed entities 58742 are acknowledgement enti-ties These numbers suggest that using our modelonly about 50 of named entities are acknowl-edged Using only named entities acknowledge-ment studies could significantly overestimate the
7We excluded 2003 duplicate papers with exactly the sameSHA1 values
Algorithm 1 Relation-Based Classifier1 Function find entity()2 pre-process with punctuation and format3 find candidate entities entity list by Stanza4 for x isin entity list do5 if x isin
subject part (differs based on predicate)then
6 x = none
7 entity listremove(none)8 return entity list
9 Input sentence10 Output entity list all11 find subject part predicate and the object part12 if ldquopredicate valuerdquo isin reg expression 1 then13 if ldquoandrdquo isin sentence then14 split into subsentences by lsquoandrsquo15 for each subsentence do16 entity list
larr find entity(subsentence)17 if entity list 6= none then18 entity list all
larr appendentity list[0]
19 else if ldquoandrdquo isin sentence then20 entity listlarr find entity(subsentence)21 entity list alllarr entity list[0]
22 else if ldquopredicate valuerdquo isin reg expression 2 then23 do the same in if part but focus on entities in
subject24 for x isin candidate list by Stanza do25 if x isin sentenceampxtype = type then26 orig listlarr append x27 sentlarr replace x by index(x)
28 if reg format isin sent then29 templarr find reg expression in sent30 numlistlarr find numbers in temp31 listlarr index orig list by numlist
32 combine list and entity list all
number of acknowledgement entities by relyingon entities recognized by NER software packageswithout further classification Here we analyze ourresults and study some trends of acknowledgemententities in the CORD-19 dataset
The top 10 acknowledged organizations (Ta-ble 4) are all funding agencies Overall NIH (NI-AID is an institute of NIH) is acknowledged themost but funding agencies in other countries arealso acknowledged a lot in CORD-19 papers
Figure 6 shows the numbers of CORD-19 pa-pers with vs without acknowledgement (personor organization) recognized from 1970 to 2020The huge leap around 2002 was due to the wave ofcoronavirus studies during the SARS period Thesmall drop-down in 2020 was due to data incom-pleteness The plot indicates that the fraction of
17
Organization Count
National Institutes of Health or NIH 1414National Natural Science Foundation of China or NSFC 615National Institute of Allergy amp Infectious Diseases or NIAID 209National Science Foundation or NSF 183Ministry of Health 160Wellcome Trust 146Deutsche Forschungsgemeinschaft 119Ministry of Education 119National Science Council 104Public Health Service 85
Table 4 Top 10 acknowledged organizations identifiedin our CORD-19 dataset
0
500
1000
1500
2000
2500
3000
3500
4000
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
With Acknowledgement Without Acknowledgement
Figure 6 Numbers of CORD-19 papers with vs with-out acknowledgement recognized from 1970 to 2020
papers with acknowledgement has been increasinggradually over the last 20 years Figure 7 showsthe numbers of the top 10 acknowledged organiza-tions from 1983 to 2020 The figure indicates thatthe number of acknowledgements to NIH has beengradually decreasing over the past 10 years whilethe acknowledgements to NSF is roughly constantIn contrast the number has been gradually increas-ing from NSFC (a Chinese funding agency) Notethat the distribution of acknowledged organizationshas a long tail and organizations behind the top 10actually dominate the total number However thetop 10 organizations are the biggest research agen-cies and the trend to some extent reflects strategicshifts of funding support
6 Conclusion and Future Work
Here we extended the work of Khabsa et al (2012)and built an acknowledgement extraction frame-work denoted as ACKEXTRACT for research arti-cles ACKEXTRACT is based on heuristic meth-ods and state-of-the-art text mining libraries (egGROBID and Stanza) but features a classifierthat discriminates acknowledgement entities fromnamed entities by analyzing the multiple subject-predicate-object relations in a sentence Our ap-
0
50
100
150
200
250
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Public Health Service National Science CouncilMinistry of Education Deutsche ForschungsgemeinschaftWellcome Trust Ministry of HealthNational Institute of Allergy amp Infectious DiseasesNIAID National Science FoundationNSFNational Natural Science Foundation of ChinaNSFC National Institutes of HealthNIH
Figure 7 Trend of acknowledged organizations of top10 organizations from 1983 to 2020
proach successfully recognizes acknowledgemententities that cannot be recognized by OIE packagessuch as REVERB OLLIE and the AllenNLP SRLlibrary
This method is applied to the CORD-19 datasetreleased on April 10 2020 processing one PDFdocument in 5 seconds on average Our resultsindicate that only 50ndash60 named entities are ac-knowledged The rest are mentioned to provideadditional information (eg affiliation or location)about acknowledgement entities Working on cleandata our method achieves an overall F1 = 092for person and organization entities The trendanalysis of the CORD-19 papers verifies that moreand more papers include acknowledgement enti-ties since 2002 when the SARS outbreak hap-pened The trend also reveals that the overallnumber of acknowledgements to NIH is gradu-ally decreasing over the past 10 years while morepapers acknowledge NSFC a Chinese fundingagency One caveat of our method is that organi-zations in different countries are not distinguishedFor example many countries have agencies calledldquoMinistry of Healthrdquo In the future we plan tobuild learning-based models for sentence classifi-cation and entity classification The code and dataof this project have been released on GitHub athttpsgithubcomlamps-labackextract
Acknowledgments
This work was partially supported by the DefenseAdvanced Research Projects Agency (DARPA)under cooperative agreement No W911NF-19-2-0272 The content of the information does notnecessarily reflect the position or the policy of theGovernment and no official endorsement shouldbe inferred
18
ReferencesAlan Akbik Duncan Blythe and Roland Vollgraf
2018 Contextual string embeddings for sequencelabeling In Proceedings of the 27th InternationalConference on Computational Linguistics pages1638ndash1649 Santa Fe New Mexico USA Associ-ation for Computational Linguistics
James Andreoni and Laura K Gee 2015 Gunning forefficiency with third party enforcement in thresholdpublic goods Experimental Economics 18(1)154ndash171
Hannah Bast and Claudius Korzen 2017 A bench-mark and evaluation for text extraction from PDFIn 2017 ACMIEEE Joint Conference on Digital Li-braries JCDL 2017 Toronto ON Canada June 19-23 2017 pages 99ndash108 IEEE Computer Society
Steven Bird 2006 NLTK the natural language toolkitIn ACL 2006 21st International Conference on Com-putational Linguistics and 44th Annual Meeting ofthe Association for Computational Linguistics Pro-ceedings of the Conference Sydney Australia 17-21July 2006 The Association for Computer Linguis-tics
Cronin Blaise 2001 Acknowledgement trendsin the research literature of information science57(3)427ndash433
Isaac G Councill C Lee Giles Hui Han and ErenManavoglu 2005 Automatic acknowledgement in-dexing Expanding the semantics of contribution inthe citeseer digital library In Proceedings of the 3rdInternational Conference on Knowledge Capture K-CAP rsquo05 pages 19ndash26 New York NY USA ACM
Blaise Cronin Gail McKenzie Lourdes Rubio andSherrill Weaver-Wozniak 1993 Accounting for in-fluence Acknowledgments in contemporary sociol-ogy Journal of the American Society for Informa-tion Science 44(7)406ndash412
Blaise Cronin Gail McKenzie and Michael Stiffler1992 Patterns of acknowledgement Journal ofDocumentation
S Dai Y Ding Z Zhang W Zuo X Huang andS Zhu 2019 Grantextractor Accurate grant sup-port information extraction from biomedical fulltextbased on bi-lstm-crf IEEEACM Transactions onComputational Biology and Bioinformatics pages1ndash1
Anthony Fader Stephen Soderland and Oren Etzioni2011 Identifying relations for open information ex-traction In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language ProcessingEMNLP 2011 27-31 July 2011 John McIntyre Con-ference Centre Edinburgh UK A meeting of SIG-DAT a Special Interest Group of the ACL pages1535ndash1545
C Lee Giles Kurt D Bollacker and Steve Lawrence1998 CiteSeer An automatic citation indexing sys-tem In Proceedings of the 3rd ACM InternationalConference on Digital Libraries June 23-26 1998Pittsburgh PA USA pages 89ndash98
C Lee Giles and Isaac G Councill 2004 Whogets acknowledged Measuring scientific contribu-tions through automatic acknowledgment indexingProceedings of the National Academy of Sciences101(51)17599ndash17604
Matthew Honnibal and Ines Montani 2017 spaCy 2Natural language understanding with Bloom embed-dings convolutional neural networks and incremen-tal parsing To appear
Subhradeep Kayal Zubair Afzal George TsatsaronisMarius Doornenbal Sophia Katrenko and MichelleGregory 2019 A framework to automatically ex-tract funding information from text In MachineLearning Optimization and Data Science pages317ndash328 Cham Springer International Publishing
Madian Khabsa Pucktada Treeratpituk and C LeeGiles 2012 Ackseer a repository and search en-gine for automatically extracted acknowledgmentsfrom digital libraries In Proceedings of the12th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo12 Washington DC USA June 10-14 2012 pages 185ndash194 ACM
Mario Lipinski Kevin Yao Corinna Breitinger Jo-eran Beel and Bela Gipp 2013 Evaluation ofheader metadata extraction approaches and tools forscientific pdf documents In Proceedings of the13th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo13 pages 385ndash386 New York NYUSA ACM
Kyle Lo Lucy Lu Wang Mark Neumann Rodney Kin-ney and Dan S Weld 2019 S2orc The semanticscholar open research corpus
Patrice Lopez 2009 Grobid Combining auto-matic bibliographic data recognition and term ex-traction for scholarship publications In Proceed-ings of the 13th European Conference on Re-search and Advanced Technology for Digital Li-braries ECDLrsquo09 pages 473ndash474 Berlin Heidel-berg Springer-Verlag
Christopher D Manning Mihai Surdeanu John BauerJenny Finkel Steven J Bethard and David Mc-Closky 2014 The Stanford CoreNLP natural lan-guage processing toolkit In Association for Compu-tational Linguistics (ACL) System Demonstrationspages 55ndash60
Mausam Michael Schmitz Robert Bart StephenSoderland and Oren Etzioni 2012 Open languagelearning for information extraction In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning EMNLP-CoNLL rsquo12
19
pages 523ndash534 Stroudsburg PA USA Associationfor Computational Linguistics
Adele Paul-Hus Adrian A Dıaz-Faes Maxime Sainte-Marie Nadine Desrochers Rodrigo Costas and Vin-cent Lariviere 2017 Beyond funding Acknowl-edgement patterns in biomedical natural and socialsciences PLOS ONE 12(10)1ndash14
Adele Paul-Hus Philippe Mongeon Maxime Sainte-Marie and Vincent Lariviere 2020 Who are theacknowledgees an analysis of gender and academicstatus Quantitative Science Studies 0(0)1ndash17
Peng Qi Yuhao Zhang Yuhui Zhang Jason Boltonand Christopher D Manning 2020 Stanza Apython natural language processing toolkit for manyhuman languages CoRR abs200307082
Radim Rehurek and Petr Sojka 2010 SoftwareFramework for Topic Modelling with Large Cor-pora In Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks pages 45ndash50 Valletta Malta ELRA httpismuniczpublication884893en
Peng Shi and Jimmy Lin 2019 Simple BERT mod-els for relation extraction and semantic role labelingCoRR abs190405255
Mayank Singh Barnopriyo Barua Priyank PalodManvi Garg Sidhartha Satapathy Samuel BushiKumar Ayush Krishna Sai Rohith Tulasi GamidiPawan Goyal and Animesh Mukherjee 2016OCR++ A robust framework for information extrac-tion from scholarly articles In COLING 2016 26thInternational Conference on Computational Linguis-tics Proceedings of the Conference Technical Pa-pers December 11-16 2016 Osaka Japan pages3390ndash3400 ACL
E Smith B Williams-Jones Z Master V LariviereC R Sugimoto A Paul-Hus M Shi and D BResnik 2019 Misconduct and misbehavior relatedto authorship disagreements in collaborative scienceSci Eng Ethics
Min Song Keun Young Kang Tatsawan Timakum andXinyuan Zhang 2020 Examining influential fac-tors for acknowledgements classification using su-pervised learning PLOS ONE 15(2)1ndash21
H Stewart K Brown A M Dinan N Irigoyen E JSnijder and A E Firth 2018 Transcriptional andtranslational landscape of equine torovirus J Virol92(17)
Lucy Lu Wang Kyle Lo Yoganand ChandrasekharRussell Reas Jiangjiang Yang Darrin Eide KathrynFunk Rodney Kinney Ziyang Liu William Mer-rill Paul Mooney Dewey Murdick Devvret RishiJerry Sheehan Zhihong Shen Brandon StilsonAlex D Wade Kuansan Wang Chris Wilhelm BoyaXie Douglas Raymond Daniel S Weld Oren Et-zioni and Sebastian Kohlmeier 2020 CORD-19 the covid-19 open research dataset CoRRabs200410706
Jian Wu Jason Killian Huaiyu Yang Kyle WilliamsSagnik Ray Choudhury Suppawong Tuarob Cor-nelia Caragea and C Lee Giles 2015 PdfmefA multi-entity knowledge extraction framework forscholarly documents and semantic search In Pro-ceedings of the 8th International Conference onKnowledge Capture K-CAP 2015 pages 131ndash138New York NY USA ACM
12
Figure 2 Architecture of ACKEXTRACT
the entity level Our system examines the relation-ship between entities and the current work withthe purpose of discriminating acknowledgement en-tities from named entities In this system we focuson people and organizations
Recently Dai et al (2019) proposed GrantEx-tractor a pipeline system to extract grant support in-formation from scientific papers A model combin-ing BiLSTM-CRF and pattern matching was usedto extract entities of grant numbers and agenciesfrom funding sentences which are identified usingheuristic methods The system achieves a micro-F1
up to 090 in extracting grant pairs (agency num-ber)
Kayal et al (2019) proposed an ensemble ap-proach called FUNDINGFINDER for extractingfunding information from text The authors con-struct feature vectors for candidate entities usingwhether the entities are recognized by four NERimplementation Stanford (Conditional RandomField model) LingPipe (Hidden Markov model)OpenNLP (Maximum Entropy model) and Else-vierrsquos Fingerprint Engine The F1-measure forfunding body is only 068plusmn 03
Our method is different from existing methodsin threefold (1) It is built on top of state-of-the-artneural NER methods which results in a relativelyhigh recall (2) It uses a heuristic method to filterout entities that are just mentioned but not acknowl-edged (3) It extracts both organizations and people
3 Dataset
On March 16th 2020 Allen Institute of Artifi-cial Intelligence released the first version of theCOVID-19 Open Research Dataset (CORD-19)(Wang et al 2020) in collaboration with severalother institutions The dataset contains metadataand segmented full text of research articles selectedby searching a list of keywords about coronavirusSARS-CoV MERS and other related terms from
four digital libraries including WHO PubMed Cen-tral BioRxiv and MedRxiv The initial dataset con-tained about 28k papers and was updated weeklywith papers from new sources and the latest pub-lication We used the dataset released on April10 2020 containing over 59312 metadata recordsamong which 54756 have the full text in JSONformat The CORD-19 papers were generated byprocessing PDFs using the S2ORD pipeline (Loet al 2019) in which GROBID (Lopez 2009) wasemployed for document segmentation and metadataextraction
The full text in JSON files is directly used forsentence segmentation However we observe thatGROBID extraction results are not perfect In par-ticular we estimate the fraction of acknowledge-ment sections omitted We also estimate the num-ber of acknowledgement entities omitted in thedata release due to the extraction error of GROBIDTo do this we downloaded 45916 full-text PDFpapers collected by the Internet Archive (IA) be-cause the CORD-19 dataset does not include PDFfiles3 We found 13103 CORD-19 papers in theIA dataset
4 Acknowledgement Extraction
41 OverviewThe architecture of our acknowledgement entity ex-traction system is depicted in Figure 2 The systemcan be divided into the following modules
1 Document segmentation CORD-19 pro-vides full text as JSON files but in generalmost research articles are published in PDFso our first step is converting a PDF docu-ment to text and segment it into sections Weuse GROBID that has shown superior perfor-mance over many other document extractionmethods (Lipinski et al 2013) The output isan XML file in TEI schema
2 Sentence segmentation Paragraphs are seg-mented into sentences We compare severalsentence segmentation software packages andchoose Stanza (Qi et al 2020) because of itsrelatively high accuracy
3 Sentence classification Sentences areclassified into acknowledgement and non-acknowledgement statements The result isa set of acknowledgement statements insideor outside the acknowledgement sections
3httpsarchiveorgdownloadcovid19_fatcat_20200410
13
4 Entity recognition Named entities are ex-tracted from acknowledgement statementsWe compare four commonly used NER soft-ware packages and choose Stanza because ofits relatively high performance In this workwe focus on person and organization
5 Entity classification In this module we clas-sify named entities by analyzing sentencestructures aiming at discriminating namedentities that are actually acknowledged ratherthan just mentioned We demonstrate thattriple extraction packages such as REVERB
and OLLIE fail to handle acknowledgementstatements with multiple entities in objects inour dataset The results are acknowledgemententities including people or organizations
42 Document SegmentationThe majority of scholarly papers are published inPDF format which are not readily readable by textprocessors Several attempts have been made toconvert PDF into text (Bast and Korzen 2017) andsegment the document into section and sub-sectionlevels GROBID is a machine learning library forextracting parsing and re-structuring raw docu-ments into TEI encoded documents Other simi-lar methods have been recently developed such asOCR++ (Singh et al 2016) and Science Parse4Lipinski et al (2013) compared 7 metadata extrac-tion methods and found that GROBID (version040) achieved superior performance over the oth-ers GROBID trained a cascading of conditionalrandom field (CRF) models on PubMed and com-puter science papers The recent version (060)has a set of powerful functionalities such as extract-ing and parsing headers and segmenting full-textextraction GROBID supports a batch mode andan API service mode the latter enables large scaledocument processing on multi-core servers suchas in CiteSeerX (Wu et al 2015) A benchmark-ing result for version 060 shows that the sectiontitle parsing achieves and F1 = 070 under thestrict matching criteria and F1 = 075 under thesoft matching criteria5 Singh et al (2016) claimsOCR++ achieves better performance than GROBIDin several fields evaluated on computer science pa-pers However the lack of a service mode APIand multi-domain adaptability limits its usabilityScience-Parse only extracts key metadata such as
4httpsgithubcomallenaispv25httpsgrobidreadthedocsioen
latestBenchmarking-pmc
title authors year and venue Therefore we adoptGROBID to convert PDF documents into XMLfiles
Depending on the structure and provenance ofPDFs GROBID may miss acknowledgements incertain papers To estimate the fraction of papersin which acknowledgements were missed by GRO-BID we visually inspected a random sample of 200papers from the CORD-19 dataset and found thatonly 146 papers (73) contain acknowledgementstatements out of which GROBID successfully ex-tracted all acknowledgement statements from 120papers (82) For the remaining 26 papers thatGROBID failed to parse 17 papers are in sections9 papers are in footnotes We developed a heuristicmethod that can extract acknowledgement state-ments from all 120 papers with acknowledgementstatements output by GROBID
43 Sentence Segmentation
The acknowledgement sections or statements ex-tracted above are paragraphs which needs to besegmented (or tokenized) into sentences we com-pared four software packages for sentence segmen-tation including NLTK (Bird 2006) Stanza (Qiet al 2020) Gensim (Rehurek and Sojka 2010)and the Pragmatic Segmenter6
NLTK includes a sentence tokenization methodsent tokenize() which uses an unsupervised al-gorithm to build a model for abbreviated wordscollocations and words that start sentences andthen uses that model to find sentence boundariesStanza is a Python natural language analysis pack-age developed by the Stanford NLP group Sen-tence segmentation is modeled as a tagging prob-lem over character sequences where the neuralmodel predicts whether a given character is the endof a sentence The split sentences() function inGensim package splits a text and returns list of sen-tences from a given text string using unsupervisedpattern recognition The Pragmatic Segmenter isa rule-based sentence boundary detection gem thatworks out-of-the-box across many languages
To compare the above methods we created aground truth corpus by randomly selecting ac-knowledgment sections or statements from 47 pa-pers and manually segmenting them resulting in100 sentences Table 1 shows the comparison re-sults for four methods The precision is calculated
6httpsgithubcomdiasks2pragmatic_segmenter
14
Method Precision Recall F1
Gensim 065 064 065NLTK 072 069 070
Pragmatic 086 076 081Stanza 081 088 084
Table 1 Sentence segmentation performance
by dividing the number of correctly segmented sen-tences by the total number of sentences segmentedThe recall is calculated by dividing the number ofcorrectly segmented sentences by the total numberof manually segmented sentences Stanza outper-forms the other three achieving an F1 = 084
44 Sentence Classification
Not all sentences in acknowledgement sec-tions express acknowledgement such as thefollowing sentence The funders played no
role in the study or preparation of the
manuscript Song et al (2020) In this modulewe classify sentences into acknowledgement andnon-acknowledgement statements We developeda set of regular expressions that match bothverbs (eg thank gratitude to indebted to)adjectives (eg grateful to) and nouns (eghelpful comments useful feedback) to cover asmany cases as possible To evaluate this methodwe manually selected 100 sentences including 50positive and negative samples from the sentencesobtained in Section 43 Our results show that96 out of 100 sentences were classified correctlyresulting accuracy of 096
45 Entity Recognition
In this step named entities are extracted usingstate-of-the-art NER software packages includingNLTK Bird (2006) Stanford CoreNLP Manninget al (2014) spaCy Honnibal and Montani (2017)and Stanza (Qi et al 2020) Stanza is a Pythonlibrary offering fully neural pre-trained models thatprovide state-of-the-art performance on many rawtext processing tasks when it was released TheNER model adopted the contextualized sequencetagger in Akbik et al (2018) The architecture in-cludes a Bi-LSTM character-level language modelfollowed by a one-layer Bi-LSTM sequence taggerwith a conditional random field (CRF) encoder Al-though Stanza was developed based on StanfordCoreNLP they exhibit differential performances inour NER task
The ground truth is built by randomly selecting
Method Entity Precision Recall F1
NLTK Person 045 068 055Org 059 077 067
spaCy Person 074 088 080Org 063 074 068
Stanford- Person 088 087 087CoreNLP Org 068 080 073
Stanza Person 089 093 091Org 060 089 072
Table 2 Comparison of NER software package
100 acknowledgement paragraphs from sectionsfootnotes and body text and manually annotatingperson and organization entities without discrimi-nating whether they are acknowledged or not Thisresults in 146 person and 209 organization entitiesThe comparison results indicate that overall Stanzaoutperforms the other three achieving F1 = 091for person and F1 = 072 for organization Es-pecially the recall of Stanza is 9 higher thanStanford CoreNLP (Table 2)
46 Entity Classification
As we showed not all named entities are acknowl-edged such as the ldquorsquoUtrecht Universityrdquo in Fig-ure 1 Therefore it is necessary to build a classifierto discriminate acknowledgement entities ndash entitiesthat are thanked by the paper or the authors fromnamed entities
The majority of acknowledgement statements inacademic articles have a relatively standard subject-predicate-object (SPO) structure They use a lim-ited number of words or phrases such as ldquothankrdquoldquoacknowledgerdquo ldquoare gratefulrdquo ldquois supportedrdquo andldquois fundedrdquo as predicates However the object cancontain multiple named entities some of whichare used as attributes of the others In rare casescertain sentences may not have subjects and predi-cates such as the first sentence in Figure 4
Our approach can be divided into three stepsTwo representative examples are illustrated in Fig-ure 3 The pseudo-code is shown in Algorithm 1
Step 1 We resolve the type of voice (active orpassive) subject and predicate of a sentence usingdependency parsing by Stanza This is becausenamed entities can appear as subjective or objec-tive parts We then locate all named entities Thesemantic meaning of a predicate and its type ofvoice can be used to determine whether entities ac-knowledged are in the objective part or subjectivepart In most cases the target entities are objects
15
Step 2 A sentence with multiple entities in theobjective part is split into shorter sentences calledldquosubsentencesrdquo so that each subsentence is asso-ciated with only up to one named entity This isdone by first splitting the sentence by ldquoandrdquo Foreach subsentence if the subject and predicate aremissing we fill them up using the subject and pred-icate of the original sentence The object in eachsubsentence does not necessarily contain an entityFor example in the right panel of Figure 3 becauseldquoexpertiserdquo is not a named entity it is replaced byldquononerdquo in the subsentence The SPO relations thatdo not contain named entities are removed
There are two scenarios In the first scenarioa sentence or a subsentence may contain a list ofentities with only the first being acknowledgedIn the third example of Figure 4 the named en-tities are [lsquoQi Yangrsquo lsquoMorehouse School of
Medicinersquo lsquoAtlantarsquo lsquoGArsquo] but only the firstentity is acknowledged The rest entities whichare recognized as organizations or locations areused for supplementing more information In thisscenario only the first named entity is extracted
Step 3 In the second scenario acknowledgedentities are connected by commas or ldquoandrdquosuch as in We thank Shirley Hauta Yurij
Popowych Elaine van Moorlehem and Yan
Zhou for help in the virus production Inthis scenario entities in this structure have thesame type indicating that they play similar rolesThis parallel pattern can be captured by regularexpressions The entities resolved in Step 2 and 3will be merged to form the final set
The method to find the parallel structure isas follows First check each entity whether itstype is person If so the entities are substitutedwith integer indexes The sentence becomesThe authors would like to thank 0 1 and
2 and the kind support of Bayer Animal
Health GmbH and Virbac Group If there are3 or more consecutive numbers in this formthis is the parallel pattern which is captured byregular expressions The pattern also allows textbetween names (Figure 5) Next the numbersin this part will be extracted and mapped tocorresponding entities In the example above thenumbers [012] correspond to the index of theentities [Norbert Mencke Lourdes Mottier
David McGahieand] Similar pattern recognitionare performed for organization The process isdepicted in Figure 5
We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan
for their support and expertise
We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan
for their support
We would also like to thank expertise
[lsquowersquo lsquothankrsquo lsquoCanadian Wildlife Health Cooperativersquo]
[lsquowersquo lsquothankrsquo lsquoNicolas Olivarezrsquo]
[lsquoCanadian Wildlife Health Cooperativersquo]
We thank Anne Broadwater and Nicolas Olivarez of the de Silva lab for technical assistance with the assays
We thank Anne Broadwater of the de Silva lab for technical assistance with the assays
We thank Nicolas Olivarez of the de Silva lab for technical assistance with the assays
[lsquowersquo lsquothankrsquo lsquoAnne Broadwaterrsquo]
[lsquowersquo lsquothankrsquo none]
[lsquoAnne Broadwaterrsquo lsquoNicolas Olivarezrsquo]
Figure 3 Process mapping multiple subject-predicate-object relations to acknowledgement entities with tworepresentative examples ldquoNonerdquo means the objectdoes not contain named entities
Method Precision Recall F1
Stanza 057 091 070RB(without step 3) 094 078 085
RB 094 090 092
Table 3 Performance of Stanza and our relation-based(RB) classifier Precision and recall show overall re-sults by combining person and organization
We investigated open information extractionmethods such as REVERB (Fader et al 2011) andOLLIE (Mausam et al 2012) We found that theyonly work for sentences with relatively simplestructures such as The authors wish to thank
Ming-Wei Guo for reagents and technical
assistance but fail with more complicatedsentences with long objective part or parallelsubsentences (Figure 4) We also investigated thesemantic role labeling (SRL) library in AllenNLPtoolkit (Shi and Lin 2019) For the third sentencein Figure 4 the AllenNLP SRL library resolves theentire string after ldquothankrdquo as an argument but failsto distinguish entity types Our method can handleall statements in Figure 4
As a baseline we use Stanza to extract all namedentities and compare its performance with ourrelation-based (RB) classifier The ground truthis compiled using the same corpus described inSection 45 except that only acknowledgement en-tities (as opposed to all named entities) are labeledpositive The results (Table 3) show that Stanzaachieves high recall but poor precision indicatingthat a significant fraction (sim 40) of named enti-ties are not acknowledged In contrast our classi-fier (RB) achieves a precision of 094 with a smallloss of recall achieving an F1 = 092
One limitation of the RB classifier is that it relieson text quality Sentences are expected to follow
16
To Aulio Costa Zambenedetti for the art work Nilson Fidecircncio Silvio Marques Tania Schepainski and Sibelli Tanjoni de Souza for technical assistance
The authors also thank the Program for Technological Development in Tools for Health-PDTIS FIOCRUZ for the use of its facilities (Platform RPT09H -Real Time PCR -Instituto Carlos ChagasFiocruz-PR)The authors wish to thank Ms Qi Yang DNA sequencing lab Morehouse School of Medicine Atlanta GA and the Research Cores at Morehouse School of Medicine supported through the Research Centers at Minority Institutions (RCMI) Program NIHNCRRRCMI Grant G12-RR03034
Figure 4 Sentence examples from which REVERB orOLLIE failed to extract acknowledgement entities un-derlined which can be identified by our method
Norbert MenckeDavid McGahie
Bayer Animal HealthVirbac Group
Norbert Mencke Lourdes Mottier David McGahie
step 3step 2
The authors would like to thank Norbert Mencke Lourdes Mottier and David McGahie and the kind support of Bayer Animal Health GmbH and Virbac Group
[Norbert Mencke David McGahie Bayer Animal Health Virbac Group Lourdes Mottier]
Figure 5 Merge the results given by step 2 and step 3and obtain the final results
the editorial convention such that entity names areclearly segmented by period If two sentences arenot properly delimited eg missing a period theclassifier may make incorrect predictions
5 Data Analysis
The CORD-19 dataset contains PMC and PDFfolders We found that almost all papers in thePMC folders are included in the PDF foldersFor consistency we only work on papers in thePDF folders containing 37612 unique PDF pa-pers7 Using ACKEXTRACT we extracted a totalof 102965 named entities from 21469 CORD-19 papers The fraction of papers with namedentities (2146937612asymp57) is roughly consis-tent with the fraction obtained from the 200 sam-ples (120200asymp60 in Section 42) Among allnamed entities 58742 are acknowledgement enti-ties These numbers suggest that using our modelonly about 50 of named entities are acknowl-edged Using only named entities acknowledge-ment studies could significantly overestimate the
7We excluded 2003 duplicate papers with exactly the sameSHA1 values
Algorithm 1 Relation-Based Classifier1 Function find entity()2 pre-process with punctuation and format3 find candidate entities entity list by Stanza4 for x isin entity list do5 if x isin
subject part (differs based on predicate)then
6 x = none
7 entity listremove(none)8 return entity list
9 Input sentence10 Output entity list all11 find subject part predicate and the object part12 if ldquopredicate valuerdquo isin reg expression 1 then13 if ldquoandrdquo isin sentence then14 split into subsentences by lsquoandrsquo15 for each subsentence do16 entity list
larr find entity(subsentence)17 if entity list 6= none then18 entity list all
larr appendentity list[0]
19 else if ldquoandrdquo isin sentence then20 entity listlarr find entity(subsentence)21 entity list alllarr entity list[0]
22 else if ldquopredicate valuerdquo isin reg expression 2 then23 do the same in if part but focus on entities in
subject24 for x isin candidate list by Stanza do25 if x isin sentenceampxtype = type then26 orig listlarr append x27 sentlarr replace x by index(x)
28 if reg format isin sent then29 templarr find reg expression in sent30 numlistlarr find numbers in temp31 listlarr index orig list by numlist
32 combine list and entity list all
number of acknowledgement entities by relyingon entities recognized by NER software packageswithout further classification Here we analyze ourresults and study some trends of acknowledgemententities in the CORD-19 dataset
The top 10 acknowledged organizations (Ta-ble 4) are all funding agencies Overall NIH (NI-AID is an institute of NIH) is acknowledged themost but funding agencies in other countries arealso acknowledged a lot in CORD-19 papers
Figure 6 shows the numbers of CORD-19 pa-pers with vs without acknowledgement (personor organization) recognized from 1970 to 2020The huge leap around 2002 was due to the wave ofcoronavirus studies during the SARS period Thesmall drop-down in 2020 was due to data incom-pleteness The plot indicates that the fraction of
17
Organization Count
National Institutes of Health or NIH 1414National Natural Science Foundation of China or NSFC 615National Institute of Allergy amp Infectious Diseases or NIAID 209National Science Foundation or NSF 183Ministry of Health 160Wellcome Trust 146Deutsche Forschungsgemeinschaft 119Ministry of Education 119National Science Council 104Public Health Service 85
Table 4 Top 10 acknowledged organizations identifiedin our CORD-19 dataset
0
500
1000
1500
2000
2500
3000
3500
4000
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
With Acknowledgement Without Acknowledgement
Figure 6 Numbers of CORD-19 papers with vs with-out acknowledgement recognized from 1970 to 2020
papers with acknowledgement has been increasinggradually over the last 20 years Figure 7 showsthe numbers of the top 10 acknowledged organiza-tions from 1983 to 2020 The figure indicates thatthe number of acknowledgements to NIH has beengradually decreasing over the past 10 years whilethe acknowledgements to NSF is roughly constantIn contrast the number has been gradually increas-ing from NSFC (a Chinese funding agency) Notethat the distribution of acknowledged organizationshas a long tail and organizations behind the top 10actually dominate the total number However thetop 10 organizations are the biggest research agen-cies and the trend to some extent reflects strategicshifts of funding support
6 Conclusion and Future Work
Here we extended the work of Khabsa et al (2012)and built an acknowledgement extraction frame-work denoted as ACKEXTRACT for research arti-cles ACKEXTRACT is based on heuristic meth-ods and state-of-the-art text mining libraries (egGROBID and Stanza) but features a classifierthat discriminates acknowledgement entities fromnamed entities by analyzing the multiple subject-predicate-object relations in a sentence Our ap-
0
50
100
150
200
250
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Public Health Service National Science CouncilMinistry of Education Deutsche ForschungsgemeinschaftWellcome Trust Ministry of HealthNational Institute of Allergy amp Infectious DiseasesNIAID National Science FoundationNSFNational Natural Science Foundation of ChinaNSFC National Institutes of HealthNIH
Figure 7 Trend of acknowledged organizations of top10 organizations from 1983 to 2020
proach successfully recognizes acknowledgemententities that cannot be recognized by OIE packagessuch as REVERB OLLIE and the AllenNLP SRLlibrary
This method is applied to the CORD-19 datasetreleased on April 10 2020 processing one PDFdocument in 5 seconds on average Our resultsindicate that only 50ndash60 named entities are ac-knowledged The rest are mentioned to provideadditional information (eg affiliation or location)about acknowledgement entities Working on cleandata our method achieves an overall F1 = 092for person and organization entities The trendanalysis of the CORD-19 papers verifies that moreand more papers include acknowledgement enti-ties since 2002 when the SARS outbreak hap-pened The trend also reveals that the overallnumber of acknowledgements to NIH is gradu-ally decreasing over the past 10 years while morepapers acknowledge NSFC a Chinese fundingagency One caveat of our method is that organi-zations in different countries are not distinguishedFor example many countries have agencies calledldquoMinistry of Healthrdquo In the future we plan tobuild learning-based models for sentence classifi-cation and entity classification The code and dataof this project have been released on GitHub athttpsgithubcomlamps-labackextract
Acknowledgments
This work was partially supported by the DefenseAdvanced Research Projects Agency (DARPA)under cooperative agreement No W911NF-19-2-0272 The content of the information does notnecessarily reflect the position or the policy of theGovernment and no official endorsement shouldbe inferred
18
ReferencesAlan Akbik Duncan Blythe and Roland Vollgraf
2018 Contextual string embeddings for sequencelabeling In Proceedings of the 27th InternationalConference on Computational Linguistics pages1638ndash1649 Santa Fe New Mexico USA Associ-ation for Computational Linguistics
James Andreoni and Laura K Gee 2015 Gunning forefficiency with third party enforcement in thresholdpublic goods Experimental Economics 18(1)154ndash171
Hannah Bast and Claudius Korzen 2017 A bench-mark and evaluation for text extraction from PDFIn 2017 ACMIEEE Joint Conference on Digital Li-braries JCDL 2017 Toronto ON Canada June 19-23 2017 pages 99ndash108 IEEE Computer Society
Steven Bird 2006 NLTK the natural language toolkitIn ACL 2006 21st International Conference on Com-putational Linguistics and 44th Annual Meeting ofthe Association for Computational Linguistics Pro-ceedings of the Conference Sydney Australia 17-21July 2006 The Association for Computer Linguis-tics
Cronin Blaise 2001 Acknowledgement trendsin the research literature of information science57(3)427ndash433
Isaac G Councill C Lee Giles Hui Han and ErenManavoglu 2005 Automatic acknowledgement in-dexing Expanding the semantics of contribution inthe citeseer digital library In Proceedings of the 3rdInternational Conference on Knowledge Capture K-CAP rsquo05 pages 19ndash26 New York NY USA ACM
Blaise Cronin Gail McKenzie Lourdes Rubio andSherrill Weaver-Wozniak 1993 Accounting for in-fluence Acknowledgments in contemporary sociol-ogy Journal of the American Society for Informa-tion Science 44(7)406ndash412
Blaise Cronin Gail McKenzie and Michael Stiffler1992 Patterns of acknowledgement Journal ofDocumentation
S Dai Y Ding Z Zhang W Zuo X Huang andS Zhu 2019 Grantextractor Accurate grant sup-port information extraction from biomedical fulltextbased on bi-lstm-crf IEEEACM Transactions onComputational Biology and Bioinformatics pages1ndash1
Anthony Fader Stephen Soderland and Oren Etzioni2011 Identifying relations for open information ex-traction In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language ProcessingEMNLP 2011 27-31 July 2011 John McIntyre Con-ference Centre Edinburgh UK A meeting of SIG-DAT a Special Interest Group of the ACL pages1535ndash1545
C Lee Giles Kurt D Bollacker and Steve Lawrence1998 CiteSeer An automatic citation indexing sys-tem In Proceedings of the 3rd ACM InternationalConference on Digital Libraries June 23-26 1998Pittsburgh PA USA pages 89ndash98
C Lee Giles and Isaac G Councill 2004 Whogets acknowledged Measuring scientific contribu-tions through automatic acknowledgment indexingProceedings of the National Academy of Sciences101(51)17599ndash17604
Matthew Honnibal and Ines Montani 2017 spaCy 2Natural language understanding with Bloom embed-dings convolutional neural networks and incremen-tal parsing To appear
Subhradeep Kayal Zubair Afzal George TsatsaronisMarius Doornenbal Sophia Katrenko and MichelleGregory 2019 A framework to automatically ex-tract funding information from text In MachineLearning Optimization and Data Science pages317ndash328 Cham Springer International Publishing
Madian Khabsa Pucktada Treeratpituk and C LeeGiles 2012 Ackseer a repository and search en-gine for automatically extracted acknowledgmentsfrom digital libraries In Proceedings of the12th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo12 Washington DC USA June 10-14 2012 pages 185ndash194 ACM
Mario Lipinski Kevin Yao Corinna Breitinger Jo-eran Beel and Bela Gipp 2013 Evaluation ofheader metadata extraction approaches and tools forscientific pdf documents In Proceedings of the13th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo13 pages 385ndash386 New York NYUSA ACM
Kyle Lo Lucy Lu Wang Mark Neumann Rodney Kin-ney and Dan S Weld 2019 S2orc The semanticscholar open research corpus
Patrice Lopez 2009 Grobid Combining auto-matic bibliographic data recognition and term ex-traction for scholarship publications In Proceed-ings of the 13th European Conference on Re-search and Advanced Technology for Digital Li-braries ECDLrsquo09 pages 473ndash474 Berlin Heidel-berg Springer-Verlag
Christopher D Manning Mihai Surdeanu John BauerJenny Finkel Steven J Bethard and David Mc-Closky 2014 The Stanford CoreNLP natural lan-guage processing toolkit In Association for Compu-tational Linguistics (ACL) System Demonstrationspages 55ndash60
Mausam Michael Schmitz Robert Bart StephenSoderland and Oren Etzioni 2012 Open languagelearning for information extraction In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning EMNLP-CoNLL rsquo12
19
pages 523ndash534 Stroudsburg PA USA Associationfor Computational Linguistics
Adele Paul-Hus Adrian A Dıaz-Faes Maxime Sainte-Marie Nadine Desrochers Rodrigo Costas and Vin-cent Lariviere 2017 Beyond funding Acknowl-edgement patterns in biomedical natural and socialsciences PLOS ONE 12(10)1ndash14
Adele Paul-Hus Philippe Mongeon Maxime Sainte-Marie and Vincent Lariviere 2020 Who are theacknowledgees an analysis of gender and academicstatus Quantitative Science Studies 0(0)1ndash17
Peng Qi Yuhao Zhang Yuhui Zhang Jason Boltonand Christopher D Manning 2020 Stanza Apython natural language processing toolkit for manyhuman languages CoRR abs200307082
Radim Rehurek and Petr Sojka 2010 SoftwareFramework for Topic Modelling with Large Cor-pora In Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks pages 45ndash50 Valletta Malta ELRA httpismuniczpublication884893en
Peng Shi and Jimmy Lin 2019 Simple BERT mod-els for relation extraction and semantic role labelingCoRR abs190405255
Mayank Singh Barnopriyo Barua Priyank PalodManvi Garg Sidhartha Satapathy Samuel BushiKumar Ayush Krishna Sai Rohith Tulasi GamidiPawan Goyal and Animesh Mukherjee 2016OCR++ A robust framework for information extrac-tion from scholarly articles In COLING 2016 26thInternational Conference on Computational Linguis-tics Proceedings of the Conference Technical Pa-pers December 11-16 2016 Osaka Japan pages3390ndash3400 ACL
E Smith B Williams-Jones Z Master V LariviereC R Sugimoto A Paul-Hus M Shi and D BResnik 2019 Misconduct and misbehavior relatedto authorship disagreements in collaborative scienceSci Eng Ethics
Min Song Keun Young Kang Tatsawan Timakum andXinyuan Zhang 2020 Examining influential fac-tors for acknowledgements classification using su-pervised learning PLOS ONE 15(2)1ndash21
H Stewart K Brown A M Dinan N Irigoyen E JSnijder and A E Firth 2018 Transcriptional andtranslational landscape of equine torovirus J Virol92(17)
Lucy Lu Wang Kyle Lo Yoganand ChandrasekharRussell Reas Jiangjiang Yang Darrin Eide KathrynFunk Rodney Kinney Ziyang Liu William Mer-rill Paul Mooney Dewey Murdick Devvret RishiJerry Sheehan Zhihong Shen Brandon StilsonAlex D Wade Kuansan Wang Chris Wilhelm BoyaXie Douglas Raymond Daniel S Weld Oren Et-zioni and Sebastian Kohlmeier 2020 CORD-19 the covid-19 open research dataset CoRRabs200410706
Jian Wu Jason Killian Huaiyu Yang Kyle WilliamsSagnik Ray Choudhury Suppawong Tuarob Cor-nelia Caragea and C Lee Giles 2015 PdfmefA multi-entity knowledge extraction framework forscholarly documents and semantic search In Pro-ceedings of the 8th International Conference onKnowledge Capture K-CAP 2015 pages 131ndash138New York NY USA ACM
13
4 Entity recognition Named entities are ex-tracted from acknowledgement statementsWe compare four commonly used NER soft-ware packages and choose Stanza because ofits relatively high performance In this workwe focus on person and organization
5 Entity classification In this module we clas-sify named entities by analyzing sentencestructures aiming at discriminating namedentities that are actually acknowledged ratherthan just mentioned We demonstrate thattriple extraction packages such as REVERB
and OLLIE fail to handle acknowledgementstatements with multiple entities in objects inour dataset The results are acknowledgemententities including people or organizations
42 Document SegmentationThe majority of scholarly papers are published inPDF format which are not readily readable by textprocessors Several attempts have been made toconvert PDF into text (Bast and Korzen 2017) andsegment the document into section and sub-sectionlevels GROBID is a machine learning library forextracting parsing and re-structuring raw docu-ments into TEI encoded documents Other simi-lar methods have been recently developed such asOCR++ (Singh et al 2016) and Science Parse4Lipinski et al (2013) compared 7 metadata extrac-tion methods and found that GROBID (version040) achieved superior performance over the oth-ers GROBID trained a cascading of conditionalrandom field (CRF) models on PubMed and com-puter science papers The recent version (060)has a set of powerful functionalities such as extract-ing and parsing headers and segmenting full-textextraction GROBID supports a batch mode andan API service mode the latter enables large scaledocument processing on multi-core servers suchas in CiteSeerX (Wu et al 2015) A benchmark-ing result for version 060 shows that the sectiontitle parsing achieves and F1 = 070 under thestrict matching criteria and F1 = 075 under thesoft matching criteria5 Singh et al (2016) claimsOCR++ achieves better performance than GROBIDin several fields evaluated on computer science pa-pers However the lack of a service mode APIand multi-domain adaptability limits its usabilityScience-Parse only extracts key metadata such as
4httpsgithubcomallenaispv25httpsgrobidreadthedocsioen
latestBenchmarking-pmc
title authors year and venue Therefore we adoptGROBID to convert PDF documents into XMLfiles
Depending on the structure and provenance ofPDFs GROBID may miss acknowledgements incertain papers To estimate the fraction of papersin which acknowledgements were missed by GRO-BID we visually inspected a random sample of 200papers from the CORD-19 dataset and found thatonly 146 papers (73) contain acknowledgementstatements out of which GROBID successfully ex-tracted all acknowledgement statements from 120papers (82) For the remaining 26 papers thatGROBID failed to parse 17 papers are in sections9 papers are in footnotes We developed a heuristicmethod that can extract acknowledgement state-ments from all 120 papers with acknowledgementstatements output by GROBID
43 Sentence Segmentation
The acknowledgement sections or statements ex-tracted above are paragraphs which needs to besegmented (or tokenized) into sentences we com-pared four software packages for sentence segmen-tation including NLTK (Bird 2006) Stanza (Qiet al 2020) Gensim (Rehurek and Sojka 2010)and the Pragmatic Segmenter6
NLTK includes a sentence tokenization methodsent tokenize() which uses an unsupervised al-gorithm to build a model for abbreviated wordscollocations and words that start sentences andthen uses that model to find sentence boundariesStanza is a Python natural language analysis pack-age developed by the Stanford NLP group Sen-tence segmentation is modeled as a tagging prob-lem over character sequences where the neuralmodel predicts whether a given character is the endof a sentence The split sentences() function inGensim package splits a text and returns list of sen-tences from a given text string using unsupervisedpattern recognition The Pragmatic Segmenter isa rule-based sentence boundary detection gem thatworks out-of-the-box across many languages
To compare the above methods we created aground truth corpus by randomly selecting ac-knowledgment sections or statements from 47 pa-pers and manually segmenting them resulting in100 sentences Table 1 shows the comparison re-sults for four methods The precision is calculated
6httpsgithubcomdiasks2pragmatic_segmenter
14
Method Precision Recall F1
Gensim 065 064 065NLTK 072 069 070
Pragmatic 086 076 081Stanza 081 088 084
Table 1 Sentence segmentation performance
by dividing the number of correctly segmented sen-tences by the total number of sentences segmentedThe recall is calculated by dividing the number ofcorrectly segmented sentences by the total numberof manually segmented sentences Stanza outper-forms the other three achieving an F1 = 084
44 Sentence Classification
Not all sentences in acknowledgement sec-tions express acknowledgement such as thefollowing sentence The funders played no
role in the study or preparation of the
manuscript Song et al (2020) In this modulewe classify sentences into acknowledgement andnon-acknowledgement statements We developeda set of regular expressions that match bothverbs (eg thank gratitude to indebted to)adjectives (eg grateful to) and nouns (eghelpful comments useful feedback) to cover asmany cases as possible To evaluate this methodwe manually selected 100 sentences including 50positive and negative samples from the sentencesobtained in Section 43 Our results show that96 out of 100 sentences were classified correctlyresulting accuracy of 096
45 Entity Recognition
In this step named entities are extracted usingstate-of-the-art NER software packages includingNLTK Bird (2006) Stanford CoreNLP Manninget al (2014) spaCy Honnibal and Montani (2017)and Stanza (Qi et al 2020) Stanza is a Pythonlibrary offering fully neural pre-trained models thatprovide state-of-the-art performance on many rawtext processing tasks when it was released TheNER model adopted the contextualized sequencetagger in Akbik et al (2018) The architecture in-cludes a Bi-LSTM character-level language modelfollowed by a one-layer Bi-LSTM sequence taggerwith a conditional random field (CRF) encoder Al-though Stanza was developed based on StanfordCoreNLP they exhibit differential performances inour NER task
The ground truth is built by randomly selecting
Method Entity Precision Recall F1
NLTK Person 045 068 055Org 059 077 067
spaCy Person 074 088 080Org 063 074 068
Stanford- Person 088 087 087CoreNLP Org 068 080 073
Stanza Person 089 093 091Org 060 089 072
Table 2 Comparison of NER software package
100 acknowledgement paragraphs from sectionsfootnotes and body text and manually annotatingperson and organization entities without discrimi-nating whether they are acknowledged or not Thisresults in 146 person and 209 organization entitiesThe comparison results indicate that overall Stanzaoutperforms the other three achieving F1 = 091for person and F1 = 072 for organization Es-pecially the recall of Stanza is 9 higher thanStanford CoreNLP (Table 2)
46 Entity Classification
As we showed not all named entities are acknowl-edged such as the ldquorsquoUtrecht Universityrdquo in Fig-ure 1 Therefore it is necessary to build a classifierto discriminate acknowledgement entities ndash entitiesthat are thanked by the paper or the authors fromnamed entities
The majority of acknowledgement statements inacademic articles have a relatively standard subject-predicate-object (SPO) structure They use a lim-ited number of words or phrases such as ldquothankrdquoldquoacknowledgerdquo ldquoare gratefulrdquo ldquois supportedrdquo andldquois fundedrdquo as predicates However the object cancontain multiple named entities some of whichare used as attributes of the others In rare casescertain sentences may not have subjects and predi-cates such as the first sentence in Figure 4
Our approach can be divided into three stepsTwo representative examples are illustrated in Fig-ure 3 The pseudo-code is shown in Algorithm 1
Step 1 We resolve the type of voice (active orpassive) subject and predicate of a sentence usingdependency parsing by Stanza This is becausenamed entities can appear as subjective or objec-tive parts We then locate all named entities Thesemantic meaning of a predicate and its type ofvoice can be used to determine whether entities ac-knowledged are in the objective part or subjectivepart In most cases the target entities are objects
15
Step 2 A sentence with multiple entities in theobjective part is split into shorter sentences calledldquosubsentencesrdquo so that each subsentence is asso-ciated with only up to one named entity This isdone by first splitting the sentence by ldquoandrdquo Foreach subsentence if the subject and predicate aremissing we fill them up using the subject and pred-icate of the original sentence The object in eachsubsentence does not necessarily contain an entityFor example in the right panel of Figure 3 becauseldquoexpertiserdquo is not a named entity it is replaced byldquononerdquo in the subsentence The SPO relations thatdo not contain named entities are removed
There are two scenarios In the first scenarioa sentence or a subsentence may contain a list ofentities with only the first being acknowledgedIn the third example of Figure 4 the named en-tities are [lsquoQi Yangrsquo lsquoMorehouse School of
Medicinersquo lsquoAtlantarsquo lsquoGArsquo] but only the firstentity is acknowledged The rest entities whichare recognized as organizations or locations areused for supplementing more information In thisscenario only the first named entity is extracted
Step 3 In the second scenario acknowledgedentities are connected by commas or ldquoandrdquosuch as in We thank Shirley Hauta Yurij
Popowych Elaine van Moorlehem and Yan
Zhou for help in the virus production Inthis scenario entities in this structure have thesame type indicating that they play similar rolesThis parallel pattern can be captured by regularexpressions The entities resolved in Step 2 and 3will be merged to form the final set
The method to find the parallel structure isas follows First check each entity whether itstype is person If so the entities are substitutedwith integer indexes The sentence becomesThe authors would like to thank 0 1 and
2 and the kind support of Bayer Animal
Health GmbH and Virbac Group If there are3 or more consecutive numbers in this formthis is the parallel pattern which is captured byregular expressions The pattern also allows textbetween names (Figure 5) Next the numbersin this part will be extracted and mapped tocorresponding entities In the example above thenumbers [012] correspond to the index of theentities [Norbert Mencke Lourdes Mottier
David McGahieand] Similar pattern recognitionare performed for organization The process isdepicted in Figure 5
We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan
for their support and expertise
We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan
for their support
We would also like to thank expertise
[lsquowersquo lsquothankrsquo lsquoCanadian Wildlife Health Cooperativersquo]
[lsquowersquo lsquothankrsquo lsquoNicolas Olivarezrsquo]
[lsquoCanadian Wildlife Health Cooperativersquo]
We thank Anne Broadwater and Nicolas Olivarez of the de Silva lab for technical assistance with the assays
We thank Anne Broadwater of the de Silva lab for technical assistance with the assays
We thank Nicolas Olivarez of the de Silva lab for technical assistance with the assays
[lsquowersquo lsquothankrsquo lsquoAnne Broadwaterrsquo]
[lsquowersquo lsquothankrsquo none]
[lsquoAnne Broadwaterrsquo lsquoNicolas Olivarezrsquo]
Figure 3 Process mapping multiple subject-predicate-object relations to acknowledgement entities with tworepresentative examples ldquoNonerdquo means the objectdoes not contain named entities
Method Precision Recall F1
Stanza 057 091 070RB(without step 3) 094 078 085
RB 094 090 092
Table 3 Performance of Stanza and our relation-based(RB) classifier Precision and recall show overall re-sults by combining person and organization
We investigated open information extractionmethods such as REVERB (Fader et al 2011) andOLLIE (Mausam et al 2012) We found that theyonly work for sentences with relatively simplestructures such as The authors wish to thank
Ming-Wei Guo for reagents and technical
assistance but fail with more complicatedsentences with long objective part or parallelsubsentences (Figure 4) We also investigated thesemantic role labeling (SRL) library in AllenNLPtoolkit (Shi and Lin 2019) For the third sentencein Figure 4 the AllenNLP SRL library resolves theentire string after ldquothankrdquo as an argument but failsto distinguish entity types Our method can handleall statements in Figure 4
As a baseline we use Stanza to extract all namedentities and compare its performance with ourrelation-based (RB) classifier The ground truthis compiled using the same corpus described inSection 45 except that only acknowledgement en-tities (as opposed to all named entities) are labeledpositive The results (Table 3) show that Stanzaachieves high recall but poor precision indicatingthat a significant fraction (sim 40) of named enti-ties are not acknowledged In contrast our classi-fier (RB) achieves a precision of 094 with a smallloss of recall achieving an F1 = 092
One limitation of the RB classifier is that it relieson text quality Sentences are expected to follow
16
To Aulio Costa Zambenedetti for the art work Nilson Fidecircncio Silvio Marques Tania Schepainski and Sibelli Tanjoni de Souza for technical assistance
The authors also thank the Program for Technological Development in Tools for Health-PDTIS FIOCRUZ for the use of its facilities (Platform RPT09H -Real Time PCR -Instituto Carlos ChagasFiocruz-PR)The authors wish to thank Ms Qi Yang DNA sequencing lab Morehouse School of Medicine Atlanta GA and the Research Cores at Morehouse School of Medicine supported through the Research Centers at Minority Institutions (RCMI) Program NIHNCRRRCMI Grant G12-RR03034
Figure 4 Sentence examples from which REVERB orOLLIE failed to extract acknowledgement entities un-derlined which can be identified by our method
Norbert MenckeDavid McGahie
Bayer Animal HealthVirbac Group
Norbert Mencke Lourdes Mottier David McGahie
step 3step 2
The authors would like to thank Norbert Mencke Lourdes Mottier and David McGahie and the kind support of Bayer Animal Health GmbH and Virbac Group
[Norbert Mencke David McGahie Bayer Animal Health Virbac Group Lourdes Mottier]
Figure 5 Merge the results given by step 2 and step 3and obtain the final results
the editorial convention such that entity names areclearly segmented by period If two sentences arenot properly delimited eg missing a period theclassifier may make incorrect predictions
5 Data Analysis
The CORD-19 dataset contains PMC and PDFfolders We found that almost all papers in thePMC folders are included in the PDF foldersFor consistency we only work on papers in thePDF folders containing 37612 unique PDF pa-pers7 Using ACKEXTRACT we extracted a totalof 102965 named entities from 21469 CORD-19 papers The fraction of papers with namedentities (2146937612asymp57) is roughly consis-tent with the fraction obtained from the 200 sam-ples (120200asymp60 in Section 42) Among allnamed entities 58742 are acknowledgement enti-ties These numbers suggest that using our modelonly about 50 of named entities are acknowl-edged Using only named entities acknowledge-ment studies could significantly overestimate the
7We excluded 2003 duplicate papers with exactly the sameSHA1 values
Algorithm 1 Relation-Based Classifier1 Function find entity()2 pre-process with punctuation and format3 find candidate entities entity list by Stanza4 for x isin entity list do5 if x isin
subject part (differs based on predicate)then
6 x = none
7 entity listremove(none)8 return entity list
9 Input sentence10 Output entity list all11 find subject part predicate and the object part12 if ldquopredicate valuerdquo isin reg expression 1 then13 if ldquoandrdquo isin sentence then14 split into subsentences by lsquoandrsquo15 for each subsentence do16 entity list
larr find entity(subsentence)17 if entity list 6= none then18 entity list all
larr appendentity list[0]
19 else if ldquoandrdquo isin sentence then20 entity listlarr find entity(subsentence)21 entity list alllarr entity list[0]
22 else if ldquopredicate valuerdquo isin reg expression 2 then23 do the same in if part but focus on entities in
subject24 for x isin candidate list by Stanza do25 if x isin sentenceampxtype = type then26 orig listlarr append x27 sentlarr replace x by index(x)
28 if reg format isin sent then29 templarr find reg expression in sent30 numlistlarr find numbers in temp31 listlarr index orig list by numlist
32 combine list and entity list all
number of acknowledgement entities by relyingon entities recognized by NER software packageswithout further classification Here we analyze ourresults and study some trends of acknowledgemententities in the CORD-19 dataset
The top 10 acknowledged organizations (Ta-ble 4) are all funding agencies Overall NIH (NI-AID is an institute of NIH) is acknowledged themost but funding agencies in other countries arealso acknowledged a lot in CORD-19 papers
Figure 6 shows the numbers of CORD-19 pa-pers with vs without acknowledgement (personor organization) recognized from 1970 to 2020The huge leap around 2002 was due to the wave ofcoronavirus studies during the SARS period Thesmall drop-down in 2020 was due to data incom-pleteness The plot indicates that the fraction of
17
Organization Count
National Institutes of Health or NIH 1414National Natural Science Foundation of China or NSFC 615National Institute of Allergy amp Infectious Diseases or NIAID 209National Science Foundation or NSF 183Ministry of Health 160Wellcome Trust 146Deutsche Forschungsgemeinschaft 119Ministry of Education 119National Science Council 104Public Health Service 85
Table 4 Top 10 acknowledged organizations identifiedin our CORD-19 dataset
0
500
1000
1500
2000
2500
3000
3500
4000
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
With Acknowledgement Without Acknowledgement
Figure 6 Numbers of CORD-19 papers with vs with-out acknowledgement recognized from 1970 to 2020
papers with acknowledgement has been increasinggradually over the last 20 years Figure 7 showsthe numbers of the top 10 acknowledged organiza-tions from 1983 to 2020 The figure indicates thatthe number of acknowledgements to NIH has beengradually decreasing over the past 10 years whilethe acknowledgements to NSF is roughly constantIn contrast the number has been gradually increas-ing from NSFC (a Chinese funding agency) Notethat the distribution of acknowledged organizationshas a long tail and organizations behind the top 10actually dominate the total number However thetop 10 organizations are the biggest research agen-cies and the trend to some extent reflects strategicshifts of funding support
6 Conclusion and Future Work
Here we extended the work of Khabsa et al (2012)and built an acknowledgement extraction frame-work denoted as ACKEXTRACT for research arti-cles ACKEXTRACT is based on heuristic meth-ods and state-of-the-art text mining libraries (egGROBID and Stanza) but features a classifierthat discriminates acknowledgement entities fromnamed entities by analyzing the multiple subject-predicate-object relations in a sentence Our ap-
0
50
100
150
200
250
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Public Health Service National Science CouncilMinistry of Education Deutsche ForschungsgemeinschaftWellcome Trust Ministry of HealthNational Institute of Allergy amp Infectious DiseasesNIAID National Science FoundationNSFNational Natural Science Foundation of ChinaNSFC National Institutes of HealthNIH
Figure 7 Trend of acknowledged organizations of top10 organizations from 1983 to 2020
proach successfully recognizes acknowledgemententities that cannot be recognized by OIE packagessuch as REVERB OLLIE and the AllenNLP SRLlibrary
This method is applied to the CORD-19 datasetreleased on April 10 2020 processing one PDFdocument in 5 seconds on average Our resultsindicate that only 50ndash60 named entities are ac-knowledged The rest are mentioned to provideadditional information (eg affiliation or location)about acknowledgement entities Working on cleandata our method achieves an overall F1 = 092for person and organization entities The trendanalysis of the CORD-19 papers verifies that moreand more papers include acknowledgement enti-ties since 2002 when the SARS outbreak hap-pened The trend also reveals that the overallnumber of acknowledgements to NIH is gradu-ally decreasing over the past 10 years while morepapers acknowledge NSFC a Chinese fundingagency One caveat of our method is that organi-zations in different countries are not distinguishedFor example many countries have agencies calledldquoMinistry of Healthrdquo In the future we plan tobuild learning-based models for sentence classifi-cation and entity classification The code and dataof this project have been released on GitHub athttpsgithubcomlamps-labackextract
Acknowledgments
This work was partially supported by the DefenseAdvanced Research Projects Agency (DARPA)under cooperative agreement No W911NF-19-2-0272 The content of the information does notnecessarily reflect the position or the policy of theGovernment and no official endorsement shouldbe inferred
18
ReferencesAlan Akbik Duncan Blythe and Roland Vollgraf
2018 Contextual string embeddings for sequencelabeling In Proceedings of the 27th InternationalConference on Computational Linguistics pages1638ndash1649 Santa Fe New Mexico USA Associ-ation for Computational Linguistics
James Andreoni and Laura K Gee 2015 Gunning forefficiency with third party enforcement in thresholdpublic goods Experimental Economics 18(1)154ndash171
Hannah Bast and Claudius Korzen 2017 A bench-mark and evaluation for text extraction from PDFIn 2017 ACMIEEE Joint Conference on Digital Li-braries JCDL 2017 Toronto ON Canada June 19-23 2017 pages 99ndash108 IEEE Computer Society
Steven Bird 2006 NLTK the natural language toolkitIn ACL 2006 21st International Conference on Com-putational Linguistics and 44th Annual Meeting ofthe Association for Computational Linguistics Pro-ceedings of the Conference Sydney Australia 17-21July 2006 The Association for Computer Linguis-tics
Cronin Blaise 2001 Acknowledgement trendsin the research literature of information science57(3)427ndash433
Isaac G Councill C Lee Giles Hui Han and ErenManavoglu 2005 Automatic acknowledgement in-dexing Expanding the semantics of contribution inthe citeseer digital library In Proceedings of the 3rdInternational Conference on Knowledge Capture K-CAP rsquo05 pages 19ndash26 New York NY USA ACM
Blaise Cronin Gail McKenzie Lourdes Rubio andSherrill Weaver-Wozniak 1993 Accounting for in-fluence Acknowledgments in contemporary sociol-ogy Journal of the American Society for Informa-tion Science 44(7)406ndash412
Blaise Cronin Gail McKenzie and Michael Stiffler1992 Patterns of acknowledgement Journal ofDocumentation
S Dai Y Ding Z Zhang W Zuo X Huang andS Zhu 2019 Grantextractor Accurate grant sup-port information extraction from biomedical fulltextbased on bi-lstm-crf IEEEACM Transactions onComputational Biology and Bioinformatics pages1ndash1
Anthony Fader Stephen Soderland and Oren Etzioni2011 Identifying relations for open information ex-traction In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language ProcessingEMNLP 2011 27-31 July 2011 John McIntyre Con-ference Centre Edinburgh UK A meeting of SIG-DAT a Special Interest Group of the ACL pages1535ndash1545
C Lee Giles Kurt D Bollacker and Steve Lawrence1998 CiteSeer An automatic citation indexing sys-tem In Proceedings of the 3rd ACM InternationalConference on Digital Libraries June 23-26 1998Pittsburgh PA USA pages 89ndash98
C Lee Giles and Isaac G Councill 2004 Whogets acknowledged Measuring scientific contribu-tions through automatic acknowledgment indexingProceedings of the National Academy of Sciences101(51)17599ndash17604
Matthew Honnibal and Ines Montani 2017 spaCy 2Natural language understanding with Bloom embed-dings convolutional neural networks and incremen-tal parsing To appear
Subhradeep Kayal Zubair Afzal George TsatsaronisMarius Doornenbal Sophia Katrenko and MichelleGregory 2019 A framework to automatically ex-tract funding information from text In MachineLearning Optimization and Data Science pages317ndash328 Cham Springer International Publishing
Madian Khabsa Pucktada Treeratpituk and C LeeGiles 2012 Ackseer a repository and search en-gine for automatically extracted acknowledgmentsfrom digital libraries In Proceedings of the12th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo12 Washington DC USA June 10-14 2012 pages 185ndash194 ACM
Mario Lipinski Kevin Yao Corinna Breitinger Jo-eran Beel and Bela Gipp 2013 Evaluation ofheader metadata extraction approaches and tools forscientific pdf documents In Proceedings of the13th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo13 pages 385ndash386 New York NYUSA ACM
Kyle Lo Lucy Lu Wang Mark Neumann Rodney Kin-ney and Dan S Weld 2019 S2orc The semanticscholar open research corpus
Patrice Lopez 2009 Grobid Combining auto-matic bibliographic data recognition and term ex-traction for scholarship publications In Proceed-ings of the 13th European Conference on Re-search and Advanced Technology for Digital Li-braries ECDLrsquo09 pages 473ndash474 Berlin Heidel-berg Springer-Verlag
Christopher D Manning Mihai Surdeanu John BauerJenny Finkel Steven J Bethard and David Mc-Closky 2014 The Stanford CoreNLP natural lan-guage processing toolkit In Association for Compu-tational Linguistics (ACL) System Demonstrationspages 55ndash60
Mausam Michael Schmitz Robert Bart StephenSoderland and Oren Etzioni 2012 Open languagelearning for information extraction In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning EMNLP-CoNLL rsquo12
19
pages 523ndash534 Stroudsburg PA USA Associationfor Computational Linguistics
Adele Paul-Hus Adrian A Dıaz-Faes Maxime Sainte-Marie Nadine Desrochers Rodrigo Costas and Vin-cent Lariviere 2017 Beyond funding Acknowl-edgement patterns in biomedical natural and socialsciences PLOS ONE 12(10)1ndash14
Adele Paul-Hus Philippe Mongeon Maxime Sainte-Marie and Vincent Lariviere 2020 Who are theacknowledgees an analysis of gender and academicstatus Quantitative Science Studies 0(0)1ndash17
Peng Qi Yuhao Zhang Yuhui Zhang Jason Boltonand Christopher D Manning 2020 Stanza Apython natural language processing toolkit for manyhuman languages CoRR abs200307082
Radim Rehurek and Petr Sojka 2010 SoftwareFramework for Topic Modelling with Large Cor-pora In Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks pages 45ndash50 Valletta Malta ELRA httpismuniczpublication884893en
Peng Shi and Jimmy Lin 2019 Simple BERT mod-els for relation extraction and semantic role labelingCoRR abs190405255
Mayank Singh Barnopriyo Barua Priyank PalodManvi Garg Sidhartha Satapathy Samuel BushiKumar Ayush Krishna Sai Rohith Tulasi GamidiPawan Goyal and Animesh Mukherjee 2016OCR++ A robust framework for information extrac-tion from scholarly articles In COLING 2016 26thInternational Conference on Computational Linguis-tics Proceedings of the Conference Technical Pa-pers December 11-16 2016 Osaka Japan pages3390ndash3400 ACL
E Smith B Williams-Jones Z Master V LariviereC R Sugimoto A Paul-Hus M Shi and D BResnik 2019 Misconduct and misbehavior relatedto authorship disagreements in collaborative scienceSci Eng Ethics
Min Song Keun Young Kang Tatsawan Timakum andXinyuan Zhang 2020 Examining influential fac-tors for acknowledgements classification using su-pervised learning PLOS ONE 15(2)1ndash21
H Stewart K Brown A M Dinan N Irigoyen E JSnijder and A E Firth 2018 Transcriptional andtranslational landscape of equine torovirus J Virol92(17)
Lucy Lu Wang Kyle Lo Yoganand ChandrasekharRussell Reas Jiangjiang Yang Darrin Eide KathrynFunk Rodney Kinney Ziyang Liu William Mer-rill Paul Mooney Dewey Murdick Devvret RishiJerry Sheehan Zhihong Shen Brandon StilsonAlex D Wade Kuansan Wang Chris Wilhelm BoyaXie Douglas Raymond Daniel S Weld Oren Et-zioni and Sebastian Kohlmeier 2020 CORD-19 the covid-19 open research dataset CoRRabs200410706
Jian Wu Jason Killian Huaiyu Yang Kyle WilliamsSagnik Ray Choudhury Suppawong Tuarob Cor-nelia Caragea and C Lee Giles 2015 PdfmefA multi-entity knowledge extraction framework forscholarly documents and semantic search In Pro-ceedings of the 8th International Conference onKnowledge Capture K-CAP 2015 pages 131ndash138New York NY USA ACM
14
Method Precision Recall F1
Gensim 065 064 065NLTK 072 069 070
Pragmatic 086 076 081Stanza 081 088 084
Table 1 Sentence segmentation performance
by dividing the number of correctly segmented sen-tences by the total number of sentences segmentedThe recall is calculated by dividing the number ofcorrectly segmented sentences by the total numberof manually segmented sentences Stanza outper-forms the other three achieving an F1 = 084
44 Sentence Classification
Not all sentences in acknowledgement sec-tions express acknowledgement such as thefollowing sentence The funders played no
role in the study or preparation of the
manuscript Song et al (2020) In this modulewe classify sentences into acknowledgement andnon-acknowledgement statements We developeda set of regular expressions that match bothverbs (eg thank gratitude to indebted to)adjectives (eg grateful to) and nouns (eghelpful comments useful feedback) to cover asmany cases as possible To evaluate this methodwe manually selected 100 sentences including 50positive and negative samples from the sentencesobtained in Section 43 Our results show that96 out of 100 sentences were classified correctlyresulting accuracy of 096
45 Entity Recognition
In this step named entities are extracted usingstate-of-the-art NER software packages includingNLTK Bird (2006) Stanford CoreNLP Manninget al (2014) spaCy Honnibal and Montani (2017)and Stanza (Qi et al 2020) Stanza is a Pythonlibrary offering fully neural pre-trained models thatprovide state-of-the-art performance on many rawtext processing tasks when it was released TheNER model adopted the contextualized sequencetagger in Akbik et al (2018) The architecture in-cludes a Bi-LSTM character-level language modelfollowed by a one-layer Bi-LSTM sequence taggerwith a conditional random field (CRF) encoder Al-though Stanza was developed based on StanfordCoreNLP they exhibit differential performances inour NER task
The ground truth is built by randomly selecting
Method Entity Precision Recall F1
NLTK Person 045 068 055Org 059 077 067
spaCy Person 074 088 080Org 063 074 068
Stanford- Person 088 087 087CoreNLP Org 068 080 073
Stanza Person 089 093 091Org 060 089 072
Table 2 Comparison of NER software package
100 acknowledgement paragraphs from sectionsfootnotes and body text and manually annotatingperson and organization entities without discrimi-nating whether they are acknowledged or not Thisresults in 146 person and 209 organization entitiesThe comparison results indicate that overall Stanzaoutperforms the other three achieving F1 = 091for person and F1 = 072 for organization Es-pecially the recall of Stanza is 9 higher thanStanford CoreNLP (Table 2)
46 Entity Classification
As we showed not all named entities are acknowl-edged such as the ldquorsquoUtrecht Universityrdquo in Fig-ure 1 Therefore it is necessary to build a classifierto discriminate acknowledgement entities ndash entitiesthat are thanked by the paper or the authors fromnamed entities
The majority of acknowledgement statements inacademic articles have a relatively standard subject-predicate-object (SPO) structure They use a lim-ited number of words or phrases such as ldquothankrdquoldquoacknowledgerdquo ldquoare gratefulrdquo ldquois supportedrdquo andldquois fundedrdquo as predicates However the object cancontain multiple named entities some of whichare used as attributes of the others In rare casescertain sentences may not have subjects and predi-cates such as the first sentence in Figure 4
Our approach can be divided into three stepsTwo representative examples are illustrated in Fig-ure 3 The pseudo-code is shown in Algorithm 1
Step 1 We resolve the type of voice (active orpassive) subject and predicate of a sentence usingdependency parsing by Stanza This is becausenamed entities can appear as subjective or objec-tive parts We then locate all named entities Thesemantic meaning of a predicate and its type ofvoice can be used to determine whether entities ac-knowledged are in the objective part or subjectivepart In most cases the target entities are objects
15
Step 2 A sentence with multiple entities in theobjective part is split into shorter sentences calledldquosubsentencesrdquo so that each subsentence is asso-ciated with only up to one named entity This isdone by first splitting the sentence by ldquoandrdquo Foreach subsentence if the subject and predicate aremissing we fill them up using the subject and pred-icate of the original sentence The object in eachsubsentence does not necessarily contain an entityFor example in the right panel of Figure 3 becauseldquoexpertiserdquo is not a named entity it is replaced byldquononerdquo in the subsentence The SPO relations thatdo not contain named entities are removed
There are two scenarios In the first scenarioa sentence or a subsentence may contain a list ofentities with only the first being acknowledgedIn the third example of Figure 4 the named en-tities are [lsquoQi Yangrsquo lsquoMorehouse School of
Medicinersquo lsquoAtlantarsquo lsquoGArsquo] but only the firstentity is acknowledged The rest entities whichare recognized as organizations or locations areused for supplementing more information In thisscenario only the first named entity is extracted
Step 3 In the second scenario acknowledgedentities are connected by commas or ldquoandrdquosuch as in We thank Shirley Hauta Yurij
Popowych Elaine van Moorlehem and Yan
Zhou for help in the virus production Inthis scenario entities in this structure have thesame type indicating that they play similar rolesThis parallel pattern can be captured by regularexpressions The entities resolved in Step 2 and 3will be merged to form the final set
The method to find the parallel structure isas follows First check each entity whether itstype is person If so the entities are substitutedwith integer indexes The sentence becomesThe authors would like to thank 0 1 and
2 and the kind support of Bayer Animal
Health GmbH and Virbac Group If there are3 or more consecutive numbers in this formthis is the parallel pattern which is captured byregular expressions The pattern also allows textbetween names (Figure 5) Next the numbersin this part will be extracted and mapped tocorresponding entities In the example above thenumbers [012] correspond to the index of theentities [Norbert Mencke Lourdes Mottier
David McGahieand] Similar pattern recognitionare performed for organization The process isdepicted in Figure 5
We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan
for their support and expertise
We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan
for their support
We would also like to thank expertise
[lsquowersquo lsquothankrsquo lsquoCanadian Wildlife Health Cooperativersquo]
[lsquowersquo lsquothankrsquo lsquoNicolas Olivarezrsquo]
[lsquoCanadian Wildlife Health Cooperativersquo]
We thank Anne Broadwater and Nicolas Olivarez of the de Silva lab for technical assistance with the assays
We thank Anne Broadwater of the de Silva lab for technical assistance with the assays
We thank Nicolas Olivarez of the de Silva lab for technical assistance with the assays
[lsquowersquo lsquothankrsquo lsquoAnne Broadwaterrsquo]
[lsquowersquo lsquothankrsquo none]
[lsquoAnne Broadwaterrsquo lsquoNicolas Olivarezrsquo]
Figure 3 Process mapping multiple subject-predicate-object relations to acknowledgement entities with tworepresentative examples ldquoNonerdquo means the objectdoes not contain named entities
Method Precision Recall F1
Stanza 057 091 070RB(without step 3) 094 078 085
RB 094 090 092
Table 3 Performance of Stanza and our relation-based(RB) classifier Precision and recall show overall re-sults by combining person and organization
We investigated open information extractionmethods such as REVERB (Fader et al 2011) andOLLIE (Mausam et al 2012) We found that theyonly work for sentences with relatively simplestructures such as The authors wish to thank
Ming-Wei Guo for reagents and technical
assistance but fail with more complicatedsentences with long objective part or parallelsubsentences (Figure 4) We also investigated thesemantic role labeling (SRL) library in AllenNLPtoolkit (Shi and Lin 2019) For the third sentencein Figure 4 the AllenNLP SRL library resolves theentire string after ldquothankrdquo as an argument but failsto distinguish entity types Our method can handleall statements in Figure 4
As a baseline we use Stanza to extract all namedentities and compare its performance with ourrelation-based (RB) classifier The ground truthis compiled using the same corpus described inSection 45 except that only acknowledgement en-tities (as opposed to all named entities) are labeledpositive The results (Table 3) show that Stanzaachieves high recall but poor precision indicatingthat a significant fraction (sim 40) of named enti-ties are not acknowledged In contrast our classi-fier (RB) achieves a precision of 094 with a smallloss of recall achieving an F1 = 092
One limitation of the RB classifier is that it relieson text quality Sentences are expected to follow
16
To Aulio Costa Zambenedetti for the art work Nilson Fidecircncio Silvio Marques Tania Schepainski and Sibelli Tanjoni de Souza for technical assistance
The authors also thank the Program for Technological Development in Tools for Health-PDTIS FIOCRUZ for the use of its facilities (Platform RPT09H -Real Time PCR -Instituto Carlos ChagasFiocruz-PR)The authors wish to thank Ms Qi Yang DNA sequencing lab Morehouse School of Medicine Atlanta GA and the Research Cores at Morehouse School of Medicine supported through the Research Centers at Minority Institutions (RCMI) Program NIHNCRRRCMI Grant G12-RR03034
Figure 4 Sentence examples from which REVERB orOLLIE failed to extract acknowledgement entities un-derlined which can be identified by our method
Norbert MenckeDavid McGahie
Bayer Animal HealthVirbac Group
Norbert Mencke Lourdes Mottier David McGahie
step 3step 2
The authors would like to thank Norbert Mencke Lourdes Mottier and David McGahie and the kind support of Bayer Animal Health GmbH and Virbac Group
[Norbert Mencke David McGahie Bayer Animal Health Virbac Group Lourdes Mottier]
Figure 5 Merge the results given by step 2 and step 3and obtain the final results
the editorial convention such that entity names areclearly segmented by period If two sentences arenot properly delimited eg missing a period theclassifier may make incorrect predictions
5 Data Analysis
The CORD-19 dataset contains PMC and PDFfolders We found that almost all papers in thePMC folders are included in the PDF foldersFor consistency we only work on papers in thePDF folders containing 37612 unique PDF pa-pers7 Using ACKEXTRACT we extracted a totalof 102965 named entities from 21469 CORD-19 papers The fraction of papers with namedentities (2146937612asymp57) is roughly consis-tent with the fraction obtained from the 200 sam-ples (120200asymp60 in Section 42) Among allnamed entities 58742 are acknowledgement enti-ties These numbers suggest that using our modelonly about 50 of named entities are acknowl-edged Using only named entities acknowledge-ment studies could significantly overestimate the
7We excluded 2003 duplicate papers with exactly the sameSHA1 values
Algorithm 1 Relation-Based Classifier1 Function find entity()2 pre-process with punctuation and format3 find candidate entities entity list by Stanza4 for x isin entity list do5 if x isin
subject part (differs based on predicate)then
6 x = none
7 entity listremove(none)8 return entity list
9 Input sentence10 Output entity list all11 find subject part predicate and the object part12 if ldquopredicate valuerdquo isin reg expression 1 then13 if ldquoandrdquo isin sentence then14 split into subsentences by lsquoandrsquo15 for each subsentence do16 entity list
larr find entity(subsentence)17 if entity list 6= none then18 entity list all
larr appendentity list[0]
19 else if ldquoandrdquo isin sentence then20 entity listlarr find entity(subsentence)21 entity list alllarr entity list[0]
22 else if ldquopredicate valuerdquo isin reg expression 2 then23 do the same in if part but focus on entities in
subject24 for x isin candidate list by Stanza do25 if x isin sentenceampxtype = type then26 orig listlarr append x27 sentlarr replace x by index(x)
28 if reg format isin sent then29 templarr find reg expression in sent30 numlistlarr find numbers in temp31 listlarr index orig list by numlist
32 combine list and entity list all
number of acknowledgement entities by relyingon entities recognized by NER software packageswithout further classification Here we analyze ourresults and study some trends of acknowledgemententities in the CORD-19 dataset
The top 10 acknowledged organizations (Ta-ble 4) are all funding agencies Overall NIH (NI-AID is an institute of NIH) is acknowledged themost but funding agencies in other countries arealso acknowledged a lot in CORD-19 papers
Figure 6 shows the numbers of CORD-19 pa-pers with vs without acknowledgement (personor organization) recognized from 1970 to 2020The huge leap around 2002 was due to the wave ofcoronavirus studies during the SARS period Thesmall drop-down in 2020 was due to data incom-pleteness The plot indicates that the fraction of
17
Organization Count
National Institutes of Health or NIH 1414National Natural Science Foundation of China or NSFC 615National Institute of Allergy amp Infectious Diseases or NIAID 209National Science Foundation or NSF 183Ministry of Health 160Wellcome Trust 146Deutsche Forschungsgemeinschaft 119Ministry of Education 119National Science Council 104Public Health Service 85
Table 4 Top 10 acknowledged organizations identifiedin our CORD-19 dataset
0
500
1000
1500
2000
2500
3000
3500
4000
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
With Acknowledgement Without Acknowledgement
Figure 6 Numbers of CORD-19 papers with vs with-out acknowledgement recognized from 1970 to 2020
papers with acknowledgement has been increasinggradually over the last 20 years Figure 7 showsthe numbers of the top 10 acknowledged organiza-tions from 1983 to 2020 The figure indicates thatthe number of acknowledgements to NIH has beengradually decreasing over the past 10 years whilethe acknowledgements to NSF is roughly constantIn contrast the number has been gradually increas-ing from NSFC (a Chinese funding agency) Notethat the distribution of acknowledged organizationshas a long tail and organizations behind the top 10actually dominate the total number However thetop 10 organizations are the biggest research agen-cies and the trend to some extent reflects strategicshifts of funding support
6 Conclusion and Future Work
Here we extended the work of Khabsa et al (2012)and built an acknowledgement extraction frame-work denoted as ACKEXTRACT for research arti-cles ACKEXTRACT is based on heuristic meth-ods and state-of-the-art text mining libraries (egGROBID and Stanza) but features a classifierthat discriminates acknowledgement entities fromnamed entities by analyzing the multiple subject-predicate-object relations in a sentence Our ap-
0
50
100
150
200
250
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Public Health Service National Science CouncilMinistry of Education Deutsche ForschungsgemeinschaftWellcome Trust Ministry of HealthNational Institute of Allergy amp Infectious DiseasesNIAID National Science FoundationNSFNational Natural Science Foundation of ChinaNSFC National Institutes of HealthNIH
Figure 7 Trend of acknowledged organizations of top10 organizations from 1983 to 2020
proach successfully recognizes acknowledgemententities that cannot be recognized by OIE packagessuch as REVERB OLLIE and the AllenNLP SRLlibrary
This method is applied to the CORD-19 datasetreleased on April 10 2020 processing one PDFdocument in 5 seconds on average Our resultsindicate that only 50ndash60 named entities are ac-knowledged The rest are mentioned to provideadditional information (eg affiliation or location)about acknowledgement entities Working on cleandata our method achieves an overall F1 = 092for person and organization entities The trendanalysis of the CORD-19 papers verifies that moreand more papers include acknowledgement enti-ties since 2002 when the SARS outbreak hap-pened The trend also reveals that the overallnumber of acknowledgements to NIH is gradu-ally decreasing over the past 10 years while morepapers acknowledge NSFC a Chinese fundingagency One caveat of our method is that organi-zations in different countries are not distinguishedFor example many countries have agencies calledldquoMinistry of Healthrdquo In the future we plan tobuild learning-based models for sentence classifi-cation and entity classification The code and dataof this project have been released on GitHub athttpsgithubcomlamps-labackextract
Acknowledgments
This work was partially supported by the DefenseAdvanced Research Projects Agency (DARPA)under cooperative agreement No W911NF-19-2-0272 The content of the information does notnecessarily reflect the position or the policy of theGovernment and no official endorsement shouldbe inferred
18
ReferencesAlan Akbik Duncan Blythe and Roland Vollgraf
2018 Contextual string embeddings for sequencelabeling In Proceedings of the 27th InternationalConference on Computational Linguistics pages1638ndash1649 Santa Fe New Mexico USA Associ-ation for Computational Linguistics
James Andreoni and Laura K Gee 2015 Gunning forefficiency with third party enforcement in thresholdpublic goods Experimental Economics 18(1)154ndash171
Hannah Bast and Claudius Korzen 2017 A bench-mark and evaluation for text extraction from PDFIn 2017 ACMIEEE Joint Conference on Digital Li-braries JCDL 2017 Toronto ON Canada June 19-23 2017 pages 99ndash108 IEEE Computer Society
Steven Bird 2006 NLTK the natural language toolkitIn ACL 2006 21st International Conference on Com-putational Linguistics and 44th Annual Meeting ofthe Association for Computational Linguistics Pro-ceedings of the Conference Sydney Australia 17-21July 2006 The Association for Computer Linguis-tics
Cronin Blaise 2001 Acknowledgement trendsin the research literature of information science57(3)427ndash433
Isaac G Councill C Lee Giles Hui Han and ErenManavoglu 2005 Automatic acknowledgement in-dexing Expanding the semantics of contribution inthe citeseer digital library In Proceedings of the 3rdInternational Conference on Knowledge Capture K-CAP rsquo05 pages 19ndash26 New York NY USA ACM
Blaise Cronin Gail McKenzie Lourdes Rubio andSherrill Weaver-Wozniak 1993 Accounting for in-fluence Acknowledgments in contemporary sociol-ogy Journal of the American Society for Informa-tion Science 44(7)406ndash412
Blaise Cronin Gail McKenzie and Michael Stiffler1992 Patterns of acknowledgement Journal ofDocumentation
S Dai Y Ding Z Zhang W Zuo X Huang andS Zhu 2019 Grantextractor Accurate grant sup-port information extraction from biomedical fulltextbased on bi-lstm-crf IEEEACM Transactions onComputational Biology and Bioinformatics pages1ndash1
Anthony Fader Stephen Soderland and Oren Etzioni2011 Identifying relations for open information ex-traction In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language ProcessingEMNLP 2011 27-31 July 2011 John McIntyre Con-ference Centre Edinburgh UK A meeting of SIG-DAT a Special Interest Group of the ACL pages1535ndash1545
C Lee Giles Kurt D Bollacker and Steve Lawrence1998 CiteSeer An automatic citation indexing sys-tem In Proceedings of the 3rd ACM InternationalConference on Digital Libraries June 23-26 1998Pittsburgh PA USA pages 89ndash98
C Lee Giles and Isaac G Councill 2004 Whogets acknowledged Measuring scientific contribu-tions through automatic acknowledgment indexingProceedings of the National Academy of Sciences101(51)17599ndash17604
Matthew Honnibal and Ines Montani 2017 spaCy 2Natural language understanding with Bloom embed-dings convolutional neural networks and incremen-tal parsing To appear
Subhradeep Kayal Zubair Afzal George TsatsaronisMarius Doornenbal Sophia Katrenko and MichelleGregory 2019 A framework to automatically ex-tract funding information from text In MachineLearning Optimization and Data Science pages317ndash328 Cham Springer International Publishing
Madian Khabsa Pucktada Treeratpituk and C LeeGiles 2012 Ackseer a repository and search en-gine for automatically extracted acknowledgmentsfrom digital libraries In Proceedings of the12th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo12 Washington DC USA June 10-14 2012 pages 185ndash194 ACM
Mario Lipinski Kevin Yao Corinna Breitinger Jo-eran Beel and Bela Gipp 2013 Evaluation ofheader metadata extraction approaches and tools forscientific pdf documents In Proceedings of the13th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo13 pages 385ndash386 New York NYUSA ACM
Kyle Lo Lucy Lu Wang Mark Neumann Rodney Kin-ney and Dan S Weld 2019 S2orc The semanticscholar open research corpus
Patrice Lopez 2009 Grobid Combining auto-matic bibliographic data recognition and term ex-traction for scholarship publications In Proceed-ings of the 13th European Conference on Re-search and Advanced Technology for Digital Li-braries ECDLrsquo09 pages 473ndash474 Berlin Heidel-berg Springer-Verlag
Christopher D Manning Mihai Surdeanu John BauerJenny Finkel Steven J Bethard and David Mc-Closky 2014 The Stanford CoreNLP natural lan-guage processing toolkit In Association for Compu-tational Linguistics (ACL) System Demonstrationspages 55ndash60
Mausam Michael Schmitz Robert Bart StephenSoderland and Oren Etzioni 2012 Open languagelearning for information extraction In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning EMNLP-CoNLL rsquo12
19
pages 523ndash534 Stroudsburg PA USA Associationfor Computational Linguistics
Adele Paul-Hus Adrian A Dıaz-Faes Maxime Sainte-Marie Nadine Desrochers Rodrigo Costas and Vin-cent Lariviere 2017 Beyond funding Acknowl-edgement patterns in biomedical natural and socialsciences PLOS ONE 12(10)1ndash14
Adele Paul-Hus Philippe Mongeon Maxime Sainte-Marie and Vincent Lariviere 2020 Who are theacknowledgees an analysis of gender and academicstatus Quantitative Science Studies 0(0)1ndash17
Peng Qi Yuhao Zhang Yuhui Zhang Jason Boltonand Christopher D Manning 2020 Stanza Apython natural language processing toolkit for manyhuman languages CoRR abs200307082
Radim Rehurek and Petr Sojka 2010 SoftwareFramework for Topic Modelling with Large Cor-pora In Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks pages 45ndash50 Valletta Malta ELRA httpismuniczpublication884893en
Peng Shi and Jimmy Lin 2019 Simple BERT mod-els for relation extraction and semantic role labelingCoRR abs190405255
Mayank Singh Barnopriyo Barua Priyank PalodManvi Garg Sidhartha Satapathy Samuel BushiKumar Ayush Krishna Sai Rohith Tulasi GamidiPawan Goyal and Animesh Mukherjee 2016OCR++ A robust framework for information extrac-tion from scholarly articles In COLING 2016 26thInternational Conference on Computational Linguis-tics Proceedings of the Conference Technical Pa-pers December 11-16 2016 Osaka Japan pages3390ndash3400 ACL
E Smith B Williams-Jones Z Master V LariviereC R Sugimoto A Paul-Hus M Shi and D BResnik 2019 Misconduct and misbehavior relatedto authorship disagreements in collaborative scienceSci Eng Ethics
Min Song Keun Young Kang Tatsawan Timakum andXinyuan Zhang 2020 Examining influential fac-tors for acknowledgements classification using su-pervised learning PLOS ONE 15(2)1ndash21
H Stewart K Brown A M Dinan N Irigoyen E JSnijder and A E Firth 2018 Transcriptional andtranslational landscape of equine torovirus J Virol92(17)
Lucy Lu Wang Kyle Lo Yoganand ChandrasekharRussell Reas Jiangjiang Yang Darrin Eide KathrynFunk Rodney Kinney Ziyang Liu William Mer-rill Paul Mooney Dewey Murdick Devvret RishiJerry Sheehan Zhihong Shen Brandon StilsonAlex D Wade Kuansan Wang Chris Wilhelm BoyaXie Douglas Raymond Daniel S Weld Oren Et-zioni and Sebastian Kohlmeier 2020 CORD-19 the covid-19 open research dataset CoRRabs200410706
Jian Wu Jason Killian Huaiyu Yang Kyle WilliamsSagnik Ray Choudhury Suppawong Tuarob Cor-nelia Caragea and C Lee Giles 2015 PdfmefA multi-entity knowledge extraction framework forscholarly documents and semantic search In Pro-ceedings of the 8th International Conference onKnowledge Capture K-CAP 2015 pages 131ndash138New York NY USA ACM
15
Step 2 A sentence with multiple entities in theobjective part is split into shorter sentences calledldquosubsentencesrdquo so that each subsentence is asso-ciated with only up to one named entity This isdone by first splitting the sentence by ldquoandrdquo Foreach subsentence if the subject and predicate aremissing we fill them up using the subject and pred-icate of the original sentence The object in eachsubsentence does not necessarily contain an entityFor example in the right panel of Figure 3 becauseldquoexpertiserdquo is not a named entity it is replaced byldquononerdquo in the subsentence The SPO relations thatdo not contain named entities are removed
There are two scenarios In the first scenarioa sentence or a subsentence may contain a list ofentities with only the first being acknowledgedIn the third example of Figure 4 the named en-tities are [lsquoQi Yangrsquo lsquoMorehouse School of
Medicinersquo lsquoAtlantarsquo lsquoGArsquo] but only the firstentity is acknowledged The rest entities whichare recognized as organizations or locations areused for supplementing more information In thisscenario only the first named entity is extracted
Step 3 In the second scenario acknowledgedentities are connected by commas or ldquoandrdquosuch as in We thank Shirley Hauta Yurij
Popowych Elaine van Moorlehem and Yan
Zhou for help in the virus production Inthis scenario entities in this structure have thesame type indicating that they play similar rolesThis parallel pattern can be captured by regularexpressions The entities resolved in Step 2 and 3will be merged to form the final set
The method to find the parallel structure isas follows First check each entity whether itstype is person If so the entities are substitutedwith integer indexes The sentence becomesThe authors would like to thank 0 1 and
2 and the kind support of Bayer Animal
Health GmbH and Virbac Group If there are3 or more consecutive numbers in this formthis is the parallel pattern which is captured byregular expressions The pattern also allows textbetween names (Figure 5) Next the numbersin this part will be extracted and mapped tocorresponding entities In the example above thenumbers [012] correspond to the index of theentities [Norbert Mencke Lourdes Mottier
David McGahieand] Similar pattern recognitionare performed for organization The process isdepicted in Figure 5
We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan
for their support and expertise
We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan
for their support
We would also like to thank expertise
[lsquowersquo lsquothankrsquo lsquoCanadian Wildlife Health Cooperativersquo]
[lsquowersquo lsquothankrsquo lsquoNicolas Olivarezrsquo]
[lsquoCanadian Wildlife Health Cooperativersquo]
We thank Anne Broadwater and Nicolas Olivarez of the de Silva lab for technical assistance with the assays
We thank Anne Broadwater of the de Silva lab for technical assistance with the assays
We thank Nicolas Olivarez of the de Silva lab for technical assistance with the assays
[lsquowersquo lsquothankrsquo lsquoAnne Broadwaterrsquo]
[lsquowersquo lsquothankrsquo none]
[lsquoAnne Broadwaterrsquo lsquoNicolas Olivarezrsquo]
Figure 3 Process mapping multiple subject-predicate-object relations to acknowledgement entities with tworepresentative examples ldquoNonerdquo means the objectdoes not contain named entities
Method Precision Recall F1
Stanza 057 091 070RB(without step 3) 094 078 085
RB 094 090 092
Table 3 Performance of Stanza and our relation-based(RB) classifier Precision and recall show overall re-sults by combining person and organization
We investigated open information extractionmethods such as REVERB (Fader et al 2011) andOLLIE (Mausam et al 2012) We found that theyonly work for sentences with relatively simplestructures such as The authors wish to thank
Ming-Wei Guo for reagents and technical
assistance but fail with more complicatedsentences with long objective part or parallelsubsentences (Figure 4) We also investigated thesemantic role labeling (SRL) library in AllenNLPtoolkit (Shi and Lin 2019) For the third sentencein Figure 4 the AllenNLP SRL library resolves theentire string after ldquothankrdquo as an argument but failsto distinguish entity types Our method can handleall statements in Figure 4
As a baseline we use Stanza to extract all namedentities and compare its performance with ourrelation-based (RB) classifier The ground truthis compiled using the same corpus described inSection 45 except that only acknowledgement en-tities (as opposed to all named entities) are labeledpositive The results (Table 3) show that Stanzaachieves high recall but poor precision indicatingthat a significant fraction (sim 40) of named enti-ties are not acknowledged In contrast our classi-fier (RB) achieves a precision of 094 with a smallloss of recall achieving an F1 = 092
One limitation of the RB classifier is that it relieson text quality Sentences are expected to follow
16
To Aulio Costa Zambenedetti for the art work Nilson Fidecircncio Silvio Marques Tania Schepainski and Sibelli Tanjoni de Souza for technical assistance
The authors also thank the Program for Technological Development in Tools for Health-PDTIS FIOCRUZ for the use of its facilities (Platform RPT09H -Real Time PCR -Instituto Carlos ChagasFiocruz-PR)The authors wish to thank Ms Qi Yang DNA sequencing lab Morehouse School of Medicine Atlanta GA and the Research Cores at Morehouse School of Medicine supported through the Research Centers at Minority Institutions (RCMI) Program NIHNCRRRCMI Grant G12-RR03034
Figure 4 Sentence examples from which REVERB orOLLIE failed to extract acknowledgement entities un-derlined which can be identified by our method
Norbert MenckeDavid McGahie
Bayer Animal HealthVirbac Group
Norbert Mencke Lourdes Mottier David McGahie
step 3step 2
The authors would like to thank Norbert Mencke Lourdes Mottier and David McGahie and the kind support of Bayer Animal Health GmbH and Virbac Group
[Norbert Mencke David McGahie Bayer Animal Health Virbac Group Lourdes Mottier]
Figure 5 Merge the results given by step 2 and step 3and obtain the final results
the editorial convention such that entity names areclearly segmented by period If two sentences arenot properly delimited eg missing a period theclassifier may make incorrect predictions
5 Data Analysis
The CORD-19 dataset contains PMC and PDFfolders We found that almost all papers in thePMC folders are included in the PDF foldersFor consistency we only work on papers in thePDF folders containing 37612 unique PDF pa-pers7 Using ACKEXTRACT we extracted a totalof 102965 named entities from 21469 CORD-19 papers The fraction of papers with namedentities (2146937612asymp57) is roughly consis-tent with the fraction obtained from the 200 sam-ples (120200asymp60 in Section 42) Among allnamed entities 58742 are acknowledgement enti-ties These numbers suggest that using our modelonly about 50 of named entities are acknowl-edged Using only named entities acknowledge-ment studies could significantly overestimate the
7We excluded 2003 duplicate papers with exactly the sameSHA1 values
Algorithm 1 Relation-Based Classifier1 Function find entity()2 pre-process with punctuation and format3 find candidate entities entity list by Stanza4 for x isin entity list do5 if x isin
subject part (differs based on predicate)then
6 x = none
7 entity listremove(none)8 return entity list
9 Input sentence10 Output entity list all11 find subject part predicate and the object part12 if ldquopredicate valuerdquo isin reg expression 1 then13 if ldquoandrdquo isin sentence then14 split into subsentences by lsquoandrsquo15 for each subsentence do16 entity list
larr find entity(subsentence)17 if entity list 6= none then18 entity list all
larr appendentity list[0]
19 else if ldquoandrdquo isin sentence then20 entity listlarr find entity(subsentence)21 entity list alllarr entity list[0]
22 else if ldquopredicate valuerdquo isin reg expression 2 then23 do the same in if part but focus on entities in
subject24 for x isin candidate list by Stanza do25 if x isin sentenceampxtype = type then26 orig listlarr append x27 sentlarr replace x by index(x)
28 if reg format isin sent then29 templarr find reg expression in sent30 numlistlarr find numbers in temp31 listlarr index orig list by numlist
32 combine list and entity list all
number of acknowledgement entities by relyingon entities recognized by NER software packageswithout further classification Here we analyze ourresults and study some trends of acknowledgemententities in the CORD-19 dataset
The top 10 acknowledged organizations (Ta-ble 4) are all funding agencies Overall NIH (NI-AID is an institute of NIH) is acknowledged themost but funding agencies in other countries arealso acknowledged a lot in CORD-19 papers
Figure 6 shows the numbers of CORD-19 pa-pers with vs without acknowledgement (personor organization) recognized from 1970 to 2020The huge leap around 2002 was due to the wave ofcoronavirus studies during the SARS period Thesmall drop-down in 2020 was due to data incom-pleteness The plot indicates that the fraction of
17
Organization Count
National Institutes of Health or NIH 1414National Natural Science Foundation of China or NSFC 615National Institute of Allergy amp Infectious Diseases or NIAID 209National Science Foundation or NSF 183Ministry of Health 160Wellcome Trust 146Deutsche Forschungsgemeinschaft 119Ministry of Education 119National Science Council 104Public Health Service 85
Table 4 Top 10 acknowledged organizations identifiedin our CORD-19 dataset
0
500
1000
1500
2000
2500
3000
3500
4000
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
With Acknowledgement Without Acknowledgement
Figure 6 Numbers of CORD-19 papers with vs with-out acknowledgement recognized from 1970 to 2020
papers with acknowledgement has been increasinggradually over the last 20 years Figure 7 showsthe numbers of the top 10 acknowledged organiza-tions from 1983 to 2020 The figure indicates thatthe number of acknowledgements to NIH has beengradually decreasing over the past 10 years whilethe acknowledgements to NSF is roughly constantIn contrast the number has been gradually increas-ing from NSFC (a Chinese funding agency) Notethat the distribution of acknowledged organizationshas a long tail and organizations behind the top 10actually dominate the total number However thetop 10 organizations are the biggest research agen-cies and the trend to some extent reflects strategicshifts of funding support
6 Conclusion and Future Work
Here we extended the work of Khabsa et al (2012)and built an acknowledgement extraction frame-work denoted as ACKEXTRACT for research arti-cles ACKEXTRACT is based on heuristic meth-ods and state-of-the-art text mining libraries (egGROBID and Stanza) but features a classifierthat discriminates acknowledgement entities fromnamed entities by analyzing the multiple subject-predicate-object relations in a sentence Our ap-
0
50
100
150
200
250
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Public Health Service National Science CouncilMinistry of Education Deutsche ForschungsgemeinschaftWellcome Trust Ministry of HealthNational Institute of Allergy amp Infectious DiseasesNIAID National Science FoundationNSFNational Natural Science Foundation of ChinaNSFC National Institutes of HealthNIH
Figure 7 Trend of acknowledged organizations of top10 organizations from 1983 to 2020
proach successfully recognizes acknowledgemententities that cannot be recognized by OIE packagessuch as REVERB OLLIE and the AllenNLP SRLlibrary
This method is applied to the CORD-19 datasetreleased on April 10 2020 processing one PDFdocument in 5 seconds on average Our resultsindicate that only 50ndash60 named entities are ac-knowledged The rest are mentioned to provideadditional information (eg affiliation or location)about acknowledgement entities Working on cleandata our method achieves an overall F1 = 092for person and organization entities The trendanalysis of the CORD-19 papers verifies that moreand more papers include acknowledgement enti-ties since 2002 when the SARS outbreak hap-pened The trend also reveals that the overallnumber of acknowledgements to NIH is gradu-ally decreasing over the past 10 years while morepapers acknowledge NSFC a Chinese fundingagency One caveat of our method is that organi-zations in different countries are not distinguishedFor example many countries have agencies calledldquoMinistry of Healthrdquo In the future we plan tobuild learning-based models for sentence classifi-cation and entity classification The code and dataof this project have been released on GitHub athttpsgithubcomlamps-labackextract
Acknowledgments
This work was partially supported by the DefenseAdvanced Research Projects Agency (DARPA)under cooperative agreement No W911NF-19-2-0272 The content of the information does notnecessarily reflect the position or the policy of theGovernment and no official endorsement shouldbe inferred
18
ReferencesAlan Akbik Duncan Blythe and Roland Vollgraf
2018 Contextual string embeddings for sequencelabeling In Proceedings of the 27th InternationalConference on Computational Linguistics pages1638ndash1649 Santa Fe New Mexico USA Associ-ation for Computational Linguistics
James Andreoni and Laura K Gee 2015 Gunning forefficiency with third party enforcement in thresholdpublic goods Experimental Economics 18(1)154ndash171
Hannah Bast and Claudius Korzen 2017 A bench-mark and evaluation for text extraction from PDFIn 2017 ACMIEEE Joint Conference on Digital Li-braries JCDL 2017 Toronto ON Canada June 19-23 2017 pages 99ndash108 IEEE Computer Society
Steven Bird 2006 NLTK the natural language toolkitIn ACL 2006 21st International Conference on Com-putational Linguistics and 44th Annual Meeting ofthe Association for Computational Linguistics Pro-ceedings of the Conference Sydney Australia 17-21July 2006 The Association for Computer Linguis-tics
Cronin Blaise 2001 Acknowledgement trendsin the research literature of information science57(3)427ndash433
Isaac G Councill C Lee Giles Hui Han and ErenManavoglu 2005 Automatic acknowledgement in-dexing Expanding the semantics of contribution inthe citeseer digital library In Proceedings of the 3rdInternational Conference on Knowledge Capture K-CAP rsquo05 pages 19ndash26 New York NY USA ACM
Blaise Cronin Gail McKenzie Lourdes Rubio andSherrill Weaver-Wozniak 1993 Accounting for in-fluence Acknowledgments in contemporary sociol-ogy Journal of the American Society for Informa-tion Science 44(7)406ndash412
Blaise Cronin Gail McKenzie and Michael Stiffler1992 Patterns of acknowledgement Journal ofDocumentation
S Dai Y Ding Z Zhang W Zuo X Huang andS Zhu 2019 Grantextractor Accurate grant sup-port information extraction from biomedical fulltextbased on bi-lstm-crf IEEEACM Transactions onComputational Biology and Bioinformatics pages1ndash1
Anthony Fader Stephen Soderland and Oren Etzioni2011 Identifying relations for open information ex-traction In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language ProcessingEMNLP 2011 27-31 July 2011 John McIntyre Con-ference Centre Edinburgh UK A meeting of SIG-DAT a Special Interest Group of the ACL pages1535ndash1545
C Lee Giles Kurt D Bollacker and Steve Lawrence1998 CiteSeer An automatic citation indexing sys-tem In Proceedings of the 3rd ACM InternationalConference on Digital Libraries June 23-26 1998Pittsburgh PA USA pages 89ndash98
C Lee Giles and Isaac G Councill 2004 Whogets acknowledged Measuring scientific contribu-tions through automatic acknowledgment indexingProceedings of the National Academy of Sciences101(51)17599ndash17604
Matthew Honnibal and Ines Montani 2017 spaCy 2Natural language understanding with Bloom embed-dings convolutional neural networks and incremen-tal parsing To appear
Subhradeep Kayal Zubair Afzal George TsatsaronisMarius Doornenbal Sophia Katrenko and MichelleGregory 2019 A framework to automatically ex-tract funding information from text In MachineLearning Optimization and Data Science pages317ndash328 Cham Springer International Publishing
Madian Khabsa Pucktada Treeratpituk and C LeeGiles 2012 Ackseer a repository and search en-gine for automatically extracted acknowledgmentsfrom digital libraries In Proceedings of the12th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo12 Washington DC USA June 10-14 2012 pages 185ndash194 ACM
Mario Lipinski Kevin Yao Corinna Breitinger Jo-eran Beel and Bela Gipp 2013 Evaluation ofheader metadata extraction approaches and tools forscientific pdf documents In Proceedings of the13th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo13 pages 385ndash386 New York NYUSA ACM
Kyle Lo Lucy Lu Wang Mark Neumann Rodney Kin-ney and Dan S Weld 2019 S2orc The semanticscholar open research corpus
Patrice Lopez 2009 Grobid Combining auto-matic bibliographic data recognition and term ex-traction for scholarship publications In Proceed-ings of the 13th European Conference on Re-search and Advanced Technology for Digital Li-braries ECDLrsquo09 pages 473ndash474 Berlin Heidel-berg Springer-Verlag
Christopher D Manning Mihai Surdeanu John BauerJenny Finkel Steven J Bethard and David Mc-Closky 2014 The Stanford CoreNLP natural lan-guage processing toolkit In Association for Compu-tational Linguistics (ACL) System Demonstrationspages 55ndash60
Mausam Michael Schmitz Robert Bart StephenSoderland and Oren Etzioni 2012 Open languagelearning for information extraction In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning EMNLP-CoNLL rsquo12
19
pages 523ndash534 Stroudsburg PA USA Associationfor Computational Linguistics
Adele Paul-Hus Adrian A Dıaz-Faes Maxime Sainte-Marie Nadine Desrochers Rodrigo Costas and Vin-cent Lariviere 2017 Beyond funding Acknowl-edgement patterns in biomedical natural and socialsciences PLOS ONE 12(10)1ndash14
Adele Paul-Hus Philippe Mongeon Maxime Sainte-Marie and Vincent Lariviere 2020 Who are theacknowledgees an analysis of gender and academicstatus Quantitative Science Studies 0(0)1ndash17
Peng Qi Yuhao Zhang Yuhui Zhang Jason Boltonand Christopher D Manning 2020 Stanza Apython natural language processing toolkit for manyhuman languages CoRR abs200307082
Radim Rehurek and Petr Sojka 2010 SoftwareFramework for Topic Modelling with Large Cor-pora In Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks pages 45ndash50 Valletta Malta ELRA httpismuniczpublication884893en
Peng Shi and Jimmy Lin 2019 Simple BERT mod-els for relation extraction and semantic role labelingCoRR abs190405255
Mayank Singh Barnopriyo Barua Priyank PalodManvi Garg Sidhartha Satapathy Samuel BushiKumar Ayush Krishna Sai Rohith Tulasi GamidiPawan Goyal and Animesh Mukherjee 2016OCR++ A robust framework for information extrac-tion from scholarly articles In COLING 2016 26thInternational Conference on Computational Linguis-tics Proceedings of the Conference Technical Pa-pers December 11-16 2016 Osaka Japan pages3390ndash3400 ACL
E Smith B Williams-Jones Z Master V LariviereC R Sugimoto A Paul-Hus M Shi and D BResnik 2019 Misconduct and misbehavior relatedto authorship disagreements in collaborative scienceSci Eng Ethics
Min Song Keun Young Kang Tatsawan Timakum andXinyuan Zhang 2020 Examining influential fac-tors for acknowledgements classification using su-pervised learning PLOS ONE 15(2)1ndash21
H Stewart K Brown A M Dinan N Irigoyen E JSnijder and A E Firth 2018 Transcriptional andtranslational landscape of equine torovirus J Virol92(17)
Lucy Lu Wang Kyle Lo Yoganand ChandrasekharRussell Reas Jiangjiang Yang Darrin Eide KathrynFunk Rodney Kinney Ziyang Liu William Mer-rill Paul Mooney Dewey Murdick Devvret RishiJerry Sheehan Zhihong Shen Brandon StilsonAlex D Wade Kuansan Wang Chris Wilhelm BoyaXie Douglas Raymond Daniel S Weld Oren Et-zioni and Sebastian Kohlmeier 2020 CORD-19 the covid-19 open research dataset CoRRabs200410706
Jian Wu Jason Killian Huaiyu Yang Kyle WilliamsSagnik Ray Choudhury Suppawong Tuarob Cor-nelia Caragea and C Lee Giles 2015 PdfmefA multi-entity knowledge extraction framework forscholarly documents and semantic search In Pro-ceedings of the 8th International Conference onKnowledge Capture K-CAP 2015 pages 131ndash138New York NY USA ACM
16
To Aulio Costa Zambenedetti for the art work Nilson Fidecircncio Silvio Marques Tania Schepainski and Sibelli Tanjoni de Souza for technical assistance
The authors also thank the Program for Technological Development in Tools for Health-PDTIS FIOCRUZ for the use of its facilities (Platform RPT09H -Real Time PCR -Instituto Carlos ChagasFiocruz-PR)The authors wish to thank Ms Qi Yang DNA sequencing lab Morehouse School of Medicine Atlanta GA and the Research Cores at Morehouse School of Medicine supported through the Research Centers at Minority Institutions (RCMI) Program NIHNCRRRCMI Grant G12-RR03034
Figure 4 Sentence examples from which REVERB orOLLIE failed to extract acknowledgement entities un-derlined which can be identified by our method
Norbert MenckeDavid McGahie
Bayer Animal HealthVirbac Group
Norbert Mencke Lourdes Mottier David McGahie
step 3step 2
The authors would like to thank Norbert Mencke Lourdes Mottier and David McGahie and the kind support of Bayer Animal Health GmbH and Virbac Group
[Norbert Mencke David McGahie Bayer Animal Health Virbac Group Lourdes Mottier]
Figure 5 Merge the results given by step 2 and step 3and obtain the final results
the editorial convention such that entity names areclearly segmented by period If two sentences arenot properly delimited eg missing a period theclassifier may make incorrect predictions
5 Data Analysis
The CORD-19 dataset contains PMC and PDFfolders We found that almost all papers in thePMC folders are included in the PDF foldersFor consistency we only work on papers in thePDF folders containing 37612 unique PDF pa-pers7 Using ACKEXTRACT we extracted a totalof 102965 named entities from 21469 CORD-19 papers The fraction of papers with namedentities (2146937612asymp57) is roughly consis-tent with the fraction obtained from the 200 sam-ples (120200asymp60 in Section 42) Among allnamed entities 58742 are acknowledgement enti-ties These numbers suggest that using our modelonly about 50 of named entities are acknowl-edged Using only named entities acknowledge-ment studies could significantly overestimate the
7We excluded 2003 duplicate papers with exactly the sameSHA1 values
Algorithm 1 Relation-Based Classifier1 Function find entity()2 pre-process with punctuation and format3 find candidate entities entity list by Stanza4 for x isin entity list do5 if x isin
subject part (differs based on predicate)then
6 x = none
7 entity listremove(none)8 return entity list
9 Input sentence10 Output entity list all11 find subject part predicate and the object part12 if ldquopredicate valuerdquo isin reg expression 1 then13 if ldquoandrdquo isin sentence then14 split into subsentences by lsquoandrsquo15 for each subsentence do16 entity list
larr find entity(subsentence)17 if entity list 6= none then18 entity list all
larr appendentity list[0]
19 else if ldquoandrdquo isin sentence then20 entity listlarr find entity(subsentence)21 entity list alllarr entity list[0]
22 else if ldquopredicate valuerdquo isin reg expression 2 then23 do the same in if part but focus on entities in
subject24 for x isin candidate list by Stanza do25 if x isin sentenceampxtype = type then26 orig listlarr append x27 sentlarr replace x by index(x)
28 if reg format isin sent then29 templarr find reg expression in sent30 numlistlarr find numbers in temp31 listlarr index orig list by numlist
32 combine list and entity list all
number of acknowledgement entities by relyingon entities recognized by NER software packageswithout further classification Here we analyze ourresults and study some trends of acknowledgemententities in the CORD-19 dataset
The top 10 acknowledged organizations (Ta-ble 4) are all funding agencies Overall NIH (NI-AID is an institute of NIH) is acknowledged themost but funding agencies in other countries arealso acknowledged a lot in CORD-19 papers
Figure 6 shows the numbers of CORD-19 pa-pers with vs without acknowledgement (personor organization) recognized from 1970 to 2020The huge leap around 2002 was due to the wave ofcoronavirus studies during the SARS period Thesmall drop-down in 2020 was due to data incom-pleteness The plot indicates that the fraction of
17
Organization Count
National Institutes of Health or NIH 1414National Natural Science Foundation of China or NSFC 615National Institute of Allergy amp Infectious Diseases or NIAID 209National Science Foundation or NSF 183Ministry of Health 160Wellcome Trust 146Deutsche Forschungsgemeinschaft 119Ministry of Education 119National Science Council 104Public Health Service 85
Table 4 Top 10 acknowledged organizations identifiedin our CORD-19 dataset
0
500
1000
1500
2000
2500
3000
3500
4000
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
With Acknowledgement Without Acknowledgement
Figure 6 Numbers of CORD-19 papers with vs with-out acknowledgement recognized from 1970 to 2020
papers with acknowledgement has been increasinggradually over the last 20 years Figure 7 showsthe numbers of the top 10 acknowledged organiza-tions from 1983 to 2020 The figure indicates thatthe number of acknowledgements to NIH has beengradually decreasing over the past 10 years whilethe acknowledgements to NSF is roughly constantIn contrast the number has been gradually increas-ing from NSFC (a Chinese funding agency) Notethat the distribution of acknowledged organizationshas a long tail and organizations behind the top 10actually dominate the total number However thetop 10 organizations are the biggest research agen-cies and the trend to some extent reflects strategicshifts of funding support
6 Conclusion and Future Work
Here we extended the work of Khabsa et al (2012)and built an acknowledgement extraction frame-work denoted as ACKEXTRACT for research arti-cles ACKEXTRACT is based on heuristic meth-ods and state-of-the-art text mining libraries (egGROBID and Stanza) but features a classifierthat discriminates acknowledgement entities fromnamed entities by analyzing the multiple subject-predicate-object relations in a sentence Our ap-
0
50
100
150
200
250
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Public Health Service National Science CouncilMinistry of Education Deutsche ForschungsgemeinschaftWellcome Trust Ministry of HealthNational Institute of Allergy amp Infectious DiseasesNIAID National Science FoundationNSFNational Natural Science Foundation of ChinaNSFC National Institutes of HealthNIH
Figure 7 Trend of acknowledged organizations of top10 organizations from 1983 to 2020
proach successfully recognizes acknowledgemententities that cannot be recognized by OIE packagessuch as REVERB OLLIE and the AllenNLP SRLlibrary
This method is applied to the CORD-19 datasetreleased on April 10 2020 processing one PDFdocument in 5 seconds on average Our resultsindicate that only 50ndash60 named entities are ac-knowledged The rest are mentioned to provideadditional information (eg affiliation or location)about acknowledgement entities Working on cleandata our method achieves an overall F1 = 092for person and organization entities The trendanalysis of the CORD-19 papers verifies that moreand more papers include acknowledgement enti-ties since 2002 when the SARS outbreak hap-pened The trend also reveals that the overallnumber of acknowledgements to NIH is gradu-ally decreasing over the past 10 years while morepapers acknowledge NSFC a Chinese fundingagency One caveat of our method is that organi-zations in different countries are not distinguishedFor example many countries have agencies calledldquoMinistry of Healthrdquo In the future we plan tobuild learning-based models for sentence classifi-cation and entity classification The code and dataof this project have been released on GitHub athttpsgithubcomlamps-labackextract
Acknowledgments
This work was partially supported by the DefenseAdvanced Research Projects Agency (DARPA)under cooperative agreement No W911NF-19-2-0272 The content of the information does notnecessarily reflect the position or the policy of theGovernment and no official endorsement shouldbe inferred
18
ReferencesAlan Akbik Duncan Blythe and Roland Vollgraf
2018 Contextual string embeddings for sequencelabeling In Proceedings of the 27th InternationalConference on Computational Linguistics pages1638ndash1649 Santa Fe New Mexico USA Associ-ation for Computational Linguistics
James Andreoni and Laura K Gee 2015 Gunning forefficiency with third party enforcement in thresholdpublic goods Experimental Economics 18(1)154ndash171
Hannah Bast and Claudius Korzen 2017 A bench-mark and evaluation for text extraction from PDFIn 2017 ACMIEEE Joint Conference on Digital Li-braries JCDL 2017 Toronto ON Canada June 19-23 2017 pages 99ndash108 IEEE Computer Society
Steven Bird 2006 NLTK the natural language toolkitIn ACL 2006 21st International Conference on Com-putational Linguistics and 44th Annual Meeting ofthe Association for Computational Linguistics Pro-ceedings of the Conference Sydney Australia 17-21July 2006 The Association for Computer Linguis-tics
Cronin Blaise 2001 Acknowledgement trendsin the research literature of information science57(3)427ndash433
Isaac G Councill C Lee Giles Hui Han and ErenManavoglu 2005 Automatic acknowledgement in-dexing Expanding the semantics of contribution inthe citeseer digital library In Proceedings of the 3rdInternational Conference on Knowledge Capture K-CAP rsquo05 pages 19ndash26 New York NY USA ACM
Blaise Cronin Gail McKenzie Lourdes Rubio andSherrill Weaver-Wozniak 1993 Accounting for in-fluence Acknowledgments in contemporary sociol-ogy Journal of the American Society for Informa-tion Science 44(7)406ndash412
Blaise Cronin Gail McKenzie and Michael Stiffler1992 Patterns of acknowledgement Journal ofDocumentation
S Dai Y Ding Z Zhang W Zuo X Huang andS Zhu 2019 Grantextractor Accurate grant sup-port information extraction from biomedical fulltextbased on bi-lstm-crf IEEEACM Transactions onComputational Biology and Bioinformatics pages1ndash1
Anthony Fader Stephen Soderland and Oren Etzioni2011 Identifying relations for open information ex-traction In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language ProcessingEMNLP 2011 27-31 July 2011 John McIntyre Con-ference Centre Edinburgh UK A meeting of SIG-DAT a Special Interest Group of the ACL pages1535ndash1545
C Lee Giles Kurt D Bollacker and Steve Lawrence1998 CiteSeer An automatic citation indexing sys-tem In Proceedings of the 3rd ACM InternationalConference on Digital Libraries June 23-26 1998Pittsburgh PA USA pages 89ndash98
C Lee Giles and Isaac G Councill 2004 Whogets acknowledged Measuring scientific contribu-tions through automatic acknowledgment indexingProceedings of the National Academy of Sciences101(51)17599ndash17604
Matthew Honnibal and Ines Montani 2017 spaCy 2Natural language understanding with Bloom embed-dings convolutional neural networks and incremen-tal parsing To appear
Subhradeep Kayal Zubair Afzal George TsatsaronisMarius Doornenbal Sophia Katrenko and MichelleGregory 2019 A framework to automatically ex-tract funding information from text In MachineLearning Optimization and Data Science pages317ndash328 Cham Springer International Publishing
Madian Khabsa Pucktada Treeratpituk and C LeeGiles 2012 Ackseer a repository and search en-gine for automatically extracted acknowledgmentsfrom digital libraries In Proceedings of the12th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo12 Washington DC USA June 10-14 2012 pages 185ndash194 ACM
Mario Lipinski Kevin Yao Corinna Breitinger Jo-eran Beel and Bela Gipp 2013 Evaluation ofheader metadata extraction approaches and tools forscientific pdf documents In Proceedings of the13th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo13 pages 385ndash386 New York NYUSA ACM
Kyle Lo Lucy Lu Wang Mark Neumann Rodney Kin-ney and Dan S Weld 2019 S2orc The semanticscholar open research corpus
Patrice Lopez 2009 Grobid Combining auto-matic bibliographic data recognition and term ex-traction for scholarship publications In Proceed-ings of the 13th European Conference on Re-search and Advanced Technology for Digital Li-braries ECDLrsquo09 pages 473ndash474 Berlin Heidel-berg Springer-Verlag
Christopher D Manning Mihai Surdeanu John BauerJenny Finkel Steven J Bethard and David Mc-Closky 2014 The Stanford CoreNLP natural lan-guage processing toolkit In Association for Compu-tational Linguistics (ACL) System Demonstrationspages 55ndash60
Mausam Michael Schmitz Robert Bart StephenSoderland and Oren Etzioni 2012 Open languagelearning for information extraction In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning EMNLP-CoNLL rsquo12
19
pages 523ndash534 Stroudsburg PA USA Associationfor Computational Linguistics
Adele Paul-Hus Adrian A Dıaz-Faes Maxime Sainte-Marie Nadine Desrochers Rodrigo Costas and Vin-cent Lariviere 2017 Beyond funding Acknowl-edgement patterns in biomedical natural and socialsciences PLOS ONE 12(10)1ndash14
Adele Paul-Hus Philippe Mongeon Maxime Sainte-Marie and Vincent Lariviere 2020 Who are theacknowledgees an analysis of gender and academicstatus Quantitative Science Studies 0(0)1ndash17
Peng Qi Yuhao Zhang Yuhui Zhang Jason Boltonand Christopher D Manning 2020 Stanza Apython natural language processing toolkit for manyhuman languages CoRR abs200307082
Radim Rehurek and Petr Sojka 2010 SoftwareFramework for Topic Modelling with Large Cor-pora In Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks pages 45ndash50 Valletta Malta ELRA httpismuniczpublication884893en
Peng Shi and Jimmy Lin 2019 Simple BERT mod-els for relation extraction and semantic role labelingCoRR abs190405255
Mayank Singh Barnopriyo Barua Priyank PalodManvi Garg Sidhartha Satapathy Samuel BushiKumar Ayush Krishna Sai Rohith Tulasi GamidiPawan Goyal and Animesh Mukherjee 2016OCR++ A robust framework for information extrac-tion from scholarly articles In COLING 2016 26thInternational Conference on Computational Linguis-tics Proceedings of the Conference Technical Pa-pers December 11-16 2016 Osaka Japan pages3390ndash3400 ACL
E Smith B Williams-Jones Z Master V LariviereC R Sugimoto A Paul-Hus M Shi and D BResnik 2019 Misconduct and misbehavior relatedto authorship disagreements in collaborative scienceSci Eng Ethics
Min Song Keun Young Kang Tatsawan Timakum andXinyuan Zhang 2020 Examining influential fac-tors for acknowledgements classification using su-pervised learning PLOS ONE 15(2)1ndash21
H Stewart K Brown A M Dinan N Irigoyen E JSnijder and A E Firth 2018 Transcriptional andtranslational landscape of equine torovirus J Virol92(17)
Lucy Lu Wang Kyle Lo Yoganand ChandrasekharRussell Reas Jiangjiang Yang Darrin Eide KathrynFunk Rodney Kinney Ziyang Liu William Mer-rill Paul Mooney Dewey Murdick Devvret RishiJerry Sheehan Zhihong Shen Brandon StilsonAlex D Wade Kuansan Wang Chris Wilhelm BoyaXie Douglas Raymond Daniel S Weld Oren Et-zioni and Sebastian Kohlmeier 2020 CORD-19 the covid-19 open research dataset CoRRabs200410706
Jian Wu Jason Killian Huaiyu Yang Kyle WilliamsSagnik Ray Choudhury Suppawong Tuarob Cor-nelia Caragea and C Lee Giles 2015 PdfmefA multi-entity knowledge extraction framework forscholarly documents and semantic search In Pro-ceedings of the 8th International Conference onKnowledge Capture K-CAP 2015 pages 131ndash138New York NY USA ACM
17
Organization Count
National Institutes of Health or NIH 1414National Natural Science Foundation of China or NSFC 615National Institute of Allergy amp Infectious Diseases or NIAID 209National Science Foundation or NSF 183Ministry of Health 160Wellcome Trust 146Deutsche Forschungsgemeinschaft 119Ministry of Education 119National Science Council 104Public Health Service 85
Table 4 Top 10 acknowledged organizations identifiedin our CORD-19 dataset
0
500
1000
1500
2000
2500
3000
3500
4000
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
With Acknowledgement Without Acknowledgement
Figure 6 Numbers of CORD-19 papers with vs with-out acknowledgement recognized from 1970 to 2020
papers with acknowledgement has been increasinggradually over the last 20 years Figure 7 showsthe numbers of the top 10 acknowledged organiza-tions from 1983 to 2020 The figure indicates thatthe number of acknowledgements to NIH has beengradually decreasing over the past 10 years whilethe acknowledgements to NSF is roughly constantIn contrast the number has been gradually increas-ing from NSFC (a Chinese funding agency) Notethat the distribution of acknowledged organizationshas a long tail and organizations behind the top 10actually dominate the total number However thetop 10 organizations are the biggest research agen-cies and the trend to some extent reflects strategicshifts of funding support
6 Conclusion and Future Work
Here we extended the work of Khabsa et al (2012)and built an acknowledgement extraction frame-work denoted as ACKEXTRACT for research arti-cles ACKEXTRACT is based on heuristic meth-ods and state-of-the-art text mining libraries (egGROBID and Stanza) but features a classifierthat discriminates acknowledgement entities fromnamed entities by analyzing the multiple subject-predicate-object relations in a sentence Our ap-
0
50
100
150
200
250
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Public Health Service National Science CouncilMinistry of Education Deutsche ForschungsgemeinschaftWellcome Trust Ministry of HealthNational Institute of Allergy amp Infectious DiseasesNIAID National Science FoundationNSFNational Natural Science Foundation of ChinaNSFC National Institutes of HealthNIH
Figure 7 Trend of acknowledged organizations of top10 organizations from 1983 to 2020
proach successfully recognizes acknowledgemententities that cannot be recognized by OIE packagessuch as REVERB OLLIE and the AllenNLP SRLlibrary
This method is applied to the CORD-19 datasetreleased on April 10 2020 processing one PDFdocument in 5 seconds on average Our resultsindicate that only 50ndash60 named entities are ac-knowledged The rest are mentioned to provideadditional information (eg affiliation or location)about acknowledgement entities Working on cleandata our method achieves an overall F1 = 092for person and organization entities The trendanalysis of the CORD-19 papers verifies that moreand more papers include acknowledgement enti-ties since 2002 when the SARS outbreak hap-pened The trend also reveals that the overallnumber of acknowledgements to NIH is gradu-ally decreasing over the past 10 years while morepapers acknowledge NSFC a Chinese fundingagency One caveat of our method is that organi-zations in different countries are not distinguishedFor example many countries have agencies calledldquoMinistry of Healthrdquo In the future we plan tobuild learning-based models for sentence classifi-cation and entity classification The code and dataof this project have been released on GitHub athttpsgithubcomlamps-labackextract
Acknowledgments
This work was partially supported by the DefenseAdvanced Research Projects Agency (DARPA)under cooperative agreement No W911NF-19-2-0272 The content of the information does notnecessarily reflect the position or the policy of theGovernment and no official endorsement shouldbe inferred
18
ReferencesAlan Akbik Duncan Blythe and Roland Vollgraf
2018 Contextual string embeddings for sequencelabeling In Proceedings of the 27th InternationalConference on Computational Linguistics pages1638ndash1649 Santa Fe New Mexico USA Associ-ation for Computational Linguistics
James Andreoni and Laura K Gee 2015 Gunning forefficiency with third party enforcement in thresholdpublic goods Experimental Economics 18(1)154ndash171
Hannah Bast and Claudius Korzen 2017 A bench-mark and evaluation for text extraction from PDFIn 2017 ACMIEEE Joint Conference on Digital Li-braries JCDL 2017 Toronto ON Canada June 19-23 2017 pages 99ndash108 IEEE Computer Society
Steven Bird 2006 NLTK the natural language toolkitIn ACL 2006 21st International Conference on Com-putational Linguistics and 44th Annual Meeting ofthe Association for Computational Linguistics Pro-ceedings of the Conference Sydney Australia 17-21July 2006 The Association for Computer Linguis-tics
Cronin Blaise 2001 Acknowledgement trendsin the research literature of information science57(3)427ndash433
Isaac G Councill C Lee Giles Hui Han and ErenManavoglu 2005 Automatic acknowledgement in-dexing Expanding the semantics of contribution inthe citeseer digital library In Proceedings of the 3rdInternational Conference on Knowledge Capture K-CAP rsquo05 pages 19ndash26 New York NY USA ACM
Blaise Cronin Gail McKenzie Lourdes Rubio andSherrill Weaver-Wozniak 1993 Accounting for in-fluence Acknowledgments in contemporary sociol-ogy Journal of the American Society for Informa-tion Science 44(7)406ndash412
Blaise Cronin Gail McKenzie and Michael Stiffler1992 Patterns of acknowledgement Journal ofDocumentation
S Dai Y Ding Z Zhang W Zuo X Huang andS Zhu 2019 Grantextractor Accurate grant sup-port information extraction from biomedical fulltextbased on bi-lstm-crf IEEEACM Transactions onComputational Biology and Bioinformatics pages1ndash1
Anthony Fader Stephen Soderland and Oren Etzioni2011 Identifying relations for open information ex-traction In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language ProcessingEMNLP 2011 27-31 July 2011 John McIntyre Con-ference Centre Edinburgh UK A meeting of SIG-DAT a Special Interest Group of the ACL pages1535ndash1545
C Lee Giles Kurt D Bollacker and Steve Lawrence1998 CiteSeer An automatic citation indexing sys-tem In Proceedings of the 3rd ACM InternationalConference on Digital Libraries June 23-26 1998Pittsburgh PA USA pages 89ndash98
C Lee Giles and Isaac G Councill 2004 Whogets acknowledged Measuring scientific contribu-tions through automatic acknowledgment indexingProceedings of the National Academy of Sciences101(51)17599ndash17604
Matthew Honnibal and Ines Montani 2017 spaCy 2Natural language understanding with Bloom embed-dings convolutional neural networks and incremen-tal parsing To appear
Subhradeep Kayal Zubair Afzal George TsatsaronisMarius Doornenbal Sophia Katrenko and MichelleGregory 2019 A framework to automatically ex-tract funding information from text In MachineLearning Optimization and Data Science pages317ndash328 Cham Springer International Publishing
Madian Khabsa Pucktada Treeratpituk and C LeeGiles 2012 Ackseer a repository and search en-gine for automatically extracted acknowledgmentsfrom digital libraries In Proceedings of the12th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo12 Washington DC USA June 10-14 2012 pages 185ndash194 ACM
Mario Lipinski Kevin Yao Corinna Breitinger Jo-eran Beel and Bela Gipp 2013 Evaluation ofheader metadata extraction approaches and tools forscientific pdf documents In Proceedings of the13th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo13 pages 385ndash386 New York NYUSA ACM
Kyle Lo Lucy Lu Wang Mark Neumann Rodney Kin-ney and Dan S Weld 2019 S2orc The semanticscholar open research corpus
Patrice Lopez 2009 Grobid Combining auto-matic bibliographic data recognition and term ex-traction for scholarship publications In Proceed-ings of the 13th European Conference on Re-search and Advanced Technology for Digital Li-braries ECDLrsquo09 pages 473ndash474 Berlin Heidel-berg Springer-Verlag
Christopher D Manning Mihai Surdeanu John BauerJenny Finkel Steven J Bethard and David Mc-Closky 2014 The Stanford CoreNLP natural lan-guage processing toolkit In Association for Compu-tational Linguistics (ACL) System Demonstrationspages 55ndash60
Mausam Michael Schmitz Robert Bart StephenSoderland and Oren Etzioni 2012 Open languagelearning for information extraction In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning EMNLP-CoNLL rsquo12
19
pages 523ndash534 Stroudsburg PA USA Associationfor Computational Linguistics
Adele Paul-Hus Adrian A Dıaz-Faes Maxime Sainte-Marie Nadine Desrochers Rodrigo Costas and Vin-cent Lariviere 2017 Beyond funding Acknowl-edgement patterns in biomedical natural and socialsciences PLOS ONE 12(10)1ndash14
Adele Paul-Hus Philippe Mongeon Maxime Sainte-Marie and Vincent Lariviere 2020 Who are theacknowledgees an analysis of gender and academicstatus Quantitative Science Studies 0(0)1ndash17
Peng Qi Yuhao Zhang Yuhui Zhang Jason Boltonand Christopher D Manning 2020 Stanza Apython natural language processing toolkit for manyhuman languages CoRR abs200307082
Radim Rehurek and Petr Sojka 2010 SoftwareFramework for Topic Modelling with Large Cor-pora In Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks pages 45ndash50 Valletta Malta ELRA httpismuniczpublication884893en
Peng Shi and Jimmy Lin 2019 Simple BERT mod-els for relation extraction and semantic role labelingCoRR abs190405255
Mayank Singh Barnopriyo Barua Priyank PalodManvi Garg Sidhartha Satapathy Samuel BushiKumar Ayush Krishna Sai Rohith Tulasi GamidiPawan Goyal and Animesh Mukherjee 2016OCR++ A robust framework for information extrac-tion from scholarly articles In COLING 2016 26thInternational Conference on Computational Linguis-tics Proceedings of the Conference Technical Pa-pers December 11-16 2016 Osaka Japan pages3390ndash3400 ACL
E Smith B Williams-Jones Z Master V LariviereC R Sugimoto A Paul-Hus M Shi and D BResnik 2019 Misconduct and misbehavior relatedto authorship disagreements in collaborative scienceSci Eng Ethics
Min Song Keun Young Kang Tatsawan Timakum andXinyuan Zhang 2020 Examining influential fac-tors for acknowledgements classification using su-pervised learning PLOS ONE 15(2)1ndash21
H Stewart K Brown A M Dinan N Irigoyen E JSnijder and A E Firth 2018 Transcriptional andtranslational landscape of equine torovirus J Virol92(17)
Lucy Lu Wang Kyle Lo Yoganand ChandrasekharRussell Reas Jiangjiang Yang Darrin Eide KathrynFunk Rodney Kinney Ziyang Liu William Mer-rill Paul Mooney Dewey Murdick Devvret RishiJerry Sheehan Zhihong Shen Brandon StilsonAlex D Wade Kuansan Wang Chris Wilhelm BoyaXie Douglas Raymond Daniel S Weld Oren Et-zioni and Sebastian Kohlmeier 2020 CORD-19 the covid-19 open research dataset CoRRabs200410706
Jian Wu Jason Killian Huaiyu Yang Kyle WilliamsSagnik Ray Choudhury Suppawong Tuarob Cor-nelia Caragea and C Lee Giles 2015 PdfmefA multi-entity knowledge extraction framework forscholarly documents and semantic search In Pro-ceedings of the 8th International Conference onKnowledge Capture K-CAP 2015 pages 131ndash138New York NY USA ACM
18
ReferencesAlan Akbik Duncan Blythe and Roland Vollgraf
2018 Contextual string embeddings for sequencelabeling In Proceedings of the 27th InternationalConference on Computational Linguistics pages1638ndash1649 Santa Fe New Mexico USA Associ-ation for Computational Linguistics
James Andreoni and Laura K Gee 2015 Gunning forefficiency with third party enforcement in thresholdpublic goods Experimental Economics 18(1)154ndash171
Hannah Bast and Claudius Korzen 2017 A bench-mark and evaluation for text extraction from PDFIn 2017 ACMIEEE Joint Conference on Digital Li-braries JCDL 2017 Toronto ON Canada June 19-23 2017 pages 99ndash108 IEEE Computer Society
Steven Bird 2006 NLTK the natural language toolkitIn ACL 2006 21st International Conference on Com-putational Linguistics and 44th Annual Meeting ofthe Association for Computational Linguistics Pro-ceedings of the Conference Sydney Australia 17-21July 2006 The Association for Computer Linguis-tics
Cronin Blaise 2001 Acknowledgement trendsin the research literature of information science57(3)427ndash433
Isaac G Councill C Lee Giles Hui Han and ErenManavoglu 2005 Automatic acknowledgement in-dexing Expanding the semantics of contribution inthe citeseer digital library In Proceedings of the 3rdInternational Conference on Knowledge Capture K-CAP rsquo05 pages 19ndash26 New York NY USA ACM
Blaise Cronin Gail McKenzie Lourdes Rubio andSherrill Weaver-Wozniak 1993 Accounting for in-fluence Acknowledgments in contemporary sociol-ogy Journal of the American Society for Informa-tion Science 44(7)406ndash412
Blaise Cronin Gail McKenzie and Michael Stiffler1992 Patterns of acknowledgement Journal ofDocumentation
S Dai Y Ding Z Zhang W Zuo X Huang andS Zhu 2019 Grantextractor Accurate grant sup-port information extraction from biomedical fulltextbased on bi-lstm-crf IEEEACM Transactions onComputational Biology and Bioinformatics pages1ndash1
Anthony Fader Stephen Soderland and Oren Etzioni2011 Identifying relations for open information ex-traction In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language ProcessingEMNLP 2011 27-31 July 2011 John McIntyre Con-ference Centre Edinburgh UK A meeting of SIG-DAT a Special Interest Group of the ACL pages1535ndash1545
C Lee Giles Kurt D Bollacker and Steve Lawrence1998 CiteSeer An automatic citation indexing sys-tem In Proceedings of the 3rd ACM InternationalConference on Digital Libraries June 23-26 1998Pittsburgh PA USA pages 89ndash98
C Lee Giles and Isaac G Councill 2004 Whogets acknowledged Measuring scientific contribu-tions through automatic acknowledgment indexingProceedings of the National Academy of Sciences101(51)17599ndash17604
Matthew Honnibal and Ines Montani 2017 spaCy 2Natural language understanding with Bloom embed-dings convolutional neural networks and incremen-tal parsing To appear
Subhradeep Kayal Zubair Afzal George TsatsaronisMarius Doornenbal Sophia Katrenko and MichelleGregory 2019 A framework to automatically ex-tract funding information from text In MachineLearning Optimization and Data Science pages317ndash328 Cham Springer International Publishing
Madian Khabsa Pucktada Treeratpituk and C LeeGiles 2012 Ackseer a repository and search en-gine for automatically extracted acknowledgmentsfrom digital libraries In Proceedings of the12th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo12 Washington DC USA June 10-14 2012 pages 185ndash194 ACM
Mario Lipinski Kevin Yao Corinna Breitinger Jo-eran Beel and Bela Gipp 2013 Evaluation ofheader metadata extraction approaches and tools forscientific pdf documents In Proceedings of the13th ACMIEEE-CS Joint Conference on Digital Li-braries JCDL rsquo13 pages 385ndash386 New York NYUSA ACM
Kyle Lo Lucy Lu Wang Mark Neumann Rodney Kin-ney and Dan S Weld 2019 S2orc The semanticscholar open research corpus
Patrice Lopez 2009 Grobid Combining auto-matic bibliographic data recognition and term ex-traction for scholarship publications In Proceed-ings of the 13th European Conference on Re-search and Advanced Technology for Digital Li-braries ECDLrsquo09 pages 473ndash474 Berlin Heidel-berg Springer-Verlag
Christopher D Manning Mihai Surdeanu John BauerJenny Finkel Steven J Bethard and David Mc-Closky 2014 The Stanford CoreNLP natural lan-guage processing toolkit In Association for Compu-tational Linguistics (ACL) System Demonstrationspages 55ndash60
Mausam Michael Schmitz Robert Bart StephenSoderland and Oren Etzioni 2012 Open languagelearning for information extraction In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning EMNLP-CoNLL rsquo12
19
pages 523ndash534 Stroudsburg PA USA Associationfor Computational Linguistics
Adele Paul-Hus Adrian A Dıaz-Faes Maxime Sainte-Marie Nadine Desrochers Rodrigo Costas and Vin-cent Lariviere 2017 Beyond funding Acknowl-edgement patterns in biomedical natural and socialsciences PLOS ONE 12(10)1ndash14
Adele Paul-Hus Philippe Mongeon Maxime Sainte-Marie and Vincent Lariviere 2020 Who are theacknowledgees an analysis of gender and academicstatus Quantitative Science Studies 0(0)1ndash17
Peng Qi Yuhao Zhang Yuhui Zhang Jason Boltonand Christopher D Manning 2020 Stanza Apython natural language processing toolkit for manyhuman languages CoRR abs200307082
Radim Rehurek and Petr Sojka 2010 SoftwareFramework for Topic Modelling with Large Cor-pora In Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks pages 45ndash50 Valletta Malta ELRA httpismuniczpublication884893en
Peng Shi and Jimmy Lin 2019 Simple BERT mod-els for relation extraction and semantic role labelingCoRR abs190405255
Mayank Singh Barnopriyo Barua Priyank PalodManvi Garg Sidhartha Satapathy Samuel BushiKumar Ayush Krishna Sai Rohith Tulasi GamidiPawan Goyal and Animesh Mukherjee 2016OCR++ A robust framework for information extrac-tion from scholarly articles In COLING 2016 26thInternational Conference on Computational Linguis-tics Proceedings of the Conference Technical Pa-pers December 11-16 2016 Osaka Japan pages3390ndash3400 ACL
E Smith B Williams-Jones Z Master V LariviereC R Sugimoto A Paul-Hus M Shi and D BResnik 2019 Misconduct and misbehavior relatedto authorship disagreements in collaborative scienceSci Eng Ethics
Min Song Keun Young Kang Tatsawan Timakum andXinyuan Zhang 2020 Examining influential fac-tors for acknowledgements classification using su-pervised learning PLOS ONE 15(2)1ndash21
H Stewart K Brown A M Dinan N Irigoyen E JSnijder and A E Firth 2018 Transcriptional andtranslational landscape of equine torovirus J Virol92(17)
Lucy Lu Wang Kyle Lo Yoganand ChandrasekharRussell Reas Jiangjiang Yang Darrin Eide KathrynFunk Rodney Kinney Ziyang Liu William Mer-rill Paul Mooney Dewey Murdick Devvret RishiJerry Sheehan Zhihong Shen Brandon StilsonAlex D Wade Kuansan Wang Chris Wilhelm BoyaXie Douglas Raymond Daniel S Weld Oren Et-zioni and Sebastian Kohlmeier 2020 CORD-19 the covid-19 open research dataset CoRRabs200410706
Jian Wu Jason Killian Huaiyu Yang Kyle WilliamsSagnik Ray Choudhury Suppawong Tuarob Cor-nelia Caragea and C Lee Giles 2015 PdfmefA multi-entity knowledge extraction framework forscholarly documents and semantic search In Pro-ceedings of the 8th International Conference onKnowledge Capture K-CAP 2015 pages 131ndash138New York NY USA ACM
19
pages 523ndash534 Stroudsburg PA USA Associationfor Computational Linguistics
Adele Paul-Hus Adrian A Dıaz-Faes Maxime Sainte-Marie Nadine Desrochers Rodrigo Costas and Vin-cent Lariviere 2017 Beyond funding Acknowl-edgement patterns in biomedical natural and socialsciences PLOS ONE 12(10)1ndash14
Adele Paul-Hus Philippe Mongeon Maxime Sainte-Marie and Vincent Lariviere 2020 Who are theacknowledgees an analysis of gender and academicstatus Quantitative Science Studies 0(0)1ndash17
Peng Qi Yuhao Zhang Yuhui Zhang Jason Boltonand Christopher D Manning 2020 Stanza Apython natural language processing toolkit for manyhuman languages CoRR abs200307082
Radim Rehurek and Petr Sojka 2010 SoftwareFramework for Topic Modelling with Large Cor-pora In Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks pages 45ndash50 Valletta Malta ELRA httpismuniczpublication884893en
Peng Shi and Jimmy Lin 2019 Simple BERT mod-els for relation extraction and semantic role labelingCoRR abs190405255
Mayank Singh Barnopriyo Barua Priyank PalodManvi Garg Sidhartha Satapathy Samuel BushiKumar Ayush Krishna Sai Rohith Tulasi GamidiPawan Goyal and Animesh Mukherjee 2016OCR++ A robust framework for information extrac-tion from scholarly articles In COLING 2016 26thInternational Conference on Computational Linguis-tics Proceedings of the Conference Technical Pa-pers December 11-16 2016 Osaka Japan pages3390ndash3400 ACL
E Smith B Williams-Jones Z Master V LariviereC R Sugimoto A Paul-Hus M Shi and D BResnik 2019 Misconduct and misbehavior relatedto authorship disagreements in collaborative scienceSci Eng Ethics
Min Song Keun Young Kang Tatsawan Timakum andXinyuan Zhang 2020 Examining influential fac-tors for acknowledgements classification using su-pervised learning PLOS ONE 15(2)1ndash21
H Stewart K Brown A M Dinan N Irigoyen E JSnijder and A E Firth 2018 Transcriptional andtranslational landscape of equine torovirus J Virol92(17)
Lucy Lu Wang Kyle Lo Yoganand ChandrasekharRussell Reas Jiangjiang Yang Darrin Eide KathrynFunk Rodney Kinney Ziyang Liu William Mer-rill Paul Mooney Dewey Murdick Devvret RishiJerry Sheehan Zhihong Shen Brandon StilsonAlex D Wade Kuansan Wang Chris Wilhelm BoyaXie Douglas Raymond Daniel S Weld Oren Et-zioni and Sebastian Kohlmeier 2020 CORD-19 the covid-19 open research dataset CoRRabs200410706
Jian Wu Jason Killian Huaiyu Yang Kyle WilliamsSagnik Ray Choudhury Suppawong Tuarob Cor-nelia Caragea and C Lee Giles 2015 PdfmefA multi-entity knowledge extraction framework forscholarly documents and semantic search In Pro-ceedings of the 8th International Conference onKnowledge Capture K-CAP 2015 pages 131ndash138New York NY USA ACM