Automating Scholarly Article Data Collection with Action ... · Automating Scholarly Article Data Collection with Action Science Explorer Sehrish Amjad 1, Hamid Mukhtar 1, Cody Dunne

Automating Scholarly Article Data Collection with

Action Science Explorer

Sehrish Amjad1, Hamid Mukhtar1, Cody Dunne2

1 National University of Sciences and Technology, Islamabad, Pakistan; 2 IBM Watson, Cambridge, MA, USA

{11msitsamjad, hamid.mukhtar}@seecs.edu.pk, [email protected]

Abstract— Keeping up with rapidly emerging research fronts in various inter-disciplinary fields requires significant effort from scholars and researchers. These users are concerned not only with finding relevant articles or websites, but also for gaining the understanding of key articles, authors, citation information, and current trends. Several tools such as Action Science Explorer (ASE) have been developed to evaluate the network of citations between articles, recognize important papers and their clusters, summarize them automatically, delve into the full-text of papers to fetch context, generate reviews, create annotations and finally export results in numerous document authoring formats. Although ASE is useful for researchers and scholars, as a research prototype it is limited and tested on data from the ACL Anthology Network. ASE does not have the ability to automatically import and process scholarly articles from external repositories such as Google Scholar, IEEE Xplore, and the ACM Digital Library.

This paper contributes an enhanced ASE which automates the data import process: starting with a web search, then generating a citation network, statistics, text analytics, and cluster summaries. Our enhanced ASE gives researchers in many fields the ability to gain an understanding of their academic literature: the key papers, authors, research fronts, hypothesis, and state of the art.

Keywords—Information Visualization; Annotations;

Text Analytics; Statistics; Interactive Data Exploration; In-

Cite; Out-Cite; Citation Context;

I. INTRODUCTION

Research can be described as a systematic process of gathering, evaluating and interpreting information to seek out solutions for a given problem. These problems can be theoretical or applied, but the general purpose in solving them is to increase the breadth of human knowledge. It is pivotal as part of this contributory process to effectively communicate newfound understanding of the studied phenomena. For advocates of open societies, this communication can be seen as the core responsibility of being a researcher.

Academic fields generally have established or understood rules and standards for conducting experiments and resolving ethical dilemmas. However, there are few

guidelines for other key parts of the research process which can be more of a subjective experience. At one time or another, almost all novice researchers and students face problems such a

• Selection of new topic/lack of confidence to take up a new study;

• Poor organization of research material which include searching, storing and writing of articles; and, relatedly,

• Reference management.

Often researchers have difficulty with selecting a new topic or area of study and gaining the confidence to explore it deeply. Additionally, their progress can be hampered by poor organization of their research material. These difficulties may occur in different forms and at various stages of research. They can be alleviated, however, by conducting research in a systematic way. This can be accomplished both through adopting known effective methodologies as well as gaining experience in the field.

One of the most important problems faced by novice researchers and scholars is the selection of a novel topic, studying it within available resources, and being able do so without close supervision. Due to the tremendous pace of progression in rapidly developing research fields, keeping abreast of the state of the art is becoming increasingly challenging. Research in all disciplines and subjects must begin with a clearly defined jump-off point. Proper science requires this practice, yet it is a tough responsibility for even an experienced researcher – much less a novice.

Identification of key articles to study can be difficult, especially in the increasing number of inter-disciplinary fields where relevant references may be spread across several different domains. Yet this is a critical skill for many research roles. Researchers and scholars need to quickly gain understanding of an upcoming research area. A researcher who wishes to apply a state-of-the-art technique to his research will need to rapidly identify leading papers and recent breakthroughs. Review panel members who are reviewing grant proposals from unfamiliar fields may need to identify open questions, current trends, and emerging areas of study. Graduate students who need to become familiar with research in their chosen areas may search for historical papers, leading authors, appropriate publication venues, and current research methodologies [2].

Several approaches have been developed to integrate these disparate sources of academic literature, as well as to explore and summarize it. These include reference managers (RMs), search engines, research visualizers, recommender systems, and automatic summarizers. Tools developed for analyzing scholarly literature classically try to support a diverse nature of tasks and different levels of users. With these, users of various roles are capable of searching academic literature to find answers to multiple, inter-related questions.

Tools based on search engines naturally center around searching for a known quantity. However, end-users may not have a clear objective in mind. Moreover, may feel difficulty in expressing their goal in a query which a search engine can easily respond to, such as identifying the most important authors or prominent research communities in a field.

Existing reference managers and digital libraries support a diverse range of inter-related features, indicating that there is little consensus regarding which components are most significant for the exploration of academic literature.

Research exploration and visualization tools are quite helpful in rapidly generating accurate survey articles, providing readers with concise overviews tailored to their needs. But most of them support visualization in only a limited extent. Digital repositories continue to expand due to new scholarly literature being written and digitization efforts, but there is a dearth of tools integrating interactive visualizations, statistics, and text analytics.

The primary contribution of this paper a modification of Action Science Explorer (ASE) [1], [2], an academic literature exploration and analysis tool, to incorporate the following key features:

• Web Search,

• Incorporating data from external resources and further processing for syntax matching,

• Organization of full-text research material, and

• Additional visualization techniques for focused searchers using a Radial Tree Graph.

The motivation behind this project is to enable researchers to discover knowledge and highlight latest trends in research more effectively. We aim to realize the needs of students, educators, scientists, and government decision makers for learning about scientific fields. Information visualization techniques can greatly aid in this task, as properly formatted visual information can be processed rapidly by users. Rather than poring over textual data, visualizations immediately expose patterns, trends, and outliers by leveraging our perceptual abilities. Thus, we chose to leverage the capabilities of ASE, an existing open source software tool that provides a framework for integrating visualization, text analytics, and statistics for academic literature exploration. We augment these features by integrating web search, data import, full text organization, and additional network visualization layouts. Our hope is that researchers at all levels all over the world

will now be able to use ASE and benefit from the real-time interactive data exploration and data analysis.

In Section I of this paper, we briefly introduce the problems faced by researchers, the contribution of the paper and the motivation behind this contribution is presented. Section II describes current systems developed for searching and exploration of academic literature. In Section III, our enhancement to ASE is discussed along with software architecture details. Section IV highlights the implementation details. Section V describes the functionality of extended ASE with screenshots. Conclusion and Future work is discussed in Section VI.

II. RELATED WORK

The research process has been altered tremendously due to development of large number of tools for exploring, summarizing, visualizing, ranking, and filtering academic literature.

A. Academic Literature Management

Various tools exist to aid in the exploration and summarization of scholarly literature, and each has particular strengths and features. In this paper, we categorize these tools into four main categories: research explorers, search engines, reference managers, and summarizers/recommenders. A thorough analysis of these tools is performed to present the general functionality and realize the need for improvement. Below, we briefly describe each of these tools.

Action Science Explorer (ASE) [1], [2] is a new tool which incorporates many capabilities of already developed literature-centric information retrieval systems. These features include

• Search and Data Import

• Reference Management

• Citation Network Statistics and Visualization

• Citation Text

• Multi-Document Summarization

Additionally, it presents several capabilities unique among literature search and exploration tools which include filtering and ranking papers by statistics of the citation network, automatic detection and visualization of clusters of articles determined from the citation topology, and generation of computer-created summaries of individual clusters.

ASE is partially an integration of two powerful existing tools the SocialAction network analysis tool and the JabRef reference manager. SocialAction provides us with powerful network analysis capabilities including force-directed citation network visualization, ranking and filtering papers by statistical measures, scatterplots of paper attributes and statistics, categorical and numerical range coloring, and automatic cluster detection. Using visualizations of the citation network we can easily find unexpected trends, clusters, gaps and outliers. Additionally, visualizations can

immediately identify invalid data that is easily missed in tabular views.

JabRef [10] supplies all the features one would expect from a reference manager, including searching using simple regular expressions, automatic and manual grouping of papers, DOI and URL links, PDF full text with annotations, abstracts, user generated reviews and text annotations, and many ways of exporting. It integrates with Microsoft Word, OpenOffice.org, and LaTeX/BibTeX, which allows quick adding of citations to discovered articles when writing survey papers.

ASE integrates many components in order to provide a tool multiple coordinated views of the data. When any node or cluster is selected, the In-Cite Text window displays the text of all incoming citations to the paper(s), i.e. the whole sentences from the citing papers that include the citation to the selected paper(s). These are displayed in a hyperlinked list that allows the user to select any one of them to show their surrounding context in the Out-Cite Text window. This window shows the full text of the paper citing one of the selected papers, with highlighting showing the selected citation sentence as well as any other sentences that include hyperlinked citations to other papers. The last view is the summary window, which can contain various multi-document summaries of a selected cluster. Using automatic summarization techniques, users can summarize all of the incoming citations to papers within that cluster, hopefully providing key insights into that research community.

Visual Understanding Environment (VUE) [14] is also a concept and content mapping tool, developed to support learning, teaching, research and to ease the organization, contextualization, and access of digital information. With the help of a simple set of applications and a basic visual grammar consists of nodes and links, researchers, faculty members and students can easily draw relationships between ideas, concepts and digital content.

CiteSeerX [3], [4] is an evolving scientific literature digital library and search engine that incorporates citation indexing and linking, automatic metadata extraction, citation statistics, lists of citing papers, and at one time citation context as well. Information in the form of these statistics and summaries can easily explain impact of a paper and scholar contribution.

Google Scholar [5] allows its users to search across a wide range of academic literature. Google Scholar applies the Google search interface to patents, articles, and court cases. Moreover, it provides the ability to explore related works, citations, authors, and publications. However, it provides less of a variety of metadata and statistics compared to CiteSeerX. CiteSeerX and Google Scholar are both useful tools for generating the summary of individual

papers or authors, but may not be ideal for producing the summary of an entire corpus.

Another search engine, GoPubMed [6] uses a knowledge-based approach specifically for biomedical texts. It presents publications by country and year in a non-interactive visualization, as well as numeric results for keywords and journals. Moreover, it displays a co-author relationship network visualization for top authors. However, the lack of interaction substantially hampers the effectiveness and increases task time, by requiring users to reload the page to update the shown data. Moreover, these data are aggregated across journals, authors, and search terms rather than per article, which potentially restricts the analysis capabilities.

Broader in scope, Web of Knowledge [7] is an academic citation indexing and search service. It enables its user to acquire, analyze, and disseminate information about articles. It also provides visualizations of article citations which support limited dynamic interaction. However several interactions, such as filtering by publication year, require recreation of citation tree which degrades and slows down the exploration and interaction process. Web of Knowledge supports additional features such as list of citing papers, documents statistics, filtering and ranking.

Specific publishers and educational societies provide digital libraries of academic literature, such as the ACM Portal [8] and IEEE Xplore [9]. They provide document search, metadata, full text, statistics, and often citation linking.

Reference managers such as JabRef [10], Zotero [11], Mendeley [12] and EndNote [13] also provide features for exploration and some have limited summarization capabilities. For instance, users can import a set of articles into Mendeley or EndNote and then search the full-text of documents, create notes of a document, and analyze limited statistics such as the number of publications per author. Although, most of these features in reference managers are not much better for generating the summary of groups of papers that the source digital libraries, and in some ways are less powerful because reference managers do not tend to provide lists of citing papers or citation context.

In addition recommender systems can also be utilized to give support to the exploration of academic literature whereas summarizers and trimmers are used to readily generate the summary of the group of papers to get the overview of details. Recommender systems recommend articles on the basis of a given input, but such systems do not necessarily give end-users a general idea of a domain, nor do they freely permit users to search the academic literature.

B. Tool Analysis

Table 1 summarizes the feature comparison of several of

the above mentioned search engines, reference managers, summarizers/recommenders, and research visualization tools. This is an expanded version of the table from Gove et al. [2].

From the analysis in Table 1, it can be easily observed that although ASE is a powerful tool used for exploration of relevant literature, it functionality lacks in web search, alternate visualization techniques, sharing and collaboration among research groups, and incorporation of literature from external repositories such as IEEE, ACM digital library, Google Scholar and Microsoft Academic Search (MAS). ASE was initially created as a research prototype, and was built to explore articles from the ACL Anthology network [15]. When it was created, there was no easy way to extract the necessary citation and citation context information from general digital libraries.

As far as visualization is concerned, with a standard force-directed node-link citation visualization users are unable to egocentrically explore the paper citation network. For example, they cannot fix any node as a central point and iteratively expand outwards as they explore the incoming and outgoing citations.

It is obvious from the above discussion that ASE contains some features which remarkably distinguish it from other literature search and exploration tools. And improvement in ASE will lead toward significant impact on its usage and popularity. Therefore to overcome the shortcoming and limitation of ASE, an enhancement to the built-in functionality of ASE is presented in this paper based on introducing import features which can be easily used to generate visualizations and statistics of various datasets, not just the ACL Anthology Network. Through this, a user can easily get the overview of recent developments in his area of interest and does not have to wait for an updated release of specialized datasets to keep himself current is his field of research.

This feature not only provides the facility of connecting to external digital libraries but also downloading and parsing of articles, citation parsing and indexing, citation context generation, access to full-text article, summary generation and conversion of articles into multi-dimensional visualizations through a single click.

III. ENHANCEMENT TO ASE

A. Overview

An overview of the enhancement to prototype of ASE is presented in Fig. 1 provided below. It can conceptually be divided into three parts which are 1) fetching of data from Digital Libraries, 2) performing data processing, and 3) incorporating data into the existing database and linking the views of different components. The proposed enhancement is not only based upon modification of single component of ASE but also require changes in other components of ASE. To fetch data from external repositories, amendments in the JabRef is performed. Although latest versions of JabRef have the ability to connect from external repository and then

fetch data, but their functionality is only limited to extraction of BibTex entry information from external repository and does not provide facility of viewing, saving, downloading and parsing of full text articles. For automation of data import process of ASE, BibTex entry information of articles is not sufficient to fulfill the requirement of fetching In-Cite and Out-Cite information and extracting citation context. As ASE is integration of multiple components and input of each and every component vary from each other. Therefore data processing is quite lengthy task as it is based on processing of full-text article, extraction of citation context, fetching of In-Cite and Out-Cite information, generating summary of multiple documents and keep track of relationships between articles and then update data based on those relationships.

TABLE I: ANALYSIS TABLE OF REFERENCE MANAGERS

Res. &

Vis Tool

Search Engine Ref.

Manager Summ. &

Recc.

Functionality

AS

E

VU

E

Cit

esee

rX

Go

og

le S

cho

lar

Go

Pub

Med

Web

of

Kn

ow

ledg

e

AC

M P

ort

al

MS

FT

Aca

dem

ic S

ear

ch

IEE

E X

plo

re

JabR

ef

Zo

tero

Men

del

ey

End

Note

Xp

lorM

ed

Rec

om

mend

er S

yst

ems

New

sIn

Ess

ence

Tri

mm

er

Summary of Textual Excerpts

�

� �

Importing Custom Database

� �

� � � � � � � �

Collaboration among Research Group

�

�

Ranking � �

� � � � � � � �

Web Search

�

�

�

Research Material Organization

�

�

�

Document Statistics � � � � � � � �

Collection and Sub-Collection Creation

�

�

Search Excerpts � � � �

�

Corpus Statistics �

�

� �

Publication and Sharing of Research Material

�

�

�

Citation Visualization � �

� �

�

Citation Context � �

�

Document Recommendations

� � � � �

�

� �

Full-text Search � � �

� � �

�

Create Notes � �

� � � �

History Maintenance

Cited by “List” �

� �

� �

�

Keyword Summary �

�

�

�

Incorporating document from external source

Exporting Databases

Fig 1. ASE Enhancment

B. Software Architecture

Software architecture of prototype and enhanced ASE is provided in Fig 2. The main inspiration behind the design of ASE is to integrate statistical, visual, and text representations each relevant to the task of scientific literature exploration. All of these modalities are linked together in multiple coordinated views, with brushing and linking such that any selection in one is reflected in the others. The main challenge involved in the design of ASE is arrangements of screen space to minimize distracting window manipulation and occluding overlaps, and how to use rich forms of brushing and linking to produce relevant highlights in related windows. Thus in proposed enhancement, for display and selection of list of articles relevant to the user queries and reading of full-text article, pop-up windows are utilized. In prototype ASE, input files of each and every component vary from each other and there is no centralized storage and access of data. Whereas in proposed enhancement centralized data storage approach is utilized and all the components are directly or indirectly connected to the database. In ASE, each windows presents a distinct view of the underlying scientific literature, therefore After performing data fetching and preparation of necessary input files, all the windows are dynamically reloaded to provide updated view of the database. Dynamic linking of views is performed to introduce the visual reflection of interaction between windows of different components.

Fig. 2 Architecture of Extended ASE

C. Workflow

The first step involved in the workflow of the proposed add-on is getting the scholarly article, which at that time may simply exist on some external repository and need to be downloaded and stored on the hard drive of user’s computer. Once the scholarly article is downloaded, JabRef parses that file and automatically extracts basic parts of its content such as abstract, title, introduction, citations and the Digital Object Identifier on the top page of the manuscript and retrieves metadata from external repositories. After successful extraction of contents, parsing of citation is performed for further querying the In-Cite and Out-Cite information from external repositories. Later In-Cite and Out-Cite information respective to user-selected article is downloaded and linking is performed. Creation of graph is performed based on the linkage detail of main article and respective in-Cite and Out-Cite articles. Fetched citation contexts are also used to create input file of In-Cite window and parsed article files are utilized to create input files of Out-Cite Text Window. In-Cite summary information is also generated on the basis of citation context and their relevant articles. Cluster is automatically detected based on the detected relationships between articles and afterward all the information is stored in the database.

IV. IMPLEMENTATION AND EVALUATION

As ASE is developed in Java, therefore to avoid compatibility issues, proposed automation of import process is also implemented in Java. Different Java libraries are used for parsing, viewing and storing information in the database. For viewing of the downloaded article, the icepdf-viewer [16] library is utilized. After successful downloading of an article, the Apache Tika [17] Java library is used to extract the metadata from the article and to store all the contents of article in a simple text file. Later this text file is sent through command to ParsCit [18], a Perl-based content extractor and citation parser, for extraction of the different parts of the article. ParsCit [18] separate each and every part of article and provide output in the form of XML file. For citation parsing, there are also several other open source add-on available. But to avoid compatibility issues and to improve efficiency, ParsCit is again utilized for parsing of citations. One of the changes to SocialAction is the use of radial tree network visualization rather than a force-directed layout. This provides a visualization emphasizing a central focal node. To provide input to different components of ASE, XML writing is performed using Java Document Builder and File writer.

An experiment is performed to give the overview of how effectively ASE is working with proposed enhancement. The example session is started with the keyword search of “Cloud Computing”. Google Scholar returns a subset of 20 articles from which one of the articles “A view of Cloud Computing” is selected for downloading. Once the downloading of main article is completed, ASE displays the initial dataset, screenshot of ASE is provided in Figure 4 containing all the In-Cite and Cited-By information of main article i.e. A01-0001. Similar to prototype, a unique identifier termed an ACL-ID is used to represent articles in the graph.

Graph presented in Figure 4 represents the In-Cite and Cited-by relationship of main articles (A01-0001) with the other articles. It is obvious from the Fig. 4 that the entire fetched information is successfully parsed and incorporated by all the components. When user clicks on the main node, the In-Cite window provides the list of Citation Context. Whereas on clicking of one of the Citation context, it gets highlighted and displays the complete article with same highlighted text in Out-Cite Text Window. When user drag the mouse on one of the cited-by nodes, it provides the display of sub-cited link of cited-by nodes (as shown in Figure 5). On selection of main node (A01-0001), JabRef displays the BibTex entry detail of that node and In-Cite Summary displays the summary generated on the basis of all related articles. The detailed view of all the components of ASE after clicking on the main node (A01-0001) is provided in Fig 6. Figure 7 displays the rank list of the articles according to the generated sequence of the articles. In Fig. 8, Scatterplot is displayed based on the Betweenness Centrality of the publication year of all the articles. Fig. 9 provides the display of clusters detected in the entire dataset.

Display and Downloading of Articles

Article Parsing

Content and Citation Extraction

Citation Parsing

In-Cite and Out-Cite Information

Downloading

In-Cite and Out-Cite Information Linking

Fetching Article and Linking Information

Graph Creation

Updating Citation Listing

Updating Ranking

Summary Generation

Cluster Detection

Saving Information into Database

Fig. 3 Workflow of Proposed Enhancement in ASE

Fig. 4 ASE loaded with 59 papers dataset of Cloud Computing (1) Bibliography List highlighted the main paper downloaded by user (2) Bibliography detail of

high-lighted entry (3) User Query (4) Full text article where highlighted part shown linked citation (5) In-Cite Text (6) Citation Network Visualization (7) Network

Overview

Fig. 5 Graph presenting relationship between cited-by and sub-cited by links

Fig. 6 ASE loaded with 75 papers dataset of Cloud Computing (1) Bibliography List highlighted the paper who have cited the main paper downloaded by user (2)

Bibliography detail of high-lighted entry (3) User Query (4) Full text article where highlighted part shown linked citation of the paper who have Cited the main paper (5) In-Cite Text of sub-Cited paper (6) Citation Network Visualization (7) Network Overview

Fig. 7 Rank List of Cloud Computing paper by its In-degree

1

2

3

4

5

6 7

Fig. 8 Scatter Plot of Betweenness Centrality according to Year

Fig. 9 Cluster Detection

V. CONCLUSION AND FUTURE WORK

The paper proposes significant improvement in ASE which is based on article fetching and parsing, metadata extraction, content and citation writing, graph generation, ranking and filtering, cluster detection facility at runtime. Through this improvement, it is possible to construct an efficient and well-organized workflow for management of scholarly content. The goal of the developed extension is to smooth the progress of researcher by simplifying the process of getting domain knowledge and extracting the key information and articles, view the mapping of in-coming and out-going citations regarding any research area. Through ASE user can not only focus on published contents of single external repository but can also link the contents of multiple repository and can analyze their efforts in research of particular domain.

Future work includes refining, testing and extending other

components of ASE and introducing other important

features such as collaboration and social networking.

Moreover, there is currently no undo feature or exploration

history view to show or return to previously viewed states.

History awareness is an important aspect of visual analytics

systems, and integrating history views into the tool can

improve task recall, result in more efficient search

strategies, and enable asynchronous collaboration between

users. In ASE there is no native way to access its

bibliographic database from another computer. This can be

frustrating when user does not always work on the same

computer. It is also recommend to improve the visualization

and analytics of ASE by introducing other techniques.

ACKNOWLEDGEMENT

The authors would like to thank JabRef team, and ASE

developers team for discussion and guidance of this work.

REFERENCES

[1] C. Dunne, B. Shneiderman, R. Gove, J. Klavans, and B.

Dorr, “Rapid understanding of scientific paper collections:

integrating statistics, text analysis, and visualization,”

University of Maryland, Human-Computer Interaction Lab

Tech Report HCIL-2011, 2011.

[2] R. Gove, C. Dunne, B. Shneiderman, J. Klavans, and B.

Dorr, "Evaluating visual and statistical exploration of

scientific literature networks," In Visual Languages and

Human-Centric Computing (VL/HCC), 2011 IEEE

Symposium on, Pittsburgh, PA, 2011, pp. 217-224.

[3] K. D. Bollacker, S. Lawrence, and C. L. Giles,

“CiteSeer: an autonomous Web agent for automatic retrieval

and identification of interesting publications,” in Proc. int.

conf. Autonomous Agents, 1998, pp. 116–123.

[4] C. L. Giles, K. D. Bollacker, and S. Lawrence,

“CiteSeer: an automatic citation indexing system,” in Proc.

ACM conf. Digital Libraries, 1998, pp. 89–98.

[5] Google, “Google Scholar,” http://scholar.google.com/,

July, 2014.

[6] Transinsight, “GoPubMed,”

http://www.gopubmed.org/, July, 2014.

[7] Thomson Reuters, “ISI web of knowledge,” http://www.

isiwebofknowledge.com/, July, 2014.

[8] ACM Portal, “ACM Digital Library,” http://dl.acm.org/,

July, 2014.

[9] IEEE Xplore, “IEEE Xplore Digital Library,”

http://www.ieee.org/, July, 2014.

[10] JabRef Development Team, JabRef, 2014. [Online].

Available: http://jabref.sourceforge.net

[11] Center for History and New Media, “Zotero,”

http://www.zotero.org/, July, 2014.

[12] Mendeley Ltd, “Mendeley,”

http://www.mendeley.com/, July, 2014.

[13] Thomson Reuters, “EndNote,”

http://www.endnote.com/, July, 2014.

[14] Visual Understanding Environment, “Visual

Understanding Environment”, http://vue.tufts.edu/, July,

2014

[15] ACL Anthology Network, “ACL Anthology Network”,

http://clair.eecs.umich.edu/aan/, July, 2014

[16] ICE PDF, “ICEpdf Open Source Java PDF Viewer -

ICEsoft Technologies”,

http://www.icesoft.org/java/projects/ICEpdf/overview.jsf,

July, 2014.

[17] Apache Tika User Library “Apache Tika”,

http://tika.apache.org/, July, 2014.

[18] ParsCit, “ParsCit,” http://aye.comp.nus.edu.sg/parsCit/,

July, 2014.

Documents

Automating Scholarly Article Data Collection with Action ... · Automating Scholarly Article Data Collection with Action Science Explorer Sehrish Amjad 1, Hamid Mukhtar 1, Cody Dunne