82
Nov 13, 2007 1 Master’s Thesis Defense Bibliographic Tools In The Context Of WWW And LaTeX Munushree Thummala Committee members Dr. Prabhaker Mateti (Advisor) Dr. Thomas Hartrum Dr. T.K. Prasad

Nov 13, 20071 Master’s Thesis Defense Bibliographic Tools In The Context Of WWW And LaTeX Munushree Thummala Committee members Dr. Prabhaker Mateti (Advisor)

Embed Size (px)

Citation preview

Nov 13, 2007 1

Master’s Thesis Defense

Bibliographic ToolsIn The Context Of WWW

And LaTeXMunushree Thummala

Committee membersDr. Prabhaker Mateti (Advisor)Dr. Thomas HartrumDr. T.K. Prasad

Nov 13, 2007 2

Agenda Introduction BiBTeX Primer Bibliographic Tool Survey Requirements for the BiBTeXTools Design Discussion Conclusion Future Work Questions & Answers Session Demonstration

Nov 13, 2007 3

Introduction Preparing academic papers Collecting bibliographic entries Tools used to prepare the papers Common problems

Nov 13, 2007 4

BibTeX Primer

What is BibTeX? Helps prepare the References section in their documents Defines entry types and required/optional fields Uses “style” files to define the format of references Standards for publications are specified in style files

Used with LaTeX Latex collects \cite{}s in the .tex file BibTeX extracts corresponding references from .bib file BibTeX formats and sorts according to the .bst style Output of BibTeX program is LaTeX formatted text

Nov 13, 2007 5

Sample BibTeX entry@mastersthesis{Thummala-2007,

author = {Munushree Thummala},title = {Bibliographic tools in the context of WWW and \latex},month = {November},

year = {2007},school = {Wright State University},

OPTkey = {}, OPTtype = {}, OPTaddress = {}, OPTnote = {}, OPTannote = {},

advisor ={Prabhaker Mateti}}

Nov 13, 2007 6

Contribution Of Thesis Evaluation of Bibliographic tools BiBTeX to Database Suite of Tools

Database to store BibTeX entries LoadBiBTeX BibSearch Discovery of Duplicate BiBTeX entries Normalization of BiBTeX entries

Text to BiBTeX Translation TextToBiBTeX command line tool & API PDFrefsToBiBTeX command line tool Integration of TextToBiBTeX into Aigaion

Nov 13, 2007 7

Bibliographic Tools There are 100+ tools In this thesis: 87 are reviewed Tools were evaluated for the following:

Formats supported Navigating, Searching and Sorting capabilities Ease of maintaining bibliographic entries Duplicate discovery Import/Export to other formats

Nov 13, 2007 8

Bibliographic Tools Web browser based tools

Aigaion, Bibsonomy, CiteULike, Zotero, BibORB, Basilic, PubsOnline, etc.

Desktop/Small scale tools JabRef, KBibTeX, TkBibTeX, BibDB, BibEdit,

Open Office Bibliographic Manager, Tellico, etc. Commercial tools

Scholar’s Aid, Bookends, NotaBene, ProCite, etc.

Utilities Bib2html, Bibclean, Bp, Bibdup, Sixpack, etc.

Nov 13, 2007 9

A Few Notable Tools Aigaion Zotero Bibsonomy JabRef

Nov 13, 2007 10

Aigaion Web application, Open source Easy to use Supports basic editing features Supports Multiple Users Native format is BiBTeX Organizes references by Topics & Sub Topics Maintains a list of authors to eliminate duplication Duplicate discovery present in import feature

Nov 13, 2007 11

Aigaion (Contd. 2)

Nov 13, 2007 12

Aigaion (Contd. 3)

Nov 13, 2007 13

Aigaion (Contd. 4) Author Profile

Nov 13, 2007 14

Zotero Firefox Browser Extension Easy to use Organizes entries in collections Captures bibliographic entries from

websites automatically Some drawbacks

Loses BiBTeX citation keys and custom fields while importing

Not well suited for managing BiBTeX bibliographies

Local storage

Nov 13, 2007 15

Zotero (Contd. 2)

Nov 13, 2007 16

Zotero (Contd. 3)

Nov 13, 2007 17

Zotero (Contd. 4)

Nov 13, 2007 18

Zotero (Contd. 5)

Nov 13, 2007 19

Bibsonomy Web browser based, hosted service Easy to use References

Users upload refs and bookmarks to Bibsonomy Made available to other users Tagged with keywords for categorization and search Can be exported as BiBTeX

Browser shortcuts to capture entries from web

Nov 13, 2007 20

Bibsonomy (Contd. 2)

Nov 13, 2007 21

Bibsonomy (Contd. 3)

Nov 13, 2007 22

Bibsonomy (Contd. 4)

Nov 13, 2007 23

Bibsonomy (Contd. 5)

Nov 13, 2007 24

JabRef Desktop Application Easy to use Multiple bib files can be edited Search online:

CiteSeer, Medline, IEEExplore, ArXiv.org Native format is BibTeX Auto generate BiBTeX keys Imports/Exports multiple formats

Nov 13, 2007 25

JabRef (Contd. 2)

Nov 13, 2007 26

JabRef (Contd. 3)

Nov 13, 2007 27

JabRef (Contd. 4)

Nov 13, 2007 28

CiteuLike Web browser based, hosted service Easy to use References

Users upload refs to CiteULike Made available to other users Tagged with keywords for categorization and search Can be exported as BiBTeX

Browser shortcuts to capture entries from web cite the current article

Nov 13, 2007 29

CiteuLike (Contd. 2)

Nov 13, 2007 30

CiteuLike (Contd. 3)

Nov 13, 2007 31

CiteuLike (Contd. 4)

Nov 13, 2007 32

Requirements for New Tools Text to BiBTeX translation

Translating free style text into BibTeX Customizing the translation Certainty of Recognition measure Extract references section from PDF papers Provide an API for other developers to integrate

free style translation into their applications Command line invocation GUI also Normalized BiBTeX output

Nov 13, 2007 33

Requirements (Contd. 2) Database of Bibliographic entries

Database to store BiBTeX files Tool to Detect duplicates Command line invocation Normalized BiBTeX output

Nov 13, 2007 34

Requirements (Contd. 3) Search and Generate BiBTeX files

Flexible searches Command line invocation Outputs BiBTeX format Normalized BiBTeX output

Platform Independent

Nov 13, 2007 35

Database on Local Machine Tables to store

BiBTeX entries lookup data for text to BiBTeX translation search index data for fast and flexible

searching

Nov 13, 2007 36

Database Of BiBTeX Entries A schema to store BiBTeX entries

including string macros Ability to specify a tag for each entry

Tag defaults to .bib filename

Nov 13, 2007 37

Database Of Lookup Data A database Schema to store lookup tables Lookup Tables:

Author Sub Names Journal Names Publishers Cities States Months Organizations

Nov 13, 2007 38

Database Of Search Indexes A database Schema to store BiBTeX

Search Index data Stores data as sequence of tokens Provides ability to search

Any field(s) Any keyword(s) Citation key also stored as tokens

Nov 13, 2007 39

LoadBiBTeX Tool Loads BiBTeX files into the database and

updates the search index tables Loads the lookup tables used by Text to

BiBTeX tool Detects duplicates

Nov 13, 2007 40

LoadBibTeX– Loads BiBTeX Files Program Usage

LoadBiBTeX –loadentries –bibtag thesis2007 –bibfile thesis.bib

Any entries that have errors are not loaded and are shown in the output

Updates the index tables used by the BibSearch tool

Nov 13, 2007 41

LoadBibTeX– Populate Lookup Tables Program Usage

LoadBiBTeX –loadauthors –loadpublishers –loadjournals –bibfile thesis.bib

Only new values are loaded The above command does not load the

BiBTeX entries

Nov 13, 2007 42

LoadBibTeX– Duplicate Discovery Program Usage

LoadBiBTeX –dupdisc –bibtag thesis2007 –bibfile thesis.bib

The BiBTeX entries in thesis.bib are read and compared to the entries in the database corresponding to the bibtag thesis2007

Any entries considered to be duplicates are displayed for the user

Nov 13, 2007 43

BibSearch – Searching The Database Program Usage

BibSearch –bibtag thesis2007 –fields author –keywords Donald Knuth

The database is searched for entries with the tag “thesis2007” and the words “Donald” and “Knuth” in the “author” field

The resulting BiBTeX entries and any required @String constructs are normalized and written to the output

Nov 13, 2007 44

Normalization Make BiBTeX entries consistent

Some of the rules Citation Keys are consistent Fields are enclosed in {} to preserve formatting Month field abbreviations are expanded Missing required fields are indicated to the user

appropriately Order of the fields in the output

Where is it implemented? In whichever tool a particular rule makes sense Spread across TextToBiBTeX, LoadBibTeX, BibSearch

Nov 13, 2007 45

Normalization (Example 2) @mastersthesis{Thummala2007,

title = “Bibliographic tools in the context of WWW and \latex”,

year = 2007,school = “Wright State University”,month = “Nov”,author = “Munushree Thummala”,advisor = “Prabhaker Mateti”,

}

@MASTERSTHESIS{Thummala-2007,AUTHOR = {{Munushree} {Thummala}},TITLE = {{Bibliographic} tools in the context of {WWW} and \latex},MONTH = {November},YEAR = {2007},SCHOOL = {{Wright} {State} {University}},ADVISOR= {{Prabhaker} {Mateti}},

}

Nov 13, 2007 46

Normalization (Example 3) @InCollection{ lawrence01access, author = "Steve Lawrence",

title= "Access to Scientific Literature", journal = "The {\it Nature} Yearbook of Science and Technology", editor = "Declan Butler", publisher = "Macmillan", address = "London, England", pages = "86-88", year = 2001

} @INCOLLECTION{ Lawrence-2001, AUTHOR = {{Steve} {Lawrence}},

TITLE = {{Access} to {Scientific} {Literature}},BOOKTITLE= {},YEAR = {2001},JOURNAL = {The {\it Nature} {Yearbook} of {Science} and

{Technology}},EDITOR = {{Declan} {Butler}},PUBLISHER= {{Macmillan}},ADDRESS = {{London}, {England}},PAGES = {86-88},

}

Nov 13, 2007 47

Text to BiBTeX Translation What are Free Style References and where would

authors find these ? References at the end of academic papers References on Internet sites like CiteSeer A jotted-down text description

How do authors benefit from this translation ? No need to manually convert to BiBTeX Significantly better accuracy Speeds the process of translating multiple references

Nov 13, 2007 48

Text to BiBTeX Translation (Contd. 2) Ways to translate free style text

Write a routine to analyze the strings and guess the fields

Develop Language Grammar Recursive Descent Parser

Which method did we pick? Recursive Descent Parsing Tried other methods with varying degrees of

success

Nov 13, 2007 49

Text to BiBTeX Translation (Contd. 3) How does the Parser work?

Extent = A sequence of tokens Field type = An extent that matches the set of

okTokens for that field and ends when a notOkToken (including a delimiting token) is hit.

Backtrack: If the current token in an extent does not match the field, it is backtracked to the beginning token, and given a chance to match other field types.

Unrecognized: If the current token does not match any field type, it is appended to the unrecognized field list and the above process is repeated starting at the next token.

Nov 13, 2007 50

Text to BiBTeX Translation (Contd. 4) How is a series of tokens recognized as a field?

Author, Journal fields - lookup table and heuristics Title field - quoted strings or heurisitics Pages field –

[PAGES.|PP.|P.] <number [–][–number]> Year field - a four digit number between 1900 and 2100 Volume field –

[VOL. | VOLUME] <number> Number field –

[NO. | NUMBER] <number> Abbrev field –

<volume>(<number>):<startpage>–[-]<endpage> Edition field-

EDITION<number> or <number> EDITION Publisher field, Place, State - Lookup table

Nov 13, 2007 51

Text to BiBTeX Translation (Contd. 5) A lexical analyzer tokenizes:

Holland, J. H. Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor, MI (1975).

Holland , J . H

. Adaptation In Natural And

Artificial Systems . The University

Of Michigan Press , Ann

Arbor , MI ( 1995

) .

Nov 13, 2007 52

Text to BiBTeX Translation (Contd. 6) Author Field Recognition

“Holland” was present in author lookup table “J.”, “H.” are initials and the author is recognized as present in

the form lastname, firstname Author Field is set to “J.H. Holland”

Holland , J . H

. Adaptation In Natural And

Artificial Systems . The University

Of Michigan Press , Ann

Arbor , MI ( 1995

) .

Nov 13, 2007 53

Text to BiBTeX Translation (Contd. 7) Title Field Recognition

Since “Adaptation” is not recognized as a possible starting token of any other field, tokens are gathered till the next punctuation as title field

Holland , J . H

. Adaptation In Natural And

Artificial Systems . The University

Of Michigan Press , Ann

Arbor , MI ( 1995

) .

Nov 13, 2007 54

Text to BiBTeX Translation (Contd. 8) Publisher Field Recognition

The sequence of tokens “The” “University”, “of”, “Michigan” and “Press” represent a valid publisher name in the publishers lookup table

Thus “The University of Michigan Press” is publisher field

Holland , J . H

. Adaptation In Natural And

Artificial Systems . The University

Of Michigan Press , Ann

Arbor , MI ( 1995

) .

Nov 13, 2007 55

Text to BiBTeX Translation (Contd. 9) Place and State Field Recognition

The sequence of tokens “Ann” and “Arbor” represents a valid place name in the cities lookup table

The token “MI” represents a valid state name in the states lookup table

Holland , J . H

. Adaptation In Natural And

Artificial Systems . The University

Of Michigan Press , Ann

Arbor , MI ( 1995

) .

Nov 13, 2007 56

Text to BiBTeX Translation (Contd. 10) Year Field Recognition

The token “1995” is a valid year value in the range 1900 - 2100. As such it becomes the year field

Holland , J . H

. Adaptation In Natural And

Artificial Systems . The University

Of Michigan Press , Ann

Arbor , MI ( 1995

) .

Nov 13, 2007 57

Text to BiBTeX Translation (Contd. 11) Citation Entry Type

Since there are no distinguishing fields recognized, the entry type is defaulted to Misc

CORN calculations Author field is fully recognized a CORN of 100 Title field follows Author field a CORN of 100 Publisher field is in lookup table a CORN of 100 There are no required fields for Misc entry type. So

multiplier is 1 Entry CORN = AVG ( Author + Title + Publisher) *

multiplier = 100

Nov 13, 2007 58

Text to BiBTeX Translation (Contd. 12)-- Entry CORN = 100 Author = 100 Title = 100 -- Publisher = 100@MISC{Holland-1975

AUTHOR = {{J}. {H}. {Holland}}TITLE = {{Adaptation} in {Natural} and

{Artificial} {Systems}}YEAR = {1975}PUBLISHER = {{The} {University} of

{Michigan} {Press}}PLACE = {{Ann} {Arbor}}STATE = {MI}

}

Nov 13, 2007 59

Text to BiBTeX Translation Example 1 Werner Damm and Bernhard Josko. A sound and relatively

complete Hoare-logic for a language with higher type procedures. Acta Informatica, 20:59-101, 1983.

-- Entry CORN = 87 Author=50 Title = 100 Journal = 100 Pages = 100 @ARTICLE{Damm-Josko-1983, AUTHOR = {{Werner} {Damm} and {Bernhard} {Josko}}, TITLE = {{A} sound and relatively complete {Hoare}-logic

for a language with higher type procedures}, YEAR = {1983}, JOURNAL = {{Acta} {Informatica}}, PAGES = {59-101}, VOLUME = {20}, }

Nov 13, 2007 60

Text to BiBTeX Translation Example 2 Collins R. J. and Jefferson D. R. "AntFarm: towards simulated evolution."

In: C. G. Langton, C. Taylor, J. D. Farmer, and S. Rasmussen (Eds.), Artificial Life II, Vol. X of SFI Studies in the Sciences of Complexity. Redwood City, CA: Addison-Wesley, 1991, pp.579-601.

@INPROCEEDINGS{J-R-1991, AUTHOR = {{Collins} {R}. {J.} and {Jefferson} {D}. {R.}}, TITLE = {{AntFarm}: towards simulated evolution.}, YEAR = {1991}, EDITOR = {{G}. {Langton} and {C}. {Taylor} and {J}. {D}. {Farmer} and {S}. {Rasmussen}}, PAGES = {579-601}, PUBLISHER = {{Addison} - {Wesley}}, JOURNAL = {{In}: {C}}, PLACE = {{Redwood} {City}}, STATE = {CA}, OPTERRORFIELD0 = {{Artificial} {Life} {II}}, OPTERRORFIELD1 = {{Vol}. {X} of {SFI} {Studies} in the {Sciences} of {Complexity}}, }

}

Nov 13, 2007 61

Correctness Of Recognition Number CORN for entire BiBTeX entry is based on

CORN for each field recognized Completeness of the entry (% of required fields

present) CORN is calculated for:

Author field Editor field Title field Journal field Publisher field Pages field

Nov 13, 2007 62

CORN – Example 1

@INPROCEEDINGS{Wegener-2002, AUTHOR = {{I}. {Wegener}}, TITLE = {{Methods} for the {Analysis} of {Evolutionary}

{Algorithms} on {PseudoBoolean} {Functions}}, BOOKTITLE = {}, YEAR = {2002}, PUBLISHER = {{Kluwer} {Academic} {Publishers}}, JOURNAL = {{In}: {Evolutionary} {Optimization}}, }

Nov 13, 2007 63

CORN – Example 1 (Contd.)

Author, Title and Publisher were correctly recognized and their field CORN is set to 100 each.

The journal field was recognized due to the presence of string “In:”. As such it is assigned a CORN of 50.

The required field “Booktitle” is not present so the multiplier is ¾.

This reduces the entry CORN to 65. (100+100+100+50)/4*3/4

Nov 13, 2007 64

CORN – Example 2

@MISC{Luckham-1990, AUTHOR = {{David} {Luckham}}, TITLE = {{Programming} with {Specifications}}, YEAR = {1990}, EDITION = {1}, OPTERRORFIELD0 = {Springer}, OPTERRORFIELD1 = {Berlin},

}

Nov 13, 2007 65

CORN – Example 2 (Contd.)

One of the Author names is not fully recognized and hence reduces the CORN for author field to 1/2*100 = 50

Title is correctly recognized and its field CORN is set to 100.

Year and Edition fields are correctly recognized but do not impact entry CORN.

Entry CORN = (100+50)/2 = 75. Since the entry type is MISC, the multiplier is 1.

Nov 13, 2007 66

CORN – Example 3

@INPROCEEDINGS{Collins-Jefferson-1990, AUTHOR = {{Robert} {J}. {Collins} and {David} {R}. {Jefferson}}, TITLE = {{AntFarm}: {Towards} simulated evolution}, BOOKTITLE = {}, YEAR = {1990}, PAGES = {579--601}, MONTH = {February}, PUBLISHER = {{Addison} - {Wesley}}, JOURNAL = {{In} {Artificial} {Life} {II}: {Proceedings} of the

{Workshop} on {Artificial} {Life}}, PLACE = {{Santa} {Fe}}, STATE = {NM}, }

Nov 13, 2007 67

CORN – Example 3 (Contd.)

Author names are fully recognized and hence CORN is set to 100.

Title is correctly recognized and its field CORN is set to 100.

Pages is recognized and the page range is valid so CORN is 100.

Journal is recognized with a heuristic, so CORN is set to 50.

Publisher is publishers lookup table, so CORN is set to 100.

Entry CORN = (100+100+50+100+100)/5 *(3/4)= 67. The multiplier ¾ is due to the missing booktitle required field.

Nov 13, 2007 68

TextToBiBTeX API SetupDbConnection setInputString setMarkupStream –re colorized HTML setBiBTeXStream –re BiBTeX entries textToBiBTeX – text to BiBTeX translation getEntriesCount getBibTeXEntryFieldCount getBibTeXEntryField

Nov 13, 2007 69

TextToBiBTeX API (Contd.) Java library jar Non-java programs can invoke

TextToBiBTeX PDFrefsToBiBTeX

Nov 13, 2007 70

TextToBiBTeX Command line tool Free style input in a file BiBTeX output Marked up HTML output Uses TextToBiBTeX API Usage:

TextToBiBTeX <txt file> [bib file]

Nov 13, 2007 71

PDFrefsToBiBTeX Command line tool PDF file as input BiBTeX output Marked up HTML output Uses 3rd party tool PDFBox for parsing

PDF file Uses TextToBiBTeX API Usage:

PDFrefsToBiBTeX [-clean] <pdf file> [bib file]

Nov 13, 2007 72

Integrating into Aigaion Free Style translation functionality

integrated into Aigaion Free Style recognition from PDF files

Logic to clean the text recognized from PDF Synchronizing TextToBiBTeX lookup tables

with entries from Aigaion database

Nov 13, 2007 73

Integrating Into Aigaion (Contd. 2)

Nov 13, 2007 74

Integrating Into Aigaion (Contd. 3)

Nov 13, 2007 75

Integrating Into Aigaion (Contd. 4)

Nov 13, 2007 76

Integrating Into Aigaion (Contd. 5)

Nov 13, 2007 77

Sync Tables with Aigaion (Contd. 6)

Nov 13, 2007 78

Sync Tables with Aigaion (Contd. 7)

Nov 13, 2007 79

Conclusion Tool Survey

Evaluated over 80 tools Tool Recommendations

Database of BiBTeX entries Store BiBTeX files as database entries Searching is based on token level instead of

string level which yields good results Duplicates are detected logically instead of

string comparisons

Nov 13, 2007 80

Conclusion (Contd.) Text to BiBTeX translation

TextToBiBTeX saves scholar’s time and effort by relieving them from the burden of translating and maintaining BiBTeX entries

TextToBiBTeX API allows other tools to reuse free style functionality

Integrated into Aigaion tool Converted PDF references into BiBTeX format

Nov 13, 2007 81

Future Work Better duplicate detection by letting the

users configure the base rules for detecting duplicates

Recognizing more variations in Free style text

Recognizing more fields Optimizing the database loading speed for

BiBTeX entries

Nov 13, 2007 82

Demonstration Integration of free style into Aigaion

Text file input PDF file input

LoadBiBTeX – Duplicate Discovery BibSearch – Searching the database LoadBiBTeX – loading a BiBTeX file LoadBiBTeX – updating lookup tables