Next steps for BHL and Linked Data
John MignaultTechnical Advisory Group
Biodiversity Heritage LibraryTwitter: @jmignault
The Biodiversity Heritage Library
• BHL is a consortium of natural history, botanical libraries and research institutions
• An open access digital library for legacy biodiversity literature
• An open data repository of taxonomic names and bibliographic information
• An increasingly global effort– US/UK, Europe, Egypt, China, Africa
How much text are we talking?
• Just hit 40 million page mark• Tens of thousands of titles• 110, 000 volumes• Internet Archive is BHL scanning partner• In conjunction with local scanning efforts
Issues we’ve faced
• OCR is a *BIG* deal• A lot of literature is pre-1923• Expanding the range of material in BHL
OCR is a *BIG* deal
• All book / literature digitization projects affected, not just BHL
• Especially problematic in BHL– More than 50 languages represented in BHL– Dates of publication from 1400’s to 2000’s– Irregular typeface / typesetting– Multiple languages on one page
• Botanical descriptions in Latin
2007 Name Finding Study
>35% OCR error rate for names only
1 Insert Space 8 n->v
2 Omit Space 9 l->i
3 e->c 10 r->i
4 u->I 11 u->ii
5 u->n 12 h->l
6 i->l 13 h->ii
7 c->e 14 e->o
Top OCR errors
35.16%
Of the 3,003 names, 1,056 were incorrectly transcribed by OCR.
Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008.http://www.tdwg.org/proceedings/article/view/380
Abbild ungen und Beschreibungen der
Fische Syriens, nebst
einer neuen Classification und Characteristik sämmtlicher Gattungen
der i
JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in
Wien, mehr, yelelirt. UeHtllMeii. MIfglivd.
STUTTGART. E. Schweizerbart' sehe Verlagshandlung,
1843.
Older material
• Great deal of material is pre-1923 • Irregular fonts – blackletter• Multiple languages on same page – English
text with Latin scientific names• Changes in geographic names• Changes in scientific names
*E.xvi c piteI von c. cXx.WptdvonfnrWmn � �bu fbe;bcn.5 am cix bIa S &3rn~ 41X � �a m cv(f b1air 'o et ert oiensr ; � � � �
', : hlrfc c wa ff 4am.diug bist a� � � �6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn ciblatGteaM �w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl Oiff ;Bruet wacfttc n qmcx b1a bl: �bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t B Rn " trv W1Rt' ?Cm c blas � �waIwutr Ober ci ti 1V Ces ' wt �gbtiemwwajfu tpctt, afferain 9 c: b titbfof �
r f eran m rs bra wlg auig4;f aer m *mc vrt � �blatcabtfm wfru an'deg~m rt blas Iaum bwWt run f ncmai b14ianf tJobrrfan �ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W e &mcyfbq4 Mabtt mmw � �rc a iiu bc Jcn ncI.end.*, blat s. a\ u: rprd3 �rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
Expanding scope
• Manuscripts, field notebooks –mostly handwritten, often with drawings
• Global expansion means dealing with non-Western script systems and a whole new set of OCR problems – Arabic materials from Bibliotheca Alexandria in Egypt
Images
Some current initiatives
• Scientific name extraction• “Parts”• PDF Generator
Scientific Name Extraction
• TaxonFinder algorithm in production since 2008– More than 100 million candidate name strings– More than 1.5 million unique, verified names– Available through UI, APIs, Data Exports &
Internet Archive
• New collaboration with Global Names– Improved algorithm, better precision & recall– More data!
Finding parts
• Disambiguating and locating structural boundaries in the corpus
• Done mainly by crowdsourced means– Citebank
• Greatly increases usability and semantic value of the dataset
• Addressing important – makes data addressable and thus linkable
Articles in the BHL UI
Images
PDF Generator
What we’d like to dohttp://biodivlib.wikispaces.com/BHL+and+Gaming
•Correcting OCR•Rekeying Tables of Contents•Researching candidate Scientific Names•Image identification & extraction
– http://biodivlib.wikispaces.com/Art+of+Life – Currently funded by NEH
^Challenges framed as games
We need your help
• “When in doubt, use humans.” – @dpatil: ttp://radar.oreilly.com/2012/07/data-
jujitsu.html
• Increase value of biodiversity domain through improved data integration
• Many similarities between specimen labels and literature
Need deep intertwingling
• Wider integration of biodiversity data• Normalization through controlled
vocabularies and authorities• Linkages between
– Specimens– Descriptions– Articles– Manuscripts
To sum up
• BHL is a massive dataset useful for multidisciplinary research– Systematics– Natural Language Processing– Humanities
• BHL is open– Free to use at http://biodiversitylibrary.org– Open access data for scholarly use & reuse
• BHL has APIs and data exports to enable reuse– BHL data can be incorporated into other virtual
research environments
Get involved
• http://biodiversitylibrary.org• http://biodivlib.wikispaces.com/Developer+Tools+and+API • http://biodivlib.wikispaces.com/BHL+and+Gaming
• Thanks!