39
Expanding Access to Biodiversity Literature Presented by: William Ulate Center for Biodiversity Informatics, MBG May 26, 2016

Expanding Access to Biodiversity Literature. Mining Biodiversity

Embed Size (px)

Citation preview

Expanding Access to Biodiversity Literature

Presented by: William Ulate Center for Biodiversity Informatics, MBG

May 26, 2016

Biodiversity Heritage Library

• a consortium of botanical and natural history libraries

• stores digitised legacy literature on biodiversity

• currently holds 180,000 volumes = millions of pages (PDFs and OCR-generated text)

• open-access

4 5/31/2016 Mining Biodiversity

http://diggingintodata.org/awards/2013

The partners

Social Media Lab

7 5/31/2016 Mining Biodiversity

Thanks to the sponsors:

Mining Biodiversity

• Transform BHL into a next-generation social digital library

• A multi-disciplinary approach – Text Mining

– Machine learning

– History of Science

– Environmental History & Studies

– Library and Information Science

– Social Media

9 5/31/2016 Mining Biodiversity

What have we done so far?

Social Media

Visualisation

Semantic

Metadata

10 5/31/2016 Mining Biodiversity

Making Biodiversity Digital Objects More Social and Shareable

Follow us on Twitter: @SMLabTO

We also partnered with Altmetric to better understand who and why people share BHL content across various social media platforms

Follow us on Twitter: @SMLabTO

“My Tweeps” app mytweeps.com

Helping BHL (and other organizations) to get daily insights about their Twitter followers (or Tweeps) and what they are interested in.

We call it a "reverse" Twitter because instead of seeing tweets from people whom you follow, the app shows you tweets from people who follow you.

Follow us on Twitter: @SMLabTO

MyT

wee

ps.

com

How are these tools in BHL being used?

What are we doing?

Social Media

Visualisation

Semantic

Metadata

16 5/31/2016 Mining Biodiversity

Current features

• supports keyword-based search

• species names annotated and linked to the Encyclopedia of Life

• integrates automatic taxonomic name finding tools (uBio Taxonfinder / GNRDS)1

• data access through export functionalities and Web services

17 5/31/2016 Mining Biodiversity

1 Global Names Recognition and Discovery tools and Services (GNRDS). See http://gnrd.globalnames.org/

Keyword-based search and Browsing

Advanced search (also keyword-based)

5/31/2016 19 Mining Biodiversity

Enhancements to BHL

What’s wrong with keyword-based search: Polysemy

•Ambiguity!

Boxwood

historic place in Alabama?

North American term for plants in the Buxaceae

family?

Box

container?

Boxwood for other English-speaking countries?

What’s wrong with keyword-based search: Polysemy

•Ambiguity!

California bay

hardwood tree?

location?

Semantic metadata generation

• Entity types

– species

– location

– habitat

– anatomical parts

– qualities

– persons

– temporal expressions

• Association types

– observation

– Habitation

– nutrition

– trait

5/31/2016 Mining Biodiversity 23

Examples of semantic metadata (annotations)

• Observation

• Habitation

Examples of semantic metadata (annotations)

• Nutrition

• Trait

How does semantic information help?

SPECIES California bay

hardwood tree

location

California bay LOCATION

•Word sense disambiguation

What’s wrong with keyword-based search: Synonymy

Campanula portenschlagiana Schult.

Campanula portenschlagiana Schult.

Campanula affinis Rchb. ex Nyman

Campanula muralis Port ex. A. DC.

What’s wrong with keyword-based search: Synonymy

Clematis L.

Clematis L.

Clematopsis Bojer ex Hutch.

Atragene L.

Archiclematis tamura

How does semantic information help?

Campanula portenschlagiana Schult.

Campanula portenschlagiana

Schult.

Campanula affinis Rchb. ex Nyman

Campanula muralis Port ex. A. DC.

•Query expansion

Term Inventory

• compilation of species names (flowering plants, mammals, birds)

• acts as a thesaurus, as each name is linked to its synonyms as well as other semantically related names

• “semantically relatedness”: defined in terms of a contextual similarity measure, computed over the entire BHL corpus

Sources we leveraged

• Names

– Encyclopedia of Life (EOL)

– Catalogue of Life

– Global Biodiversity Information Facility (GBIF)

• Images

– Encyclopedia of Life (EOL)

Experiments

• Training data: all English texts from the BHL – about 26 million pages with a size of 49GB

• Evaluation data: synonymous terms from the Catalogue of Life

• Select 500 scientific names and their synonyms from the CoL

• Results at top-20

Category Class #terms in

CoL

#terms in

BHL

#average synonyms

in CoL

Birds Aves 1140 818 2.28

Mammals Mammalia 1131 726 2.26

Plants Plantae 1141 826 2.28

Category Pre@20 Re@20

Birds 69.41% 63%

Mammals 62.12% 53.84%

Plants 56.17% 21.43%

Application to Query Expansion

• an interface for searching BHL documents using a species name as a query

• query is automatically expanded by retrieving synonyms/semantically related names from the term inventory

• documents mentioning all of the names in the expanded query are returned

https://www.youtube.com/watch?v=lF2ManWhljM

http://nactem10.mib.man.ac.uk/va/MiBio/Search/queryExpansion.html?prot=thumb

http://goo.gl/forms/3mO5fWd7Y4

Some Magnoliopsida species (common) names

CHOICE 1 CHOICE 2 CHOICE 3

Phaseolus multiflorus Garden pea Argemone alba

Citrus nobilis Sweetheart Arabis perfoliata

Spergularia marina Aster pauciflorus Mimosa

Canavalia ensiformis Physic nut Mung bean

Chrysanthemum inodorum Guilandina bonducella Tilia parvifolia

Fraxinus pubescens Arabidopsis thaliana Pulsatilla vulgaris

Symphoricarpos orbiculatus Turritis glabra Medick

Sorbus domestica Lespedeza reticulata Hypericum galioides

Haematoxylon campechianum Scaevola lobelia Alliaria petiolata

Kerria japonica Clematis indivisa Erythrina glauca

Petasites officinalis Ptychotis ajowan Aster multiflorus

Salix cinerea Ribes vulgare Sword bean

More Magnoliopsida species (common) names

CHOICE 1 CHOICE 2 CHOICE 3

Peucedanum ostruthium Cynoglossum sylvaticum Allamanda schottii

Windsor bean Common haricot Ranunculus pusillus

Sambucus canadensis Field bean Dwarf bean

Gaillardia bicolor Ipomoea nil Monniera monniera

Arnica alpina Indian laburnum Eastern redbud

Calocarpum mammosum Ribes americanum Lactuca alpina

Polygonum maritimum Erica mediterranea Paronychia canadensis

Imperatoria ostruthium Rubus lasiocarpus Melochia corchorifolia

Valerianella olitoria Sonchus oleraceus Vicia hirsuta

Mountain ebony Carduus lanceolatus Salix rubra

Ledum groenlandicum Sida abutilon Tecoma radicans

Gilia coronopifolia Corydalis canadensis Lacinaria spicata

And more Magnoliopsida species (common) names

CHOICE 1 CHOICE 2 CHOICE 3

Pyrus cydonia Barbados pride Prenanthes alba

Clianthus dampieri White lupin Yellow pea

Geum intermedium Pyrus melanocarpa Erigeron canadensis

Pyrola uniflora Japanese pagoda tree Epilobium hirsutum

Ampelopsis engelmanni Soybean Salix pentandra

Solanum nodiflorum Exogonium purga Lathyrus montanus

Ribes floridum Impatiens biflora Stellaria media

Orobus tuberosus Cassia marilandica Cnicus discolor

Medicago maculata Melilotus indica Apium nodiflorum

Glycine soja Balsam of tolu Juglans laciniosa

Stellaria longifolia Salix arctica Purging cassia

Echinospermum lappula Umbrella tree Potentilla pumila

Thank you

William Ulate Missouri Botanical Garden

[email protected]

Photo: W.Ulate. Corcovado National Park, Costa Rica. 2013

This project was made possible in part by

[LG-00-14-04-0032-14]

Riza Batista-Navarro NaCTeM, University of Manchester

[email protected]