Upload
vian
View
38
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Online tools and standards for Biodiversity data in the Semantic Web. Dr Dimitris Koureas Biodiversity Informatics Group | Department of Life Sciences The Natural History Museum London. What is the semantic web?. http://…. http://…. - PowerPoint PPT Presentation
Citation preview
Online tools and standards for Biodiversity data in the Semantic Web
Dr Dimitris KoureasBiodiversity Informatics Group | Department of Life SciencesThe Natural History Museum London
http://… http://…
What is the semantic web?
Slide adjusted from Page R. presentation in pro-iBiosphere
http://… http://…
link
,
What is the semantic web?
Slide adjusted from Page R. presentation in pro-iBiosphere
http://… http://…
http://…
What is the semantic web?
Slide adjusted from Page R. presentation in pro-iBiosphere
http://… http://…
http://…
is a author of
person
Fred
book
What is the semantic web?
Slide adjusted from Page R. presentation in pro-iBiosphere
The Semantic web:
“The future of the web…and always will be” – Peter Norvig (Google)
What is the semantic web?
Slide adjusted from Page R. presentation in pro-iBiosphere
Biodiversity informatics
The study of the transformation and communication of information in Life and Earth sciences
provides the means (generating and enhancing the necessary infrastructure)
Research
vs
InfrastructureSlide adapted from Patterson D. 2013, Tempe, Arizona
vs
Infrastructure
Discovery Ephemeral Individualistic Massive redundancy Optional Risk taking
Slide adapted from Patterson D. 2013, Tempe, Arizona
Research
vs
Infrastructure
Discovery Ephemeral Individualistic Massive redundancy Optional Risk taking
Implementation Communal / agreed Essential Persistent Robust & reliable Adaptable
Slide adapted from Patterson D. 2013, Tempe, Arizona
Research
What are the current challenges in Biodiversity informatics?
Publications based on countless specimens, images, maps,
keys and datasets
Current taxonomic data production
Typically generated by small communities for “local” research projects
Figure from Costello M.J et al, 2013doi: 10.1126/science.1230318
• 15-20k new spp. described annually (2M total)1
• 30k nomenclatural acts (12M total) 1
• 20k phylogenies (750k total)2
• 31k taxa sequenced (360k taxa total)3
• 800k BioMed papers (40M total pp. of taxonomy) 4
• Countless specimens, images, maps, keys and datasets
Our current taxonomic data production
Figures from 1) Zhang, Zootaxa 2011 4, 1-4; 2) Web-of-Science; 3) Genbank and 4) PubMed.
1.8 M described spp. (17M names)300M pages (over last 250 years)1.5-3B specimens
Estimates of
7.5 million species
still undescribed1
1How Many Species Are There on Earth and in the Ocean? Mora C et al. doi:10.1371/journal.pbio.1001127
Now imagine that…
Biodiversity informatics landscape
Key problems• Landscape is complex, fragmented & hard to navigate• Many audiences (policy makers, scientists, amateurs, citizen scientists)• Many scales (global solutions to local problems)
Figure adapted from Peterson et al, Syst. & Biodiv. 2010doi: 10.1080/14772001003739369
Science is carried out “locally”• By local scientists• Being part of local infrastructures• Having local funders
Science is global• It needs global standards• Global workflows• Cooperation of global players
BUT
Expected volume
of taxonomic and
biodiversity data
Need of extracting,
aggregating and linking
data on a global level
Cyndy Parr, Rob Guralnick, Nico Cellinese and Rod Page. TREE doi:10.1016/j.tree.2011.11.001
This requires data, information & knowledge to be…
• Digital Not printed paper
• Openly accessible Not behind barriers (e.g. paywalls)
• Linked-up Not in silos
“Link together evolutionary data… by developing
analytical tools and proper documentation and then use this framework to conduct comparative analyses, studies of evolutionary process and biodiversity analyses”
To achieve this…
Hour-glass motif for big data infrastructure
Data re-use
Data generation
Data pool
Slide adapted from Patterson D. 2013, Tempe, Arizona
Big data world with re-use data
AggregationVisualization Analysis Manipulation
ModelsObservations Experiments Processed
Data re-use
Data generation
Data pool
Re-use Quality enhancement
Distribute Make discoverable and actionable Atomize Standardize (metadata, ontology) Use stable UUIDs to identify content Preserve Federate Register
Make accessible Normalize data Structure data Make data digital
AggregationVisualization Analysis Manipulation
ModelsObservations Experiments Processed
Data re-use
Data generation
Data pool
Big data world with re-use data
• Dynamically interconnected• Nodes with sub-discipline
specific responsibilities• Standard Exchange
formats• Using UUIDs to
identify content• Ontologies
Nodes are the essence of infrastructure
Nodes interconnected
Slide adapted from Patterson D. 2013, Tempe, Arizona
But how many biodiversity informatics projects are out there?
At least 679!
But how many biodiversity informatics projects are out there?
Sources: EDIT, TDWG & ViBRANT 2013
Categories:
Data Aggregator - a web site that collates data from a variety of sources (digital and hardcopy) and
presents it in one form
Data Indexer - a web site that provides lists or indexes of other sites that provide data
Data Provider - a web site that provides data directly from research or other studies
Data Standards - a web site that contributes to formulating or developing standards for data
Facilitator - a web site that facilitates the provision of data by other projects or web sites
GBIF: Our global leader in occurrence data
Aggregators
http://www.eu-nomen.eu/portal/EU-NOMEN - PESI
Aggregators
Making taxonomy digital, open & linked
Aggregators
Scratchpads are an integrated system to
Enter, Curate, Mark-up, Link and Publish data
taxonomic workflowin a single virtual environment
A Scratchpad is a website that holds data for you and your community
The Scratchpads concept
Your data External data & services
65,000 unique visitors/month
Per month unique visitors to Scratchpads sites
580 Scratchpads Communities
by 8,185 active registered users
covering 55,607 taxa
in 653,274 pages.
In total more than
1,300,000 visitors
Researchers can assemble, test, and analyse their data records in BOLD before uploading them to: International Nucleotide Sequence Database Collaboration (DDBJ, ENA, GenBank)
BOLDBarcode of Life Data Systems
Facilitators
Biodiversity literature openly available to the world as part of a global biodiversity community
Biodiversity Heritage LibraryBHL
http://www.biodiversitylibrary.org/
> 40 M pages of legacy literature
Providers
Standard Exchange formats
http://rs.tdwg.org/dwc/index.htmDarwin Core(DwC)
Primarily used as a specimen records metadata standard
Standard Exchange formats
Access to Biological Collection Data(ABCD)
http://www.tdwg.org/standards/115/
highly detailed and aims to provide a complete set of data elements for natural history collection items
Standard Exchange formats
Audubon Core Multimedia Resources Metadata Schema
http://www.tdwg.org/standards/638/
The Audubon Core metadata schema ("AC") is a representation-neutral metadata vocabulary for describing biodiversity-related multimedia resources and collections.
Standard Exchange formats
http://tdwg.napier.ac.uk/index.php?pagename=HomePage
Taxonomic Concept Transfer Schema (TCS)
Mechanism to exchange data concerning the names of organisms
Standard Exchange formats
Standards facilitate systems interoperability
UPIDs to identify content
IdentifiersA key to findsomething in adatabase.
We need Unique Identifiers
10.4289/0013-8797.115.1.75
We need Unique Identifiers
http://hdl.handle.net/10.4289/0013-8797.115.1.75
http://dx.doi.org/10.4289/0013-8797.115.1.75
http://www.google.co.uk/search?q=10.4289/0013-8797.115.1.75
http://zoobank.org/10.4289/0013-8797.115.1.75
We need Unique Identifiers
Can a taxonomic name be used as a UPID?
Is it Unique?Is it Persistent?Is it an Identifier?
Are taxonomic names enough for communication between Scientists? YES
Are taxonomic names enough for communication between machines? CAN BE IF
We need Unique Identifiers
For example:
Page R., Brief Bioinform (2008) 9 (5): 345-354. doi: 10.1093/bib/bbn022
We need Unique Identifiers
ONLY IF Name reconciliation
Patterson, D. J. et al. 2010. Names are key to the big new biology. TREE 25: 686-691 doi: 10.1016/j.tree.2010.09.004
We need Unique Identifiers
The need for Controlled Vocabularies and Ontologies
Knowledge Organisation Systems
Google has done it:http://googleblog.blogspot.co.uk/2012/05/introducing-knowledge-graph-things-not.html
Ontologies
Plant anatomical and structural development Ontologyhttp://www.plantontology.org/
Deans A. et al. Time to change how we describe biodiversity, Trends in Ecology & Evolution 2012doi:10.1016/j.tree.2011.11.007
Example of ontology usage
Examples of integrated projects
http://protectedplanet.net
http://thymus.myspecies.info
How are all this relevant to my work?
What should I take home?
Repositories#bigdata
Providers
Data silos
Community
The four nodes of data workflow
1. We collect and generate data
2. We curate, link and structure data
3. We analyse data
4. We publish data
Data curation
Data analysis
Data publishing
The four nodes of data workflow
Data collection &generation
What are the
bottlenecks
in the workflow?
Data curation
Data analysis
Data publishing
What we need is…
Data collection &generation
aseamless
workflow
Old Joke:A drunk is crawling around a lamp post on his hands and knees.
A cop comes along …
Cop: What are you doing?
Drunk: Looking for my car keys.
Cop: Are you sure you dropped them here?
Drunk: No, I dropped them in the alley.
Cop: So why are you looking here?
Drunk: Because the light’s better.
Old Joke
Science is a ‘light’s better’ endeavor in that research effort is
not directed at areas where the work is technically infeasible.
Research is directed where real, interpretable results may be
obtained.
We do, in fact, conduct research where the light’s better.
But, when the light changes, so does science.
With better illumination, we look in new areas.
We find new things…
Old Joke
Addressing the challenges of biodiversity informatics
“…the field [of biodiversity informatics] appears to be growing ina void of overarching, motivating questions, effectively making it
a set of technologies in search of questions to address.”
Peterson et al, Syst. & Biodiv. 2010doi: 10.1080/14772001003739369