Taxonomic databases: The SEEK and VegBank experience



Taxonomic databases: The SEEK and VegBank experience. R.K. Peet The University of North Carolina Ecological Society of America Vegetation Panel The SEEK development team. Biodiversity informatics depends on accurate and precise taxonomy. - PowerPoint PPT Presentation

Citation preview

Taxonomic databases: Taxonomic databases: The SEEK and VegBank The SEEK and VegBank


R.K. PeetR.K. Peet

The University of North CarolinaEcological Society of America Vegetation Panel

The SEEK development team

• Accurate identification and labelling of organisms is a critical part of collecting, recording and reporting biological data.

• Increasingly, research in biodiversity and ecology is based on the integration (and re-use) of multiple datasets.

Biodiversity informatics Biodiversity informatics depends on accurate and depends on accurate and

precise taxonomyprecise taxonomy

• What was a minor annoyance for a few tens of records becomes intractable when looking at a million records.

• Some data types, such as organism identifications, are inherently more complex to define with the consequence that few standards have been adopted.

Biodiversity data structure

Taxonomic database

Observation database

Occurrence database

Observation/Collection Event

Specimen or Object



Observation or Community Type

Observation type database


• The ESA Vegetation Panel is developing VegBank as a public archive for vegetation plot observations (

• VegBank is expected to function for vegetation plot data in a manner analogous to GenBank.

• Primary data will be deposited for reference, novel synthesis, and reanalysis.

• The database architecture is generalizable to most types of species co-occurrence data.

What is SEEK?Science Environment for Ecological Knowledge

Multidisciplinary project to create:Scientific-workflow system (Kepler)

– Design, reuse, and execute scientific analyses

Distributed data network (EcoGrid)– Environmental, ecological, and systematics data

KR & Semantic Mediation– Discover, integrate, and compose hard-to-relate data and services via


Taxonomic concept services– Resolve taxon ambiguities

Collaborators (the SEEK team)• NCEAS, UNM, SDSC/UCSD, U Kansas• Vermont, Napier, ASU, UNC

Data SetData Set

Data Set

Ecological Data Set

Ecological data set providers

Concept Provider 1e.g. Fishbase

Concept Provider 3e.g. Prometheus

Concept Provider 2e.g. ITIS

Taxonomic concept providers

Taxonomy transfer schema- TML

Concept matching/expansion/…Weighted concepts

Semantic Mediation SystemReturn list of Data Sets

User’s Taxonomic concept + quality measure

Name/Concept Repository

Ecological metadata language- EML (Containing Collector’s

Taxonomic concept(s))

EML repository

Taxon coverage

SEEK High-Level Approach

Taxonomic database Taxonomic database challenge:challenge:

Standardizing organisms and Standardizing organisms and communitiescommunities

The problem:The problem: Integration of data potentially Integration of data potentially

representing different times, places, representing different times, places, investigators and taxonomic standards.investigators and taxonomic standards.

The traditional solution:The traditional solution: A standard list of organisms / A standard list of organisms /


Standard lists are available for Taxa

Representative examples for higher plants in Representative examples for higher plants in North America / USNorth America / US

USDA PlantsUSDA Plants ITIS NatureServe BONAP Flora North America

These are intended to be checklists wherein the taxa These are intended to be checklists wherein the taxa recognized perfectly partition all plants. The lists can recognized perfectly partition all plants. The lists can be dynamic.

Abies lasiocarpa

Abies bifolia

Abies lasiocarpa

sec. Littlesec. USDA PLANTS

sec. Flora North America

Three concepts of subalpine firThree concepts of subalpine fir

Splitting one species into two illustrates the ambiguity often associated with scientific names.

USDA Plants & ITIS

Abies lasiocarpa

var. lasiocarpa

var. arizonica

One concept ofAbies lasiocarpa

Flora North America

Abies lasiocarpa

Abies bifolia

A narrow concept of Abies lasiocarpa

Partnership with USDA plants to provide plant concepts for data integration

Andropogon virginicusAndropogon virginicus complex in the complex in the CarolinasCarolinas

9 elemental units; 17 base concepts9 elemental units; 17 base concepts

Standardized taxon lists Standardized taxon lists failfail

to allow dataset integrationto allow dataset integration

The reasons include:The reasons include:

• Taxonomic concepts are not defined (just Taxonomic concepts are not defined (just lists), lists),

• Relationships among concepts are not Relationships among concepts are not defineddefined

• The user cannot reconstruct the database as The user cannot reconstruct the database as viewed at an arbitrary time in the past, viewed at an arbitrary time in the past,

• Multiple party perspectives on taxonomic Multiple party perspectives on taxonomic concepts and names cannot be supported or concepts and names cannot be supported or reconciled.reconciled.

Name ReferenceConcept

Taxonomic theoryTaxonomic theory

A taxon concept represents a unique combination of a name and a reference.

Report -- name sec reference.


Name ConceptUsage

A usage represents an association of a concept with

a name.

• The name used in defining the concept need not be the same name used in your work.

e.g. Carya alba = Carya tomentosa sec. Gleason & Cronquist 1991.

• Usage can be used to apply multiple name systems to a concept

Relationships among concepts

allow comparisons and conversions

• Congruent, equal (=)• Includes (>)• Included in (<)• Overlaps (><)• Disjunct (|)• and others …

High-elevation fir trees of western US


var. arizonica

Abies lasiocarpa



Flora North America

Abies bifolia Abies lasiocarpa

A. lasiocarpa sec USDA > A. lasiocarpa sec FNA

A. lasiocarpa sec USDA > A. bifolia sec FNA

A. lasiocarpa v. lasiocarpa sec USDA > A. lasiocarpa sec FNA

A. lasiocarpa v. lasiocarpa sec USDA | A. bifolia sec FNA

A. lasiocarpa v. arizonica sec USDA < A. bifolia sec FNA

var. lasiocarpa

Party Perspective

The Party Perspective on a Concept includes:

• Status – Standard, Nonstandard, Undetermined

• Correlation with other concepts – Equal, Greater, Lesser, Overlap, Undetermined.

• Start & Stop dates.

Intended functionality

• Organisms are labeled by reference to concept (name-reference combination),

• Party perspectives on concepts and names can be dynamic, but remain perfectly archived,

• User can select which party perspective to follow, and at which date,

• Different names systems are supported,

• Enhanced stability in recognized concepts by separating name assignment and rank from concept.

When reporting the identity of organisms in publications, data, or on specimens, provide the full scientific name of each kind of organism and the reference that provided the taxonomic concept.

e.g., Abies lasiocarpa sec. Flora North America 1997.

Best practice: Report taxa by reference to concepts.

• Reference high-quality sources for taxon concepts such as a major compendium that provides its own defined concepts, or a source that references the concepts of others.

• Avoid checklists as they typically lack true taxonomic descriptions or circumscriptions.

Best practice: Choose high-quality concepts

SEEK & GBIF are working to provide standards for concept

data• Several data models incorporate

taxon concepts. The IOPI, VegBank, and Taxonomer models are optimized for different uses.

• SEEK, GBIF, and TDWG developed TCS, which was adopted by TDWG in August 2005 and is being implemented by GBIF and SEEK.

• A name in a publication could be either a concept or an identification.

• An annotation is an identification.

• Identifications should include linkage to at least one concept, but need not be limited to a single concept.

Concepts and identifications

are distinct.

Documenting identifications

Relationships added for identification= Indicates identification ~ (or aff.) Indicates similarity≡ Indicates identity, or defined as

Example of complex identification< Potentilla sec. Cronquist 1991 +~ Potentilla simplex sec Cronquist 1991 +~ Potentilla canadensis sec Cronquist 1991

Fuzzy logic qualification

1 = Absolutely wrong2 = Understandable but wrong3 = Reasonable or acceptable 4 = Good answer5 = Absolutely correct

Biodiversity informatics depends on standards and

connectivity• Names (Linnean Core)• Taxonomic concepts (TCS)• Publications (Alexandrian core, etc)• Observations (proposed TDWG

standard)• Identifications (proposed EML

extension)• GUIDS (under development by GBIF)

Tools to develop and map concepts

• Taxonomists need mapping and visualization tools for relating concepts of various authors. SEEK is building prototypes for review and possible adoption.

• Aggregators need tools for mapping relationships among concepts.

• Users need tools for entering legacy concepts. Several are in development.

Concept mapper

Demonstration ProjectsConcept relationships of Southeastern US

plants treated in different floras.

Based on > 50,000 mapped concepts

Step 1: Adoption of minimum standards and best practices by high-quality journals, funding agencies, and professional organizations.

Distributed information systems - and the way


Publishers, curators and data managers need to tag taxon

interpretations with concepts

• Precedence exists with tagging literature citations and GenBank accessions

• Presses are linking scientific names in many ejournals to ITIS (e.g. Evolution, Ecology)

Step 2: Creation, availability, and maintenance of databases that document core sets of taxonomic concepts and the relationships of these concepts to each other.

The way ahead

True concept-based checklists

• Equivalent of ITIS but with concept documentation and including how other concepts map onto the concepts accepted by the party.

• Several are operative or in development including EuroMed, IOPI-GPC, Biotics, VegBank. Concept documentation planned for ITIS/USDA.

Registration system and standard identifiers for names, references, and

concepts• Essential for data exchange

• GBIF is hosting a set of international workshops to design the GUID infrastructure.

Step 3: Development and provision of tools to facilitate mark-up of data and manuscripts with taxonomic concepts

Step 4: Demonstration projects

The way ahead
