37
Biodiversity informatics: why aren’t we there yet? @rdmpage http://iphylo.blogspot.com

Biodiversity informatics: why aren't we there yet?

Embed Size (px)

DESCRIPTION

Talk given at CISA 2013, Barcelona, 26 September 2013

Citation preview

Page 1: Biodiversity informatics: why aren't we there yet?

Biodiversity informatics: why aren’t we there yet?

@rdmpage

http://iphylo.blogspot.com

Page 2: Biodiversity informatics: why aren't we there yet?

I’ve often said I want a Google for biodiversity data…

Page 3: Biodiversity informatics: why aren't we there yet?

…turns out what I should have asked for was a NSA for biodiversity

Page 4: Biodiversity informatics: why aren't we there yet?
Page 5: Biodiversity informatics: why aren't we there yet?

• There are known knowns, things we know that we know

• There are known unknowns, things we now know we don’t know

• But there are also unknown unknowns, things we do not know we don't know

Page 6: Biodiversity informatics: why aren't we there yet?

known

unknown

knowns

unknowns

Page 7: Biodiversity informatics: why aren't we there yet?
Page 8: Biodiversity informatics: why aren't we there yet?

What do these diagrams tell us?

Page 9: Biodiversity informatics: why aren't we there yet?
Page 10: Biodiversity informatics: why aren't we there yet?
Page 11: Biodiversity informatics: why aren't we there yet?

Implications

• Sequencing is cheap

• The flood of sequences is only going to increase

• How much of this is relevant to biodiversity?

• --

Page 12: Biodiversity informatics: why aren't we there yet?

Numbers of new animal names

1923

WWI WWII

Page 13: Biodiversity informatics: why aren't we there yet?

Implications

• Rate of new taxa being described is relatively constant

• Suggests taxonomists are working at capacity

• Most taxonomic work is in the past

• Compare this to exponential growth of sequencing• --

Page 14: Biodiversity informatics: why aren't we there yet?

Mammals in GenBank

Proper Linnaean names

Aus sp.

Page 15: Biodiversity informatics: why aren't we there yet?

Mammals

Proper Linnaean names

Aus sp.

Page 16: Biodiversity informatics: why aren't we there yet?

“Invertebrates”

BOLD

Page 17: Biodiversity informatics: why aren't we there yet?

Dark taxa

• Disconnect between taxonomy and genomics

• How much of this comprises taxa we already know about versus new diversity?

• Do we need taxonomic names?• --

Page 18: Biodiversity informatics: why aren't we there yet?

100,000 articles from http://biostor.org (BHL)

1923 today

Page 19: Biodiversity informatics: why aren't we there yet?

Scanned legacy

• BHL is more than pre-1923 literature

• The real gap is post-1923 to pre-open access (2003)

• Most of the 20th century taxonomic literature is “dark”

• --

Page 20: Biodiversity informatics: why aren't we there yet?

Size of Wikipedia articles on mammals

Few, large articles

Many, small articles “long tail”

Page 21: Biodiversity informatics: why aren't we there yet?

Power law

• We know a lot about a few species

• For most species we know very little (even in well-known groups)

• For poorly known species need to go to legacy literature

• --

Page 22: Biodiversity informatics: why aren't we there yet?

PanTHERIA (2009)1923 2003

Page 23: Biodiversity informatics: why aren't we there yet?

Legacy literature

• Legacy literature matters (even for well-studied taxa)

• Much of this will be in digitally “dark” period

• --

Page 24: Biodiversity informatics: why aren't we there yet?

Publishers of taxonomy(# articles)

http://bionames.org

Page 25: Biodiversity informatics: why aren't we there yet?

Publishers

• BioStor (BHL) is the single largest source of taxonomic literature

• Lots of tiny publishers (long tail)

• Commercial publishers important (Magnolia Press, Springer, Informa, Wiley, Elsevier, BioOne)

• Who do we talk to about data mining?• --

Page 26: Biodiversity informatics: why aren't we there yet?

Taxonomic journals (articles/decade)

Page 27: Biodiversity informatics: why aren't we there yet?

Implications

• Zootaxa is indeed a “mega journal”

• If we had to pick one journal to data mine it is Zootaxa

• --

Page 28: Biodiversity informatics: why aren't we there yet?
Page 29: Biodiversity informatics: why aren't we there yet?

GBIF

• The Global Biodiversity Information Facility is not evenly “global”

• Tells us as much about sampling as distribution of diversity

Page 30: Biodiversity informatics: why aren't we there yet?

Flickr EOL group

Page 31: Biodiversity informatics: why aren't we there yet?

Crowd sourcing

• Where is the “crowd”?

• It’s where the iPhones are…

Page 32: Biodiversity informatics: why aren't we there yet?

GenBank animal sequences

Page 33: Biodiversity informatics: why aren't we there yet?

GenBank host records

Page 34: Biodiversity informatics: why aren't we there yet?

Implications

• GenBank is about more than genes

• GenBank has a wealth of information on location, and ecological interactions

Page 35: Biodiversity informatics: why aren't we there yet?
Page 36: Biodiversity informatics: why aren't we there yet?

Implications

• Phylogenetic data is not being archived (why not?)

• Makes it hard to reproduce studies

• Does data matter?

• What level of granularity should be citable?

Page 37: Biodiversity informatics: why aren't we there yet?

What do these diagrams tell us?