47
Vincent S. Smith The Virtual Taxonomist Scholarly communication for the facebook generation

The Virtual Taxonomist

Embed Size (px)

DESCRIPTION

Scholarly communication for the facebook generation

Citation preview

Page 1: The Virtual Taxonomist

Vincent S. Smith

The Virtual TaxonomistScholarly communication forthe facebook generation

Page 2: The Virtual Taxonomist

Goal…• Inventory the Earth’s species• Document their relationships• “Publish” these data

Data set…• 1.8M described species (10M names)

• 300M pages (over last 250 years)

• 1.5-3B specimens

People…• 4-6,000 scientists• 30-40,000 amateurs• Many more citizen scientists?

TaxonomyThe foundation of biology

Page 3: The Virtual Taxonomist

Bacteria9021 Spp

Archaebacteria

259 Spp.

Plants260k spp.

Animals1.18 M spp.

Other193k spp.

Fungi101k

1.8 million species

Taxonomy is parochialInformation sits in the “long tail” of a power distribution

Page 4: The Virtual Taxonomist

Crusta-ceans

39k

Birds 10kReptiles 7.1kMammals 5kAmphib.5k

Sponges 10kCnidarians 9kRotifers 1.8k

Flatworms 13.7k

Insects0.82 M spp.

Molluscs117 k

Fish 25k

Bacteria9021 Spp

Archaebacteria

259 Spp.

Plants260k spp.

Animals1.18 M spp.

Other193k spp.

Fungi101k

Taxonomy is parochialInformation sits in the “long tail” of a power distribution

1.8 million species

Page 5: The Virtual Taxonomist

Crusta-ceans

39k

Birds 10kReptiles 7.1kMammals 5kAmphib.5k

Sponges 10kCnidarians 9kRotifers 1.8k

Flatworms 13.7k

Insects0.82 M spp.

Molluscs117 k

Fish 25k

Bacteria9021 Spp

Archaebacteria

259 Spp.

Plants260k spp.

Animals1.18 M spp.

Other193k spp.

Fungi101k

Beetles370k spp.

Flies85k spp.

Butterflies & moths165k spp.

Bees, wasps & ants198k spp.

0.01 papers per species per yeari.e 1 paper every 100 years

Birds: 1 paper per species per yr.Mammals: 2 papers per species per yr.

Elephants: 47 papers per species per yr.

Taxonomy is parochialInformation sits in the “long tail” of a power distribution

1.8 million species

Page 6: The Virtual Taxonomist

250 yrs 1000 yrs!!!

?1758 2008 3008

Taxonomy is slowMost life on earth is still undescribed

Bacteria9021 Spp

Archaebacteria

259 Spp.

Plants260k spp.

Animals1.18 M spp.

Other193k spp.

Fungi101k

250 year and counting!

The story so far…• Estimates range from 5-100 million species (prob. 80% undescribed)

• At present rates most species will be extinct before we get to describe them

• Most descriptions are formulaic, publication process is slow, involves paper archival

Most biodiversity (data) is hidden

Page 7: The Virtual Taxonomist

Taxonomy is hard to findPeople & data distributed & highly fragmented

• Small communities working on biodiversity

• So is the data we use & publish- 1.5-3B specimens worldwide (type specimens)- 300M pages spanning 250 yrs. (all still relevant)

• We use different methods of citation (pp.)

- Just 4-6,000 taxonomists worldwide

Page 8: The Virtual Taxonomist

Mol. Phyl. Evol.21,964 pp. since 2000

Menopon gallinaeNumidicola antennatusAmyrsidea ventralisSomaphantus lusiusMenacanthus stramineusColimenopon urocoliusTrinoton anserinumMeromenopon meropisGruimenopon longumHoazineus armiferusCopocephalum zebraComatomenopon elbeli/elongatumPsittacomenopon poicephalusOdoriphila clayae/phoeniculiArdeiphilus trochioxusCuculiphilus fasciatusCiconiphilus quadripustulatusEomenopon denticulatumPiagetiella bursaepelecaniOsborniella crotophagaeHohorstiella lataNeomenopon pteroclurusMachaerilaemus laticorpus/latifronsAustromenopon crocatumEidmanniella pellucidaHolomenopon brevithoracicumDennyus hirundinisMyrsidea victrixAncistrona vagelliPseudomenopon pilosumBonomiella columbaeChapinia robustaPlegadiphilus threskiornisActornithophilus uniseriatusMEGAMENOPONRediella mirabilisLatumcephalum lesouefi/macropusParaboopia flavaParaheterodoxus insignisBoopia tarsataTherodoxus oweniLaemobothrion maximumRicinus fringillaeTrochiliphagus abdominalisTrochiloecetes rupununiLiposcelis bostrychophilus

Taxonomy is hard to findPeople & data distributed & highly fragmented

• Small communities working on biodiversity

• So is the data we use & publish- 1.5-3B specimens worldwide (type specimens)- 300M pages spanning 250 yrs. (all still relevant)

• We use different methods of citation (pp.)

- Just 4-6,000 taxonomists worldwide

• Publications are data rich

Page 9: The Virtual Taxonomist

Taxonomy is hard to findPeople & data distributed & highly fragmented

DATA

• Linked by taxonomic names

• Small communities working on biodiversity

• So is the data we use & publish- 1.5-3B specimens worldwide (type specimens)- 300M pages spanning 250 yrs. (all still relevant)

• We use different methods of citation (pp.)

- Just 4-6,000 taxonomists worldwide

• Publications are data rich

Page 10: The Virtual Taxonomist

Taxonomy is hard to findPeople & data distributed & highly fragmented

DATA

What does this all mean…• Taxonomy is an information science (formulaic, data rich, parochial, under funded)

• Taxonomy lends itself to the Web

• Linked by taxonomic names

• Small communities working on biodiversity

• So is the data we use & publish- 1.5-3B specimens worldwide (type specimens)- 300M pages spanning 250 yrs. (all still relevant)

• We use different methods of citation (pp.)

- Just 4-6,000 taxonomists worldwide

• Publications are data rich

Page 11: The Virtual Taxonomist

Getting taxonomy on the Web

Scratchpads• Web publishing for taxonomists

Tackling the problems of the taxonomic community

Biodiversity Heritage Library• Digitising heritage literature

Encyclopedia of Life• A web page for every species

Plazi.org & iPhylo• Data mining contemporary literature

Page 12: The Virtual Taxonomist

Getting taxonomy on the Web

Scratchpads• Web publishing for taxonomists

Tackling the problems of the taxonomic community

Biodiversity Heritage Library• Digitising heritage literature

Encyclopedia of Life• A web page for every species

Plazi.org & iPhylo• Data mining contemporary literature

Page 13: The Virtual Taxonomist

Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”

• Biodiversity publications since 1469- 5.4 million books- 800,000 monographs- 40,000 periodicals

• Held by Natural History librariesE.g., NHM holds more than 1M books, 250kmonographs & periodicals, 0.5M artworks

• Sharing the digisation of contents• Focus on out of copyright materials• Partnership with “Internet Archive”

• BHL partnership of 10 Nat. Hist. libraries

• Make the contents “findable”

Page 14: The Virtual Taxonomist

Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”

1 scribe machine, 3,500 pages per shift per day

2. Extract text (OCR)1. Scan (photograph)

34 scribe machines now in operation

3. Find keywords- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs

Page 15: The Virtual Taxonomist

Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”

2. Extract text (OCR)3. Find keywords

1. Scan

- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs

Palma, R.L., andR.L.C. Pilgrim.2002. A revisionof the genusNaubates(Insecta:Phthiraptera:Philopteridae).J. R. Soc. N.Z.32:7-60.

Page 16: The Virtual Taxonomist

Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”

2. Extract text (OCR)3. Find keywords

1. Scan

- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs

4. Index5. Put on the web

Palma, R.L., andR.L.C. Pilgrim.2002. A revisionof the genusNaubates(Insecta:Phthiraptera:Philopteridae).J. R. Soc. N.Z.32:7-60.

Page 17: The Virtual Taxonomist

Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”

• NHM, London- 1 scribe machine- >500k pages- Focus on exceptionally rare text

• Completed to date:- 3,802 periodicals (journals)- 9,181 books- 5.5 million pages (2% of total)

http://www.biodiversitylibrary.org/

- Copyright (1923 USA)• Challenges

- OCR quality (old fonts)- Better indexing- Foreign language content- Needs a critical mass of content to be useful

Page 18: The Virtual Taxonomist

Getting taxonomy on the Web

Scratchpads• Web publishing for taxonomists

Tackling the problems of the taxonomic community

Biodiversity Heritage Library• Digitising heritage literature

Encyclopedia of Life• A web page for every species

Plazi.org & iPhylo• Data mining contemporary literature

Page 19: The Virtual Taxonomist

Data mining taxonomic publications“Extracting factual information”

- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs

Palma, R.L., andR.L.C. Pilgrim.2002. A revisionof the genusNaubates(Insecta:Phthiraptera:Philopteridae).J. R. Soc. N.Z.32:7-60.

Page 20: The Virtual Taxonomist

“Extracting factual information”

Palma, R.L., andR.L.C. Pilgrim.2002. A revisionof the genusNaubates(Insecta:Phthiraptera:Philopteridae).J. R. Soc. N.Z.32:7-60.

- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs

Data mining taxonomic publications

Page 21: The Virtual Taxonomist

Experimental extraction of factual information

Plazi.org (D. Agosti et al)(Manual, slow but accurate)

iPhylo (R. Page)(Automatic, fast but dirty)

Article(Hand selected)

“Library”(Legal & minable)

Repository(DSpace)

Entity-Attribute-Value Model(Database)

GoldenGate(Manual Software)

Crawler scripts & web services

Approx. 26nested fields

(TaxonX-XML)

Approx. 12?data objects

Data mining taxonomic publications

Page 22: The Virtual Taxonomist

Experimental extraction of factual information

Plazi.org (D. Agosti et al)(Manual, slow but accurate)

iPhylo (R. Page)(Automatic, fast but dirty)

Repository(DSpace)

Entity-Attribute-Value Model(Database)

RSS + TAPIR Data visualizations

“A database of everything!”

RDF + RSS

Data mining taxonomic publications

Page 23: The Virtual Taxonomist

Getting taxonomy on the Web

Scratchpads• Web publishing for taxonomists

Tackling the problems of the taxonomic community

Biodiversity Heritage Library• Digitising heritage literature

Encyclopedia of Life• A web page for every species

Plazi.org & iPhylo• Data mining contemporary literature

Page 24: The Virtual Taxonomist

Encyclopedia of Life (EOL)“A web page for every species”

http://www.eol.org/

• A web page for all 1.8M species

• Multi-institution collaboration

• $50m funding (5 years)- MacArthur and Sloan Foundations

• Megascience mashup- Aggregating data from the web

• Multiple audiences- Science & outreach

• 10 years to complete- First draft 2008, “finished” 2017!

Page 25: The Virtual Taxonomist
Page 26: The Virtual Taxonomist

Encyclopedia of Life (EOL)“A web page for every species”

• Huge interest- 11.5 million hits in first 5 hours- 500+ press articles- Pages unavailable for first two days!

• First draft 27 Feb. 2008- 24 “exemplar” pages- 30,000 detailed pages (fish & amphib.)- 1 million “stubs” (names & links)

- Growth (needs 1,000 spp. per day)• Much praise but some criticism

- Quality vs. quantity of information- Authoritative “vetting” process- Credit for “authors”

• Nine more years to go- Get more content online- Better tools to engage more people

Page 27: The Virtual Taxonomist

Getting taxonomy on the Web

Scratchpads• Web publishing for taxonomists

Tackling the problems of the taxonomic community

Biodiversity Heritage Library• Digitising heritage literature

Encyclopedia of Life• A web page for every species

Plazi.org & iPhylo• Data mining contemporary literature

Page 28: The Virtual Taxonomist

What is a Scratchpad?

Your data1

Published & reviewedon your site

3Uploaded &

tagged

2

“A Website & publishing platform for taxonomic communities”

Page 29: The Virtual Taxonomist

What is a Scratchpad?

Your data1

Published & reviewedon your site

3Uploaded &

tagged

2

Fast Intuitive Fit for use

“A Website & publishing platform for taxonomic communities”

Page 30: The Virtual Taxonomist

What can Scratchpads do?Import, manage, search & browse:

DNA & Phylogenies

Specimens

Literature Images

Page 31: The Virtual Taxonomist

What can Scratchpads do?Integration & connectivity within & between sites

DNA & Phylogenies

Specimens

Literature ImagesTaxonomy

Page 32: The Virtual Taxonomist

Current ScratchpadsAntsBeesBeetlesBig-headed fliesBirdsBlackfliesCiliatesCockroachesDragon TreesDung BeetlesFalse ButtonweedFlat wormsFliesForaminiferaFossil InsectsFungus GnatsHolometabolaLeaf-miner FliesLiceLichens of BermudaMalvaceaeMegalastrum fernsMilichiid fliesMosquitoesMossesNannotax fossilsNepticuloid mothsPalmsPearl oystersPolychaete wormsScaleworms

TermitesTriticid grassesWeevilsWood Ferns

Sulawesi FernsStick insects

Sites: 61Users: 665Pages: 130kSince March 2007

Page 33: The Virtual Taxonomist

Scratchpad applications

4th Edition Howard & Moore, Birds of the world(fact checking, data compilation, 2010, funding)

A multipurpose, flexible technology

eBooks

Page 34: The Virtual Taxonomist

Scratchpad applications

European Mosquito Bulletin (ISSN 1460-6127), Phasmid Studies (ISSN 0966-0011)(submission, review, & dissemination of articles)

A multipurpose, flexible technology

eJournals

Page 35: The Virtual Taxonomist

Scratchpad applications

Image galleries

A multipurpose, flexible technology

Nanno fossils, Cockroaches, Stick insects, Flatworms, Grasses, Lichens & many more… (rapid upload, annotation, & display of images)

Page 36: The Virtual Taxonomist

Scratchpad usageContent & contributors in the first 15 months

Pages:- Across 61 sites- In detail:

• Definitions (41%)• References (26%)• Associations (8.5%)• DNA sequences (6%)

• Images (4.5%)• Maps (2.8%)• Specimens (2.1%)• Others (1.3%)

129,896 pages, 665 contributors

June 24 2008

Page 37: The Virtual Taxonomist

Scratchpad usage

Contributors:- No more than 10% significantly active- Contributors in more than 30 countries- In detail:

• Europe (55%)• Unknown (29%)• North America (9%)• Asia (3%)

• Australasia (2.5%)• South America (2%)• Russia (0.8%)• Middle East (0.4%) [Jan. 08]

129,896 pages, 665 contributors

June 24 2008

Content & contributors in the first 15 months

Page 38: The Virtual Taxonomist

Scratchpad visitorsTracking visitors across sites

March 2008

Page 39: The Virtual Taxonomist

Scratchpad visitorsPopular content: what visitors are looking at

The “long tail” of taxonomy

Visitors want less of more, i.e. everyone wants something different

Page 40: The Virtual Taxonomist

Scratchpad overview

Page 41: The Virtual Taxonomist

Scratchpads are integrating taxonomy

Scratchpads• Web publishing for taxonomists

“Small pieces loosely joined”

Biodiversity Heritage Library• Digitising heritage literature

Encyclopedia of Life• A web page for every species

Plazi.org & iPhylo• Data mining contemporary literature

Page 42: The Virtual Taxonomist

Integrating taxonomy

Page 43: The Virtual Taxonomist

Questions?

Page 44: The Virtual Taxonomist
Page 45: The Virtual Taxonomist

Scratchpad managementScalable & sustainable technology

Virtual machine, open-source software, self-archiving, backed-up, multi-site configuration(easy to move & upgrade, secure & reliable, citable, screencasts, low admin., low marginal costs)

Hardware, software & user support

Page 46: The Virtual Taxonomist

Impact(Web equivalent to journal impact

factor & personal H-index)

Scratchpad bibliometricsMetrics of output and use

130,000 pages

665 contributors

Content Usage

Page 47: The Virtual Taxonomist