34
EXTRACTING OBSERVATIONS FROM CENTURY-OLD FIELD NOTEBOOKS What Henderson Saw Andrea Thomer UIUC , Gaurav Vaidya CU-B , Robert Guralnick CU-B , David Bloom UC-B & Laura Russell KU

From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Embed Size (px)

DESCRIPTION

Slides from SPNHC 2012 presentation in the Archives and Special Collections session -- titled alternately "What Henderson Saw" or "From Documents to Datasets" depending on which author you ask. See http://soyouthinkyoucandigitize.wordpress.com/category/henderson-project/ for more detail. Contact: @an_dre_a_, @mrvaidya, @robgural, @dabblepop, @pagodarose

Citation preview

Page 1: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

E X T R A C T I N G O B S E R V A T I O N S F R O M C E N T U R Y- O L D F I E L D N O T E B O O K S

What Henderson Saw

Andrea ThomerUIUC, Gaurav VaidyaCU-B, Robert GuralnickCU-B, David BloomUC-B & Laura RussellKU

Page 2: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

or

Page 3: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

M I N I N G T H E J U N I U S H E N D E R S O N F I E L D N O T E S F O R S P E C I E S O C C U R R E N C E R E C O R D S

From documents to datasets

Andrea ThomerUIUC, Gaurav VaidyaCU-B, Robert GuralnickCU-B, David BloomUC-B & Laura RussellKU

Page 4: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Field notes and Biodiversity science

• Field work is central to biodiversity work• Field notes: • Are central to field work• Are typically stored in archives• But contain data• Data wants to be free!

Page 5: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Biodiversity science and “first person precision”

• We often forget that field notes store data

• Value of field notes is in the combination of qualitative/quantitative data (Kramer, 2011)

• Grinnell: “first person precision” (1912)

• How do we free the data, while also preserving the record of its context of production?

Page 6: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Junius Henderson

• A typical natural history “old-timer” • Had a mustache• wore suspenders• wrote snarky comments in his

field notes about young whippersnappers and trains

• Studied clams

Page 7: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Influential in small but lasting ways, but not well-known beyond Boulder

Page 8: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Henderson’s field notes

• 13 notebooks, 1 locality notebook• 1672 pages of notes total• Prolific collector• numerous photographs• 1905: Began field work for CU Museum• 2000-2002: Transcribed by Dr. Peter Robinson• 2006: NSIDC scanned the Henderson notebooks• 2011-2012: annotation and data extraction

Page 9: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

The Henderson Field Note Project

• Were looking for a low-tech digitization project• Rob knew of the existence of the transcribed

notes• “What we can accomplish with five hours of work

each?”• Goals:• Make notes freely available• Try to engage volunteers on the internet• Produce one “neat thing” (a visualization, a map, etc)

Page 10: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Challenges in making notes available

• No time!• No resources!• No time!• No repository!• No platform!• No time!

Page 11: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Solutions to challenges (ver. 1)

• No sleeping!• Use free resources!• Guerrilla takeover of Wikisource!• Profit!

Page 12: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Wikisource

• Part of Wikimedia Foundation, as is Wikipedia• Has its own “collections” or “accessions” policies• All docs from before 1923• Post-1922: Documentary sources, peer-reviewed

scientific research, analytical & artistic works

• Support for “adding value” via transcription, translation, annotation, and more

Page 13: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Basic Project Steps

• Upload notebooks to Wikisource• Match transcriptions to scans by hand• Create templates to support annotation • Advertise project; attract volunteers• Write simple script to extract annotations• Publish those via IPT installation as a DwC-A• Sleep

Page 14: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Basic Project Steps

• Upload notebooks to Wikisource• Match transcriptions to scans by hand• Create templates to support annotation • Advertise project; attract volunteers• Write simple script to extract annotations• Publish those via IPT installation as a DwC-A• Sleep

Page 15: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records
Page 16: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Basic Project Steps

• Upload notebooks to Wikisource• Match transcriptions to scans by hand• Create templates to support annotation • Advertise project; attract volunteers• Write simple script to extract annotations• Publish those via IPT installation as a DwC-A• Sleep

Page 17: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Annotation Templates

• Anyone can annotate the transcribed to tag elements • Ex. “I saw a white-tailed jack rabbit”

“I saw a {{taxon|Lepus townsendii|white tailed jack rabbit}}.”

Page 18: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Annotation Templates

{{taxon|Lepus townsendii|white tailed jack rabbit}}.

Type of annotation Wikipedia link verbatim textWikipedia link

Note: “white tailed jack

rabbit” would work here as well.

Page 19: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Basic Project Steps

• Upload notebooks to Wikisource• Match transcriptions to scans by hand• Create templates to support annotation • Advertise project; attract volunteers• Write simple script to extract annotations• Publish those via IPT installation as a DwC-A• Sleep

Page 20: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Basic Project Steps

• Upload notebooks to Wikisource• Match transcriptions to scans by hand• Create templates to support annotation • Advertise project; attract volunteers• Write simple script to extract annotations• Publish those via IPT installation as a DwC-A• Sleep

Page 21: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records
Page 22: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Basic Project Steps

• Upload notebooks to Wikisource• Match transcriptions to scans by hand• Create templates to support annotation • Advertise project; attract volunteers• Write simple script to extract annotations• Write complex scripts to extract annotations and

compile them into occurrences• Extensively review occurrences• Taxonomic referencing• Publish those via IPT installation as a DwC-A• Sleep

Page 23: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Taxonomic Referencing

• Remember that “Wikipedia link”?• We want to check if that is a valid taxonomic

name• How?• Easy, right? Just check against a resolver!

Page 24: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Taxonomic Referencing

• Remember that “Wikipedia link”?• We want to check if that is a valid taxonomic name• How?• Easy, right? Just check against a resolver!• Hard! Which resolver? How to verify?

1)Check name against ITIS and EOL.2)Possible outcomes:

a) Both concordant! YAY!b) No results from both. Boo!c) Discordant results. Need HUMANS!

3) This was LOTS of work (thanks, Gaurav!)

Page 25: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Basic Project Steps

• Upload notebooks to Wikisource• Match transcriptions to scans by hand• Create templates to support annotation • Advertise project; attract volunteers• Write simple script to extract annotations• Write complex scripts to extract annotations and

compile them into occurrences• Extensively review occurrences• Taxonomic referencing• Publish those via IPT installation as a DwC-A• Sleep

Page 26: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Results!

• 3 Notebooks posted and fully annotatedNotebook 1 Notebook 2 Notebook 3

Downloaded on March 27, 2012 March 27, 2012 March 27, 2012

Pages processed 112 of 114 120 of 123 120 of 122

Number of entries 62 of 64 62 of 63 98 of 99

Number of annotations 632 703 1007

Taxon annotations 349 (201 unique) 224 (125 unique) 514 (248 unique)

Place annotations 219 (115 unique) 419 (154 unique) 401 (139 unique)

Date annotations 64 (63 unique) 60 (59 unique) 92 (90 unique)

Dates in range July 1905 to April 1907

May 1907 to October 1908

January 1909 to September 1909

Page 27: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Results!... With caveats

• 3 Notebooks posted and fully mostly annotated• 1076 occurrences extracted• A published Darwin Core Archive!

• Most of our project’s Skype calls were about Dwc term use

• A ZooKeys paper (hopefully)• A lot more questions….

Page 28: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

What challenges remain?

• How do we georeference these occurrences?

• How to we maintain ties between DwC records and field notes?

• How do we assign unique identifiers to wiki tags?

• Is Wikisource the best place for this data?

Page 29: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Why this could work for you too:

• Wikimedia projects really are community driven

Page 30: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Why this could work for you too:

• Wikimedia projects really are community driven• We can all be a part of this community – if we do

the work

Page 31: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Why this could work for you too:

• Wikimedia projects really are community driven• We can all be a part of this community – if we do

the work• Your lab, archive or library has as many or more

potential contributors as our project

Page 32: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Why this could work for you too:

• Wikimedia projects really are community driven• We can all be a part of this community – if we do

the work• Your lab, archive or library has as many or more

potential contributors as our project• There are many flexible transcription platforms in

addition to Wikipedia

Page 33: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

This entire project was only possible because people had

been making small steps towards digitization over the

last 10 years

Page 34: From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Questions?

• References:• Grinnell J (1912) An Afternoon’s Field Notes. The Condor,

14(3), 104-107. Retrieved from http://www.jstor.org/stable/1362226.

• Kramer KL (2011) The spoken and the unspoken. In M. R. Canfield (Ed.), Field Notes on Science & Nature. Cambridge, Massachusetts: Harvard University Press.

• For more about Henderson, see our blog! http://soyouthinkyoucandigitize.wordpress.com/category/henderson-project/