Upload
andrea-thomer
View
477
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Slides from SPNHC 2012 presentation in the Archives and Special Collections session -- titled alternately "What Henderson Saw" or "From Documents to Datasets" depending on which author you ask. See http://soyouthinkyoucandigitize.wordpress.com/category/henderson-project/ for more detail. Contact: @an_dre_a_, @mrvaidya, @robgural, @dabblepop, @pagodarose
Citation preview
E X T R A C T I N G O B S E R V A T I O N S F R O M C E N T U R Y- O L D F I E L D N O T E B O O K S
What Henderson Saw
Andrea ThomerUIUC, Gaurav VaidyaCU-B, Robert GuralnickCU-B, David BloomUC-B & Laura RussellKU
or
M I N I N G T H E J U N I U S H E N D E R S O N F I E L D N O T E S F O R S P E C I E S O C C U R R E N C E R E C O R D S
From documents to datasets
Andrea ThomerUIUC, Gaurav VaidyaCU-B, Robert GuralnickCU-B, David BloomUC-B & Laura RussellKU
Field notes and Biodiversity science
• Field work is central to biodiversity work• Field notes: • Are central to field work• Are typically stored in archives• But contain data• Data wants to be free!
Biodiversity science and “first person precision”
• We often forget that field notes store data
• Value of field notes is in the combination of qualitative/quantitative data (Kramer, 2011)
• Grinnell: “first person precision” (1912)
• How do we free the data, while also preserving the record of its context of production?
Junius Henderson
• A typical natural history “old-timer” • Had a mustache• wore suspenders• wrote snarky comments in his
field notes about young whippersnappers and trains
• Studied clams
Influential in small but lasting ways, but not well-known beyond Boulder
Henderson’s field notes
• 13 notebooks, 1 locality notebook• 1672 pages of notes total• Prolific collector• numerous photographs• 1905: Began field work for CU Museum• 2000-2002: Transcribed by Dr. Peter Robinson• 2006: NSIDC scanned the Henderson notebooks• 2011-2012: annotation and data extraction
The Henderson Field Note Project
• Were looking for a low-tech digitization project• Rob knew of the existence of the transcribed
notes• “What we can accomplish with five hours of work
each?”• Goals:• Make notes freely available• Try to engage volunteers on the internet• Produce one “neat thing” (a visualization, a map, etc)
Challenges in making notes available
• No time!• No resources!• No time!• No repository!• No platform!• No time!
Solutions to challenges (ver. 1)
• No sleeping!• Use free resources!• Guerrilla takeover of Wikisource!• Profit!
Wikisource
• Part of Wikimedia Foundation, as is Wikipedia• Has its own “collections” or “accessions” policies• All docs from before 1923• Post-1922: Documentary sources, peer-reviewed
scientific research, analytical & artistic works
• Support for “adding value” via transcription, translation, annotation, and more
Basic Project Steps
• Upload notebooks to Wikisource• Match transcriptions to scans by hand• Create templates to support annotation • Advertise project; attract volunteers• Write simple script to extract annotations• Publish those via IPT installation as a DwC-A• Sleep
Basic Project Steps
• Upload notebooks to Wikisource• Match transcriptions to scans by hand• Create templates to support annotation • Advertise project; attract volunteers• Write simple script to extract annotations• Publish those via IPT installation as a DwC-A• Sleep
Basic Project Steps
• Upload notebooks to Wikisource• Match transcriptions to scans by hand• Create templates to support annotation • Advertise project; attract volunteers• Write simple script to extract annotations• Publish those via IPT installation as a DwC-A• Sleep
Annotation Templates
• Anyone can annotate the transcribed to tag elements • Ex. “I saw a white-tailed jack rabbit”
“I saw a {{taxon|Lepus townsendii|white tailed jack rabbit}}.”
Annotation Templates
{{taxon|Lepus townsendii|white tailed jack rabbit}}.
Type of annotation Wikipedia link verbatim textWikipedia link
Note: “white tailed jack
rabbit” would work here as well.
Basic Project Steps
• Upload notebooks to Wikisource• Match transcriptions to scans by hand• Create templates to support annotation • Advertise project; attract volunteers• Write simple script to extract annotations• Publish those via IPT installation as a DwC-A• Sleep
Basic Project Steps
• Upload notebooks to Wikisource• Match transcriptions to scans by hand• Create templates to support annotation • Advertise project; attract volunteers• Write simple script to extract annotations• Publish those via IPT installation as a DwC-A• Sleep
Basic Project Steps
• Upload notebooks to Wikisource• Match transcriptions to scans by hand• Create templates to support annotation • Advertise project; attract volunteers• Write simple script to extract annotations• Write complex scripts to extract annotations and
compile them into occurrences• Extensively review occurrences• Taxonomic referencing• Publish those via IPT installation as a DwC-A• Sleep
Taxonomic Referencing
• Remember that “Wikipedia link”?• We want to check if that is a valid taxonomic
name• How?• Easy, right? Just check against a resolver!
Taxonomic Referencing
• Remember that “Wikipedia link”?• We want to check if that is a valid taxonomic name• How?• Easy, right? Just check against a resolver!• Hard! Which resolver? How to verify?
1)Check name against ITIS and EOL.2)Possible outcomes:
a) Both concordant! YAY!b) No results from both. Boo!c) Discordant results. Need HUMANS!
3) This was LOTS of work (thanks, Gaurav!)
Basic Project Steps
• Upload notebooks to Wikisource• Match transcriptions to scans by hand• Create templates to support annotation • Advertise project; attract volunteers• Write simple script to extract annotations• Write complex scripts to extract annotations and
compile them into occurrences• Extensively review occurrences• Taxonomic referencing• Publish those via IPT installation as a DwC-A• Sleep
Results!
• 3 Notebooks posted and fully annotatedNotebook 1 Notebook 2 Notebook 3
Downloaded on March 27, 2012 March 27, 2012 March 27, 2012
Pages processed 112 of 114 120 of 123 120 of 122
Number of entries 62 of 64 62 of 63 98 of 99
Number of annotations 632 703 1007
Taxon annotations 349 (201 unique) 224 (125 unique) 514 (248 unique)
Place annotations 219 (115 unique) 419 (154 unique) 401 (139 unique)
Date annotations 64 (63 unique) 60 (59 unique) 92 (90 unique)
Dates in range July 1905 to April 1907
May 1907 to October 1908
January 1909 to September 1909
Results!... With caveats
• 3 Notebooks posted and fully mostly annotated• 1076 occurrences extracted• A published Darwin Core Archive!
• Most of our project’s Skype calls were about Dwc term use
• A ZooKeys paper (hopefully)• A lot more questions….
What challenges remain?
• How do we georeference these occurrences?
• How to we maintain ties between DwC records and field notes?
• How do we assign unique identifiers to wiki tags?
• Is Wikisource the best place for this data?
Why this could work for you too:
• Wikimedia projects really are community driven
Why this could work for you too:
• Wikimedia projects really are community driven• We can all be a part of this community – if we do
the work
Why this could work for you too:
• Wikimedia projects really are community driven• We can all be a part of this community – if we do
the work• Your lab, archive or library has as many or more
potential contributors as our project
Why this could work for you too:
• Wikimedia projects really are community driven• We can all be a part of this community – if we do
the work• Your lab, archive or library has as many or more
potential contributors as our project• There are many flexible transcription platforms in
addition to Wikipedia
This entire project was only possible because people had
been making small steps towards digitization over the
last 10 years
Questions?
• References:• Grinnell J (1912) An Afternoon’s Field Notes. The Condor,
14(3), 104-107. Retrieved from http://www.jstor.org/stable/1362226.
• Kramer KL (2011) The spoken and the unspoken. In M. R. Canfield (Ed.), Field Notes on Science & Nature. Cambridge, Massachusetts: Harvard University Press.
• For more about Henderson, see our blog! http://soyouthinkyoucandigitize.wordpress.com/category/henderson-project/