33
The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and Information Science University of Illinois at Urbana-Champaign

The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Embed Size (px)

Citation preview

Page 1: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

The Phylogeny of a Dataset

Andrea K Thomer & Nicholas M. Weber

Center for Informatics Research in Science and ScholarshipGraduate School of Library and Information Science

University of Illinois at Urbana-Champaign

Page 2: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Time

Page 3: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

How do we understand the evolution of digital objects?

Time

Page 4: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

How do we understand the evolution of digital objects when they are complexly interrelated?

c/o Steve Worley, NCAR

Page 5: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Evolution as a tree

From http://tolweb.org/tree/home.pages/aboutoverview.html

Page 6: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

tl;dr

Page 7: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

tl;dr1) Biologists construct evolutionary trees by

comparing animals’ traits and inferring how they may have evolved

Page 8: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

tl;dr1) Biologists construct evolutionary trees by

comparing animals’ traits and inferring how they may have evolved

2) And there’s lots of free, open source software available for this work.

Page 9: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Why not datasets? (which, like organisms, also often lack explicit documentation…)

Cornets (Tëmkin & Eldredge, 2007)

“Little Red Riding Hood” (Tehrani, 2013)

Non-biological evolution

Page 10: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

A phylogenetic approach helps us:

• Study evolution of digital objects more rigorously

• Model how digital objects are reworked into new “species”

• Understand what properties of a digital object must be preserved or expressed to facilitate modeling

• We ask: In a digital object, what properties lead to evolutionary fitness?

Page 11: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and
Page 12: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Dataset of datasets: COADS, ICOADS and its derivatives

• (I)COADS= (International) Comprehensive Ocean and Atmosphere Dataset

• Community project bringing together 1000s of marine surface measurements from buoys, ship’s logs, more– First release: 1987– New releases as new datasets are added; now at

2.5

• Enormously modified & reused by others in climate science

Page 13: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Towards a more rigorous view of the evolutionary process: anagenesis and phylogenesis

• ICOADS documentation largely describes anagenesis (versioning)

• GCMD* = 1 of many potential sources of data on phylogenesis (branching)– Found 99 metadata records

versions/derivatives of ICOADS (“specimens”) through keyword search

– Metadata includes scientific paramaters, geographic scope, instruments used, more

*known problems in metadata quality, but value in GCMD is breadth rather than depth

Page 14: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Workflow

Download<XML>records

Create character

matrix

Create a NEXUS file

Assess thetree!

Page 15: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Workflow

Download<XML>records

Create character

matrix

Create a NEXUS file

Assess thetree!

Page 16: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Identifying “characters”

• In phylogenetics: characters are morphological features, DNA, other measurable qualities

• In ICOADS datasets: we treated each metadata field as a character, and each term as a character state

Page 17: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Dates, times, resolution are “binned” into categories

Parameters are split into individual categories, and presence/absence are

noted in binary

Page 18: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

https://github.com/akthom/phylomemetics

Page 19: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Method: *

• Software: PAUP* (Phylogenetic Analysis Using Parsimony *and other methods)

• Maximum Likelihood algorithm (we can talk about that more if people are interested).

• Result:

Page 20: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and
Page 21: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Phylogeny of ICOADS datasets

• Each fork = a “speciation event”• Each group joined at a node = a

“clade”–We annotated primary clades

Page 22: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Related datasets cluster; some clades show up as derived from “ancestral” forms– Clade 1 – original

COADS datasets– Clade 2 – ICOADS

input datasets– Clade 3 – Sea

surface flux calculations

– Clade 4 – later COADS data products

– Clade 5 – COADS derivatives

Page 23: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and
Page 24: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Why does it matter that digital objects evolve? Or how?

• Digital preservation implications– A way to understand the history and

contents of a collection– Could be used to browse repositories?– Could be used to complement citation

analysis?

• Offers a lens into cooperative processes that create objects– A way to “read” interplay of different

scientific cultures

Page 25: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Challenges and areas for future work

• What existing statistical models of evolution are most appropriate for this? Or do we need to develop a new one?

• How can existing software be modified for this work?

• How do we show reticulating relationships?

Page 26: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Future work: Phylogenies showing hybridization & ‘spontaneous generation’

Page 27: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Future work: what makes a dataset “fit”?

• Part of ICOADS success and proliferation is surely due to low levels of “competition”– But is some of it due to its open

availability?– How do we test the effects of openness

on a dataset’s fitness-for-purpose?

Page 28: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Acknowledgements

• Thanks to Julie Allen, Peter Fox and Steve Worley for feedback, and our reviewers for excellent comments.

• Thanks to CIRSS and the DCERC program for funding

Page 29: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

References & Additional Reading

• Datasets mentioned in this talk: https://github.com/akthom/phylomemetics

• Howe, C. J., & Windram, H. F. (2011). Phylomemetics--evolutionary analysis beyond the gene. PLoS Biology, 9(5), e1001069. doi:10.1371/journal.pbio.1001069

• O’Brien, M. J., Darwent, J., & Lyman, R. L. (2001). Cladistics Is Useful for Reconstructing Archaeological Phylogenies: Palaeoindian Points from the Southeastern United States. Journal of Archaeological Science, 28(10), 1115–1136. doi:10.1006/jasc.2001.0681

• Tehrani JJ (2013) The Phylogeny of Little Red Riding Hood. PLoS ONE 8(11): e78871. doi:10.1371/journal.pone.0078871

• Tëmkin, I., & Eldredge, N. (2007). Phylogenetics and Material Cultural Evolution. Current Anthropology, 48(1), 146–154.

Page 30: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Homology

Page 31: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and
Page 32: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and
Page 33: The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and

Future work: Phylogenies showing hybridization & ‘spontaneous generation’