21
Data as research output; Data as part of the scholarly record Todd Vision University of North Carolina at Chapel Hill Dryad Digital Repository SciELO15 24 October 2013 São Paulo

Data as research output, data as part of the scholarly record

Embed Size (px)

DESCRIPTION

Talk given at SciELO15, 24 October 2013Ÿ, São Paulo Brazil. Video followed by slides.

Citation preview

Page 1: Data as research output, data as part of the scholarly record

Data as research output;�Data as part of the scholarly record

Todd Vision University of North Carolina at Chapel Hill Dryad Digital Repository

SciELO15 � 24 October 2013 � São Paulo

Page 2: Data as research output, data as part of the scholarly record

CC-­‐BY-­‐NC-­‐SA  nic221  h/p://www.flickr.com/photos/nic221/391536867/  

Page 3: Data as research output, data as part of the scholarly record

Source:  IFEX  h/p://www.ifex.org/united_states/2013/09/05/cipa_libraries/  

Page 4: Data as research output, data as part of the scholarly record
Page 5: Data as research output, data as part of the scholarly record

2011 BURLEIGH ET AL.—INFERRING THE PLANT TREE OF LIFE FROM GENE TREES 121

TABLE 1. Summary of supertree bootstrap support from the GTPanalysis

Clades With Bootstrap Support (%)Numberof Taxa 100 !90 !70 !50

136-Taxon Cons. 136 9.8 30.8 56.4 74.4Reduced Cons. 82 50.6 70.9 89.9 98.782-Taxon Cons. 82 53.1 72.2 84.8 96.2

Notes: This displays the percentage of total clades at or above a givenlevel of bootstrap support for 1) the majority rule consensus of all boot-strap trees from the NJ/PP analysis of 136 taxa (136-Taxon Cons.), 2)the reduced consensus of all bootstrap trees for the 82 taxa present inat least 1300 of the gene trees (Reduced Cons.), and 3) the majority ruleconsensus of all bootstrap trees from the NJ/PP analysis of the same82 taxa as above (82-Taxon Cons.).

plants (99% support), gymnosperms (100% support),angiosperms (99% support), eudicots (99% support),core eudicots (99% support), and asterids (100% sup-port; Fig. 3). Within gymnosperms, Gnetales were sisterto the conifers (100% support; Fig. 3). Amborella was sis-ter to all other angiosperms, and Nuphar (Nympheales)was sister to all angiosperms except Amborella (Fig. 3).Magnoliids were sister to a monocot + eudicot clade(Fig. 3). Within monocots, the Poaceae (grass family)had 100% support, and within the grasses, the Pani-coideae clade had 100% bootstrap support (Fig. 3). Inthe core eudicot clade, the Caryophyllales (100% sup-port) were sister to the rosids (99% support) and theasterids (100% support) (Fig. 3).

There were several differences in the species tree ob-tained using ML gene trees versus NJ/PP gene trees. Forexample, the relationships among eurosid lineages dif-fered slightly; however, in both analyses, Malpighiales

FIGURE 2. Average quartet similarity for each taxon among boot-strap trees. Each point in the graph represents a single taxon. The x-axis shows the number of gene families trees that have data from thetaxon. The y-axis shows the average percentage frequency of quartets(four taxon statements) containing the taxon that are identical betweentwo bootstrap trees. The shaded area in the graph contains all taxa thatare present in less than 1300 gene trees.

(eurosid I) were nested in a clade with eurosid II taxa(Figs. 1 and 3). The BEP-clade (Bambusoideae, Ehrhar-toideae, and Pooideae) was not supported in the analy-sis using NJ/PP gene trees, but it was when using MLgene trees (Fig. 3). Acorus americanus was not placedwith other monocots in the NJ/PP analysis, but it wasin a monocot clade when using ML gene trees (Fig. 3).

DISCUSSION

Frequent gene and whole-genome duplications have,in the past, limited the use of nuclear genes for deeplevel phylogenetic analyses in plants and other cladeswith highly duplicated genomes. GTP provides a way toexploit the phylogenetic information inherent not onlyin the relationships among orthologous genes but alsothe rare gene duplications that produce paralogous genefamily members. Rather than treating gene tree discor-dance as a nuisance, it seeks the species tree that pro-vides the best reconciliation among the many discordantgene trees.

In this study, we used GTP to find species trees thatminimize the total number of duplications across a col-lection of nearly 18,896 plant gene trees. The sequencesampling includes extensive collections of existing ESTdata that have rarely before been used for plant phylo-genetics (but see de la Torre et al. 2006; Sanderson andMcMahon 2007; de la Torre-Barcena et al. 2009). Thus,this study provides a new nuclear genomic perspectiveon the plant tree of life.

Overall, the phylogenetic relationships inferred fromgene duplications are largely consistent with previouslarge-scale molecular studies of plant phylogeny (e.g.,Soltis et al. 2000; Hilu et al. 2003; Jansen et al. 2007).Yet the GTP analysis also provides support for somerelationships that are unresolved or conflicting in pre-vious analyses. For example, the results support theplacement of magnoliids sister to monocots + eudicots,making eudicots (possibly with Ceratophyllum, whichwas not included in this study) sister to monocots (Figs 1and 3). The relationships among these major clades areunclear from analyses using few genes (e.g., Soltis et al.2000, 2007; Hilu et al. 2003), but our result is consistentwith recent analyses using 81 plastid genes (Jansen et al.2007). The placement of Malpighiales within a eurosidI clade (Figs. 1 and 3) generally conflicts with previouslarge-scale angiosperm analyses (e.g., Soltis et al. 2000;Hilu et al. 2003; Jansen et al. 2007). Given the novelty ofthe result, it should be interpreted with great caution.

Our results indicate that data from many gene treesmay be required to produce a well-supported phy-logeny using GTP (Table 1; Figs. 2 and 3), suggestingthat GTP may not use data as efficiently as more tradi-tional phylogenetic analyses of concatenated multigenedata sets. For example, in plants, recent analyses of upto 83 plastid genes have apparently resolved enigmaticrelationships in the backbone angiosperm phylogeny,whereas our analyses appear to require data from >1000genes (Jansen et al. 2007; Moore et al. 2007, 2010). Like

at University of North Carolina at Chapel Hill on February 18, 2011sysbio.oxfordjournals.org

Downloaded from

or sum

!int3, i !Xn

j!0

zij

of the transformed scores for gene i.The fourth method, referred to as int4, differs from the other

three by considering both the score of a gene within a data sourceas well as the number of genes returned for that data source. First,a transformed score sij is obtained.

sij !rijPni!0 rij

The transformed gene scores are then summed together to providea final score for each gene.

!int4, i !XJ

j!1

sijgjG

where gj is the number of genes returned for source j and

G !XJ

j!1

gj

2.4 Implementation

The CAESAR algorithms were written using Perl version 5.8.1and Java version 1.4.2. The vector space similarity searches wereperformed using a modified version of the Perl module

Search::VectorSpace by Maciej Ceglowski (http://www.perl.com/pub/a/2003/02/19/engine.html). Databases and ontology schemas weredownloaded and parsed into XML under a custom XML schema.Intermediate text and data-mining results were also stored as XMLunder the same schema.

2.5 Selection of the tests for complex traits

To assess the ability of CAESAR to choose valid candidates, 18 testgenes were selected from recently published reports providing strongevidence of statistical association with known complex humandisorders. The test genes included CTLA4 (Ueda et al., 2003),PTPN22 (Bottini et al., 2004), PTPN22 (Begovich et al., 2004),SUMO4 (Guo et al., 2004), FCRL3 (Kochi et al., 2005), ENTH(Pimm et al., 2005), EN2 (Gharani et al., 2004), TCF7L2 (Grant et al.,2006), CFH (Klein et al., 2005), LOC387715 (Rivera et al., 2005),LTA4H (Helgadottir et al., 2006), C2 (Gold et al., 2006),CFB (Gold et al., 2006), NPSR1 (Laitinen et al., 2004), MYO9B(Monsuur et al., 2005), IL2RA (Vella et al., 2005), SEMA5A(Maraganore et al., 2005) and LOC439999 (Grupe et al., 2006).

Each disorder required a custom corpus, either an OMIM recordor one or more review articles describing the biology of the disorder(Table 2). Review articles were selected by searching PubMed(Wheeler et al., 2006) for articles published before the year of discoveryof each gene association. Where multiple suitable review articleswere available, the texts were concatenated to produce the corpus.We removed any direct reference to the testing gene in the input text.In addition, entries in the GAD containing the test genes were removed.Thus, the input data closely mimicked the state of knowledge prior

Table 2. Tests using susceptibility genes for complex human traits

Complex trait OMIM Review(s)a Geneb Reviews OMIM

Rank Total Percent Enrich Rank Total Percent Enrich

Age-related macular 603075 15094132; 15350892 CFH 7263 13771 47.3 2 10450 12608 17.1 1degeneration LOC387715 – 13771 – – – 12608 – –ARMD (second run) 603075 N/Ac C2 – – – – 766 12875 94.1 17

CFB – – – – 44 12875 99.7 293Alzheimer’s disease 104300 15225164 LOC439999 – 13550 – – – 13709 – –Asthma 600807 12810182; 14551038 NPSR1 1117 13881 92.0 12 2835 13120 78.4 5Autism 209850 11733747; 12142938 EN2 98 13610 99.3 139 98 13213 99.2 135Celiac disease 212750 12907013; 12699968;

14592529MYO9B 234 13039 98.2 56 168 12703 98.7 76

Myocardial infarction 608446 15861005; 16041318 LTA4H 122 14043 99.1 115 –d – – –Parkinson’s disease 168600 16026116; 16278972 SEMA5A 4548 13477 66.2 3 879 13329 93.4 15Rheumatoid arthritis 180300 15478157; 12915205 PTPN22 333 13279 97.5 40 2156 13038 83.5 6

FCRL3 3743 13279 71.8 3 2230 13038 82.9 6Schizophrenia 181500 15340352; 16033310 ENTH 10013 14603 31.4 1 8065 13572 40.6 2Type 1 diabetes mellitus 222100 12270944; 11921414 SUMO4 12123 14272 15.1 1 7675 13130 41.5 2

11237226; 11899083 PTPN22 165 14272 98.8 86 833 13130 93.7 16IL2RA 130 14272 99.1 110 528 13130 96.0 25CTLA4 78 14272 99.5 183 324 13130 97.5 40

Type 2 diabetes mellitus 125853 15662000; 15662001; TCF7L2 2911 13922 79.1 5 4013 13586 70.5 315662002; 15662003

Totals 725e 13826e 94.7e 54f 879e 13130e 93.4e 43f

aPubMed ID(s) of review articles used in corpus.bFor references see Methods section. HUGO approved gene symbols used to identify genes.cNo suitable review corpus available (see Methods section).dThe OMIM record is insufficiently detailed and was not used.eMedian result.fMean result.

A computational system to select candidate genes for complex human traits

1135

Substitution Patterns in RRE

We examined 3 possible explanations for the surpris-ing result that jp , ju in RRE. First, because both the RREand CRE secondary structures occur within coding regions,we examined the possibility that the difference between jpand ju is diminished by selection on the protein sequence.We recalculated jp and ju for both molecules using onlydata from 4-fold degenerate sites in paired and unpaired re-gions. In CRE, the presence of codons affects the estimatesin the predicted direction (4-fold degenerate sites:jp5j2:89u ; all sites: jp5j1:45u ), though the 4-fold sites over-shoot the predicted pattern. We had less power to compare4-fold degenerate sites at the paired and unpaired sites ofRRE because there were too few 4-fold degenerate unpairedsites, and there was insufficient sequence variability at thesesites. However, the 4-fold degenerate paired sites did show

a higher jp (jp 5 7.61 with 95% CI [4.79–18.48]) than thepaired sites as a whole (jp 5 4.21 with 95% CI [3.51–5.28]). This suggests that the presence of protein-codingconstraints does impede compensatory evolution at pairedsites in RNA secondary structures, although it does not ex-plain why ju would be ‘‘greater’’ than jp in RRE.

Second, we examined the possibility that we had useda nonrepresentative sample of RRE sequences. To confirmthat the observed substitution patterns in RRE were not spe-cific to the particular set of HIV sequences we examined(which were all derived from subtype B), we estimatedjp and ju from 2 additional RRE alignments of sequencesdrawn from higher taxonomic levels: sequences from dif-ferent subtypes (1 sequence each from A, B, C, F, G, H, J,and K) and sequences from different groups (1–2 sequenceseach from M, N, and O) of HIV. In both these alignments,the results were qualitatively similar to those for subtype B:ju was significantly higher than jp (table 4).

Third, we considered whether the RRE estimates weredisproportionately influenced by a portion of the moleculethat experiences a type of selection that differs from themolecule as a whole. We systematically removed eachstem-loop of RRE and reestimated jp and ju for the result-ing partial structures. The jp and ju estimates were quali-tatively similar for all these partial structures (table 5).

FIG. 2.—Best-fit nucleotide substitution models for each alignment. Shown is a cartoon illustration of the rate categories of the best-fit nucleotidesubstitution models for each molecule. Within a molecule, rates were scaled to the maximum rate (black). Diagonal lines depict transitions; the edges ofthe square depict transversions. The HKY85 model, which was used for the rate ratios reported throughout this article, is shown for comparison onthe right.

FIG. 3.—Transition–transversion rate ratios (j) for each alignment.The dotted line represents a 1:1 relationship between jp and ju. The solidline represents the predicted relationship jp5j2u. Note that the CRE datapoint is from the analysis of 4-fold degenerate sites in paired and unpairedregions.

Table 3Transition–Transversion Rate Ratios (jp)

Structure j jp ju k

RRE 5.19 4.21 9.01 546.05a

IRES 6.50 15.34 3.60 73.46a

CRE 12.52 22.36 2.93 177.32a

5S rRNA 3.70 4.44 2.82 35.05a

16S rRNA 3.24 3.79 2.02 665.64a

23S rRNA 2.57 3.06 1.71 1281.71a

A tRNA 6.04 9.48 3.30 204.73a

M tRNA 11.98 18.78 9.65 122.24a

12S rRNA 3.90 6.69 2.83 131.93a

RNase P 2.98 4.86 1.30 59.21a

a LRT value significant at P , 0.0001

1784 Knies et al.

Page 6: Data as research output, data as part of the scholarly record

Published tables & figures

Analysed data

Raw data

Relatively little data is published within articles

Page 7: Data as research output, data as part of the scholarly record

Reuse of open data boosts citations to the original article

Piwowar  and  Vision  (2013)    doi:10.7717/peerj.175  

Page 8: Data as research output, data as part of the scholarly record

Most analyzed data is in the ‘long tail’, for which there is no specialized repository

Volu

me

Rank frequency of datatype

Structured data (e.g. Genbank, GBIF)

Long-tail data

After Heidorn (2008) doi:10.1353/lib.0.0036

Page 9: Data as research output, data as part of the scholarly record

Peer-to-peer data sharing does not work

Wicherts and colleagues requested data from from 141 articles in American Psychological Association journals.

“6 months later, after … 400 emails, [sending] detailed descriptions of our study aims, approvals of our ethical committee, signed assurances not to share data with others, and even our full resumes…” only 27% of authors complied

Wicherts JM, Borsboom D, Kats J, Molenaar D (2006) doi:10.1037/0003-066X.61.7.726

Page 10: Data as research output, data as part of the scholarly record

Inform

a(on

 Con

tent  

Time  

Time  of  publica(on  

Specific  details  

General  details  

Accident  

Re(rement  or    career  change  

Death  

(Michener  et  al.  1997)  

Data is best captured at the time of publication

Page 11: Data as research output, data as part of the scholarly record

Bumpus HC (1898) The Elimination of the Unfit as Illustrated by the Introduced Sparrow, Passer domesticus. Biological Lectures from the Marine Biological Laboratory: 209-226.

CC-­‐BY  Adamo  h/p://www.piqs.de/fotos/121272.html  

Page 12: Data as research output, data as part of the scholarly record

Joint Data Archiving Policy ( JDAP ) Data are important products of the scientific

enterprise, and they should be preserved and usable for decades in the future.

As a condition for publication, data supporting the results in the article should be deposited in an appropriate public archive.

Authors may elect to embargo access to the data for a period up to a year after publication.

Exceptions may be granted at the discretion of the editor, especially for sensitive information.

http://datadryad.org/pages/jdap

Page 13: Data as research output, data as part of the scholarly record

Piwowar HA, Chapman WW (2008) hdl:10101/npre.2008.1700.1

High impact factor journals have stronger data archiving policies

n=70

IF=3.6

IF=4.5

IF=6.0

Page 14: Data as research output, data as part of the scholarly record
Page 15: Data as research output, data as part of the scholarly record

published data (with article citation)

published article(with data citation)

DRYAD

JOURNAL

prepare manuscript and related data files

submit manuscript

editor

manuscript review

curation

send articledescription

Dryad data package

accepted?

yesno

send data identifier (DOI)

author

accepted?

data curator

upload data

Page 16: Data as research output, data as part of the scholarly record

When using this data, please cite the original article: Chave J, Coomes D, Jansen S, Lewis SL, Swenson NG, Zanne

AE (2009) Towards a worldwide wood economics spectrum. Ecology Letters 12: 351-366. doi:10.1111/j.1461-0248.2009.01285.x

Additionally, please cite the Dryad data package: Zanne AE, Lopez-Gonzalez G, Coomes DA, Ilic J, Jansen S,

Lewis SL, Miller RB, Swenson NG, Wiemann MC, Chave J (2009) Data from: Towards a worldwide wood economics spectrum. Dryad Digital Repository. doi:10.5061/dryad.234

Page 17: Data as research output, data as part of the scholarly record
Page 18: Data as research output, data as part of the scholarly record

No fees for submission from low and lower middle income countries

Page 19: Data as research output, data as part of the scholarly record

Dryad by the numbers

Data packages 4,172 Authors 15,581 Data files 11,912 Integrated journals 37 All journals 268 File downloads 4,629,256

Stats  as  of  23  Oct  2013  

Page 20: Data as research output, data as part of the scholarly record
Page 21: Data as research output, data as part of the scholarly record

To learn more

•  Repository home: http://datadryad.org •  News: http://blog.datadryad.org

•  Project documentation: http://wiki.datadryad.org •  Twitter: @datadryad

•  Code: http://code.google.com/p/dryad

or contact us:

•  http://datadryad.org/feedback

•  Todd Vision, Director, [email protected] •  Laura Wendell, Dryad Executive Director, [email protected]