Public data archiving: Who does? Who doesn't? What can we do about it?

  • View
    760

  • Download
    1

Embed Size (px)

DESCRIPTION

Presentation at UBC Biodiversity Internal Seminar Series (BLISS) http://www.zoology.ubc.ca/~biodiv/BLISS/BLISS.htm

Transcript

  • 1. Public data archiving: Who shares? Who doesnt? What can we do about it? HeatherPiwowar PresentedatUBCBLISS,Sept2010 DataONEpostdocwithDryadandNESCent,@UBC PhDinDeptofBiomedicalInformatics,UofPittsburgh
  • 2. http://www.metmuseum.org/toah/ho/09/euwf/ho_24.45.1.htm
  • 3. http://www.flickr.com/photos/jsmjr/62443357/
  • 4. http://www.flickr.com/photos/camilleharrington/3587294608/
  • 5. http://www.flickr.com/photos/rkuhnau/3318245976/
  • 6. http://www.flickr.com/photos/conformpdx/1796399674/
  • 7. http://www.flickr.com/photos/rkuhnau/3317418699/
  • 8. http://www.flickr.com/photos/zemlinki/261617721/
  • 9. http://www.flickr.com/photos/tracenmatt/3020786491/
  • 10. http://www.flickr.com/photos/the-o/2078239333/
  • 11. http://www.flickr.com/photos/ryanr/142455033/
  • 12. http://www.flickr.com/photos/75166820@N00/5318468/
  • 13. Find Organize Document Deidentify Format Decide Ask Submit Answer questions Worry about mistakes being found Worry about data being misinterpreted Worry about being scooped Forgo money and IP and prestige???
  • 14. not very motivating.
  • 15. Asaresult,policymakershavespent lotsoftimeandmoney.... http://www.flickr.com/photos/johnnyvulkan/381941233/ http://www.flickr.com/photos/tonivc/2283676770/
  • 16. buildingdatabases, developingstandards, articulatingbestpractices tosupportpublicarchivingof researchdatasets
  • 17. lotsofdatasharing! http://www.genome.jp/en/db_growth.html
  • 18. buthowmuchisnt shared? whatisntshared? whoisntsharingit? whynot? howmuchdoesitmatter? whatcanwedo aboutit?
  • 19. youcannotmanage whatyoudonotmeasure quote: Lord Kelvin http://www.flickr.com/photos/archeon/2941655917/
  • 20. As we seek to embrace and encourage data sharing, understanding patterns of adoption will allow us to make informed decisions about tools, policies, and best practices. Measuring adoption over time will allow us to note progress and identify best practices and opportunities for improvement.
  • 21. researchquestions 1. Is there benet for those who share? 2. How can we study data sharing behaviour in a scalable, systematic way? 3. What factors are correlated with sharing and withholding data?
  • 22. http://www.flickr.com/photos/paulhami/1020538523//
  • 23. Which data? http://www.flickr.com/photos/paulhami/1020538523//
  • 24. Where? http://www.flickr.com/photos/paulhami/1020538523//
  • 25. With whom? http://www.flickr.com/photos/paulhami/1020538523//
  • 26. When? http://www.flickr.com/photos/paulhami/1020538523//
  • 27. Under what terms? http://www.flickr.com/photos/paulhami/1020538523//
  • 28. http://www.flickr.com/photos/paulhami/1020538523//
  • 29. http://www.flickr.com/photos/paulhami/1020538523//
  • 30. gene expression microarray data raw intensity data upon publication publicly on the internet (centralized databases) http://www.flickr.com/photos/paulhami/1020538523//
  • 31. http://en.wikipedia.org/wiki/DNA_microarray http://en.wikipedia.org/wiki/Image:Heatmap.png http://commons.wikimedia.org/wiki/ File:DNA_double_helix_vertikal.PNG microarray data
  • 32. microarray data
  • 33. 1.Istherebenetfor thosewhoshare? http://www.flickr.com/photos/sunrise/35819369/
  • 34. currencyofvalue? Citations.
  • 35. currencyofvalue? Citations. $50! Diamond,Arthur M. What is a Citation Worth?. The Journal of Human Resources (1986) vol. 21 (2) pp. 200-215
  • 36. dataset 85 cancer microarray trials published in 1999-2003, as identied by Ntzani and Ioannidis (2003) citations ISI Web of Science Citation index, citations from 2004-2005 data sharing locations Publisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine statistics Multivariate linear regression
  • 37. Note: log scale
  • 38. ~70%
  • 39. 2. Need automated methods to: a) Identify studies that create datasets b) Determine which of these have in fact been shared c) Extract attributes about the environment
  • 40. a) Identify studies that create datasets http://www.ickr.com/photos/lofaesofa/248546821/
  • 41. Lookforwetlabmethodsinarticlefulltext: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
  • 42. Combined, these full-text portals reach 85% of the articles available through U of Pittsburgh library subscriptions.
  • 43. But how to generate an effective query? Use open access articles.
  • 44. text analysis: automatically catalogued single words and word-pairs from full text assessed precision and recall combined the high performers:
  • 45. Derived query: ("gene expression" AND microarray AND cell AND rna) AND (rneasy OR trizol OR "real-time pcr") NOT (tissue microarray* OR cpg island*)
  • 46. Evaluation: Ochsner et al. Nature Methods (2008) 400 studies across 20 journals Precision: 90% (conf int: 86% to 93%) Recall: 56% (conf int: 52% to 61%)
  • 47. a) Identify studies that create datasets b) Determine which of these have in fact been shared c) Extract attri