Open Data - Where Do We Stand from a Researcher's Perspective?

  • View

  • Download

Embed Size (px)


Presented at Kansas State University as Part of Open Access Week, October 23, 2012.


  • 1.Open Data Where Do WeStand from A Researchers Perspective? Philip E. BourneUniversity of California San

2. My Perspective Mine is a biomedical sciences perspective My lab. distributes for free data equivalent to theLibrary of Congress every month I am a supporter of open access (provided there is abusiness/sustainability model) and founding editor inchief of PLOS Computational Biology I am Co-founder of SciVee Inc. and believeinnovation comes from open access to knowledge Recently became UCSDs AVC of Innovation which isgiving me a more institutional perspective I Readily Acknowledge Each Discipline is Different 3. My General Opinion:Where Does the Open Access DebateStand Today? Its not a question of if but a question ofwhen and how for most disciplines We are at the tip of the iceberg in ourability to use OA content OA will gain momentum in an increasinglyknowledge-based economy 4. The State of Play: UC Open Access Policy Debate: Opt Out vs Opt in For Against Publically funded Cost to someresearch should bedisciplinespublic Impact on societies Institutional Journal quality rePerspective: The open promotionprovision of data and Extra workknowledge derived Administrationfrom these dataappears to be an UC as Big Brotherunidentified asset atthis time 5. We will come back to this, but first let us explore why openknowledge is so important (tome at least) 6. Open Data May* Save Lives?Structure Summary page activity forH1N1 Influenza related structures Jan. 2008Jul. 2008Jan. 2009 Jul. 2009 Jan. 2010 Jul. 2010 3B7E: Neuraminidase of A/Brevig Mission/1/1918 H1N1 strain in complex with zanamivir 1RUZ: 1918 H1 Hemagglutinin * 7. Open Science Can Accelerate the Scientific ProcessFor some people the change maybe too slow to save their life 8. Josh Sommer A Remarkable Young ManCo-founder & Executive Director the Chordoma Foundation 9. Chordoma A rare form of braincancer No known drugs Treatment surgicalresection followed byintense radiationtherapy 10. 11. 12. 13. If I have seen further it is only bystanding on the shoulders of giants Isaac Isaac NewtonFrom Joshs point of view the climbup just takes too long> 15 years and > $850M to bemore preciseAdapted: 14. 15. 16. 17. The Story of Meredith 18. What Does Meredith Tell Us? The Wikipedia / Kahn Academy /YouTubegeneration knows no bounds Bounds are too often imposed by traditionrather than what makes the most sense Another example of an underexploitedasset at this time? 19. Another Way of Thinking About the Implications of What Joshand Meredith Represent Is theNeed for New Forms of Knowledge Management andAccessLets Explore this Notion with An Emphasis on Data 20. The Silos of Data & Knowledge Are Starting to CoalesceIs a Biological Database Really Different than a Biological Journal?PLoS Comp. Biol. 2005 1(3) e34 21. The Silos of Data & Knowledge Are Starting to Coalesce Supplemental information Databases are nowhas exploded knowledgebases Data journals are Science can be done onemerging the fly The use of rich media is Biocuration is a respectfulincreasing career Software and otherprocesses are becomingavailablePLoS Comp. Biol. 2008. 4(7): e1000136 22. Where Does That Take Us? A paper is an artifact of a previous era It is not the logical end product of eScience,hence: Work is omitted Article vs supplement is a mess Visualization may be limited Interaction and enquiry are non-existent Rich media can help, but barriers remain 23. Where Does That Take Us? Data Sharing Policies From the NSF: Investigators are expected to share with other researchers, atno more than incremental cost and within a reasonable time,the primary data, samples, physical collections and othersupporting materials created or gathered in the course ofwork under NSF grants. Grantees are expected to encourageand facilitate such sharing. See Award & Administration Guide(AAG) Chapter VI.D.4. 24. Big Data is Off March 2012 OSTPcommits $200M toBig Data NSF, DOD, NIH allannounce programs GBMF think tankleads to soon-to-be-announcedinstitutional awards 25. Where Does That Take Us? Add into the Mix: Reproducibility It really is a myth! Maintainability DNA doubles in 5 months Usability Go ahead and try! Reward Tenure for data no wayNotwithstanding dreams do emerge Here is mine 26. Here is What The Knowledge and Data Cycle0. Full text of PLoS papers stored4. The composite view hasI Want in a database links to pertinent blocks of literature text and back to the PDB 1. User clicks on thumbnail 4.2. Metadata and awebservices call providea renderable image that1.can be annotated 3. A composite view of 1. A link brings up figuresfrom the paperjournal and database 3. Selecting a features content results 3. provides adatabase/literaturemashup 4. That leads to newpapers 2.2. Clicking the paper figure retrieves data from the PDB which is analyzed PLoS Comp. Biol. 2005 1(3) e34 27. The Knowledge Economy BeginsCardiac DiseaseLiterature Immunology Literature 28. Simultaneously DiscoveryInformatics Emerges Google with notsuffice as a scientificknowledge discoverytool Google is broad butshallow Science is cross-disciplinary narrowerand deeper 29. NSF Discovery Informatics Workshop Discoveries surpass an individuals ability - need intelligent tools Need to increase connections between knowledge and data Need to combine diverse human abilitiesDiscovery informatics - computer scientists, domain scientists,social scientists - 30. This is Just the Beginning of Discovery Informatics Each evening the labs Evernotenotebooks are scanned for commonalitiesfrom the days activities. These are seedsin a deep search of the web for knowledgeand data that has become available sincelast searched. Results are ranked andpresented for consideration over coffeethe next morning 31. Unimaginable Connections Made AutomaticallyThrough RDF Descriptions 32. Before We Get Too Heady LetsLook at the Realities of theSituation from My Perspective Data repositories are broken There is a high noon effect NCBI has been a wonderful model todate 33. Data/Institutional Repositories Build it and they will come fails most of thetime Institutional repository is an oxymoron NCBI works because: It is an act of the US congress It has strong leadership It has a monopoly on the literature It has IT thought out over many years Innkeeper at the Roach Motel D. Salo 2008 34. Data/Institutional Repositories High Noon Effect Publishers make knowledge in very difficult,but at least knowledge out, albeit limited isconsistent, intuitive and easy to use Data repositories make data in and data outvery difficult they strive to be different whenin fact users want them to be the same 35. Data and Journals That journals are thinking about data isgood Dryad etc. are welcome but a stop gapmeasure Fully functional data journals will not occurwithout a change to the reward system Data papers can help shift the rewardsystem Are PLoS Topic Pages a sign? 36. Interim Solution:Use the Traditional Reward SystemThe Wikipedia Experiment Topic Pages Identify areas of Wikipedia thatrelate to the journal that aremissing of stubs Develop a Wikipedia page in thesandbox Have a Topic Page Editor Reviewthe page Publish the copy of record withassociated rewards Release the living version intoWikipedia 37. Think Globally Act Locally:What Can Our Institutions DoNow To Move Us in The RightDirection? 38. Institutional Response Have repositories that are useful Use common standards Are vetted by the community Are fully open and searchable Reward all forms of scholarship Leverage the asset 39. Most Laboratories We are the long tail Goodbye to the student is goodbye to the data Very few of us have complied (or will comply with the data management plans we write into grants) 40. UCSD Dropbox Simple!!!! Can drop large files easily Asks for limited metadata and permissions todiscover Has guaranteed quality of service andsecurity not available in the cloud Is the data management plan and chargedagainst grants Is a rich campus corpus open to discoveryinformatics 41. The UCSD Dropbox Discovery Environment Scenarios: Fosters known collaborations throughsimplified data exchange Discovers new collaborators through thesame or related data elements A corpus whose intrinsic value is as yetunknown 42. What Do I Want by 2020 orEarlier as a Researcher? Answer biological questions not justretrieve data Understand all there is to know about theavailability and quality of a unit ofbiological data Operate on data in a way that is simpler,more productive, and reproducible 43. What Do We Need to Do to GetThere? A Data Registry? Individual repositories register theirmetadata which includes accessstatistics, commentary etc. DataCiteis a beginning Identify identical data objects and theirrespective metadata for comparativeanalysis Funders support registration Publishers support registration 44. What Do We Need to Do toGet There? An App+ Store? The App model Think of it operating on a content baserather than a mobile device Simple and consistent user interface Needs to pass some quality control Has a reward The App+ Model Apps interoperate through a genericworkflow