Open Notebooks Science

Embed Size (px)

DESCRIPTION

A keynote talk I gave OSU Research Week on the importance of Open Science, especially Open Notebook Science, illustrated by practical examples. Talk inspired by Jean-Claude Bradley. Slides inspired by Cameron Neylon.

Citation preview

  • 1.Andrew Lang Professor of Mathematics Oral Roberts University February 17, 2014 OSU Research Week

2. -Cameron Neylon 3. Eight committees investigated the allegations and published reports, finding no evidence of fraud or scientific misconduct. However, the reports* called on the scientists to avoid any such allegations in the future by taking steps to regain public confidence in their work, for example by opening up access to their supporting data, processing methods and software, and by promptly honouring freedom of information requests. * Archana Venkatraman, "Data Without the Doubts". Information World Review 4. Andrew Wakefields study, linked the measles, mumps and rubella vaccine to autism. Vaccination rates in the developed world plummeted after the studys publication and a heated anti-vaccination movement persists today. 5. http://www.cfr.org/interactives/GH_Vaccine_Map/#map 6. ? 7. Science has lost its way, at a big cost to humanity Researchers are rewarded for splashy findings, not for double-checking accuracy. So many scientists looking for cures to diseases have been building on ideas that aren't even true. A few years ago, scientists at the Thousand Oaks biotech firm Amgen set out to double-check the results of 53 landmark papers in their fields of cancer research and blood biology. The idea was to make sure that research on which Amgen was spending millions of development dollars still held up. They figured that a few of the studies would fail the test that the original results couldn't be reproduced because the findings were especially novel or described fresh therapeutic approaches. But what they found was startling: Of the 53 landmark papers, only six could be proved valid. http://www.latimes.com/business/la-fi-hiltzik-20131027,0,1228881.column#axzz2ix1w9zGf 8. A special challenge for science writers covering research today arises from sciences growing credibility problem. It stems from the cumulative effect of errors and exaggerations that has fueled a recent rise in retractions, misconduct, and fraud among peer-reviewed researchers. For reporters covering major scientific developments from the search for alien life and genomics, to particle physics, climate change and cancer it can be difficult to distinguish error from fraud, sloppiness from deception, eagerness from greed or, increasingly, scientific conviction from partisan passion. Findings in fields from climate change to vaccines can also be deceptively cherry-picked in service of a political cause. 9. trust evidence 10. trust documentation 11. trust confidence 12. trust reproducibility 13. Anything produced is released under a CC0 license: Open Data, Open Access, Open Source. 14. Faster Science failed experiments discoverable unexpected collaborations real-time data and results 15. Faster Science failed experiments discoverable unexpected collaborations real-time data and results 16. Faster Science failed experiments discoverable unexpected collaborations real-time data and results 17. Faster Science failed experiments discoverable unexpected collaborations real-time data and results 18. Faster Science failed experiments discoverable unexpected collaborations real-time data and results 19. no insider information reusability reproducibility transparency 20. no insider information reusability reproducibility transparency 21. no insider information reusability reproducibility transparency 22. no insider information reusability reproducibility transparency 23. no insider information reusability reproducibility transparency 24. Open Drug Discovery for Neglected Diseases malaria schistosomiasis gram positive bacteria breast cancer 25. Drugs for neglected diseases need to be 26. cheap and 27. easy to make. 28. docking combinatorial library synthesis solvent selection recrystallization biological assay solubility models solubility data melting point models melting point data The big picture 29. docking combinatorial library synthesis solvent selection recrystallization biological assay solubility models solubility data melting point models melting point data Lets focus 30. Early models, before 2005 were 31. specialized 1979 Martin disubstituted benzenes 1987 Hanson normal alkanes 1988 Needham normal and branched alkanes 1990 Abramowitz non-hydrogen bonded benzenes 1991 Dearden anilines 1993 Katritzky aldehydes, amines, and ketones 1994 Simamora rigid aromatic 1996 Charlton alkanes 1996 Katritzky pyridines 1999 Zhao aliphatic 2001 Chickos homologous series 2003 Bergstrom druglike (N = 277, r2 = 0.54) 32. In 2005 everything changed 33. MDPI - cheminformatics.org Karthikeyan 2005 N = 4173, r2 = 0.65 34. PHYSPROP Clark 2005 N = 6257, r2 = 0.61 35. Recent melting point models use these datasets never reproducing r2 = 0.65 (0.47 0.56) 36. Even though [a] melting point can be measured accurately, its prediction has been a notoriously difficult problem. 37. We began measuring, collecting, and curating melting points in the Fall of 2010 38. Jean-Claude Bradleys Chemical Information Retrieval Course at Drexel 567 curated and referenced measurements from Fall 2010 Chemical Information Retrieval course 39. Most popular data sources chemical vendors 40. Alfa Aesar donates ~13,000 melting points to the public domain 41. collection curation modelingvalidation measurement ONS melting point workflow 42. Collection: Open Data source data points curated values source year data type Bell 2483 1631 1995 donated-CC0 Bergstrom 277 277 2003 open MDPI-Karthikeyan 4450 4084 2005 open Hughes 287 262 2008 open Oxford-MSDS 3217 1481 2010 open Drugbank 875 875 2011 open Griffiths 3757 278 2011 donated-CC0 Alfa Aesar 12986 8739 2011 donated-CC0 PHYSPROP 11645 9694 2011 donated-CC0 ONS 471 471 2012 open 27792 curated measurements for 19515 compounds 43. Curation is lots of hard, tedious work (Jean-Claude Bradley and Antony Williams) Antony Williams RSC ChemSpider 44. Inconsistencies and SMILES problems within the high trust level MDPI dataset 45. PHYSPROP Structure Errors (Incorrect Valence) 2315 out of 43543 contained pentavalent nitrogens 46. PHYSPROP Errors: Structure displayed is for the neutral compound dopamine but the associated CAS Number and chemical name in the file are for the hydrobromide salt. 47. unit errors: Kelvin/Celsius, Fahrenheit/Celsius bad SMILES (non-rendering, hypervalency) salts associated with SMILES for free base using boiling point for melting point 48. Some melting points cant be resolved only with literature: 4-benzyltoluene 49. Open lab notebook page measuring the melting point of 4-benzyltoluene 50. Melting Point Model CDK descriptor calculator R statistical computing melting point data 51. use this model 52. compounds doubleplusgood single CDK descriptor calculator R statistical computing Melting Point Model 53. Straight chain carboxylic acids from 1 to 10 carbons Straight chain alcohols from 1 to 10 carbons Comparison of model with double+ validated measurements 54. Cyclic primary amines from 3 to 6 carbons cyclobutylamine flagged for measurement only single source available 55. Publication of double+ validated melting point dataset as a preprint 56. Publication of double+ validated melting point dataset as a book 57. Data and model deployed on the web web service 58. in Google spreadsheets 59. as an app 60. Can the solvents used to recrystallize compounds in organic teaching labs be improved? Trans-dibenzalacetone Aldol condensation between two molecules of benzaldehyde and one molecule of acetone [Matthew McBride: Undergraduate Research Assistant - Drexel] 61. First recrystallized in ethyl acetate in 1906: Straus and Ecker, Ber. 39, 2988 (1906) Recrystallized in ethyl acetate in Organic Syntheses 62. Recommended recrystallization solvent: ethyl acetate. (http://classes.kvcc.edu/chm230/mixed%20aldol%20condensation.pdf (http://www.xula.edu/chemistry/documents/orgleclab/Aldol_notes.pdf) 63. Enter compound identification and desired parameters 64. How does it work? 1. Look up the solvent boiling point 2. Look up the room temperature solubility or predict it via measured or predicted Abraham descriptors 3. Look up the solute melting point or predict it via a model 4. Use the melting point and the solubility at room temperature to predict the solubility at boiling 5. Calculate the predicted recrystallization yield 65. Lists solvents and their predicted recrystallization yield. Prediction is generated by the temperature dependent solubility curves. 66. ethyl acetate (predicted yield of 72%) vs ethanol (predicted yield of 93%) ethyl acetate ethanol 0.09M 1.1M 0.62M 2.06M 67. Dibenzalacetone derivatives docking against tubulin (paclitaxel site) 68. Derivatives of dibenzalacetone may be synthesized by altering the aldehyde used From a library of derivatives, the following compound was the top hit for the docking site of Taxol Uses phenanthrene-9-carboxaldehyde 69. Perform a Reaxys search to determine availability of synthesis procedures No results [Matthew McBride: Undergraduate Research Assistant - Drexel] 70. Used methanol and benzene Melting Point: 264-265C (http://usefulchem.wikispaces.com/EXP286) [Matthew McBride: Undergraduate Research Assistant - Drexel] 71. trust reproducibility open notebook science 72. Acknowledgements Jean-Claude Bradley (Drexel) Cameron Neylon (Advocacy Director at PLOS) Antony Williams (RSC ChemSpider) Drexel research assistants: Evan Curtin and Matthew McBride ORU research assistants: David Bulger, Daryl Charron, Lizzie Clark, Lacey Condron, Samantha Gaines, Alejandro Hernandez, Maria Hernandez, Jesse Patsolic, and Matthew Wilson