125
The Possibilities and Pitfalls of Internet- Based Chemical Data Antony Williams Royal Society of Chemistry

The Possibilities and Pitfalls of Internet-Based Chemical Data

Embed Size (px)

DESCRIPTION

In less than a decade the internet has provided us access to enormous quantities of chemistry data. Chemists have embraced the web as a rich source of data and knowledge. However, all that glisters is not gold and while online searches can now provide us access to information associated with many tens of millions of chemicals, can allow us to traverse patents, publications and public domain databases the promise of high quality data on the web needs to be tempered with caution. In recent years the crowdsourcing approach to developing curated content has been growing. Can such approaches allow us to bring to bear the collective wisdom of the crowd to validate and enhance the availability of trusted chemistry data online or are algorithms likely to be more powerful in terms of validating data? While it is now possible to search the web using a query language form natural to chemists – that of “structure searching the web” - increasingly scientists are likely going to have to accept joint responsibility for the quality of data online for the foreseeable future. Their participation is likely to come through engaging in open science, the provision of data under open licenses and by offering their skills to the community. This presentation will provide an overview of the present state of chemistry data online, the challenges and risks of managing and accessing data in the wild and how an internet for chemistry continues to expand in scope and possibilities.

Citation preview

Page 1: The Possibilities and Pitfalls of Internet-Based Chemical Data

The Possibilities and Pitfalls of Internet-

Based Chemical Data

Antony WilliamsRoyal Society of Chemistry

Page 2: The Possibilities and Pitfalls of Internet-Based Chemical Data

I’ve performed a few dozen chemical syntheses

I’ve run thousands of analytical spectra I’ve generated thousands of NMR

assignments I’ve probably published <5% of all work But things can be different today….

About Me…as a Chemist

Page 3: The Possibilities and Pitfalls of Internet-Based Chemical Data

My Early Scientific Computing

Page 4: The Possibilities and Pitfalls of Internet-Based Chemical Data

If it was not just about me…

Page 5: The Possibilities and Pitfalls of Internet-Based Chemical Data

If it was not just about me…

Together we might: build an encyclopedia …and rate restaurants …provide book reviews to each other …or movie reviews …or reviews of service providers …organize sit-ins and social action …and more data might just be Open

Page 6: The Possibilities and Pitfalls of Internet-Based Chemical Data

If it was not just about me…

Together we might: build an encyclopedia …and rate restaurants …provide book reviews to each other …or movie reviews …or reviews of service providers …organize sit-ins and social action …and more data might just be Open …more Chemists might share rather than

just take!

Page 7: The Possibilities and Pitfalls of Internet-Based Chemical Data

A hobby-project to connect chemistry data on the web Three servers – one purchased, two hand-built Software begged and borrowed – and thanks to Microsoft! Some late nights – 10pm to 2am for over a year Some survival of the naysayers in the community …and taking advantage of a changing world of data availability

and the crowdsourcing of willing participants

NO formal funding. Simply passion and abilities lining up.

A story of a hobby gone wild…

Years 1 and 2

Page 8: The Possibilities and Pitfalls of Internet-Based Chemical Data

ChemSpider(Year 2-present)

Building a Free Chemical Database

A central hub for chemists to source information >28 million unique chemical records Aggregated from >400 data sources Chemicals, analytical data, movies, images,

podcasts, links to patents, publications, predictions Web services for integration Daily updates of data

Page 9: The Possibilities and Pitfalls of Internet-Based Chemical Data

Answer Questions for Chemists

Questions a chemist might ask… What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for

Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol

Blue?

Page 10: The Possibilities and Pitfalls of Internet-Based Chemical Data

A LITTLE Chemistry First

Page 11: The Possibilities and Pitfalls of Internet-Based Chemical Data

Structural Diagrams

Page 12: The Possibilities and Pitfalls of Internet-Based Chemical Data

Structural Diagrams

Page 13: The Possibilities and Pitfalls of Internet-Based Chemical Data

Analytical Data

Page 14: The Possibilities and Pitfalls of Internet-Based Chemical Data

Does Stereochemistry Matter?

Page 15: The Possibilities and Pitfalls of Internet-Based Chemical Data

Does one stereocenter matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide

Page 16: The Possibilities and Pitfalls of Internet-Based Chemical Data

Structural Representations

Page 17: The Possibilities and Pitfalls of Internet-Based Chemical Data

The InChI Standard

Page 18: The Possibilities and Pitfalls of Internet-Based Chemical Data

InChIKeysSearch the Web by

Structure

Page 19: The Possibilities and Pitfalls of Internet-Based Chemical Data

I want to know about “Vincristine”

Page 20: The Possibilities and Pitfalls of Internet-Based Chemical Data
Page 21: The Possibilities and Pitfalls of Internet-Based Chemical Data

Vincristine: Identifiers and Properties

Page 22: The Possibilities and Pitfalls of Internet-Based Chemical Data

Vincristine: Vendors and Sources

Page 23: The Possibilities and Pitfalls of Internet-Based Chemical Data

Vincristine: Patents

Page 24: The Possibilities and Pitfalls of Internet-Based Chemical Data

Chemical Names and Synonyms

VALIDATION OF NAMES

Page 25: The Possibilities and Pitfalls of Internet-Based Chemical Data

Validated Names for Searching…

Page 26: The Possibilities and Pitfalls of Internet-Based Chemical Data

Information System Architecture

Input FilteringCuratio

nArchival

StorageIndexin

g

Processing

Search Browse

Presentation

API

Page 27: The Possibilities and Pitfalls of Internet-Based Chemical Data

The Quality of Chemical Data Online

What is the Structure of Vitamin K?

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K1 (phytomenadione) derived from plants, VITAMIN K2 (menaquinone) from bacteria & synthetic naphthoquinone provitamins, VITAMIN K3 (menadione).

Page 28: The Possibilities and Pitfalls of Internet-Based Chemical Data

What is the Structure of Vitamin K1?

Page 29: The Possibilities and Pitfalls of Internet-Based Chemical Data

What is the Structure of Vitamin K1?

Page 30: The Possibilities and Pitfalls of Internet-Based Chemical Data

CAS’s Common Chemistry

Page 31: The Possibilities and Pitfalls of Internet-Based Chemical Data

Wikipedia

Page 32: The Possibilities and Pitfalls of Internet-Based Chemical Data

Wolfram Alpha

Page 33: The Possibilities and Pitfalls of Internet-Based Chemical Data

DailyMed

Page 34: The Possibilities and Pitfalls of Internet-Based Chemical Data
Page 35: The Possibilities and Pitfalls of Internet-Based Chemical Data

People Use Trusted Resources…

Page 36: The Possibilities and Pitfalls of Internet-Based Chemical Data

Just Yesterday…

Page 37: The Possibilities and Pitfalls of Internet-Based Chemical Data

How will it improve?

Participation and

contribution

Page 38: The Possibilities and Pitfalls of Internet-Based Chemical Data

ALL Different, ALL “Domoic Acids”

Page 39: The Possibilities and Pitfalls of Internet-Based Chemical Data

ALL Different, ALL “Domoic Acids”

Page 40: The Possibilities and Pitfalls of Internet-Based Chemical Data

The EXPERTS must get it right?!

Page 41: The Possibilities and Pitfalls of Internet-Based Chemical Data

Question Everything Online:

www.dhmo.org

Page 42: The Possibilities and Pitfalls of Internet-Based Chemical Data

ANYBODY can annotate a record on ChemSpider

Registered users can deposit new data

Registered users can validate existing data

Deposition, Annotation and Validation

Page 43: The Possibilities and Pitfalls of Internet-Based Chemical Data

CURATION Search “Vitamin H”

Page 44: The Possibilities and Pitfalls of Internet-Based Chemical Data

“Curate” Identifiers

Page 45: The Possibilities and Pitfalls of Internet-Based Chemical Data

“Curate” Identifiers

Page 46: The Possibilities and Pitfalls of Internet-Based Chemical Data

ChemSpider Web Services

Page 47: The Possibilities and Pitfalls of Internet-Based Chemical Data

ChemSpider via web service access For structure identification for mass

spectrometry For name and structure resolution For structure and substructure searching For an “innovative medicines initiative”

semantic web project…

Open APIs for Science

Page 48: The Possibilities and Pitfalls of Internet-Based Chemical Data

Open PHACTS Project Develop a set of robust standards Integrate Chemistry and Biology data by implementing

the standards in a semantic integration hub Deliver services to support drug discovery programs in

pharma and public domain INITIALLY 22 partners, 8 pharmaceutical companies, 3

biotechs 36 months project – first public release version is

imminentGuiding principle is open access, open usage, open source

- Key to standards adoption -

Page 49: The Possibilities and Pitfalls of Internet-Based Chemical Data

Using RDF permalinks http://www.chemspider.com/Chemical-Structure

.7787.rdf

Using a Search Term http://www.chemspider.com/rdf.ashx?q=cyclohe

xane http://rdf.chemspider.com/cyclohexane

RDF and the semantic web

Page 50: The Possibilities and Pitfalls of Internet-Based Chemical Data

RDF and the semantic web

Page 51: The Possibilities and Pitfalls of Internet-Based Chemical Data

www.SpectralGame.comhttp://www.jcheminf.com/content/1/1/9

Page 52: The Possibilities and Pitfalls of Internet-Based Chemical Data

Times have changed Immediacy of social networks Commenting on articles/data is here The “participating scientist” has high

profile And who can be a scientist now???

The World of Contribution

Page 53: The Possibilities and Pitfalls of Internet-Based Chemical Data

A Ten Year Old Scientist

Page 54: The Possibilities and Pitfalls of Internet-Based Chemical Data
Page 55: The Possibilities and Pitfalls of Internet-Based Chemical Data

Challenging a Publication

Page 56: The Possibilities and Pitfalls of Internet-Based Chemical Data
Page 57: The Possibilities and Pitfalls of Internet-Based Chemical Data

Oops…

Page 58: The Possibilities and Pitfalls of Internet-Based Chemical Data

>2 Years to Resolution

Page 59: The Possibilities and Pitfalls of Internet-Based Chemical Data

What of Hexacyclinol?

Page 60: The Possibilities and Pitfalls of Internet-Based Chemical Data

The Blogosphere “Discusses”…

Page 61: The Possibilities and Pitfalls of Internet-Based Chemical Data

Oxidation by Sodium Hydride?

Page 62: The Possibilities and Pitfalls of Internet-Based Chemical Data

The Blogosphere Analyzes…

Page 63: The Possibilities and Pitfalls of Internet-Based Chemical Data

The Blogosphere Analyzes…

Page 64: The Possibilities and Pitfalls of Internet-Based Chemical Data

How much is in the archives?

Page 65: The Possibilities and Pitfalls of Internet-Based Chemical Data

Open Notebook Science Analysis

Page 66: The Possibilities and Pitfalls of Internet-Based Chemical Data

Motivation Faster Science, Better

Science

Page 67: The Possibilities and Pitfalls of Internet-Based Chemical Data

Openness – Still Carries Licensing

Openness may be hard..

Open Access flavors Open Source licenses Open Data licenses Open Notebook

Science

Page 68: The Possibilities and Pitfalls of Internet-Based Chemical Data

License data based on GOALS: scientific, commercial, or mixed

Explore the benefits of open licensing and drawbacks of enclosure

Provide simple explanations terms of use If you can't make the data public domain,

make the metadata public domain.

We SuggestRules for Licensing Data

Page 69: The Possibilities and Pitfalls of Internet-Based Chemical Data

We SuggestRules for Licensing Data

Page 70: The Possibilities and Pitfalls of Internet-Based Chemical Data

Challenged in the Twittersphere

Page 71: The Possibilities and Pitfalls of Internet-Based Chemical Data

Annotating Articles Today…

Page 72: The Possibilities and Pitfalls of Internet-Based Chemical Data

Attribution to me…

Page 73: The Possibilities and Pitfalls of Internet-Based Chemical Data

Other Publications to Annotate…

Page 74: The Possibilities and Pitfalls of Internet-Based Chemical Data

Other Publications to Annotate…

Page 75: The Possibilities and Pitfalls of Internet-Based Chemical Data

Publications to Annotate…

“We then established a collaboration with professor Sum Ting Wong, a fugitive from the North Korean University Hu Yu Hai Ding”

“..identified as the new protein Wai So Dim”

Page 76: The Possibilities and Pitfalls of Internet-Based Chemical Data

A New World for Publishing?

Page 77: The Possibilities and Pitfalls of Internet-Based Chemical Data

An Adventure into the World of Small

but significant contribution..

Page 78: The Possibilities and Pitfalls of Internet-Based Chemical Data

ChemSpider SyntheticPages

Page 79: The Possibilities and Pitfalls of Internet-Based Chemical Data

Micropublishing with Peer Review

(a chemical synthesis blog?)

Page 80: The Possibilities and Pitfalls of Internet-Based Chemical Data

Multi-Step Synthesis

Page 81: The Possibilities and Pitfalls of Internet-Based Chemical Data

Interactive Data

Page 82: The Possibilities and Pitfalls of Internet-Based Chemical Data

A New Route for Scientific Recognition?

Page 83: The Possibilities and Pitfalls of Internet-Based Chemical Data

How do “we” measure a scientist? The funding bodies, department heads etc. use

Publication profile Impact factors An index – h, m, g, i10, c, s … Grants brought in

Scientists are notable in different ways – technology can help measure different types of “impact”

The Measure of a Scientist?

Page 84: The Possibilities and Pitfalls of Internet-Based Chemical Data

What makes a Scientist Notable?

Page 85: The Possibilities and Pitfalls of Internet-Based Chemical Data

Online tools track activities of scientistsSome are totally opt-in, an increasing

number are about you and need checking!Take responsibility for your profile online Actively BUILD your online profile

Public Profiles of Scientists

Page 86: The Possibilities and Pitfalls of Internet-Based Chemical Data

Microsoft Academic Search

Page 87: The Possibilities and Pitfalls of Internet-Based Chemical Data

My Academic Search Profile

Page 88: The Possibilities and Pitfalls of Internet-Based Chemical Data

My Co-author Graph

Page 89: The Possibilities and Pitfalls of Internet-Based Chemical Data

How many times do you see errors where: 1) You have not been able to annotate

or curate 2) You have chosen not to annotate or

curate

Q: How Often Do You Contribute?

Annotation and Validation

Page 90: The Possibilities and Pitfalls of Internet-Based Chemical Data

My Co-author Graph

Page 91: The Possibilities and Pitfalls of Internet-Based Chemical Data

Contribute when you can!

Page 92: The Possibilities and Pitfalls of Internet-Based Chemical Data

Contribute when you can!

Page 93: The Possibilities and Pitfalls of Internet-Based Chemical Data
Page 94: The Possibilities and Pitfalls of Internet-Based Chemical Data

Scientists and Orcids?

Page 95: The Possibilities and Pitfalls of Internet-Based Chemical Data

A unique identifier for a scientist – a Scientists InChI !

Will enable aggregation of a scientists activities

ORCIDs associated with publications, data, blog comments, other contributions (Wikipedia, reviews etc.) will be a way to measure their impact

Page 96: The Possibilities and Pitfalls of Internet-Based Chemical Data

The Alt-Metrics Manifesto

http://altmetrics.org/manifesto/

Page 97: The Possibilities and Pitfalls of Internet-Based Chemical Data

ImpactStory

Page 98: The Possibilities and Pitfalls of Internet-Based Chemical Data

ImpactStory

Page 99: The Possibilities and Pitfalls of Internet-Based Chemical Data

SlideShare

Page 100: The Possibilities and Pitfalls of Internet-Based Chemical Data

SlideShare via ImpactStory

Page 101: The Possibilities and Pitfalls of Internet-Based Chemical Data

ImpactStory

Page 102: The Possibilities and Pitfalls of Internet-Based Chemical Data

Where do I contribute? How might I be measured?

Page 103: The Possibilities and Pitfalls of Internet-Based Chemical Data

Article Level Metrics

Page 104: The Possibilities and Pitfalls of Internet-Based Chemical Data

Article Level Metrics

Page 105: The Possibilities and Pitfalls of Internet-Based Chemical Data

Impact will be an aggregate measure of Publications – classic measures and article level metrics Data, algorithms and code – and its distribution and

reuse Contributions as comments, annotation and curation

activities

New “impact factors” will develop with time

New Measures of Impact

Page 106: The Possibilities and Pitfalls of Internet-Based Chemical Data

Some challenges are technology based The growth in data – storage and compute speed Ontologies, dictionaries and trusted sources

Many challenges are “about us” Licenses and rights Rewards and recognition Participation, contribution and collaboration

The Challenges

Page 107: The Possibilities and Pitfalls of Internet-Based Chemical Data

There are many government institutions building public compound databases that should collaborate more: National Cancer Institute (NCI) National Institutes of Health (NIH) Environmental Protection Agency (EPA) Food and Drug Administration (FDA) National Library of Medicine (NLM)

Tear Down Walls between Government Labs

Page 108: The Possibilities and Pitfalls of Internet-Based Chemical Data
Page 109: The Possibilities and Pitfalls of Internet-Based Chemical Data

Release STRUCTURES Please!

Page 110: The Possibilities and Pitfalls of Internet-Based Chemical Data

What Does the Future Hold?

Page 111: The Possibilities and Pitfalls of Internet-Based Chemical Data

The Linked Network Will Grow

Page 112: The Possibilities and Pitfalls of Internet-Based Chemical Data

The Data Deluge Will Not Go Away

Page 113: The Possibilities and Pitfalls of Internet-Based Chemical Data

RSC Activities in Development

Deliver a Global Chemistry Hub “Data enable” the RSC archive back to 1841:

Extract chemistry – chemicals, reactions, experimental data points, complex data

Enrich the articles for interactive viewing and crowdsourced annotation and curation

Enhance queries possible across the archive

Page 114: The Possibilities and Pitfalls of Internet-Based Chemical Data

Federated Data Segregation

Page 115: The Possibilities and Pitfalls of Internet-Based Chemical Data

Future System Architecture

Input FilteringCuratio

nArchival

StorageIndexin

g

Processing

Search Bro

PresentationNo more complex

APIComplexity is hidden

InputInput

CurationCuratio

n

StorageStorageElastic,

distributedIndexing

Indexing

New algorithms

Processing

Processing

Distributed

SearchSearchOver

federated systems

ArchivalArchival

FilteringFilteringSmarter

algorithms

BrowseOver

federated systems

Page 116: The Possibilities and Pitfalls of Internet-Based Chemical Data

Data Validation is Exacting Work

Page 117: The Possibilities and Pitfalls of Internet-Based Chemical Data

“Challenge” the Community

Page 118: The Possibilities and Pitfalls of Internet-Based Chemical Data

Chemistry is NOT just small molecules!Data in RSC publications will be “enabled” Data available for validation and curationThe delivery of the “Datument”Data will be fed to models for validation, to

retrain the models, full provenance retainedAlgorithms will be provided to the community

Chemistry Data at RSC

Page 119: The Possibilities and Pitfalls of Internet-Based Chemical Data

Enhanced Mark-Up?

Page 120: The Possibilities and Pitfalls of Internet-Based Chemical Data

An Error in my Abstract?

Page 121: The Possibilities and Pitfalls of Internet-Based Chemical Data

An Error in my Abstract?

Chemists have embraced the web as a rich source of data and knowledge. However, all that glisters is not gold

Page 122: The Possibilities and Pitfalls of Internet-Based Chemical Data

Thanks Shakespeare

Page 123: The Possibilities and Pitfalls of Internet-Based Chemical Data

Acknowledgments

RSC and RSC|Cheminformatics team All data source providers, curators and

annotators All software providers: commercial and open

source Contributors, curators, collaborators

Trusted Advisors: Jean-Claude Bradley, Sean Ekins, Lee Harland, Gary Martin, Martin Walker and…

Page 124: The Possibilities and Pitfalls of Internet-Based Chemical Data

Meet Valery…We’d love to chat…

Page 125: The Possibilities and Pitfalls of Internet-Based Chemical Data

Thank you

Email: [email protected] Twitter: ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams