27
Towards a Gold Standard: Improving The Quality of Public Domain Chemistry Databases Antony J. Williams 1 , Sean Ekins 2 1 Royal Society of Chemistry, Wake Forest, NC 27587 2 Collaborations in Chemistry, Fuquay Varina, NC 27526.

Acs towards a gold standard database

Embed Size (px)

Citation preview

Page 1: Acs   towards a gold standard  database

Towards a Gold Standard: Improving The Quality of Public Domain Chemistry

Databases

Antony J. Williams1, Sean Ekins 2

1Royal Society of Chemistry, Wake Forest, NC 27587 2Collaborations in Chemistry, Fuquay Varina, NC 27526.

Page 2: Acs   towards a gold standard  database

The future: crowdsourced drug discovery

Williams et al., Drug Discovery World, Winter 2009

Page 3: Acs   towards a gold standard  database

Safety data

Toxicity data

Blogs and Wikis

Property databases

Experimental results

Scientific publications

Compound aggregators

Open Notebook Science

Metabolic pathway databases

Encyclopedic articles (Wikipedia)

Chemistry structures are proliferating on the web

Users take them at face value

They SHOULD NOT!!!

Immense quantities of scientific information are contained in the

thousands of databases

Progress can however be inhibited by errors in these databases,

downstream effects when the data is reused.

http://bit.ly/zWGaps

Page 4: Acs   towards a gold standard  database

What is the Structure of Vitamin K1?

Page 5: Acs   towards a gold standard  database

What Mechanisms Do we Have to Alert the Community ?

Email database owner and hope for a response

Blog it

Tony has been blogging about database quality for years and nobody

was listening – other than the people at PubChem

For some databases, when he blogged they listened and would edit!

Tweet it

Dec 2010 - We felt something had to be said definitively about structure

quality

Publish it – wrote to Science, Nature and then PLoS Computational Biology

http://bit.ly/qtJF2f

Perhaps the phone?

Page 6: Acs   towards a gold standard  database

April 27 2011- Then came the : The NPC Browser

Science Translational Medicine 2011

Page 7: Acs   towards a gold standard  database

But wait, hold on – did anyone peer review the database??

Database released and within days ..

A quick analysis of structure quality revealed..

100’s of errors found in structures

Williams and Ekins,

DDT, 16: 747-750 (2011)

Page 8: Acs   towards a gold standard  database

NPC Browser http://tripod.nih.gov/npc/

Page 9: Acs   towards a gold standard  database

Neomycin in NPC Browser http://tripod.nih.gov/npc/

Page 10: Acs   towards a gold standard  database

Neomycin In ChemSpider

Page 11: Acs   towards a gold standard  database

How many contribute to clean-up?

Less than a dozen contributors to data

The majority are project members

The crowd is small…

This is the same for all cheminformatics crowd-

based efforts

Page 12: Acs   towards a gold standard  database

What Mechanisms Do we Have to Alert the Community – Publishing is too slow

Williams and Ekins,

DDT, 16: 747-750 (2011)

Tony Blogged April 28th 1 day after

release http://bit.ly/jn8wLC

I Blogged April 29th http://bit.ly/lXHInG

suggesting the need for a gold standard

database

After more extensive analysis we sent a

manuscript to Science Translational

Medicine - Rejected

Drug Discovery Today..accepted…8

Months after we pointed out the issue

even before NPC Browser release..

Page 13: Acs   towards a gold standard  database

Responses from Community and NCGC

Comments on initial blog

NCGC added a disclaimer which I blogged about May 23rd

http://bit.ly/m4Tx2b

Sept 8th 2011

Email from Tudor Oprea

(cc’ed to 60 others)

He has also been pointing

out database errors for

years..

Followed by one from

Chris Austin offering to

meet us

Several individuals thanked us for the alert

Page 14: Acs   towards a gold standard  database

Towards a Gold Standard: Regarding Quality in Public Domain Chemistry Databases and Approaches to Improving

the Situation Antony J. Williams, Sean Ekins and Valery Tkachenko, Drug Discovery Today, In Press 2012

More Extensive Analysis and solutions

More analysis of NPC browser errors

“analysis of the NPC browser ‘HTS amenable compounds’ subset of

data for 7600 compounds identified fundamental errors in

stereochemistry, valency issues and charge imbalances in a few

minutes work using a rudimentary software tool”

Analysis of other chemistry databases and errors

Other types of databases and errors

Offered solutions

Page 15: Acs   towards a gold standard  database

Substructure # of

Hits

# of

Correct

Hits

No

stereochemistry

Incomplete

Stereochemistry

Complete but

incorrect

stereochemistry

Gonane 34 5 8 21 0

Gon-4-ene 55 12 3 33 7

Gon-1,4-diene 60 17 10 23 10

Towards a Gold Standard: Regarding Quality in Public Domain Chemistry Databases and Approaches to Improving

the Situation Antony J. Williams, Sean Ekins and Valery Tkachenko, Drug Discovery Today, In Press 2012

Data Errors in the NPC Browser: Analysis of Steroids

Page 16: Acs   towards a gold standard  database

Why this matters to us and

YOU the CROWD ?

Page 17: Acs   towards a gold standard  database

What You Might Not Know About Chemistry Databases On The Internet

Data-sharing between open databases is cyclic

This can proliferate errors in the “Linked Data”

Page 18: Acs   towards a gold standard  database

Public Domain Databases

Our databases are a mess…

Non-curated databases are proliferating errors

We source and deposit data between databases

Original sources of errors hard to determine

Curation is time-consuming and challenging

Page 19: Acs   towards a gold standard  database

Molecule Data Quality Impacts

in silico drug discovery

vast ligand and protein–protein interaction databases

develop computational models

global mapping of pharmacological space

drug-target networks of approved drugs

prediction of off-target effects

Page 20: Acs   towards a gold standard  database

Different types of databases and errors

Bayer paper on target validation 2/3 of papers did not live up to claims

MDL Drug Data Report (MDDR), errors

Errors in clinical research databases vary from 2.3% to 26.9%

Multicenter analysis by MS-based proteomics identified generic problems in

databases when characterizing proteins -search engines could not distinguish

different identifiers many algorithms calculated molecular weight incorrectly

One database had between 2.1% and 13.6% of annotated Pfam hits unjustified

ligand–protein X-ray structure - these can also have errors with far reaching

consequences

Page 21: Acs   towards a gold standard  database

Solutions

Structure Validation and Standardization

Curation

Annotation

Structure filters

Incorrect valency, atom labels, aromatic bonds, stereochemistry, salts,

duplication

Structure standardization guidelines

Provided by the FDA (Substance Registration System UniqueIngredient

Identifier (UNII):

http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistrationSyste

m-UniqueIngredientIdentifierUNII/default.htm)

Need a record of molecule provenance

Can we track databases and quality - - www.scidbs.com

Page 23: Acs   towards a gold standard  database

Scidbs.com Default Body

Page 24: Acs   towards a gold standard  database

Scidbs.com

Default Body

DB logo

Type of DB

Contact

Owner

Website

License

Curation etc

Page 25: Acs   towards a gold standard  database

Data should be: Free from structure errors

Free from data errors

Free from experimental errors

Are we asking too much? Is it even possible??

When we raise our hands we are ignored

Our scientific community needs to wake up

Yet when we alert others:

Page 26: Acs   towards a gold standard  database

Today NPC browser has fewer errors..so do ALL databases!

More people aware of molecule quality online. Trust is

earned not just granted!

The future database user is more informed

Peer reviewers test the databases that are in manuscripts

NIH checks databases before release!

COLLABORATION between government DBs. PLEASE!!!

We need minimal compound database standards

(MCDS)

Tomorrow

Page 27: Acs   towards a gold standard  database

Acknowledgement

We thank the paper reviewers

and blog commenters

for their constructive comments

Chris Lipinski

This work was unfunded

(but was the right thing to do!)

www.scidbs.com