Acs towards a gold standard database

Preview:

Citation preview

Towards a Gold Standard: Improving The Quality of Public Domain Chemistry

Databases

Antony J. Williams1, Sean Ekins 2

1Royal Society of Chemistry, Wake Forest, NC 27587 2Collaborations in Chemistry, Fuquay Varina, NC 27526.

The future: crowdsourced drug discovery

Williams et al., Drug Discovery World, Winter 2009

Safety data

Toxicity data

Blogs and Wikis

Property databases

Experimental results

Scientific publications

Compound aggregators

Open Notebook Science

Metabolic pathway databases

Encyclopedic articles (Wikipedia)

Chemistry structures are proliferating on the web

Users take them at face value

They SHOULD NOT!!!

Immense quantities of scientific information are contained in the

thousands of databases

Progress can however be inhibited by errors in these databases,

downstream effects when the data is reused.

http://bit.ly/zWGaps

What is the Structure of Vitamin K1?

What Mechanisms Do we Have to Alert the Community ?

Email database owner and hope for a response

Blog it

Tony has been blogging about database quality for years and nobody

was listening – other than the people at PubChem

For some databases, when he blogged they listened and would edit!

Tweet it

Dec 2010 - We felt something had to be said definitively about structure

quality

Publish it – wrote to Science, Nature and then PLoS Computational Biology

http://bit.ly/qtJF2f

Perhaps the phone?

April 27 2011- Then came the : The NPC Browser

Science Translational Medicine 2011

But wait, hold on – did anyone peer review the database??

Database released and within days ..

A quick analysis of structure quality revealed..

100’s of errors found in structures

Williams and Ekins,

DDT, 16: 747-750 (2011)

NPC Browser http://tripod.nih.gov/npc/

Neomycin in NPC Browser http://tripod.nih.gov/npc/

Neomycin In ChemSpider

How many contribute to clean-up?

Less than a dozen contributors to data

The majority are project members

The crowd is small…

This is the same for all cheminformatics crowd-

based efforts

What Mechanisms Do we Have to Alert the Community – Publishing is too slow

Williams and Ekins,

DDT, 16: 747-750 (2011)

Tony Blogged April 28th 1 day after

release http://bit.ly/jn8wLC

I Blogged April 29th http://bit.ly/lXHInG

suggesting the need for a gold standard

database

After more extensive analysis we sent a

manuscript to Science Translational

Medicine - Rejected

Drug Discovery Today..accepted…8

Months after we pointed out the issue

even before NPC Browser release..

Responses from Community and NCGC

Comments on initial blog

NCGC added a disclaimer which I blogged about May 23rd

http://bit.ly/m4Tx2b

Sept 8th 2011

Email from Tudor Oprea

(cc’ed to 60 others)

He has also been pointing

out database errors for

years..

Followed by one from

Chris Austin offering to

meet us

Several individuals thanked us for the alert

Towards a Gold Standard: Regarding Quality in Public Domain Chemistry Databases and Approaches to Improving

the Situation Antony J. Williams, Sean Ekins and Valery Tkachenko, Drug Discovery Today, In Press 2012

More Extensive Analysis and solutions

More analysis of NPC browser errors

“analysis of the NPC browser ‘HTS amenable compounds’ subset of

data for 7600 compounds identified fundamental errors in

stereochemistry, valency issues and charge imbalances in a few

minutes work using a rudimentary software tool”

Analysis of other chemistry databases and errors

Other types of databases and errors

Offered solutions

Substructure # of

Hits

# of

Correct

Hits

No

stereochemistry

Incomplete

Stereochemistry

Complete but

incorrect

stereochemistry

Gonane 34 5 8 21 0

Gon-4-ene 55 12 3 33 7

Gon-1,4-diene 60 17 10 23 10

Towards a Gold Standard: Regarding Quality in Public Domain Chemistry Databases and Approaches to Improving

the Situation Antony J. Williams, Sean Ekins and Valery Tkachenko, Drug Discovery Today, In Press 2012

Data Errors in the NPC Browser: Analysis of Steroids

Why this matters to us and

YOU the CROWD ?

What You Might Not Know About Chemistry Databases On The Internet

Data-sharing between open databases is cyclic

This can proliferate errors in the “Linked Data”

Public Domain Databases

Our databases are a mess…

Non-curated databases are proliferating errors

We source and deposit data between databases

Original sources of errors hard to determine

Curation is time-consuming and challenging

Molecule Data Quality Impacts

in silico drug discovery

vast ligand and protein–protein interaction databases

develop computational models

global mapping of pharmacological space

drug-target networks of approved drugs

prediction of off-target effects

Different types of databases and errors

Bayer paper on target validation 2/3 of papers did not live up to claims

MDL Drug Data Report (MDDR), errors

Errors in clinical research databases vary from 2.3% to 26.9%

Multicenter analysis by MS-based proteomics identified generic problems in

databases when characterizing proteins -search engines could not distinguish

different identifiers many algorithms calculated molecular weight incorrectly

One database had between 2.1% and 13.6% of annotated Pfam hits unjustified

ligand–protein X-ray structure - these can also have errors with far reaching

consequences

Solutions

Structure Validation and Standardization

Curation

Annotation

Structure filters

Incorrect valency, atom labels, aromatic bonds, stereochemistry, salts,

duplication

Structure standardization guidelines

Provided by the FDA (Substance Registration System UniqueIngredient

Identifier (UNII):

http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistrationSyste

m-UniqueIngredientIdentifierUNII/default.htm)

Need a record of molecule provenance

Can we track databases and quality - - www.scidbs.com

Scidbs.com Default Body

Scidbs.com

Default Body

DB logo

Type of DB

Contact

Owner

Website

License

Curation etc

Data should be: Free from structure errors

Free from data errors

Free from experimental errors

Are we asking too much? Is it even possible??

When we raise our hands we are ignored

Our scientific community needs to wake up

Yet when we alert others:

Today NPC browser has fewer errors..so do ALL databases!

More people aware of molecule quality online. Trust is

earned not just granted!

The future database user is more informed

Peer reviewers test the databases that are in manuscripts

NIH checks databases before release!

COLLABORATION between government DBs. PLEASE!!!

We need minimal compound database standards

(MCDS)

Tomorrow

Acknowledgement

We thank the paper reviewers

and blog commenters

for their constructive comments

Chris Lipinski

This work was unfunded

(but was the right thing to do!)

www.scidbs.com

Recommended