Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider

Delivering Curated Chemistry to the World via Crowdsourced Deposition

and Annotation on ChemSpider

Antony WilliamsUniversity of Chicago, January 27th 2012

The World of Online Chemistry Property databases Compound aggregators Screening assay results Scientific publications Encyclopedic articles (Wikipedia) Metabolic pathway databases ADME/Tox data – eTOX for example Blogs/Wikis and Open Notebook Science Contributing Open Source code to projects

We Have …Too Much Data!!!

e-Science and Primary Data

How much data generated in a lab, that COULD go public, is lost forever?

TotallySynthetic.com

Public Domain reference databases of value? Syntheses Properties Spectra CIFs Images

PubChem

ChEMBL

Collaborative Knowledge Management

Public Domain reference databases of value? Syntheses Properties Spectra CIFs Images

Much of chemistry is chemical structure-based – where and how could we host these data?

RSC’s ChemSpider

Available Information…

Linked to vendors, safety data, toxicity, metabolism

Available Information….

Crowdsourced “Annotations”

Users can add Descriptions/Syntheses/Commentaries Links to PubMed articles Links to articles via DOIs Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos

Spectra

Data on the Web

Chemistry Data online is messy

We have inherited errors All public compound databases, including ours,

have errors “Incorrect” structures – assertions, timelines etc “Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE

The Structure of Vitamin K?

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K

The Structure of Vitamin K1?

What is the Structure of Vitamin K1?

CAS’s Common Chemistry

Wikipedia

ChEBI – Manual Curation

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”

Variants of systematic names on PubChem

2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl

Question Everything online: www.dhmo.org

It’s all on Wikipedia…

Chemistry on The Internet Is Messy

It’s Methane…

What’s Methane?

What ELSE is Methane???

EPA’s DailyMed

PHYSPROP Database

The freely downloadable database under the EPI Suite prediction software

Very Basic filters suggest data quality issues

The Stereochemistry challenge.12500 chemicals with “missed” stereo

With Great Fanfare…

NPC Browser http://tripod.nih.gov/npc/

Openness and Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011)

Science Translational Medicine 2011

Public Domain Databases

Our databases are a mess…

Non-curated databases are proliferating errors

We source and deposit data between databases

Original sources of errors hard to determine

Curation is time-consuming and challenging

Stop Whining – Fix it

Crowdsourced Curation

Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

Search “Vitamin H”

“Curate” Identifiers

Standards : Structure Standardization

What needs to happen?

Standards Standardization of structures

ChEBI/PubChem sharing InChI adoption

The InChI Identifier

Multiple Layers

InChIStrings Hash to InChIKeys

Vancomycin – Search the Internet

Vancomycin

Search Molecular SKELETON

Search Full Molecule

Full Skeleton Search: 104 Hits

Full Molecule Search: 4 Hits

Crowdsourcing Works

>130 people have deposited data and participated in data curation

Different level curators check each other

More curators and depositors are encouraged!

What needs to happen?

Standards Standardization of structures

ChEBI/PubChem sharing InChI adoption

Collaboration Stop reinventing the wheel Share data, share efforts and speed the process

Antony Williams vs Identifiers

Passport ID

Dad, Tony, others

Green Card

License5 email addressesChemSpiderman (blog, Twitter account, Facebook, Friendfeed)OpenID….

Aspirin names and synonyms

• Text searches depend on correct association

• 335 suggested identifiers for Aspirin just on PubChem!

• Disambiguation dictionaries are necessary, not just for authors!

The Final Search Strategy

All Those Names, One Structure

Ambiguity in Identifiers

Curated Dictionaries Matter

Success Depends on Dictionaries

Validated Name-Structure Dictionaries

Chemical name dictionaries are used for: Text-mining (publications, patents)

Used to index PubMed and link to Google Patents

Linking to other databases – think Biology! When structures are not available drug names link

Searching the web Names link to structures link to InChIs

I want to know about “Vincristine”

If all algorithms work then everything on the page is correct by default except the name-structure relationship!

Vincristine: Identifiers and Properties

Vincristine: Vendors and SourcesLinked by Structure

Vincristine: PatentsLinked by Name

Vincristine: ArticlesLinked by Name

Challenges of Complex Molecules Yohimbine

Originally 15 compounds “called” Yohimbine54 Skeletons for Yohimbine

Internal and external content Built to meet primary use-case Tailored indexes and GUIs Internal unique language & metadata Poor interoperability/integration Powerpoint, Documents, Excel Many suppliers of systems and content in

a single workflow

Literature Patents NewsPipeline SAR CSRs SafetyIn vivo Etc

Pharma Information Tombs

What could create change?

Harvard Business Review (2010)

“One change would make a substantial difference [to drug R&D]: the creation of agreed-upon standards for digitally

representing drug assets.”

It is so difficult to navigate…

What’s the structure?What’s the structure?

Are they in our file?

What’s similar?What’s

similar?

What’s the target?

What’s the target?Pharmacology

data?Pharmacology

Known Pathways?

Working On Now?

Working On Now?Connections

to disease?Connections to disease?

Expressed in right cell type?Expressed in

right cell type?

Competitors?Competitors?

IP?IP?

Open PHACTS Project Develop a set of robust standards… Implement the standards in a semantic integration hub Deliver services to support drug discovery programs in

pharma and public domain 22 partners, 8 pharmaceutical companies, 3 biotechs 36 months project

Guiding principle is open access, open usage, open source- Key to standards adoption -

ChemSpider Resources for Chemistry

Internet Data

The Future

Commercial SoftwarePre-competitive Data

Open ScienceOpen DataPublishersEducators

Open DatabasesChemical Vendors

Small organic moleculesUndefined materialsOrganometallicsNanomaterialsPolymersMineralsParticle boundLinks to Biologicals

The Future of Chemistry on the Web? Public compound databases federate & build

a linked environment of validated data! Data validation needs are not ignored Publishers layer on information to make

publications discoverable Public-Private databases can be linked Open Data proliferate The “Semantic Web” in action

Acknowledgments

The ChemSpider team

Our data providers, depositors, collaborators and curators

Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)

Sean Ekins @collabchem

Thank you

Email: williamsa@rsc.org Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider

Technology

Crowdsourced Emergency Information

Presentation of ChemSPider at PubChem Public Meeting

IHO Crowdsourced Bathymetry Initiative · Crowdsourced Bathymetry to state the IHO’s policy towards, and provide best practices for collecting, crowdsourced bathymetry. This document

Crowdsourced Microlearning_part1_voigt

ChemSpider -Connecting and Curating Online Chemistry Resources

Crowdsourced Fundraising

John Harry MacMillan Chemspider structure data

Vendor Session: ChemSpider, from Royal Society of Chemistry

ChemSpider as a chemical term resolver

Achieving Expert-Level Annotation Quality with CrowdTruth3.2 Crowdsourcing setup The crowdsourced annotation setup is based on our previous medical relation extraction work [4], adapted

Crowdsourced Transcription Landscape

"Chemspider: The Free Chemical Database", presentation by May

Isabc15 crowdsourced creativity

ChemSpider Presentation At University Of Toronto

WHITEPAPER CROWDSOURCED USABILITY TESTINGgo.applause.com/.../Crowdsourced-Usability-Testing.pdf · There are several different crowdsourced testing options that offer different levels

ChemSpider Overview Presentation at Special Libraries Association

Crowdsourced health studies

Connecting Chemistry Across the Internet Using ChemSpider

ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Generating Wikipedia DrugBoxes using ChemSpider Functionality