Importance of data standards for large scale data integration in chemistry

  • View
    1.607

  • Download
    2

  • Category

    Science

Preview:

Citation preview

Importance of data standards for large scale data integration in

chemistry

Antony Williams, Valery Tkachenko, Alexey Pshenichnov, Ken Karapetyan, Stuart Chalk,

Daniel Lowe and Carlos Coba

ACS Denver, March 2015

Free and Easy

• To make it easy to “take notes” these slides will be available at:

www.slideshare.net/AntonyWilliams/

Charles Holland Duell

Charles Holland Duell

• 1898-1901: US Commissioner of Patents

• "Everything that can be invented has been invented."

Antony John Williams (et al)

Antony John Williams (et al)

• “We don’t need more standards!”

• “Of COURSE we can build a spectral database!”

• “The standards we have are good enough”

A Pragmatic View to Progress

• Let’s consider progressing an NMR Spectral database for the community!

• MUST HAVES– spectra (1D/2D), associated structures, assignments

• WANTS – predict NMR spectra, spectral searching, privacy/embargos

• What would we need in terms of standards?• Molfiles and JCAMP

Standards without adoption..

Standards

2D NMR

Progress in standards

Progress in standards

Standards without adoption are limited in value

• If the instrument vendors don’t support or adopt the standards success is limited

• YESTERDAY discussion about publishing NMR – JCAMP

• But what is already available will work – Jeol, Bruker, Thermo, Anasazi, Agilent/Varian - imperfect but useful

www.ChemSpider.com

9400 Spectra and growinghttp://www.chemspider.com/spectra.aspx

JCAMP NMR Spectra

Data on ChemSpider

JCAMP file downloads

• When NMR spectra are stored as JCAMP then downloads into offline packages are feasible – MestreLabs, ACD/Labs etc

• Open Data – download versus view• Store spectra locally and reuse• Java is increasingly a pain!

• Need to move to HTML5 viewing on ChemSpider, especially for Mobile Viewing

Challenges with Spectra

• JCAMP is good for a lot of spectral data – IR, Raman, 1D NMR

• MS data is rarely made available in JCAMP• We would love a ratified JCAMP 6.0 for 2D

data exchange – allows third parties to build support for download

• ASSIGNED JCAMP spectra supported

Proper Verification

03/25/15Advanced Chemistry Development, Inc.

(ACD/Labs)20

Jmol - JSpecView

ChemDoodle Components

Spectral Display in the hand

New Repository Architecturedoi: 10.1007/s10822-014-9784-5

Compounds

Reactions

Analytical data

Deposition of Data

1,000,000 Spectra Online?

ESI – Text Spectra

Developing Proof-of-Concept• Extract from 1976-2014 USPTO applications

*unknown – starts off with NMR: peak list (no nucleus)

H 975543C 56536

unknown 44306F 9429P 3241B 91Si 62Sn 22Se 11N 8

We want to find text spectra?

• We can find and index text spectra:13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)

• What would be better are spectral figures – and include assignments where possible!

MestreLabs Mnova NMR

1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)

13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)

ESI Data also contains figures

Publications & “Real Spectra”

• We are turning text into spectra• We are turning figures into spectra

Early Test Experiments

Input 74 supplementary data documents. 3444 pages

Output Plot2Txt extracted content from 1069 pages 1151 spectra total - >80% of peaks extracted to

within 1-2 decimal places (ppm)

“Where is the real data please?”

FIGURE

DATA

Manual Curation Layer

• ALL SPECTRA WILL BE STORED AS JCAMP• ChemSpider has had a manual curation layer

for >8 years• Users can annotate data on ChemSpider• We do receive useful feedback from the

community on the data and are optimistic!

Extraction is the WRONG WAY

• We should NOT mine data out – digital form!• Structures should be submitted “correctly” • Spectra should be digital spectral formats,

not images• ESI should be RICH and interactive• Data should be open, available, with meta

data and provenance

We can solve for Authors hereWill it be used though??? YES!

Supplementary Info Data now..

Data mining – it’s MINE!!!

What should we be doing?

• Settle on a short-term format – JCAMP-JMOL?

But there ARE solutions!

But there ARE solutions!

What should we be doing?

• Settle on a short-term format – JCAMP-JMOL?• Convince the instrument vendors to export in

this format• Push button depositions into “containers” –

ChemSpider, NMRShiftDB, Institutional Repositories

• Encourage format support in software (read and write) – Mestre, ACD/Labs, Bruker TopSpin, etc.

NMRShiftDB anyone?

Standards in Large Scale Data Integration

• ALL of these are imperfect standards• Molfiles• SDF• InChI• JCAMP• But what can be done with them?

Compound Data

• The standards of chemical structure handling are primarily molfile, SDfile, SMILES, InChI

• We primarily depend on molfiles and SDF files for data deposition and interchange

• We use InChI a lot – especially for integrated searching across the web

Searching the Entire Web?

Searching Internet by Structure

Compound Data

• The standards of chemical structure handling are primarily molfile, SDfile, SMILES, InChI

• We primarily depend on molfiles and SDF files for data deposition and interchange

• We use InChI a lot – especially for integrated searching across the web

• There ARE data interchange problems associated with structures….

USE and TEACH Standards

• Too few people are aware of the existing standards and their capabilities

• Part of the CINF mission activities should be to teach standards and this is being done

• Still too few people have heard of InChI and JCAMP for example

• Still little known about the importance of correct structure representations – kudos to people like Leah et al who TEACH THIS!

USE and TEACH Standards!

USE and TEACH Standards!

CVSP: Validate and Standardize

CVSP Rules Sets

CVSP Filtering of DrugBank

Compounds

Reactions

Use Ontologies

Contribute to PUBLIC Ontologies

• Yes there are “company” ontologies – but for the good of the community contribute to public ontologies and standards

• For data interchange and meshing this is soooooo beneficial!

ChAMP – Stuart Chalk

Use standards in APIs, endpoints and widgets

Semanticize content : RDF

Actions

• Support and encourage new standards• In the meantime, reawaken and modernize the

JCAMP standard• Show up and listen to Bob Hanson today• Encourage scientists to provide data

Charles Holland Duell in 1902

“…all previous advances in the various lines of invention will appear totally insignificant when compared with those which the present century will witness.

I almost wish that I might live my life over again to see the wonders which are at the threshold”

“Git-r-Done”

Acknowledgments

• Daniel Lowe – NextMove, Reactions and Spectra • Bill Brouwer – Plot2Txt Development• Carlos Cobas and Stan Sykora– MestreLabs• The ChemSpider team – led by Richard Kidd• The RSC Data Repository team

Thank you

Email: williamsa@rsc.orgORCID: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

Recommended