Upload
antony-williams-chemconnector-orcid-0000-0002-2668-4821
View
680
Download
0
Embed Size (px)
Citation preview
Hosting public domain chemicals data online for the community – the
challenges of handling materials
Antony WilliamsNIST Diffusion/CALPHAD Data Informatics and Tools Workshop
May 14th, 2015
ORCID ID:0000-0002-2668-4821
Many challenges are the same
• What I will discuss in terms of publisher, public domain databases, curated chemistry challenges etc. are the same…• Need capable tools to handle the data• Need standards for data exchange • Meshing data without review is dangerous!
• Quality costs – time, effort and money• Algorithms can help clean data
Where is chemistry online?
• Encyclopedic articles (Wikipedia)• Chemical vendor databases• Metabolic pathway databases• Property databases• Patents with chemical structures• Drug Discovery data• Scientific publications
• Compound aggregators• Blogs/Wikis and Open Notebook Science
Chemistry on the Internet…
• Most searching for chemistry on the internet…• Name searching Google/Bing/Yahoo• Name searching Wikipedia• Name searching Wolfram Alpha• Name, name, name, name…searching
Why CAS Numbers are not great
• There is no free service…like DOIs
• The resolver is a “Google Search”• Maybe we need another “identifier”?
• And thanks to IUPAC/NIST….
InChI
• SINGLE code base managed by IUPAC – integrated into drawing packages and used by MANY databases. No variability as with SMILES
Vendor-dependent SMILESACD/LabsCC(C)CCC[C@@H](C)CCC[C@@H](C)CCCC(\C)=C\CC2=C(C)C(=O)c1ccccc1C2=O
OpenEyeCC1=C(C(=O)c2ccccc2C1=O)C/C=C(\C)/CCC[C@H](C)CCC[C@H](C)CCCC(C)C
ChEMBLCC(C)CCC[C@@H](C)CCC[C@@H](C)CCC\C(=C\CC1=C(C)C(=O)c2ccccc2C1=O)\C
InChI
• SINGLE code base managed by IUPAC – integrated into drawing packages and used by MANY databases. No variability as with SMILES
• InChI Strings can be reversed to structures – same problem as with SMILES – no layout
• Adopted by the community (databases, blogs, Wikipedia) – good for searching the internet
InChIs for small molecules…
• InChIs are good for “small molecules”• Read here: http://www.jcheminf.com/series/InChI
ChemSpider strengths
• Serves over 40,000 unique users per day• Advanced searching of >34 million chemicals
Data Quality/Standardization
• MANY structures meant to be something online are MISREPRESENTED.
• Commonly you will have better success finding information by name searches than structure – with many caveats of course…
• Validating chemical structure representations is laborious work – and it’s shocking to review data…
Can we MAKE Quality Data?
• Systems for everyone to validate and standardize their data would be useful
• Would improve structure data in publications, databases etc. and make searching across resources better
• Collaboration to establish community rules would be good!
ChemSpider limitations
• Supports “small molecules” only – no InChI, no possibility to register a compound
• SO MUCH of chemistry is “materials”
• Severe limitation in chemistry coverage:• Monomers but no polymers• Inorganic and organometallic handling• Ambiguous structures – “Markush”• Nanomaterials
• Minerals• Bound to beads, surfaces etc
ORGANICS vs. Materials• Comment – you don’t know all of the
challenges until you start to work in the area!
• We, and cheminformatics companies, have solved MANY, but not all of the issues regarding organic chemistry management
• The majority of our approaches do not map to materials • No standard ways to represent compounds• No InChI for materials
Questions to consider…
• Organics are hard enough! • What are your best dictionaries of materials?• We have chemical ontologies. Status for
materials?• Is open annotation of your databases possible?• What standards do you have for materials data
exchange?
Known Challenges
• Many materials are non-stoichiometric• How to represent composite materials (e.g.
supported catalysts)?
• Methods to distinguish novelty in materials (equivalent to diversity in organic structures)?
• Lots of challenges ahead..a curated “community dictionary” would be of value…
Mapped DICTIONARIES…
• Structure IDs• Systematic name(s)• Trivial Name(s)• SMILES• InChI Strings• InChIKeys• Database IDs
• Registry Number
Thank you
Email: [email protected] ORCID: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams