Transcript
Page 1: Code camp 2014 Talk Scientific Thinking

Geek Meets Science: ChemIDplus, an Example of Scientific Thinking

Mitch MillerScientific Thinking

Page 2: Code camp 2014 Talk Scientific Thinking

Overview

➲Introduce myself➲My definition of a scientific geek consultant➲An fast overview of Cheminformatics➲Overview of the ChemIDplus project➲The scientific geek's role in ChemIDplus

Page 3: Code camp 2014 Talk Scientific Thinking

Introduction: who am I?

➲Ph.D. chemist with 20+ years of experience in scientific information management➲Currently independent consultant➲Application developer, database person, requirements analyst, application first-aid➲Main areas of focus:

●Chemical structure database management●Managing data from high-throughput research

Page 4: Code camp 2014 Talk Scientific Thinking

The perspective of the scientific-geek-consultant

➲Is the scientific-geek-consultant's perspective on technology different from other geeks'?➲Learn new technologies/frameworks/paradigms and take them in stride➲What gets me excited is seeing a user able to do something that the user could not do yesterday➲This talk is about one project in scientific information management and what I've done to give users access to what they could not do before

Page 5: Code camp 2014 Talk Scientific Thinking

Quick Introduction to Chemical Databases

Page 6: Code camp 2014 Talk Scientific Thinking

Representing Chemical Structures

➲This discussion is restricted to 2 dimensional (2D) structures which establish identity➲Chemical structures can be represented graphically in a variety of ways.➲

➲To make structures searchable, you need a mathematical representation of the atoms and bonds: a connection table

Page 7: Code camp 2014 Talk Scientific Thinking

Searching for structures

➲Search for matches based on a graphic chemical system

●Start with a chemical of interest●Find others like it

➲Several definitions of what makes one structure like another

●Exact match: find same molecule user input●Substructure●'Similarity' fuzzy match

➲Analogy: Word search for 'store'

Page 8: Code camp 2014 Talk Scientific Thinking

Substructure matches for Aspirin

➲Each of these➲structures contains➲the query structure➲

➲Word analogy results:●Store●drugstore●stores●stored

●restore

Page 9: Code camp 2014 Talk Scientific Thinking

Non-matching structure

➲4-(acetyloxy)-benzoic acid is not a substructure match for aspirin because it does not contain the same arrangement of atoms and bonds➲

➲Non-hits for Word search analogy:●story●storm●'stoor'

➲Can be found using similarity search

Page 10: Code camp 2014 Talk Scientific Thinking

Structure search software

➲Standalone programs●Ran on server or desktop

➲Client-server architectures➲Database cartridges

●Provide chemical structure searching within a relational database●Commercially available●Add operators to store, search, retrieve and transform chemical structures within SQL●e.g. SELECT ID, MOLDEPICTION(STRUCT) FROM OUR_STRUCTURE_TABLE WHERE SUBSTRUCT(STRUCT, 'CC(=O)Oc1ccccc1C(=O)O') =1

●Client application must have a tool that can display connection tables as graphic chemical structures

Page 11: Code camp 2014 Talk Scientific Thinking

Structure database operations

➲Data stored in tables➲Data loading typically requires specialized software➲Indexing is non-typical➲Search operators are specific to the cartridge

Page 12: Code camp 2014 Talk Scientific Thinking

How can you search a million chemical structures in seconds?

➲Chemical databases have sizes in 100's of thousands or millions➲Comparing atoms and bonds takes time!➲Users want answers quickly.➲Solution: rapid screen-out step before looking at atom and bonds.

●Based on structure 'fingerprints'●Analyze input structures for features such as rings, atoms, connection patterns (O-X-X-N). ●Create a bit string●Compare bit string of query structure with bit strings in database.

●Bit string comparisons are very fast

Page 13: Code camp 2014 Talk Scientific Thinking

The ChemIDplus project

Page 14: Code camp 2014 Talk Scientific Thinking

ChemIDplus

➲“Dictionary of over 400,000 chemicals (names, synonyms, and structures) … (with) links to NLM and other databases and resources”➲Maintained by the Division of Specialized Information Services within the National Library of Medicine within National Institutes of Health➲Used by people in industry, academia and government who handle drugs and chemicals and access environmental and safety data plus other biomedical information

Page 15: Code camp 2014 Talk Scientific Thinking

ChemIDplus➲Part of a system of databases called 'Toxnet' at National Library of Medicine http://toxnet.nlm.nih.gov/➲Focus:

●Chemical Information●Environmental Health and Toxicology●HIV / AIDS●Disaster Information

➲Available on the web in 3 'flavors':●Full: http://chem.sis.nlm.nih.gov/chemidplus/●'Lite:' http://chem.sis.nlm.nih.gov/chemidplus/chemidlite.jsp●Ultralite: http://druginfo.nlm.nih.gov/drugportal/drugportal.jsp➲

Page 16: Code camp 2014 Talk Scientific Thinking

ChemIDplus Team

➲George (Mike) Hazard – team leader➲Shannon Jordan➲Michael Chambers - developer➲Chuchu Lan – system administrator/DBA➲Jenny Fang➲Stefanie Publicker➲Larry Callahan, Frank Switzer – FDA liaisons

Page 17: Code camp 2014 Talk Scientific Thinking

Historical Note

➲ChemIDplus was one of the first structure-searchable databases on the worldwide web➲Started in 1998➲Original developer

Page 18: Code camp 2014 Talk Scientific Thinking

Server Architecture

Database(Oracle)

StructuresNamesLinks

Properties

ChemicalData Cartridge

Tomcat Server

Servlets, JSPs,JS libraries

Database Server

Page 19: Code camp 2014 Talk Scientific Thinking

And now, the demo...

Page 20: Code camp 2014 Talk Scientific Thinking

Scientific Geek's role in ChemIDplus

➲Developer of the original system in 1998-9 in a since-retired technology➲Database administrator for structures

●Upgrade between versions of the chemical search software●Periodic reindexing of the structures for performance●Batch updates●Help clean up invalid data

➲Tester●Performed load testing when the application was migrated to Java servlets

➲Liaison with other governmental agencies●Share structures with NCI, PubChem➲Structure orientation application●Tool to help ensure that series of chemical compounds look similar➲

Page 21: Code camp 2014 Talk Scientific Thinking

Structure table synchronizationThe old way

➲Monthly manual process●Query structures recently added or changed●Extract to disk files●Generated data based on structure: InChI, SMILES, 3D coordinates●Registered each item separately

➲Took a couple of hours each month➲This was repetitious work

Page 22: Code camp 2014 Talk Scientific Thinking

New system

➲Database trigger detects a change when a value is inserted or updated to a chemical structure field➲Computes and stores InChI and SMILES immediately➲Submits a batch job (DBMS_JOB package) for 3D

●Deletes old 3D structure●Writes 2D structure to disk●Invokes Corina (Molecular Networks) program to generate 3D structure●Reads 3D structure into separate table

Page 23: Code camp 2014 Talk Scientific Thinking

Orienting Structures Consistently

➲Databases often contain 'families' of related compounds➲Example molecule and hits➲

➲Manually manipulatingstructures takes time!➲

Page 24: Code camp 2014 Talk Scientific Thinking

Solution: 'StructClean' Utility

➲Accepts a template structure + molecular weight●Locates all molecules in the DB that contain the template under the molecular weight cutoff●Without the cutoff, you'd might have huge molecules that contain a small template

➲All hits are oriented to match the template➲User reviews hits

●Selects/deselects items●Commits changes

➲Utility is a Java servlet

Page 25: Code camp 2014 Talk Scientific Thinking

Conclusion

➲ChemIDplus is a valuable resource to those looking for chemical information on the web➲Scientific-geek-consultants use a variety of technologies to provide service to research clients➲We are similar to regular geeks in many ways➲The differences are interesting!