25
Geek Meets Science: ChemIDplus, an Example of Scientific Thinking Mitch Miller Scientific Thinking

Code camp 2014 Talk Scientific Thinking

Embed Size (px)

DESCRIPTION

Talk delivered at Vermont Code Camp 2014 by Mitch Miller

Citation preview

Page 1: Code camp 2014 Talk Scientific Thinking

Geek Meets Science: ChemIDplus, an Example of Scientific Thinking

Mitch MillerScientific Thinking

Page 2: Code camp 2014 Talk Scientific Thinking

Overview

➲Introduce myself➲My definition of a scientific geek consultant➲An fast overview of Cheminformatics➲Overview of the ChemIDplus project➲The scientific geek's role in ChemIDplus

Page 3: Code camp 2014 Talk Scientific Thinking

Introduction: who am I?

➲Ph.D. chemist with 20+ years of experience in scientific information management➲Currently independent consultant➲Application developer, database person, requirements analyst, application first-aid➲Main areas of focus:

●Chemical structure database management●Managing data from high-throughput research

Page 4: Code camp 2014 Talk Scientific Thinking

The perspective of the scientific-geek-consultant

➲Is the scientific-geek-consultant's perspective on technology different from other geeks'?➲Learn new technologies/frameworks/paradigms and take them in stride➲What gets me excited is seeing a user able to do something that the user could not do yesterday➲This talk is about one project in scientific information management and what I've done to give users access to what they could not do before

Page 5: Code camp 2014 Talk Scientific Thinking

Quick Introduction to Chemical Databases

Page 6: Code camp 2014 Talk Scientific Thinking

Representing Chemical Structures

➲This discussion is restricted to 2 dimensional (2D) structures which establish identity➲Chemical structures can be represented graphically in a variety of ways.➲

➲To make structures searchable, you need a mathematical representation of the atoms and bonds: a connection table

Page 7: Code camp 2014 Talk Scientific Thinking

Searching for structures

➲Search for matches based on a graphic chemical system

●Start with a chemical of interest●Find others like it

➲Several definitions of what makes one structure like another

●Exact match: find same molecule user input●Substructure●'Similarity' fuzzy match

➲Analogy: Word search for 'store'

Page 8: Code camp 2014 Talk Scientific Thinking

Substructure matches for Aspirin

➲Each of these➲structures contains➲the query structure➲

➲Word analogy results:●Store●drugstore●stores●stored

●restore

Page 9: Code camp 2014 Talk Scientific Thinking

Non-matching structure

➲4-(acetyloxy)-benzoic acid is not a substructure match for aspirin because it does not contain the same arrangement of atoms and bonds➲

➲Non-hits for Word search analogy:●story●storm●'stoor'

➲Can be found using similarity search

Page 10: Code camp 2014 Talk Scientific Thinking

Structure search software

➲Standalone programs●Ran on server or desktop

➲Client-server architectures➲Database cartridges

●Provide chemical structure searching within a relational database●Commercially available●Add operators to store, search, retrieve and transform chemical structures within SQL●e.g. SELECT ID, MOLDEPICTION(STRUCT) FROM OUR_STRUCTURE_TABLE WHERE SUBSTRUCT(STRUCT, 'CC(=O)Oc1ccccc1C(=O)O') =1

●Client application must have a tool that can display connection tables as graphic chemical structures

Page 11: Code camp 2014 Talk Scientific Thinking

Structure database operations

➲Data stored in tables➲Data loading typically requires specialized software➲Indexing is non-typical➲Search operators are specific to the cartridge

Page 12: Code camp 2014 Talk Scientific Thinking

How can you search a million chemical structures in seconds?

➲Chemical databases have sizes in 100's of thousands or millions➲Comparing atoms and bonds takes time!➲Users want answers quickly.➲Solution: rapid screen-out step before looking at atom and bonds.

●Based on structure 'fingerprints'●Analyze input structures for features such as rings, atoms, connection patterns (O-X-X-N). ●Create a bit string●Compare bit string of query structure with bit strings in database.

●Bit string comparisons are very fast

Page 13: Code camp 2014 Talk Scientific Thinking

The ChemIDplus project

Page 14: Code camp 2014 Talk Scientific Thinking

ChemIDplus

➲“Dictionary of over 400,000 chemicals (names, synonyms, and structures) … (with) links to NLM and other databases and resources”➲Maintained by the Division of Specialized Information Services within the National Library of Medicine within National Institutes of Health➲Used by people in industry, academia and government who handle drugs and chemicals and access environmental and safety data plus other biomedical information

Page 15: Code camp 2014 Talk Scientific Thinking

ChemIDplus➲Part of a system of databases called 'Toxnet' at National Library of Medicine http://toxnet.nlm.nih.gov/➲Focus:

●Chemical Information●Environmental Health and Toxicology●HIV / AIDS●Disaster Information

➲Available on the web in 3 'flavors':●Full: http://chem.sis.nlm.nih.gov/chemidplus/●'Lite:' http://chem.sis.nlm.nih.gov/chemidplus/chemidlite.jsp●Ultralite: http://druginfo.nlm.nih.gov/drugportal/drugportal.jsp➲

Page 16: Code camp 2014 Talk Scientific Thinking

ChemIDplus Team

➲George (Mike) Hazard – team leader➲Shannon Jordan➲Michael Chambers - developer➲Chuchu Lan – system administrator/DBA➲Jenny Fang➲Stefanie Publicker➲Larry Callahan, Frank Switzer – FDA liaisons

Page 17: Code camp 2014 Talk Scientific Thinking

Historical Note

➲ChemIDplus was one of the first structure-searchable databases on the worldwide web➲Started in 1998➲Original developer

Page 18: Code camp 2014 Talk Scientific Thinking

Server Architecture

Database(Oracle)

StructuresNamesLinks

Properties

ChemicalData Cartridge

Tomcat Server

Servlets, JSPs,JS libraries

Database Server

Page 19: Code camp 2014 Talk Scientific Thinking

And now, the demo...

Page 20: Code camp 2014 Talk Scientific Thinking

Scientific Geek's role in ChemIDplus

➲Developer of the original system in 1998-9 in a since-retired technology➲Database administrator for structures

●Upgrade between versions of the chemical search software●Periodic reindexing of the structures for performance●Batch updates●Help clean up invalid data

➲Tester●Performed load testing when the application was migrated to Java servlets

➲Liaison with other governmental agencies●Share structures with NCI, PubChem➲Structure orientation application●Tool to help ensure that series of chemical compounds look similar➲

Page 21: Code camp 2014 Talk Scientific Thinking

Structure table synchronizationThe old way

➲Monthly manual process●Query structures recently added or changed●Extract to disk files●Generated data based on structure: InChI, SMILES, 3D coordinates●Registered each item separately

➲Took a couple of hours each month➲This was repetitious work

Page 22: Code camp 2014 Talk Scientific Thinking

New system

➲Database trigger detects a change when a value is inserted or updated to a chemical structure field➲Computes and stores InChI and SMILES immediately➲Submits a batch job (DBMS_JOB package) for 3D

●Deletes old 3D structure●Writes 2D structure to disk●Invokes Corina (Molecular Networks) program to generate 3D structure●Reads 3D structure into separate table

Page 23: Code camp 2014 Talk Scientific Thinking

Orienting Structures Consistently

➲Databases often contain 'families' of related compounds➲Example molecule and hits➲

➲Manually manipulatingstructures takes time!➲

Page 24: Code camp 2014 Talk Scientific Thinking

Solution: 'StructClean' Utility

➲Accepts a template structure + molecular weight●Locates all molecules in the DB that contain the template under the molecular weight cutoff●Without the cutoff, you'd might have huge molecules that contain a small template

➲All hits are oriented to match the template➲User reviews hits

●Selects/deselects items●Commits changes

➲Utility is a Java servlet

Page 25: Code camp 2014 Talk Scientific Thinking

Conclusion

➲ChemIDplus is a valuable resource to those looking for chemical information on the web➲Scientific-geek-consultants use a variety of technologies to provide service to research clients➲We are similar to regular geeks in many ways➲The differences are interesting!