Upload
harrisbr23
View
745
Download
0
Tags:
Embed Size (px)
Citation preview
ChemReader Testing and Improvement:Making Chemical Structural Data More Accessible
Bethany HarrisProject supervisor: Ye LiULA supervisor: Whitney Townsend
Overview
Project basics Need for ChemReader tool ChemReader team Library involvement Work process Reflections
Problem
Information trapped inside images How to search for structural data? ChemReader is an automated tool
to: Extract chemical structure diagrams
from digital images Convert graphic to searchable chemical
file formats
Chemical structure recognition
Chemical structure defines properties Ex. reactivity, toxicity, flammability
Current practice Text mining (complementary) and
manual indexing Problem of synonyms, chemical
formulas, and arbitrary indexing systems
Structure
Chemical names:Common names (brandand generic), IUPAC name
Chemical formula:Ex. C3H6O
Machine-readable codes:InChI and SMILESEx. Oc1ccc(cc1Br)C2(O[S](=O)(=O)c3ccccc23)c4ccc(O)c(Br)c4
Arbitrary index number:CAS registry numberEx. 134523-00-5
Image-based tools
Digital raster images in literature (PDFs) not understood by machine as structures
How to access all that material when you cannot search by structure?
ChemReader
Graphical representation -> machine-readable code
Ultimate goal: Automated Recognition Annotate with related articles
Oc1ccc(cc1Br)C2(O[S](=O)(=O)c3ccccc23)c4ccc(O)c(Br)c4
ChemReader
Related InformationDocument Segmentation
Chemical Structure Diagram Extractor
Structure Recognizer
Chemical DataBaseScientific literature
Digital ImagesMolfile, SMILES, etc.
End user
Query
Annotated Information
ChemReader Team
Dr. Gustavo Rosania - Pharmaceutical Sciences
Dr. Kazu Saitou - Mechanical Engineering Jungkap Park - Mechanical Engineering Ye Li - Shapiro Science Library Caroline Yee, Sarah Hughes, Kelli Herm -
Shapiro Cristof Smith - Chemistry Myself - Taubman Health Sciences
Library
Library Involvement
Testing ChemReader accuracy Human vs. computer Build test database (articles) Manual image & caption extraction Manual molecule identification
Sample reference set
Searched SciFinder “diabetes and small molecule” and limited to ‘journal articles’
Deduplication Articles containing compounds only
referred to once rationale
Fulltext PDFs downloaded and inserted into database 346 articles
Extraction and Linking
Extraction process Hand-linking of chemical structures to machine-readable codesMystery
structure in test
database
Identify SciFinder structure
Retrieve CAS registry
number
Link structure to machine-readable
codes
Current and future work
Currently outperforming major competitors (Park 2009)
With library: Verification process Further test accuracy
Without library: Test hypotheses Compare with text-based and image-
based tools
% of Correct Avg. similarity0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
ChemReader OSRA CLiDE
Reflections
Surprises: Adapting to faculty’s changing priorities ULA project proposal changes Nature of the task
Bigger picture: Widening the landscape of library
services Outreach to faculty and departments
Questions?
Slide 5: Chen, Tracy, Kablaoui, Natasha, & Little, Jeremy. (2009). Identification of
small-molecule inhibitors of the jip–jnk interaction. Biochemical Journal, 420, 283–294.
Macauley, Matthew, & Vocadlo, David. (2010). Increasing o-glcnac levels: an overview of small-molecule inhibitors. Biochimica et Biophysica Acta, 1800, 107–121.
Slide 8: Banville, Debra. (2009). Chemical information mining : facilitating
literature-based discovery. Boca Raton: CRC Press. Slides 10, 14, & 25:
Park, J, Rosania, GR, Shedden, KA, Mandee, N, & Saitou, K. (2009). Automated extraction of chemical structure information from digital raster images. Chemistry Central Journal, 3(4), 1-16.
Slides 23 & 24: About.com chemistry. (2011, March 16). Retrieved from
http://chemistry.about.com/od/factsstructures/ig/Chemical-Structures---X/Xylenol-Orange.htm