27
ChemReader Testing and Improvement: Making Chemical Structural Data More Accessible Bethany Harris Project supervisor: Ye Li ULA supervisor: Whitney Townsend

ChemReader chemical informatics tool

Embed Size (px)

Citation preview

ChemReader Testing and Improvement:Making Chemical Structural Data More Accessible

Bethany HarrisProject supervisor: Ye LiULA supervisor: Whitney Townsend

Overview

Project basics Need for ChemReader tool ChemReader team Library involvement Work process Reflections

Project basics

Interest in

science

Research

Health

scienc

es

Problem

Information trapped inside images How to search for structural data? ChemReader is an automated tool

to: Extract chemical structure diagrams

from digital images Convert graphic to searchable chemical

file formats

Problem

Chemical structure recognition

Chemical structure defines properties Ex. reactivity, toxicity, flammability

Current practice Text mining (complementary) and

manual indexing Problem of synonyms, chemical

formulas, and arbitrary indexing systems

Structure

Chemical names:Common names (brandand generic), IUPAC name

Chemical formula:Ex. C3H6O

Machine-readable codes:InChI and SMILESEx. Oc1ccc(cc1Br)C2(O[S](=O)(=O)c3ccccc23)c4ccc(O)c(Br)c4

Arbitrary index number:CAS registry numberEx. 134523-00-5

Image-based tools

Digital raster images in literature (PDFs) not understood by machine as structures

How to access all that material when you cannot search by structure?

ChemReader

Graphical representation -> machine-readable code

Ultimate goal: Automated Recognition Annotate with related articles

Oc1ccc(cc1Br)C2(O[S](=O)(=O)c3ccccc23)c4ccc(O)c(Br)c4

ChemReader

Related InformationDocument Segmentation

Chemical Structure Diagram Extractor

Structure Recognizer

Chemical DataBaseScientific literature

Digital ImagesMolfile, SMILES, etc.

End user

Query

Annotated Information

ChemReader Team

Dr. Gustavo Rosania - Pharmaceutical Sciences

Dr. Kazu Saitou - Mechanical Engineering Jungkap Park - Mechanical Engineering Ye Li - Shapiro Science Library Caroline Yee, Sarah Hughes, Kelli Herm -

Shapiro Cristof Smith - Chemistry Myself - Taubman Health Sciences

Library

Library Involvement

Testing ChemReader accuracy Human vs. computer Build test database (articles) Manual image & caption extraction Manual molecule identification

Sample reference set

Searched SciFinder “diabetes and small molecule” and limited to ‘journal articles’

Deduplication Articles containing compounds only

referred to once rationale

Fulltext PDFs downloaded and inserted into database 346 articles

Extraction and Linking

Extraction process Hand-linking of chemical structures to machine-readable codesMystery

structure in test

database

Identify SciFinder structure

Retrieve CAS registry

number

Link structure to machine-readable

codes

TestingDatabase

TestingDatabase

TestingDatabase

TestingDatabase

SciFinder

SciFinder

TestingDatabase

TestingDatabase

Linking issues

?

Linking issues

Current and future work

Currently outperforming major competitors (Park 2009)

With library: Verification process Further test accuracy

Without library: Test hypotheses Compare with text-based and image-

based tools

% of Correct Avg. similarity0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

ChemReader OSRA CLiDE

Reflections

Surprises: Adapting to faculty’s changing priorities ULA project proposal changes Nature of the task

Bigger picture: Widening the landscape of library

services Outreach to faculty and departments

Questions?

Slide 5: Chen, Tracy, Kablaoui, Natasha, & Little, Jeremy. (2009). Identification of

small-molecule inhibitors of the jip–jnk interaction. Biochemical Journal, 420, 283–294.

Macauley, Matthew, & Vocadlo, David. (2010). Increasing o-glcnac levels: an overview of small-molecule inhibitors. Biochimica et Biophysica Acta, 1800, 107–121.

Slide 8: Banville, Debra. (2009). Chemical information mining : facilitating

literature-based discovery. Boca Raton: CRC Press. Slides 10, 14, & 25:

Park, J, Rosania, GR, Shedden, KA, Mandee, N, & Saitou, K. (2009). Automated extraction of chemical structure information from digital raster images. Chemistry Central Journal, 3(4), 1-16.

Slides 23 & 24: About.com chemistry. (2011, March 16). Retrieved from

http://chemistry.about.com/od/factsstructures/ig/Chemical-Structures---X/Xylenol-Orange.htm