Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Unlocking chemical informa0on from tables and legacy ar0cles
Daniel Lowe and Roger Sayle NextMove So?ware
Aileen Day and Antony Williams Royal Society of Chemistry
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Topics
• Chemical property extrac,on
• Applica,on of chemical property extrac,on to tables
• RSC back-‐archive mining
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Chemical property extraction
• Mel,ng points • Boiling points • Mass spectrum • Textual NMR spectra • Specific rota,on • Chromatography reten,on ,mes • IR/UV spectra • Ac,vity data e.g. IC50, EC50, Ki • Etc.
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Simple grammar and corresponding state machine
Isotope: ‘1H’|‘ 13C’ |‘ 19F’ Nmr: ‘-‐NMR’ NmrPrelog: Isotope Nmr
1 3 C
9 F
H
N M R -
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Melting point recognition
Term Examples of text matched FromLiterature “lit.” Mel0ngPoint “mpt”, “mel,ng point”, “m.p.” Qualifier “>”; “approximately” Value “75° C”, “200° F”, “one hundred degrees Celsius” Range “184-‐186° C”, “191.5 to 192.4° C”
MeasurementError “50±° C” OutcomeQualifier “decomp.”, “with decomposi,on”, “subl.”
FromLiterature? Mel,ngPoint Qualifier? (Value | Range | MeasurementError) OutcomeQualifier?
M.p.: 230°C (dec.)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
NMR recognition
Term Examples of text matched Isotope “1H”, “13C”, “19F” NMR “NMR”, “RMN”
NmrMethod “400 MHz, CDCl3” Peak “3.7”
PeakAnnota0on “s, 3H”
Isotope NMR NmrMethod? Peak PeakAnnota,on? (Delimiter Peak PeakAnnota,on?)*
1H NMR (300 MHz, DMSO): 7.5-‐7.8 (m, 5H), 7.9 (d, J=8Hz, 2H), 8.33 (d, J=5Hz, 2H)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Recognition and parsing
• Grammar dis,nguishes parts of an en,ty of interest e.g. 25°C à25 (value) °C (unit)
• Can groups constructs together e.g. 25 to 30 (range)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Example parse Tree serialised to XML
Mp: 131.9-‐132.6 °C <parse>
<quantityType quantityType="MeltingPoint">Mp</quantityType>
<measurement>
<range>
<valueOptUnit>
<decimalValue>131.9</decimalValue>
</valueOptUnit>
<rangeDelimiter>-</rangeDelimiter>
<valueOptUnit>
<decimalValue>132.6</decimalValue>
<unitContainer>
<unit unitType="Temperature" normalizationFactor="1">°C</unit>
</unitContainer>
</valueOptUnit>
</range>
</measurement>
</parse>
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Recognition and parsing
• Grammar dis,nguishes parts of an en,ty of interest e.g. 25°C à25 (value) °C (unit)
• Can groups constructs together e.g. 25 to 30 (range)
• However this introduces non-‐determinism e.g. aoer seeing “25” both the possibility of being in and not being in a range, need to be considered
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
• Same grammar can be used to generate: – Single state machine representa,on
• Parts of en,ty not dis,nguished • Extremely fast recogni,on • Allows spelling correc,on of input that is close to being a match
– Mul, state machine parser representa,on • Slower… but only needs to be run on a small amount of text
• Dis,nguishes parts of en,ty • Can group parts into a parse tree
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Grammar implementation details
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Table Extraction
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Melting point table
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
NMR table
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
More difficult… Against what?
Need to be looked up else where in document. Could be in text, might be in images
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Even More difficult…
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Tables in USPTO patents
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
…and the xml provided <row>
<entry>1</entry> <entry>N<sup>1</sup>-hydroxy-N<sup>2</sup>-{[4-(phenyloxy)phenyl]sulfonyl}-</entry> <entry>H-NMR; δ (CD3OD): 7.79 (d, 2H),</entry>
</row> <row>
<entry/> <entry>D-lysinamide</entry> <entry>7.42 (t, 2H), 7.22 (t, 1H), 7.09 (d, 2H),</entry>
</row> <row>
<entry/> <entry/> <entry>7.05 (d, 2H), 3.63 (t, 1H), 2.87 (t, 2H),</entry>
</row> <row> <entry/>
<entry/> <entry>1.57-1.68 (m, 4H), 1.44 (m, 1H),</entry>
</row> <row> <entry/>
<entry/> <entry>1.37 (m, 1H)</entry>
</row> <row>
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Naïve interpretation (Google patents)
Green: chemical subs,tuent Purple: chemical molecule Blue: NMR
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
SureChemBl
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
After heuristically detecting which rows are the same row
Purple: chemical molecule Blue: NMR
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
What could be extracted?
8056714854565090342032283161287525582148 740 568 410 329 197 197 187 171 101 96 73 40 290
100000
200000
300000
400000
500000
600000
Nam
e/Iden
tifier to prop
erty re
latio
nships
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Compound number determination <heading level="2" id="h-0055">EXAMPLE-14 </heading> <heading level="2" id="h-0056">2-(2,4-difluorophenoxy)-5...</heading>
<parse> <referenceType type="Example">EXAMPLE</referenceType> <referenceId>14</referenceId> </parse>
<heading level="2" id="h-0008">3. (4aS,8aR)-2-(1-Acetyl-pipe..</heading>
2-Chloro-5-iodo-1H-benzo[d]imidazole (1)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
RSC-back archive mining
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
RSC back archive
• 1841-‐1999, 211k ar,cles (available as XML derived from OCR and PDF)
• 2000 -‐, 230k ar,cles (available as born digital XML and PDF)
• Also over 150k Electronic suppor,ng informa,on files (mostly PDF, but also Word docs, Excel files, videos etc.)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Legacy document handling
• Chemical proper,es are ooen implicitly associated with a compound by being in the same experimental sec,on
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Legacy document handling
• Chemical proper,es are ooen implicitly associated with a compound by being in the same experimental sec,on
• This requires sec,on detec,on e.g. a heading and/or a paragraph where a compound is being synthesised
• In the XML for pre-‐2000 papers all sec,ons on a page run together (including page numbers!), and the text posi,on informa,on is lost.
• …so back to the source PDF
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Heading/Paragraph detection workflow
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Results (Melting points) 1841-‐1999 RSC journal ar0cles
2000-‐2015 RSC journal ar0cles
2001-‐2015 USPTO patent applica0ons
Compound-‐value associa0ons
2,155 29,996 172,886
Suspicious Values (typically mistake in the document)
70 (3.2%) 39 (0.13%) 426 (0.25%)
Unique Compounds (StdInChI)
1,830 (84.9%) 27,956 (93.2%) 95,140 (55.0%)
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
SDF output
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
F
B–
O
NH
H3C
O+
H3C
F
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Cl
Te
Te
F
F
Cl
F
F
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Results (NMR) 1841-‐1999 RSC journal ar0cles
2000-‐2015 RSC journal ar0cles
2001-‐2015 USPTO patent applica0ons
Compound-‐value associa0ons
4,972 94,610 1,295,325
Suspicious Values (typically mistake in the document)
561 (11.3%) 2,001 (2.11%) 29,775 (2.30%)
Unique Compounds (StdInChI)
2,899 48,137 655,295
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Legacy text issues
• OCR errors in important compound names or data – chemical names in italics problema,c… key compounds ooen in italics!
– ° is more ooen than not misinterpreted e.g. ' o
• Tools prefer experimental sec,ons where one compound is being synthesised, qualita,vely older documents are less formalised
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
4-‐ChZoro-‐6-‐hydroxy-‐2-‐methyZamino~yrimidine.-‐4-‐Chloro-‐6-‐methoxy-‐2-‐methylaminopyrim-‐idine (10g.) was heated on the steam-‐bath for 30 min. with concentrated hydrochloric acid (60 c.c.). The hydvoxy-‐cmfiound which separated on cooling was collected and purified by dis-‐ solu,on in alkali,etc. as above and had m. p. 266" (decornp.) (6.6 g.) (Found c 38.3 ; H 4.1; N 26-‐2. C,H,0N3C1 requires C 37.6; H 3.8; N 26.3%).
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Conclusions
• Grammars facilitate rapid extrac,on and interpreta,on of chemical proper,es
• Table extrac,on is vital to extrac,ng large quan,,es of certain data e.g. ac,vity data
• Large amounts of high quality data can be extracted from journal ar,cles
• …but extrac,on from older documents remains very challenging, and over ,me represents a smaller and smaller percentage of the scien,fic literature
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Acknowledgements
• Igor Tetko (Mel,ng point quality feedback) • Carlos Cobas (NMR quality feedback)
Funding provided by:
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Sci-‐Mix 8:00pm – 10:00pm Today Hall C – Boston Conven,on
& Exhibi,on Center
6-‐aminopyrimidine-‐2,4,5-‐triolChinese (Hanzi used for each morpheme)
6-‐氨基嘧啶-‐2,4,5-‐三醇
Japanese (Phonetic translation to Katakana)6-‐‑‒アミノピリミジン-‐‑‒2,4,5-‐‑‒トリオール
Korean (Phonetic translation to Hangul)6-아미노피리미딘-2,4,5-트리올
ammonia radical pyrimidine three alcohol
amino pyrimidine tri ol
amino pyrimidine tri ol
Chemistry Enabling Chinese, Japanese and Korean Patents
250th ACS Na,onal Mee,ng, Boston MA, USA 17th August 2015
Thank you for your ,me!
h}p://nextmovesooware.com h}p://nextmovesooware.com/blog