View
3
Download
0
Category
Preview:
Citation preview
Exploring Structure Databases
222nd American Chemical Society National MeetingHerman Skolnik Award Symposium
August 28, 2001
Robert W. SnyderMDL Information Systems
Exploring Structure Databases
Agenda
Comparing Reaction Databases
Building a Reaction Knowledge Base
Top 10 Reaction Types
Measuring Unique Transformations in Reaction Databases
Exploring Structure Databases
An Interlude with Guenter…
After Joining MDLBefore Joining MDL
Exploring Structure Databases
Agenda
Comparing Reaction Databases
Building a Reaction Knowledge Base
Top 10 Reaction Types
Measuring Unique Transformations in Reaction Databases
Exploring Structure Databases
Comparing Reaction Databases
Complementary abstracting guidelines?solution-phase ChemInform Reaction Librarysolid-phase SPOREheterocyclic chemistry CHC
Price?Literature references?Overlap of reactions?
Exploring Structure Databases
“Duplications Among Reaction Databases”
Paper by James Hendrickson and Ling Zhang (JCICS, 2000, 40, 380-383)Analyzed 16 reaction databases
including RefLib, CHC, RX-JSM, ORGSYN, CSMtotal of 1,075,484 reactions
Converted reactions to common database formatPerformed pair-wise duplication checksResults: 2.7% duplication
Exploring Structure Databases
Limitations of Exact Reaction Comparison
Requiring an exact reaction structure match may be too stringent of a conditionTwo reactions may have the same transformation but differ in side groups which don’t play a significant role in the reaction
What if we could compare reaction databases by degree of similarity?
Exploring Structure Databases
Molecule Similarity
N
O
OO
OH
H N
O
OO
O
N
OOI
Diagnostic Agent
AntiarthriticAnesthetic
Exploring Structure Databases
Reaction Similarity
N+
O
O
NO
N+
O
O
O
N
N+
O
O
N+
O
O
N
N
Exploring Structure Databases
Tanimoto Coefficient Measures Similarity
∑ N [ A , J ] • N [ B , J ]
∑ N [ A , J ]2 + ∑ N [ B , J ]2 - ∑ N [ A , J ] • N [ B , J ]
whereA, B are the two structures,N[ A , J ] and N[ B , J ] are the number of occurrencesof the Jth fragment in structures A and B
Exploring Structure Databases
Tanimoto Coefficient Measures Similarity
Molecule and reactions keys used to compute Tanimoto coefficientStructure keys are binary
Can we extend this concept to compare similarity of reaction databases?
Exploring Structure Databases
Reaction Classification
Consistent assignment of a numerical index (15-digit integer) to a reaction center topologyTechnology developed by InfoChem GmbHBased on structural environment around the reaction center(s)Can be used as an indicator for reaction type
Exploring Structure Databases
Three Levels of Classification
C
ClO
H C
ClCl
CC
CCl
OH
CC
CCl
Cl
C
CCC
CCl
OH C
CCC
CCl
Cl
Broad
Medium
Narrow
Exploring Structure Databases
Reaction Classification Code Examples
Br
69%
324005931589888N+O O
NH2
325741082969498Br
O
O
O
Br
O OH
O
294560435478524
Exploring Structure Databases
Reaction Database Similarity Measure
Treat reaction classification codes as synthetic methodology keys of the databaseThe more classification codes two databases share in common, the more similar they areExistence of classification codes is nonbinary
there can be many reaction examples in a database with the same classification codes
We can compute a Tanimoto coefficient between databases using the classification codes as the methodology fingerprint
Exploring Structure Databases
Tanimoto Coefficient Measures Similarity
∑ N [ A , J ] • N [ B , J ]
∑ N [ A , J ]2 + ∑ N [ B , J ]2 - ∑ N [ A , J ] • N [ B , J ]
whereA, B are the two databases,N[ A , J ] and N[ B , J ] are the number of occurrencesof the Jth classcode in database A and B
Exploring Structure Databases
Reaction Database Similarity Matrices
1.00Rxn DB 7
1.00Rxn DB 6
1.00Rxn DB 5
1.00ARxn DB 4
1.00Rxn DB 3
A1.00Rxn DB 2
1.00Rxn DB 1
Rxn DB 7Rxn DB 6Rxn DB 5Rxn DB 4Rxn DB 3Rxn DB 2Rxn DB 1ClassificationLevel
Exploring Structure Databases
Reaction Database Similarity Measures
1.000.720.150.500.380.260.06THEIL
0.721.000.110.410.310.320.08RX-JSM
0.150.111.000.140.090.030.01ORGSYN
0.500.410.141.000.190.130.04CHC
0.380.310.090.191.000.090.03SPORE
0.260.320.030.130.091.000.30REFLIB
0.060.080.010.040.030.301.00CIRX
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXBROAD
Exploring Structure Databases
Reaction Database Similarity Measures
1.000.720.150.500.380.260.06THEIL
0.721.000.110.410.310.320.08RX-JSM
0.150.111.000.140.090.030.01ORGSYN
0.500.410.141.000.190.130.04CHC
0.380.310.090.191.000.090.03SPORE
0.260.320.030.130.091.000.30REFLIB
0.060.080.010.040.030.301.00CIRX
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXBROAD
Exploring Structure Databases
Reaction Database Similarity Measures
1.000.720.150.500.380.260.06THEIL
0.721.000.110.410.310.320.08RX-JSM
0.150.111.000.140.090.030.01ORGSYN
0.500.410.141.000.190.130.04CHC
0.380.310.090.191.000.090.03SPORE
0.260.320.030.130.091.000.30REFLIB
0.060.080.010.040.030.301.00CIRX
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXBROAD
Exploring Structure Databases
Reaction Database Similarity Measures
1.000.720.150.500.380.260.06THEIL
0.721.000.110.410.310.320.08RX-JSM
0.150.111.000.140.090.030.01ORGSYN
0.500.410.141.000.190.130.04CHC
0.380.310.090.191.000.090.03SPORE
0.260.320.030.130.091.000.30REFLIB
0.060.080.010.040.030.301.00CIRX
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXBROAD
Exploring Structure Databases
Reaction Database Similarity Measures
1.000.720.150.500.380.260.06THEIL
0.721.000.110.410.310.320.08RX-JSM
0.150.111.000.140.090.030.01ORGSYN
0.500.410.141.000.190.130.04CHC
0.380.310.090.191.000.090.03SPORE
0.260.320.030.130.091.000.30REFLIB
0.060.080.010.040.030.301.00CIRX
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXBROAD
Exploring Structure Databases
Reaction Database Similarity Measures
1.000.460.170.300.130.230.05THEIL
0.461.000.110.260.140.230.06RX-JSM
0.170.111.000.090.040.030.01ORGSYN
0.300.260.091.000.040.110.03CHC
0.130.140.040.041.000.040.02SPORE
0.230.230.030.110.041.000.29REFLIB
0.050.060.010.030.020.291.00CIRX
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXMEDIUM
Exploring Structure Databases
Reaction Database Similarity Measures
1.000.460.170.300.130.230.05THEIL
0.461.000.110.260.140.230.06RX-JSM
0.170.111.000.090.040.030.01ORGSYN
0.300.260.091.000.040.110.03CHC
0.130.140.040.041.000.040.02SPORE
0.230.230.030.110.041.000.29REFLIB
0.050.060.010.030.020.291.00CIRX
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXMEDIUM
Exploring Structure Databases
Reaction Database Similarity Measures
1.000.240.140.160.100.280.05THEIL
0.241.000.070.120.090.210.06RX-JSM
0.140.071.000.060.030.040.01ORGSYN
0.160.120.061.000.030.090.03CHC
0.100.090.030.031.000.050.02SPORE
0.280.210.040.090.051.000.21REFLIB
0.050.060.010.030.020.211.00CIRX
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXNARROW
Exploring Structure Databases
Reaction Database Similarity Measures
1.000.240.140.160.100.280.05THEIL
0.241.000.070.120.090.210.06RX-JSM
0.140.071.000.060.030.040.01ORGSYN
0.160.120.061.000.030.090.03CHC
0.100.090.030.031.000.050.02SPORE
0.280.210.040.090.051.000.21REFLIB
0.050.060.010.030.020.211.00CIRX
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXNARROW
Exploring Structure Databases
Reaction Center Environment vs. Average Similarity
0.21
0.14
0.10
0.07
0.050.03
0.02
0.00
0.05
0.10
0.15
0.20
0.25
0 1 2 3 4 5 6 7 8
Size of Reaction Center Environment
Aver
age
Data
base
Sim
ilarit
y
Exploring Structure Databases
An Interlude with Guenter…
Exploring Structure Databases
Agenda
Comparing Reaction Databases
Building a Reaction Knowledge Base
Top 10 Reaction Types
Measuring Unique Transformations in Reaction Databases
Exploring Structure Databases
Building a Reaction Knowledge Base
A knowledge base should:provide a single point of entrybe based on reaction typelink to individual data sources for full reaction
Can we build a reaction knowledge base built on InfoChem reaction classification codes?
Exploring Structure Databases
Reaction Knowledge Base
Reaction Knowledge Base
ReactionDB
ReactionDB
ReactionDB
Exploring Structure Databases
Reaction Knowledge Base – the Source
ChemInform = 644,947 rxnsThe Reference Library = 171,110 rxnsSPORE = 9,193 rxnsCHC = 42,375 rxnsORGSYN = 5,392 rxnsRX-JSM = 68,803 rxnsTHEILHEIMER = 46,467 rxns
988,287 total rxns
Exploring Structure Databases
Reaction Knowledge Base - Classcodes
Broad classcodes: 176,702 (5.6:1)Medium classcodes: 373,923 (2.6:1)Narrow classcodes: 447,562 (2.2:1)
Exploring Structure Databases
Reaction Knowledge Base - Questions
What are the top reported reaction types?
Which transformations have not been reported on solid-phase?
Are there any transformations that are unique to a database? Can this be used as a database selection criteria?
Exploring Structure Databases
Agenda
Comparing Reaction Databases
Building a Reaction Knowledge Base
Top 10 Reaction Types
Measuring Unique Transformations in Reaction Databases
Exploring Structure Databases
Top 10 Reaction Types (BROAD)
18615527176345843,3944,709257407228603162
16121223261635233,6335,184247321184998100
82210137489793,9825,889248507929234704
1183691264131,1833,9546,245283117507256719
1752723270671,2364,0956,331248372910785489
7732713152191,1714,2386,428259123750963135
3213222322561,6953,7026,751242815105629999
345340331936938075,2938,030261039242542204
453427591565771,4215,6559,324228413385171318
37046332121262,4806,64711,035267586484050778
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.BROAD
Exploring Structure Databases
Top 10 Reaction Types (BROAD)
18615527176345843,3944,709257407228603162
16121223261635233,6335,184247321184998100
82210137489793,9825,889248507929234704
1183691264131,1833,9546,245283117507256719
1752723270671,2364,0956,331248372910785489
7732713152191,1714,2386,428259123750963135
3213222322561,6953,7026,751242815105629999
345340331936938075,2938,030261039242542204
453427591565771,4215,6559,324228413385171318
37046332121262,4806,64711,035267586484050778
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.BROAD
O
O
O
O
Exploring Structure Databases
Top 10 Reaction Types (BROAD)
18615527176345843,3944,709257407228603162
16121223261635233,6335,184247321184998100
82210137489793,9825,889248507929234704
1183691264131,1833,9546,245283117507256719
1752723270671,2364,0956,331248372910785489
7732713152191,1714,2386,428259123750963135
3213222322561,6953,7026,751242815105629999
345340331936938075,2938,030261039242542204
453427591565771,4215,6559,324228413385171318
37046332121262,4806,64711,035267586484050778
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.BROAD
O
O
N+
O
O
O
O
N+
O
O
Exploring Structure Databases
Top 10 Reaction Types (BROAD)
18615527176345843,3944,709257407228603162
16121223261635233,6335,184247321184998100
82210137489793,9825,889248507929234704
1183691264131,1833,9546,245283117507256719
1752723270671,2364,0956,331248372910785489
7732713152191,1714,2386,428259123750963135
3213222322561,6953,7026,751242815105629999
345340331936938075,2938,030261039242542204
453427591565771,4215,6559,324228413385171318
37046332121262,4806,64711,035267586484050778
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.BROAD
N
O
O
N
O
+
Exploring Structure Databases
Top 10 Reaction Types (MEDIUM)
105332528451939131,400294529079813567
80391210112499501,423255624291255456
7578141532998641,433313146019762509
106732064402548171,441298065273835854
66481922001848411,463323130024518137
5310923753151,4422,185318547512250093
54699781761,7772,332324501134592978
1071442921313361,4292,477312444602259924
110962117274331,8272,717312632986028511
136172153171,1842,6884,564313125515671298
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.MEDIUM
Exploring Structure Databases
Top 10 Reaction Types (MEDIUM)
105332528451939131,400294529079813567
80391210112499501,423255624291255456
7578141532998641,433313146019762509
106732064402548171,441298065273835854
66481922001848411,463323130024518137
5310923753151,4422,185318547512250093
54699781761,7772,332324501134592978
1071442921313361,4292,477312444602259924
110962117274331,8272,717312632986028511
136172153171,1842,6884,564313125515671298
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.MEDIUM
O O
Exploring Structure Databases
Top 10 Reaction Types (MEDIUM)
105332528451939131,400294529079813567
80391210112499501,423255624291255456
7578141532998641,433313146019762509
106732064402548171,441298065273835854
66481922001848411,463323130024518137
5310923753151,4422,185318547512250093
54699781761,7772,332324501134592978
1071442921313361,4292,477312444602259924
110962117274331,8272,717312632986028511
136172153171,1842,6884,564313125515671298
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.MEDIUM
Br
Br
N+
Br
O
O
Br
Exploring Structure Databases
Top 10 Reaction Types (NARROW)
113323068485663332262032163809
29251199076405691336340987165931
63141455108465706313549094109654
1039431248459826329038361228014
39226340161520836318972626250200
11300042746921329002963942591
5130728172625944294560435478524
235111731407021,055336777852836899
4278194101276421,070324005931589888
99641853402377751,350325741082969498
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.NARROW
Exploring Structure Databases
Top 10 Reaction Types (NARROW)
113323068485663332262032163809
29251199076405691336340987165931
63141455108465706313549094109654
1039431248459826329038361228014
39226340161520836318972626250200
11300042746921329002963942591
5130728172625944294560435478524
235111731407021,055336777852836899
4278194101276421,070324005931589888
99641853402377751,350325741082969498
THEILRX-JSMORGSYNCHCSPOREREFLIBCIRXFreq.NARROW
O
O
N+
O
O
O
O
N
Exploring Structure Databases
Agenda
Comparing Reaction Databases
Building a Reaction Knowledge Base
Top 10 Reaction Types
Measuring Unique Transformations in Reaction Databases
Exploring Structure Databases
Unique Transformations (Broad)
ChemInform = 84,266 (69%)CHC = 9,589 (50%)The Reference Library = 20,536 (39%)SPORE = 619 (33%)RX-JSM = 9,483 (32%)ORGSYN = 401 (16%)THEILHEIMER = 2 (<1%)
Exploring Structure Databases
Unique Transformations (Medium)
ChemInform = 184,391 (73%)CHC = 20,357 (65%)The Reference Library = 47,107 (46%)SPORE = 1,932 (52%)RX-JSM = 20,433 (40%)ORGSYN = 985 (26%)THEILHEIMER = 2 (<1%)
Exploring Structure Databases
Unique Transformations (Narrow)
ChemInform = 237,744 (77%)CHC = 25,485 (75%)SPORE = 3,180 (67%)The Reference Library = 54,227 (50%)RX-JSM = 22,311 (45%)ORGSYN = 1,066 (32%)THEILHEIMER = 2 (<1%)
Exploring Structure Databases
Summary
InfoChem reaction classification codes can be used to measure similarity between databasesA reaction knowledge base can be built using reaction classification and mined for:
ranking and linking of most common reaction typesidentification of reaction types not reported on solid phasecontribution of each database to the overall knowledge
Exploring Structure Databases
Thank You
Congratulations Guenter!
Thank you for your attention.
Recommended