Upload
miguel-toste
View
216
Download
2
Embed Size (px)
Citation preview
11Source – J Kreulen
Aqui é onde eu trabalho - o IBM Centro de Pesquisa de Almaden
22
Por que eu estou aqui?
(Explicar)
- o que estamos fazendo com o computador curation- (texto e Imagem analytics)
- por que é importante para a comunidade científica
- como ele pode impactar o seu trabalho e ter competitividade vantajosa
33
Computer Curation of Patents & Scientific Literature
[ Analitica de Informações ] [Transformando Informação em Valor]
Stephen K. Boyer, [email protected]
408-858-5544
44
O Problema
Todo o conteúdo e nenhuma descoberta?
55
A Pergunta
Podemos usar computadores "para ler" documentos, identificar entidades críticas, e executar associações significativas – que pode ajudar-nos com o nosso trabalho?
66
As text
Nomes quimicos ino texto do documento
Imagens de bitmap
Figuras de quimica encontradas no documento
Por exemplo:-
As patentes e os papéis científicos contêm dados moleculares em variadas formas
77
a) (2P/4S)-4-[4-Amino-5-(4-benzyloxy-phenyl)pyrrolo[2,3-d]pyrimidin-7-yl]-2-hydroxymethyl-pyrrolidine-1-carboxylic acid tert-butyl ester prepared analogously to Example 18 starting from (2R/4S)-4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester 2-ethyl ester (Example 20a). 1 H-NMR (CDCl3, ppm): 8.52 (s, 1H), 7.52-7.32 (m, 7H), 7.1 (d, 2H), 6.95 (d,1 H), 5.50 (m, 1H), 5.13 (s, 2H), 4.62-4.42 (m, 2H), 4.28 (m, 2H), 4.10 (m, 1H), 3.95-3.70 (m, 1H), 2.75 (m, 1H), 2.50 (m, 1H),1.49 (s, 9H). b) (2R/4S)-{4-[4-Amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidin-2-yl}-methanol: 0.100 g of (2R/4S)4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester is dissolved in 4 ml of tetrahydrofuran; 10 ml of 4M hydrogen chloride in diethyl ether are added, and stirring is carried out for 1 hour at room temperature. The product is filtered off and dried under a high vacuum. The dihydrochloride of the title compound is obtained. 1 H-NMR (CD3 OD, ppm): 8.4 (s, 1H); 7.60 (s, 1H), 7.5-7.10 (m, 9H), 5.65 (m, 1H), 5.18 (s, 2H), 4.32 (m, 1H), 4.00-3.65 (m, 4H), 2.60 (m, 2H). EXAMPLE 24(2R/4S)-4-(4-Amino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester 0.130 g of (2R/4S)-4-(4-benzyloxycarbonylamino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester is dissolved in 8 ml of methanol, and the solution is hydrogenated over 0.030 g of palladium-on-carbon (10%) for 1 hour at normal pressure. The catalyst is removed by filtration, the filtrate is concentrated by
Você pode encontrar as moléculas-chave nesta patente de Novartis?
[A nomenclatura química pode estar atemorizando ]
88
a) (2P/4S)-4-[4-Amino-5-(4-benzyloxy-phenyl)pyrrolo[2,3-d]pyrimidin-7-yl]-2-hydroxymethyl-pyrrolidine-1-carboxylic acid tert-butyl ester prepared analogously to Example 18 starting from (2R/4S)-4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester 2-ethyl
ester (Example 20a). 1 H-NMR (CDCl3, ppm): 8.52 (s, 1H), 7.52-7.32 (m, 7H), 7.1 (d, 2H), 6.95 (d,1 H), 5.50 (m, 1H), 5.13 (s, 2H), 4.62-4.42 (m, 2H), 4.28 (m, 2H), 4.10 (m, 1H), 3.95-3.70 (m, 1H), 2.75 (m, 1H), 2.50 (m,
1H),1.49 (s, 9H). b) (2R/4S)-{4-[4-Amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidin-2-yl}-methanol:
0.100 g of (2R/4S)4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester is dissolved in 4 ml of tetrahydrofuran; 10 ml of 4M hydrogen chloride in
diethyl ether are added, and stirring is carried out for 1 hour at room temperature. The product is filtered off and dried under a high vacuum. The dihydrochloride of the title compound is obtained. 1 H-NMR (CD3 OD, ppm): 8.4 (s, 1H); 7.60 (s, 1H), 7.5-7.10 (m, 9H), 5.65 (m, 1H), 5.18 (s, 2H), 4.32 (m, 1H), 4.00-3.65 (m, 4H),
2.60 (m, 2H). EXAMPLE 24
(2R/4S)-4-(4-Amino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester 0.130 g of (2R/4S)-4-(4-benzyloxycarbonylamino-5-phenyl-pyrrolo[2,3-
d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester is dissolved in 8 ml of methanol, and the solution is hydrogenated over 0.030 g of palladium-on-carbon (10%) for 1 hour at normal
pressure. The catalyst is removed by filtration, the filtrate is concentrated by
O que isto é composto??
NO
O
HO
N
N
N
O
NH2
Você sabe que este produto químico é?
entity identification
99
Valium (Trade Name)
= Diazepam (Generic Name)
= CAS # 439-14-5(Chemical ID #)
ALBORAL, ALISEUM, ALUPRAM , AMIPROL ,ANSIOLIN , ANSIOLISINA , APAURIN, APOZEPAM, ASSIVAL , ATENSINE , ATILEN , BIALZEPAM , CALMOCITENE, CALMPOSE , CERCINE, CEREGULART, CONDITION, DAP, DIACEPAN, DIAPAM , DIAZEMULS , DIAZEPAN , DIAZETARD , DIENPAX, DIPAM , DIPEZONA, DOMALIUM , DUKSEN, DUXEN, E-PAM, ERIDAN, EVACALM, FAUSTAN, FREUDAL , FRUSTAN, GIHITAN, HORIZON, KIATRIUM, LA-III , LEMBROL, LEVIUM, LIBERETAS , METHYL DIAZEPINONE, MOROSAN , NEUROLYTRIL NOAN NSC-77518 PACITRAN PARANTEN PAXATE PAXEL PLIDAN QUETINIL QUIATRIL QUIEVITA RELAMINAL RELANIUM RELAX RENBORIN RO 5-2807 S.A. R.L. SAROMET SEDAPAM SEDIPAM SEDUKSEN SEDUXEN , SERENACK SERENAMIN SERENZIN SETONIL SIBAZON SONACON STESOLID STESOLIN , TENSOPAM TRANIMUL TRANQDYN TRANQUASE TRANQUIRIT , TRANQUO-TABLINEN , UMBRIUM UNISEDIL USEMPAX AP VALEO VALITRAN VALRELEASE VATRAN VELIUM, VIVAL VIVOL WY-3467
=
Valium has > 149 “names(O tranqüilizante tem> 149 "nomes”)”
Problema – tenho de encontrar a informação do Tranqüilizante
nomenclature issues
1010
Há muitos nomes químicos diferentes do Tranqüilizante
Valium = Diazepam =
7-CHLORO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE
7-CHLORO-1-METHYL-5-PHENYL-3H-1,4-BENZODIAZEPIN-2(1H)-ONE
7-CHLORO-1-METHYL-5-PHENYL-1,3-DIHYDRO-2H-1,4-BENZODIAZEPIN-2-ONE
7-CHLORO-1-METHYL-2-OXO-5-PHENYL-3H-1,4-BENZODIAZEPINE
1-METHYL-5-PHENYL-7-CHLORO-1,3-DIHYDRO-2H-1,4-BENZODIAZEPIN-2-ONE
7-CHLORO-1,3-DIHYDRO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE
7-CHLORO-1-METHYL-5-3H-1,4-BENZIODIAZEPIN-2(1H)-ONE
CAS # 439-14-5 =
entity identification
1111
Problemas de taxonomies e normalização de nome
Valium Taxonomies &
Dictionaries
Multiple documents contain Information about Valium
Diazepam
Sedapam
DIAPAM
Medline In-house database
Choose keywords
439-14-5(Chemical ID)
Chem. Abstracts
Pereira notebook 23a
7-CHLORO-1,3-DIHYDRO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE
Patent database
7-CHLORO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE
The scientist simply wants information about valium
1212
Considerações – para procurar documentos (ou páginas da Web) para substâncias químicas
Os produtos químicos têm uma larga variedade de nomes triviais e oficiais.
Nenhuma pesquisa de texto pode encontrar produtos químicos que são denominados usando um dos nomes alternativos.
A expansão de sinônimo é insuficiente.
A procura pela estrutura pode ser útil.
Source J Cooper / IBM
A normalização de nome é importante
1313
Considerations – for searching documents (or web pages) for chemical substances
Chemicals have a wide variety of trivial and official names.
No text search can find chemicals which are named using one of the alternative names.
Synonym expansion is insufficient.
Searching by structure could be helpful
Source J Cooper / IBM
Name normalization is important - (A normalização de nome é importante)
1414
Achado de estruturas de semelhança – não texto somente semelhante!
Além disso, nós gostaríamos de encontrar compostos que são superjogos da estrutura dada.
For example: toluene and methylnaphthalene
Source J Cooper / IBM
Encontre documentos com estruturas semelhantes
As pesquisas de texto não encontrarão documentos com estruturas semelhantes
1515
Computer curation now involves multiple types of analysis(O computador curation agora implica múltiplos tipos da análise)
• Analysis of text
• Analysis of image
• Analysis of XML files
Derived Meta data
Internal data
IBM + Collaborator input
Output db to Collaborators
• Analysis of (CWU’s )
NIH
1616
Paper Words
- - - - - - - - - - - - - - - - - - - - - - - -
Chemical Names
Dictionary of the English Language – minus – the Dictionary of Desired Entities
. - -
-
toluene
[CC1=CC=CC=C1]
CH3
Name=Structure SMILES String
2D Structure
methyl benzene
Computational Resources
Blue Gene – enabled -
Sumario de toda operacao de analise de texto para Quimica
Options to compute 300 properties per molecule
- Fluxograma de todo processo para analise de texto
(HMM, CRF, CFG)
1717
5-chloro-N-methyl-N-phthalimidoacetylanthranilic acid
N-aminoacetyl-5-chloro-N-methylanathranilic acid
Phosphorus pentachloride
aluminum chloride
hydrazine
7-chloro-1.3-dihydro-1-methyl-5-phenyl-2H-1,4-benzodiazepin-2-one
benzene
Chemical Entities Extracted from page
Passo 2: Extraia nomes químicos
Passo 1: Identifique as entidades químicas
Entity extraction
1818
Name Structure Program
7-CHLORO-1-METHYL-5-PHENYL-2H-1,4-
BENZODIAZEPIN-2-ONE
language-free entities
SMILES strings:
c1ccccc1
6 6 0 0 0 0 0 0 0 0999 V2000 6.7092 5.6087 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 6.7076 4.5056 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.6607 3.9551 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 8.6160 4.5062 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 8.6121 5.6136 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.6583 6.1591 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0 0 2 3 1 0 0 0 0 3 4 2 0 0 0 0 4 5 1 0 0 0 0 5 6 2 0 0 0 0 6 1 1 0 0 0 0
M END
Connection tables
INChI=1/C6H6/c1-2-4-6-5-3-1/h1-6H
Passo 3: Converta nomes químicos em estruturas químicas
Convert the chemicals into machine readable formats !
1919
Background info re InChI’s
Source : Prof Peter Murry Rust
2020
IBM Servers
Medline
Patents
Web Pages
Any text
HealthCare Life Science Data warehouse
Valium
Benzene
11 Million patent documents18 Million Medline abstracts
100 Million chemical structures
>12 Million unique
Passo 4: Automatize o processo
Aumente e automatize o processo
2121
Exemplos
Chemicals derived from text analytics –( Os produtos químicos derivaram do texto analytics )
2222
Ambiente Computacional Grande
Find and compute the 3D structures
dentifique cada doença
Identify each disease
Identify every Medline MeSh code
Identifique a ocorrência de cada biomarker
Equivalente a 240 K pesquisas de Google simultâneas
Data warehouse
Compute properties, & find relationships,
Chemical & Biological information derived from text analytics
2323
Atividades Atuais …
2424
- - - - -- - - - -- - - - -
- - - - -
- - - - -- - - - -- - - - -
- - - - -
= Chemical
= Target
= Disease
= Assay data
Texto [Anotação de Texto]
- - - - -- - - - -- - - - -
- - - - -
Texto Anotado
Identifique cada nome químico
Converta todos os nomes de chem nas suas estruturas químicas[SMILES] - então convertem essesSMILES em inchi's e Inchkeys (um identificador único do produto químico)
- - - - -- - - - -- - - - -
- - - - -
Anote o aumento de / todos os nomes químicos com o termo “inchikey e o inchikey único” para aquele produto químico. Os InChiKeys são postos no índex agora como se eles fossem palavras (texto) no documento
Re-índice o texto aumentado [inchikeys] w SOLR
= aspirin = inchikey= BSYNRYMUTXBXSQ-UHFFFAOYSA-N
= aspirin = SMILE string= CC(=O)OC1=CC=CC=C1C(=O)O
dB SOLR index
Atividade atual: “em linha” etiquetagem de entidade e classificação de nomes químicos
Índice de Texto Índice de Anotação
Acrescente as estruturas conseguidas anotações (e dados de Meta) ao nosso database
2525
Aspirin
InChI = 1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)
InChI Key = BSYNRYMUTXBXSQ-UHFFFAOYSA-N
SMILE = O=C(Oc1ccccc1C(=O)O)C
MOL File
2626
Mrv0541 03191312032D
13 13 0 0 0 0 999 V2000 1.4289 3.3000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1.4289 2.4750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.7145 2.0625 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 0.7145 1.2375 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.4289 0.8250 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.4289 -0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.7145 -0.4125 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.8250 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.7145 1.2375 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.7145 2.0625 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 -1.4289 0.8250 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 2.1434 2.0625 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 0 0 0 0 2 3 1 0 0 0 0 3 4 1 0 0 0 0 4 5 4 0 0 0 0 5 6 4 0 0 0 0 6 7 4 0 0 0 0 7 8 4 0 0 0 0 8 9 4 0 0 0 0 4 9 4 0 0 0 0 9 10 1 0 0 0 0 10 11 2 0 0 0 0 10 12 1 0 0 0 0 2 13 1 0 0 0 0M END
Aspirin MOL file
2727
- - - - -- - - - -- - - - -
- - - - -
- - - - -- - - - -- - - - -
- - - - -
= Chemical
= Target
= Disease
= Assay data
Text [ Text + Annotations]
identify all targets [Gene names & their synonyms ]
Augment all target names with a “tag = geneid “ & the NCBI unique Identifier # for that target
- - - - -- - - - -- - - - -
- - - - -
Re-index the augmentented text + geneid identifiers w SOLR
= JAK3 + Aliases = geneid = geneid=NCBIID# = 3718
dB SOLR index
Current activity : “in line” entity tagging & classification for targets (=geneid’s)
Annotated Text Index
Add the derived annotations (& meta data) to our master database
2828
- - - - -- - - - -- - - - -
- - - - -
- - - - -- - - - -- - - - -
- - - - -
= Chemical
= Target
= Disease
= Assay data
Text Text + Annotation
Identify all known MeSH terms [for example, diseases (C01) or signs & symptoms (C23)
Identified & augment every occurrence of every MeSh term with a ‘tag = MeSH & the specific MeSh code Identifier
- - - - -- - - - -- - - - -
- - - - -
Re-index the augmented text + the MeSh tags w SOLR
= Headache += MeSH term += C23 sign or symptom
dB SOLR index
Atividade atual: “em linha” etiquetagem de entidade e classificação de termos de Rede
Text Index + Annotation Index
Text = Headache New index of original text plus all of it’s associated annotated information
Add the derived annotations (& meta data) to our master database
2929
Um texto Aumentado de Mostra
“Interactions of ibogaine and D-amphetamine:[ibmentity type="drug" name="amphetamine" value="amphetamine" chebitype="neurotoxin,toxin"] in vivo microdialysis and motor behavior in rats Ibogaine, an indolalkylamine, has been proposed for use in treating stimulant addiction. In the present study we sought to determine if ibogaine had any effects on the neurochemical and motor changes induced by D-amphetamine[ibmentity type="drug" name="amphetamine" value="amphetamine" chebitype="neurotoxin,toxin"] that would substantiate the anti-addictive claim. Ibogaine (40 mg/kg, i.p.) injected 19 h prior to a D-amphetamine[ibmentity type="drug" name="amphetamine" value="amphetamine" chebitype="neurotoxin,toxin"] challenge (1.25 mg/kg, i.p.) potentiated the expected rise in extracellular dopamine[ibmentity type="drug" name="dopamine" value="dopamine" chebitype="pharmacological role,neurotransmitter agent"] levels in the striatum[ibmentity type="target" name="striatum" value="striatum" targettype="tissue"] and in the nucleus accumbens, as measured by microdialysis in freely moving rats. Using …”
3030
- - - - -- - - - -- - - - -- - - - -
= Chemical_” inchikey BSYNRYMUTXBXSQ-UHFFFAOYSA-N”
= Target
= Disease
= Assay data
Text
= Chemical
compoundTarget 1
Target 2
Target 3
Target 1
Target 2
Target 3
= [target _gene name]
Target 4
Target 5
Compound – Targets associationsKnown from the literature
Compound – Targets associationsKnown from the SEA or other computations
dB
Overall Objective : Integrate [Compound – Target] associations derived from literature + computations + additional experimental efforts (HTS)
In line text tagging (classification) coupled with computational & experimental data
NIH HTS Assay data
Compound – Targets associationsKnown from NIH or Other experimental sources
3131
Data
Sou
rces
View selected
Documents & Reports
U.S.Patents(1976 -—
2009)
U.S. Pre-
Grants (All)
PCT &EPO
Apps
Medline Abstracts
(>18 M)
SelectedInternet Content
User Applications
In-House
Content
Knime or Pipeline Pilot
BIW
SIMPLE
Chem Axon Search
Cognos/DDQB/Other Apps
Parse & Extract
data
Annotator 1
Annotator 2
Database
+compu ted Meta Data
e Classifier & OtherData Associations
Annotation Factory
Computational Analytics
(SemanticAssociations)
Computer Curation Process Overview & integration with our collaborators -
IP Database(e.g. DB2)
ADU*
* ADU = Automated Data Update
ChemVersedb
ChemVerse
Services Hosted at IBM Almaden
3232
Os exemplos –
por que isto é importante e o que ele nos permite fazer isto nós não pode fazer facilmente antes-
3333
Batch Analysis
For Example : You are about to file a patent application – that contains ~ 300 – 400 chemical compounds. How do you know if any of these (400+) compounds has been patented before ?
3434
Paste a list of InChIkeys to be batch searched here !
3535
Input list of InChIkeys to be batch searched here !
1
2
Click run search !
3636
Results form batch search of InChikeys !
Diavan Glipazol Ibuprofen Asprin Lotensin ImItrex Nabumetone Tessalon Sulfamethoxazole Trimethoprim Cyclobenzaprine Guaifenesin Oxymetazoline Anvitoff Dextromethorphan Lyrica Celexa
One can readily search hundreds or even thousands of compounds at at time – to see if any of the compounds have already been patented - & by whom & for what purpose