26
Extracting Semantic Predication from Medline Citations for Pharm acogenomics C.B. Ahlers 1 , M. Fiszman 2 , D.D. Fushman 1 , F.M. Lang 1 and T.C. Rindflesch 1 1 National Center for Biomedical Communications, National Library of Medicine 2 University of Tennessee, USA (PSB 2007 12:209-220)

Extracting Semantic Predication from Medline Citations for Pharmacogenomics

  • Upload
    aurora

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Extracting Semantic Predication from Medline Citations for Pharmacogenomics. C.B. Ahlers 1 , M. Fiszman 2 , D.D. Fushman 1 , F.M. Lang 1 and T.C. Rindflesch 1 1 National Center for Biomedical Communications, National Library of Medicine 2 University of Tennessee, USA (PSB 2007 12:209-220). - PowerPoint PPT Presentation

Citation preview

Page 1: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

Extracting Semantic Predication from Medline Citations for Pharmacogenomics

C.B. Ahlers1, M. Fiszman2, D.D. Fushman1, F.M. Lang1 and T.C.

Rindflesch1 1National Center for Biomedical

Communications, National Library of Medicine

2University of Tennessee, USA(PSB 2007 12:209-220)

Page 2: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

2/26

Abstract This paper describes a NLP system (Enh

anced SemRep) to identify core assertions on pharmacogenomics ( 基因藥理學 ) in Medline.

The development of the system is based on the adaptation of an existing system and depends on UMLS.

Preliminary evaluation: 55% recall and 73% precision.

Page 3: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

3/26

1. Introduction (1/3) Core research in pharmacogenomics investig

ates the interaction of genes/proteins with therapeutic substances. E.g. treatment of oncology( 腫瘤學 ).

Current NLP for pharmacogenomics concentrates on co-occurrence information without specifying exact relations.

Enhanced SemRep complements that approach by representing assertions in text as semantic predications.

Page 4: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

4/26

1. Introduction (2/3) Example

These findings therefore demonstrate that dexamethasone ( 皮質類固醇 ) is a potent inducer of multidrug resistance-associated protein ( 多抗藥性蛋白質 ) expression in rat hepatocytes ( 肝細胞 ) through a mechanism that seems not to involve the classical glucocorticoid receptor ( 糖皮質激素受體 ) pathway. 1. Dexamethasone STIMULATES Multidrug Resistance-Ass

ociated Proteins 2. Dexamethasone NEG_INTERACTS_WITH Glucocorticoid

receptor 3. Multidrug Resistance-Associated Proteins PART_OF Rat

s 4. Hepatocytes PART_OF Rats

Page 5: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

5/26

1. Introduction (3/3) Based on two existing systems

SemRep: extract semantic predications from clinical text.

SemGen: developed from SemRep to identify etiologic ( 病因的 ) relations between genetic phenomena and diseases.

Relations Genes, drugs, diseases, and population groups. At the gene level, no more specific genetic pheno

mena ( e.g. mutations, single nucleotide polymorphisms, and haplotype information).

Page 6: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

6/26

2. Background NLP for Biomedicine The Unified Medical Language System SemRep and SemGen

Page 7: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

7/26

2.1 NLP for Biomedicine (1/2)

Co-occurrence of entities in text (gene-disease relations, Yen et al., 2006; drug-gene, Rindflesch et al., 2000).

Machine learning techniques (gene-disease relations, Chun et al., 2006; drug-gene, Chang et al., 2004).

Syntactic templates and shallow parsing (protein interactions, Blaschke et al., 1999)

Page 8: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

8/26

2.1 NLP for Biomedicine (2/2)

Enhanced SemRep addresses a wide range of syntactic structures and specific semantic relations pertinent to pharmacogenomics.

Example STIMULATES DISRUPTS CAUSES

Page 9: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

9/26

2.2 UMLS Metathesaurus (more than 106 concepts)

Concept: fever; Synonyms: pyrexia, febrile, hyperthermia; Semantic Type: ‘Finding’

Semantic types represent allowable relationships between concepts ‘Gene or Genome’ PART_OF ‘Cell’ ‘Pharmacologic Substance’ INTERACTS_WITH

‘Enzyme’ ‘Disease or Syndrome’ CO-OCCURS_WITH ‘Ne

oplastic Process’ ( 腫瘤突起 )

Page 10: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

10/26

2.3 SemRep and SemGen (1/2)

SemRep: a rule-based symbolic NLP system. Example

Phenytoin ( 二苯妥因 ) induced gingival hyperplasia ( 齒齦增生 )

[[head(noun(phenytoin)), metaconc(‘Phenytoin’:[orch,phsu]))], [verb(induced)], [head(noun([‘gingival hyperplasia’)), metaconc(‘Gingival Hyperplasia’:[dsyn]))]]

‘Pharmacological Substance’ CAUSES ‘Disease or Syndrome’

Phenytoin CAUSES Gingival Hyperplasia

Pharmacological Substance

Disease or Syndrome

Semantic Network relation/ argument identification

Page 11: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

11/26

2.3 SemRep and SemGen (2/2)

SemGen: identify semantic predications on the genetic etiology of disease.

Gene and protein name: ABGene. Since UMLS Semantic Network does not cover

molecular genetics, semantic relations are created: Gene-disease interactions: (ASSOCIATE_WITH, PRE

DISPOSE( 易感染的 ), and CAUSE) Gene-gene interactions: (INHIBIT, STIMULATE, and

INTERACTS_WITH)

Page 12: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

12/26

3. Methods (1/2) Scrutiny of the pharmacogenomics literature t

o identify relevant predications not identified by either SemRep or SemGen. 1000 Medline were retrieved containing drug and g

ene names. 400 sentences were selected, including genetic (ge

ne-disease), genomic (gene-gene), and pharmacogenomic (drug-gene, drug-genome) relations; in addition relations between genes and population groups; disease and population groups; and pharmacological relations (drug-disease, drug-pharmacological effect, drug-drug) were scrutinized.

Page 13: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

13/26

3. Methods (2/2) After processing these 400 sentences with Se

mRep, errors were analyzed and categorized for etiology. The majority of errors

The Semantic Network Errors in argument identification due to “empty” heads Gene name identification

Extensive modifications for Enhanced SemRep. Gene name identification was addressed by addin

g ABGene to the machinery.

Page 14: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

14/26

3.1 Modification of Semantic Network for Enhanced SemRep (1/4)

Grouping semantic types: Five broader semantic groups (Substance, Anatomy, Living Being, Process, and Pathology) were defined to permit predications relevant to pharmacogenomics. Substance: ‘Amino Acid, Peptide, or Protein’, ‘Antib

iotic’( 抗生素 ), ‘Carbohydrate’( 碳水化合物 ), ... Anatomy: ‘Anatomical Structure’( 解剖學構造 ), ‘Bod

y Part, Organ, or Organ Component’, ‘Cell’, ‘Gene or Genome’, ‘Neoplastic Process’, ‘Tissue’ …

Page 15: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

15/26

3.1 Modification of Semantic Network for Enhanced SemRep (2/4)

Living Being: ‘Animal’, ‘Archaeon’( 第三類有機體 ), ‘Bacterium’, ‘Fungus’( 真菌 ), ‘Human’, ‘Invertebrate’( 無脊椎動物 ), ‘Mammal’, ‘Organism’, ‘Vertebrate’, ‘Virus’

Process: ‘Acquired Abnormality’( 後天異常 ), ‘Anatomical Abnormality’, ‘Cell Function’, ‘Cell or Molecular Dysfunction’( 機能障礙 ), ‘Congenital Abnormality’( 先天性異常 ), ‘Laboratory Test Result’…

Pathology: ‘Acquired Abnormality’, ‘Anatomical Abnormality’, ‘Cell or Molecular Dysfunction’, ‘Congenital Abnormality’, ‘Disease or Syndrome’, ‘Injury or Poisoning’, Mental or Behavioral Disorder’( 心理及行為障礙 ), …

Page 16: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

16/26

3.1 Modification of Semantic Network for Enhanced SemRep (3/4)

Define predications: categories 1-6 1: Genetic Etiology ( 基因病理學 )

{Substance} ASSOCIATED_WITH OR PREDISPOSES OR CAUSES {Pathology}

2: Substance Relations {Substance} INTERACTS_WITH OR INHIBITS OR

STIMULATES {Substance} 3: Pharmacological Effects

{Substance} AFFECTS OR DISRUPTS OR AUGMENTS {Anatomy OR Process}

Page 17: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

17/26

3.1 Modification of Semantic Network for Enhanced SemRep (4/4)

4: Clinical Actions {Substance} ADMINISTERED_TO {Living Being} {Process} MANIFESTATION_OF {Process} {Substance} TREATS {Living Being OR Pathology}

5: Organism Characteristics {Anatomy OR Living Being} LOCATION_OF

{Substance} {Anatomy} PART_OF {Anatomy OR Living Being} {Process} PROCESS_OF {Living Being}

6: Co-existence {Substance} CO-EXISTS_WITH {Substance} {Process} CO-EXISTS_WITH {Process}

Page 18: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

18/26

3.2 Empty Heads Example:

We saw differential activation of CYP2C9 variants by dapsone( 藥:氨苯 ).

“Variant” is ‘Qualitative Concept’. We want CYP2C9 variant be a member of the Subst

ance group. Enumerate several categories of terms as sem

antically empty heads, e.g. allele ( 等位基因 ), mutation, variant, levels, expression…

Words from these lists that have been labeled as heads are hidden and the word to their left is relabeled as heads.

Page 19: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

19/26

3.3 Evaluation Test 300 sentences which are randomly gener

ated from the set of 36,577 sentences containing drug and gene co-occurrences found on the Web-site. (bionlp.stanford.edu/genedrug)

These sentences were annotated by three physicians (CBA, DD-F, MF).

They did not mark up all assertions in the sentences, only those representing a predication defined in Enhanced SemRep.

A total of 850 predications were assigned by the annotators.

Page 20: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

20/26

4. ResultsCategory Recall Precisio

nGenetic Etilogy (associated_with, cause, predispose)

74% 74%

Substance Relations (interact_with, inhibit, stimulate)

50% 73%

Pharmacological Effects (affect, disrupt, augment)

41% 68%

Clinical Actions (administered_to, manifestataion_of, treat)

54% 84%

Organism Characteristics (location_of, part_of, process_of)

63% 71%

Total 55% 73%

Page 21: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

21/26

5.1 Discussion: Error Analysis (1/2)

Word sense ambiguity (28%) Ticlopidine ( 血小板抑制劑 ) inhibition of ph

enytoin ( 二苯妥因 ) metabolism mediated by potent inhibition of CYP2C19 ( 基因 ).

Inhibition wrongly mapped to ‘Psychological Inhibition’.

CYP2C19 AFFECTS Psychological Inhibition.

Page 22: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

22/26

5.1 Discussion: Error Analysis (2/2)

Process coordinate structures (35%) The cytotoxic ( 細胞毒素 ) activities of merca

ptopurine ( 藥:胇基嘌呤 ) and fluorouracil ( 抗腫瘤代謝藥物 ) are regulated by thiopurine methyltransferase (TPMT) and dihydropyrimidine dehydrogenase (DPD), respectively.

Fluorouracil INTERACTS_WITH DPD gene. (○) mercaptopurine INTERACTS_WITH thiopurine m

ethyltransferase. (X)

Page 23: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

23/26

5.2 Process Medline Citations on CYP2D6 (1/3)

2849 Medline citations contain variant forms of CYP2D6.

5219 predications containing CYP2D6 as an argument were analyzed according to two predication categories (Genetic Etiology and Substance Relations).

Compare with relations listed for this gene on the PharmGKB Web site (PharmacoGenetic Knowledge Base).

Page 24: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

24/26

5.2 Process Medline Citations on CYP2D6 (2/3)

Genetic Etiology 267 total predications represented CYP2D6 as an et

iologic agent for a disease. Parkinson’s disease ( 帕金森氏症 ) (35), carcinoma

of the lung ( 肺癌 ) (21), tardive dyskinesia ( 遲發性不自主運動 ) (15), Alzheimer’s disease ( 阿茲海默症 ) (9), bladder carcinoma ( 膀胱癌 ) (8).

169 TP, and 4 FP, two were found not to contain the disease name in the referenced citation.

Only carcinoma of the lung occurs in PharmGKB.

Page 25: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

25/26

5.2 Process Medline Citations on CYP2D6 (3/3)

Substance Relations 1128 total predications involve CYP2D6 and a drug. 69 drugs occurred 3 or more times in those predications wher

e 41 drugs were in PharmGKB and 28 were not. 68 were true positives. Inhibit CYP2D6: quinidine (45), paroxetine (34), fluoxetine (2

7), fluvoxamine (8), sertraline (8). Quinidine and sertraline are not in PharmGKB.

Interact_with CYP2D6: bufuralol (27), antipsychotic agents (25) dextromethorphan (21), venlafaxine (19), debrisoquin (18).

Bufuralol is not in PharmGKB. SemRep failed to capture: cocaine, levomepromazine, mapro

tiline, trazodone, and yohimbine.

Page 26: Extracting Semantic Predication from Medline Citations for Pharmacogenomics

26/26

6. Conclusion This paper applies an existing NLP system in the phar

macogenomics domain. The major changes for developing Enhanced SemRep

from SemRep involved modifying the semantic space stipulated by the UMLS Semantic Network.

The outputs are semantic predications that represent assertions from Medline citations expressing a range of specific relations in pharmacogenomics.

The information can support advanced information management applications for pharmacogenomics research and clinical care.

In the future, authors intend to adapt the summarization and visualization techniques developed for clinical text.