1
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1 , Yuan X 1 , Torii M 2 , Vijay-Shanker K 3 , and Wu CH 1 1 Protein Information Resource, 2 Department of Biostatistics, Bioinformatics, and Biomathematics, 4 Department of Computational Linguistics, Georgetown University, Washington, DC 20007; 3 University of Delaware, DE 19716 http://pir.georgetown.edu/iprolink/rlimsp Contact: [email protected] S ubstrate (e.g.,cPLA2) phosphorylated-cPLA2 E nzym e (e.g.,M AP kinase) <TH EM E> Substrate (protein being phosphorylated) <AGENT> E nzym e (kinase catalyzing the phosphorylation) Phosphorylation P-site (e.g.,Ser505) P-group <SITE> P -Site (am ino acid residue being phosphorylated) S er-P RLIMS-P Evidence attributi on Manual tagging assisted with computational extraction: Training and testing sets of positive and negative samples for RLIMS-P development 3 objects Annotation tagged literature sets for PTMs from iProLINK literature mining resource Introduction: The RLIMS-P is a rule-based text-mining program specifically designed to extract protein phosphorylation information on protein kinase, substrate and phosphorylation sites from the abstracts (Hu et al., 2005). The program was originally developed by Narayanaswamy, Ravikumar, and Vijay-Shanker (2005), and was tested and benchmarked by PIR using iProLINK annotated datasets (Hu et al., 2004). The RLIMS-P program is now adopted at PIR and being developed into an online text mining tool for extracting protein phosphorylation information from PubMed literature (Yuan, et al., 2006). The online RLIMS-P currently provides the following functions to: 1) determine whether the MEDLINE abstract contains protein phosphorylation information and to extract protein kinase, protein substrate and phosphorylation site/residue when available; 2) tag extracted phosphorylation objects in the abstract in different colors; 3) map the protein substrate to UniProtKB protein entries based on PMID; 4) map protein names to UniProtKB protein entries based on BioThesaurus. Coupled with BioThesarus, RLIMS-P can facilitate the UniProtKB protein phosphorylation feature annotation. P Abstracts Full-Length Texts Abstracts Full-Length Texts Post- Processing Extracted Annotations Tagged Abstracts Extracted Annotations Tagged Abstracts Sentence extraction Partofspeech tagging Sentence extraction Partofspeech tagging Preprocessing Acronym detection Term recognition Acronym detection Term recognition Entity R ecognition N oun and verb group detection Othersyntactic structure detection N oun and verb group detection Othersyntactic structure detection Phrase D etection Semantic Type C lassification Nom inal levelrelation Verballevel relation Nom inal levelrelation Verballevel relation Relation Identification RLIMS-P System Design Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)? ATR/FRP-1 also phosphorylated p53 in Ser 15 Training/benchmarking data sets and pattern rules can be downloaded. Bioinformatics. 21:2759-65, 2005 Benchmarking of RLIMS-P High recall for paper retrieval and high precision for information extraction Web-based RLIMS-P Informati on retrieval and extractio n Protein entity mapping C D A B The online RLIMS-P text-mining results: (A) The summary table lists PMIDs with top-ranking phosphorylation annotation. (B) The full report provides detailed annotation results with evidence tagging and automatic mapping to UniProtKB entry containing the citation (e.g., KPB1_RABIT). Name mapping of phosphorylated protein in RLIMS-P report (C) to UniProtKB entry using BioThesaurus (D). Name mapping includes options to use names appearing in the abstract or user- specified names to search online BioThesaurus. Here, “PBPA” retrieves 10 entries sharing the same name, including PBPA of Mycobacterium tuberculosis (P71586_MYCTU), the phosphorylated protein discussed in the abstract. A preliminary case study – Using RLIMS-P to facilitate the UniProtKB feature annotation Nuclear receptor (NR) phosphorylation was under-annotated in databases. Text-mining of 2170 PubMed abstracts (retrieved with query of NR phosphorylation) with RLIMS-P found significantly more phosphorylation sites to add to UniProt feature annotation. Future development of RLIMS-P program: • Extend to mine full-length articles • Mine in vivo protein phosphorylation and its cellular context, such as cell types and pathways References: Hu ZZ, et al., Comp Biol Chem. 28:409-16, 2004. Hu ZZ, et al., Bioinformatics. 21:2759-65, 2005. Narayanaswamy M, et al., Bioinformatics, Suppl.1 21: i319-i327, 2005. Yuan X, et al., Bioinformatics, April 27, 2006. Acknowledgements: NIH (UniProt), NSF (Entity Tagging). PIR team: Wu HT, Fang C, Huang H, Arminski L. Collaborators: Liu H, Narayanaswamya M, Ravikumar KE.

RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation

  • Upload
    jaimie

  • View
    27

  • Download
    1

Embed Size (px)

DESCRIPTION

P. - PowerPoint PPT Presentation

Citation preview

Page 1: RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation

RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation

Hu ZZ1, Yuan X1, Torii M2, Vijay-Shanker K3, and Wu CH1

1Protein Information Resource, 2Department of Biostatistics, Bioinformatics, and Biomathematics, 4Department of Computational Linguistics, Georgetown University, Washington, DC 20007; 3University of Delaware, DE 19716

http://pir.georgetown.edu/iprolink/rlimspContact:

[email protected]

Substrate(e.g., cPLA2)

phosphorylated-cPLA2

Enzyme(e.g., MAP kinase)

<THEME> Substrate (protein being phosphorylated)

<AGENT> Enzyme (kinase catalyzing the phosphorylation)

Phosphorylation

P-site

(e.g., Ser505)

P-group

<SITE> P-Site (amino acid residue being phosphorylated)

Ser-P

RLIMS-P

Evidence attribution

Manual tagging assisted with computational extraction:Training and testing sets of positive and negative samples for RLIMS-P development

3 objects

Annotation tagged literature sets for PTMs from iProLINK literature mining resource

Introduction: The RLIMS-P is a rule-based text-mining program specifically designed to extract protein phosphorylation information on protein kinase, substrate and phosphorylation sites from the abstracts (Hu et al., 2005). The program was originally developed by Narayanaswamy, Ravikumar, and Vijay-Shanker (2005), and was tested and benchmarked by PIR using iProLINK annotated datasets (Hu et al., 2004). The RLIMS-P program is now adopted at PIR and being developed into an online text mining tool for extracting protein phosphorylation information from PubMed literature (Yuan, et al., 2006). The online RLIMS-P currently provides the following functions to: 1) determine whether the MEDLINE abstract contains protein phosphorylation information and to extract protein kinase, protein substrate and phosphorylation site/residue when available; 2) tag extracted phosphorylation objects in the abstract in different colors; 3) map the protein substrate to UniProtKB protein entries based on PMID; 4) map protein names to UniProtKB protein entries based on BioThesaurus. Coupled with BioThesarus, RLIMS-P can facilitate the UniProtKB protein phosphorylation feature annotation.

P

Abstracts Full-Length Texts

Post-Processing

Extracted Annotations Tagged Abstracts

Sentence extraction

Part of speech tagging

Preprocessing

Acronym detection

Term recognition

Entity Recognition

Noun and verb group detection

Other syntactic structure detection

Phrase Detection

Semantic Type

Classification

Nominal level relation

Verbal level relation

Relation Identification

Abstracts Full-Length Texts

Abstracts Full-Length Texts

Post-Processing

Extracted Annotations Tagged Abstracts

Extracted Annotations Tagged Abstracts

Sentence extraction

Part of speech tagging

Preprocessing

Sentence extraction

Part of speech tagging

Preprocessing

Acronym detection

Term recognition

Entity Recognition

Acronym detection

Term recognition

Entity Recognition

Noun and verb group detection

Other syntactic structure detection

Phrase Detection

Noun and verb group detection

Other syntactic structure detection

Phrase Detection

Semantic Type

Classification

Nominal level relation

Verbal level relation

Relation Identification

Nominal level relation

Verbal level relation

Relation Identification

RLIMS-P System Design

Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)?

ATR/FRP-1 also phosphorylated p53 in Ser 15

Training/benchmarking data sets and pattern rules can be downloaded.

Bioinformatics. 21:2759-65, 2005

Benchmarking of RLIMS-PHigh recall for paper retrieval and high precision for information extraction

Web-based RLIMS-P

Information retrieval and

extraction

Protein entity

mapping

C

D

A

B

The online RLIMS-P text-mining results: (A) The summary table lists PMIDs with top-ranking phosphorylation annotation. (B) The full report provides detailed annotation results with evidence tagging and automatic mapping to UniProtKB entry containing the citation (e.g., KPB1_RABIT).

Name mapping of phosphorylated protein in RLIMS-P report (C) to UniProtKB entry using BioThesaurus (D). Name mapping includes options to use names appearing in the abstract or user-specified names to search online BioThesaurus. Here, “PBPA” retrieves 10 entries sharing the same name, including PBPA of Mycobacterium tuberculosis (P71586_MYCTU), the phosphorylated protein discussed in the abstract.

A preliminary case study – Using RLIMS-P to facilitate the UniProtKB feature annotation

Nuclear receptor (NR) phosphorylation was under-annotated in databases. Text-mining of 2170 PubMed abstracts (retrieved with query of NR phosphorylation) with RLIMS-P found significantly more phosphorylation sites to add to UniProt feature annotation.

Future development of RLIMS-P program:• Extend to mine full-length articles• Mine in vivo protein phosphorylation and its cellular

context, such as cell types and pathways

References:Hu ZZ, et al., Comp Biol Chem. 28:409-16, 2004. Hu ZZ, et al., Bioinformatics. 21:2759-65, 2005. Narayanaswamy M, et al., Bioinformatics, Suppl.1 21: i319-i327, 2005.Yuan X, et al., Bioinformatics, April 27, 2006.Acknowledgements: NIH (UniProt), NSF (Entity Tagging). PIR team: Wu HT, Fang C, Huang H, Arminski L. Collaborators: Liu H, Narayanaswamya M, Ravikumar KE.