RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation

RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation

Hu ZZ1, Yuan X1, Torii M2, Vijay-Shanker K3, and Wu CH1

1Protein Information Resource, 2Department of Biostatistics, Bioinformatics, and Biomathematics, 4Department of Computational Linguistics, Georgetown University, Washington, DC 20007; 3University of Delaware, DE 19716

http://pir.georgetown.edu/iprolink/rlimspContact:

[email protected]

Substrate(e.g., cPLA2)

phosphorylated-cPLA2

Enzyme(e.g., MAP kinase)

<THEME> Substrate (protein being phosphorylated)

<AGENT> Enzyme (kinase catalyzing the phosphorylation)

Phosphorylation

P-site

(e.g., Ser505)

P-group

<SITE> P-Site (amino acid residue being phosphorylated)

Ser-P

RLIMS-P

Evidence attribution

Manual tagging assisted with computational extraction:Training and testing sets of positive and negative samples for RLIMS-P development

3 objects

Annotation tagged literature sets for PTMs from iProLINK literature mining resource

Introduction: The RLIMS-P is a rule-based text-mining program specifically designed to extract protein phosphorylation information on protein kinase, substrate and phosphorylation sites from the abstracts (Hu et al., 2005). The program was originally developed by Narayanaswamy, Ravikumar, and Vijay-Shanker (2005), and was tested and benchmarked by PIR using iProLINK annotated datasets (Hu et al., 2004). The RLIMS-P program is now adopted at PIR and being developed into an online text mining tool for extracting protein phosphorylation information from PubMed literature (Yuan, et al., 2006). The online RLIMS-P currently provides the following functions to: 1) determine whether the MEDLINE abstract contains protein phosphorylation information and to extract protein kinase, protein substrate and phosphorylation site/residue when available; 2) tag extracted phosphorylation objects in the abstract in different colors; 3) map the protein substrate to UniProtKB protein entries based on PMID; 4) map protein names to UniProtKB protein entries based on BioThesaurus. Coupled with BioThesarus, RLIMS-P can facilitate the UniProtKB protein phosphorylation feature annotation.

P

Abstracts Full-Length Texts

Post-Processing

Extracted Annotations Tagged Abstracts

Sentence extraction

Part of speech tagging

Preprocessing

Acronym detection

Term recognition

Entity Recognition

Noun and verb group detection

Other syntactic structure detection

Phrase Detection

Semantic Type

Classification

Nominal level relation

Verbal level relation

Relation Identification



Post-Processing



Sentence extraction


Preprocessing

Sentence extraction


Preprocessing

Acronym detection

Term recognition

Entity Recognition

Acronym detection

Term recognition

Entity Recognition



Phrase Detection



Phrase Detection

Semantic Type

Classification







RLIMS-P System Design

Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)?

ATR/FRP-1 also phosphorylated p53 in Ser 15

Training/benchmarking data sets and pattern rules can be downloaded.

Bioinformatics. 21:2759-65, 2005

Benchmarking of RLIMS-PHigh recall for paper retrieval and high precision for information extraction

Web-based RLIMS-P

Information retrieval and

extraction

Protein entity

mapping

C

D

A

B

The online RLIMS-P text-mining results: (A) The summary table lists PMIDs with top-ranking phosphorylation annotation. (B) The full report provides detailed annotation results with evidence tagging and automatic mapping to UniProtKB entry containing the citation (e.g., KPB1_RABIT).

Name mapping of phosphorylated protein in RLIMS-P report (C) to UniProtKB entry using BioThesaurus (D). Name mapping includes options to use names appearing in the abstract or user-specified names to search online BioThesaurus. Here, “PBPA” retrieves 10 entries sharing the same name, including PBPA of Mycobacterium tuberculosis (P71586_MYCTU), the phosphorylated protein discussed in the abstract.

A preliminary case study – Using RLIMS-P to facilitate the UniProtKB feature annotation

Nuclear receptor (NR) phosphorylation was under-annotated in databases. Text-mining of 2170 PubMed abstracts (retrieved with query of NR phosphorylation) with RLIMS-P found significantly more phosphorylation sites to add to UniProt feature annotation.

Future development of RLIMS-P program:• Extend to mine full-length articles• Mine in vivo protein phosphorylation and its cellular

context, such as cell types and pathways

References:Hu ZZ, et al., Comp Biol Chem. 28:409-16, 2004. Hu ZZ, et al., Bioinformatics. 21:2759-65, 2005. Narayanaswamy M, et al., Bioinformatics, Suppl.1 21: i319-i327, 2005.Yuan X, et al., Bioinformatics, April 27, 2006.Acknowledgements: NIH (UniProt), NSF (Entity Tagging). PIR team: Wu HT, Fang C, Huang H, Arminski L. Collaborators: Liu H, Narayanaswamya M, Ravikumar KE.

Documents

RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation