Upload
jaimie
View
27
Download
1
Tags:
Embed Size (px)
DESCRIPTION
P. - PowerPoint PPT Presentation
Citation preview
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation
Hu ZZ1, Yuan X1, Torii M2, Vijay-Shanker K3, and Wu CH1
1Protein Information Resource, 2Department of Biostatistics, Bioinformatics, and Biomathematics, 4Department of Computational Linguistics, Georgetown University, Washington, DC 20007; 3University of Delaware, DE 19716
http://pir.georgetown.edu/iprolink/rlimspContact:
Substrate(e.g., cPLA2)
phosphorylated-cPLA2
Enzyme(e.g., MAP kinase)
<THEME> Substrate (protein being phosphorylated)
<AGENT> Enzyme (kinase catalyzing the phosphorylation)
Phosphorylation
P-site
(e.g., Ser505)
P-group
<SITE> P-Site (amino acid residue being phosphorylated)
Ser-P
RLIMS-P
Evidence attribution
Manual tagging assisted with computational extraction:Training and testing sets of positive and negative samples for RLIMS-P development
3 objects
Annotation tagged literature sets for PTMs from iProLINK literature mining resource
Introduction: The RLIMS-P is a rule-based text-mining program specifically designed to extract protein phosphorylation information on protein kinase, substrate and phosphorylation sites from the abstracts (Hu et al., 2005). The program was originally developed by Narayanaswamy, Ravikumar, and Vijay-Shanker (2005), and was tested and benchmarked by PIR using iProLINK annotated datasets (Hu et al., 2004). The RLIMS-P program is now adopted at PIR and being developed into an online text mining tool for extracting protein phosphorylation information from PubMed literature (Yuan, et al., 2006). The online RLIMS-P currently provides the following functions to: 1) determine whether the MEDLINE abstract contains protein phosphorylation information and to extract protein kinase, protein substrate and phosphorylation site/residue when available; 2) tag extracted phosphorylation objects in the abstract in different colors; 3) map the protein substrate to UniProtKB protein entries based on PMID; 4) map protein names to UniProtKB protein entries based on BioThesaurus. Coupled with BioThesarus, RLIMS-P can facilitate the UniProtKB protein phosphorylation feature annotation.
P
Abstracts Full-Length Texts
Post-Processing
Extracted Annotations Tagged Abstracts
Sentence extraction
Part of speech tagging
Preprocessing
Acronym detection
Term recognition
Entity Recognition
Noun and verb group detection
Other syntactic structure detection
Phrase Detection
Semantic Type
Classification
Nominal level relation
Verbal level relation
Relation Identification
Abstracts Full-Length Texts
Abstracts Full-Length Texts
Post-Processing
Extracted Annotations Tagged Abstracts
Extracted Annotations Tagged Abstracts
Sentence extraction
Part of speech tagging
Preprocessing
Sentence extraction
Part of speech tagging
Preprocessing
Acronym detection
Term recognition
Entity Recognition
Acronym detection
Term recognition
Entity Recognition
Noun and verb group detection
Other syntactic structure detection
Phrase Detection
Noun and verb group detection
Other syntactic structure detection
Phrase Detection
Semantic Type
Classification
Nominal level relation
Verbal level relation
Relation Identification
Nominal level relation
Verbal level relation
Relation Identification
RLIMS-P System Design
Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)?
ATR/FRP-1 also phosphorylated p53 in Ser 15
Training/benchmarking data sets and pattern rules can be downloaded.
Bioinformatics. 21:2759-65, 2005
Benchmarking of RLIMS-PHigh recall for paper retrieval and high precision for information extraction
Web-based RLIMS-P
Information retrieval and
extraction
Protein entity
mapping
C
D
A
B
The online RLIMS-P text-mining results: (A) The summary table lists PMIDs with top-ranking phosphorylation annotation. (B) The full report provides detailed annotation results with evidence tagging and automatic mapping to UniProtKB entry containing the citation (e.g., KPB1_RABIT).
Name mapping of phosphorylated protein in RLIMS-P report (C) to UniProtKB entry using BioThesaurus (D). Name mapping includes options to use names appearing in the abstract or user-specified names to search online BioThesaurus. Here, “PBPA” retrieves 10 entries sharing the same name, including PBPA of Mycobacterium tuberculosis (P71586_MYCTU), the phosphorylated protein discussed in the abstract.
A preliminary case study – Using RLIMS-P to facilitate the UniProtKB feature annotation
Nuclear receptor (NR) phosphorylation was under-annotated in databases. Text-mining of 2170 PubMed abstracts (retrieved with query of NR phosphorylation) with RLIMS-P found significantly more phosphorylation sites to add to UniProt feature annotation.
Future development of RLIMS-P program:• Extend to mine full-length articles• Mine in vivo protein phosphorylation and its cellular
context, such as cell types and pathways
References:Hu ZZ, et al., Comp Biol Chem. 28:409-16, 2004. Hu ZZ, et al., Bioinformatics. 21:2759-65, 2005. Narayanaswamy M, et al., Bioinformatics, Suppl.1 21: i319-i327, 2005.Yuan X, et al., Bioinformatics, April 27, 2006.Acknowledgements: NIH (UniProt), NSF (Entity Tagging). PIR team: Wu HT, Fang C, Huang H, Arminski L. Collaborators: Liu H, Narayanaswamya M, Ravikumar KE.