1 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting A Computational

ROC 2008 meeting 17/27/2008

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

A Computational Method to Identify Amino Acid Residues in RNA-protein Interactions

Michael Terribilini & Jae-Hyung Lee

Cornelia Caragea, Deepak Reyon, Ben Lewis, Jeffry Sander, Robert Jernigan, Vasant Honavar and Drena Dobbs

Bioinformatics and Computational Biology ProgramCenter for Computational Intelligence, Learning, and Discovery

L.H. Baker Center for Bioinformatics & Biological Statistics

BCBNSF IGERT

ROC 2008 meeting 27/27/2008



PROBLEM: Given the sequence of a protein (& possibly its structure), predict which amino acids participate in protei

n-RNA interactions

APPROACH: Generate datasets of known complexes from PDB to train & test machine learning algorithms (Naïve Bayes, SVM, etc.)

GOAL: Classify each amino acid in target protein as either interface or non-interface residue

Guiding hypothesis: Principal determinants of protein binding sites are reflected in local sequence features Observation: Binding site residues are often clustered within primary amino acid sequence

ROC 2008 meeting 37/27/2008



Sequence-Based Classifier:• RB181 non-redundant dataset: 181 protein-RNA complexes from the PDB• Input: window of amino acid identities centered on target & contiguous in protein sequence• Classifier: Naïve Bayes• Leave-one-out cross validation

QSVSTSSFRYM

Ser 28

Structure-Based Classifier:• Calculate distance between each pair of residues in known structure • Input: identities of the nearest n spatial neighbors • Classifier: Naïve Bayes• Leave-one-out cross validation

SSFRLNKSGRT

Ser 28

PSSM-Based Classifier:• PSI-BLAST against NCBI nr database to generate PSSMs• Input: PSSM vectors for residues contiguous in sequence• Classifier: Support Vector Machine (SVM)• 10-fold cross validation

Ser 28

-3,7,8,… 5,-4,-6, … … … …,5,9,-1,…

QSVSTSSFRYM

20

PROBLEM: Given the sequence of a protein (& possibly its structure), predict

which amino acids participate in protein-RNA interactions

ROC 2008 meeting 47/27/2008



Dataset of RNA-protein Interface Residues

Extract All Protein-RNA ComplexesSelect high resolution structures < 3.5Å Res

PDB

503 Complex

es

181 Chains48,791

Residues

Filter using PISCES< 30% pair-wise sequence identity

Identify Interface Residues using distance cutoff 5 Å

7,456 InterfaceResidues

(Positive examples)

41,335 Non-Interface Residues

(Negative examples)

PISCES: Wang and Dunbrack, 2003 Bioinformatics, 19:1589

ROC 2008 meeting 57/27/2008



Complex Protein-Protein Protein-DNA Protein-RNA

Classifier2-stage classifier

SVM + Naïve Bayes Naïve Bayes Naïve Bayes

Accuracy 72 % 77 % 85 %

Specificity 58 % 37 % 51 %

Sensitivity 39 % 43 % 38 %

Correlationcoefficient 0.30 0.25 0.35

ReferenceYan et al., 2004Bioinformatics

Yan et al., 2006BMC Bioinformatics

Terribilini et al., 2006 RNA

Related workJones & Thornton,Ofran & Rostmany others

Jones et al.,Thornton et al.,Ahmad & Sarai

Jeong et al.,Miyano et al.,Go et al.

Performance in predicting interface residuesUsing only protein sequence as input

ROC 2008 meeting 67/27/2008



Naïve BayesNaïve Bayes2-stage classifierSVM + Naïve Bayes

Protein-RNAProtein-DNAProtein-Protein

Yan Bioinformatics 2004; Yan BMC Bioinformatics 2006; Terribilini RNA 2006

Ab FabN10

Acc = 87%

CC = 0.65

Repressor

Acc = 88%

CC = 0.66

dsRNA Binding Protein

Acc = 86%

CC = 0.59

A few "good" predictions mapped onto structures Using only protein sequence as input

ROC 2008 meeting 77/27/2008



ID-Seq ID-Struct ID-PSSM Combined

1Specificity 0.55 0.63 0.51 0.53

2Sensitivity 0.30 0.32 0.43 0.49

Accuracy 0.86 0.87 0.85 0.86

Correlation Coefficient 0.33 0.38 0.38 0.43

3AUC of ROC 0.73 0.77 0.79 0.81

Combining Sequence, Structure & PSSM-Based Classifiers Improves Prediction of RNA-Binding Residues

Predictions illustrated on 3D structures: 30S ribosomal protein S17 (PDB ID 1FJG:Q)

Sequence-Based Structure-Based PSSM-Based Combined

(For clarity, bound RNA is not shown)TP = True Positive = interface residues predicted as such FP = False Positive = non-interface residues predicted as interface residuesTN = True Negative = non-interface residues predicted as suchFN = False Negative = interface residues predicted as non-interface

Combined Results for 1FJG:Q:Spec+ = 0.89Sens+ = 0.96Accuracy = 0.91Correlation Coefficient = 0.83

1Specificity (Precision for the positive, RNA-binding class)2Sensitivity (Recall for the positive, RNA-binding class)3Area Under the Curve (AUC) from a Receiver Operating Characteristic (ROC) curve

ROC 2008 meeting 87/27/2008



IDSeq PredictionsAccuracy = 80%Specificity = 56%Sensitivity = 21%

CC = 0.25

Combined PredictionsAccuracy = 82%Specificity = 55%Sensitivity = 75%

CC = 0.52

Predictions for Signal Recognition Particle 19kDa protein (PDB ID 1JID_A)

ROC 2008 meeting 97/27/2008



RNABindR: An RNA Binding Site Prediction Server

ROC 2008 meeting 107/27/2008



Applications

• Lentiviral Rev proteins

• Telomerase Reverse Transcriptase (TERT)

http://telomerase.asu.edu/

ROC 2008 meeting 117/27/2008



Rev - a potential target for novel HIV therapies

• Rev is a multifunctional regulatory protein that plays an essential role in the production of infectious virus• A small nucleo-plasmic shuttling protein

(HIV Rev 115 aa; EIAV Rev 165 aa)• Recognizes a specific binding site on viral RNA

Rev Responsive Element (RRE)• Contains specific domains that mediate nuclear

localization, RNA binding and nuclear export• Rev's critical role in lentiviral replication makes it an attrac

tive target for antiviral (AIDs) therapy

ROC 2008 meeting 127/27/2008



• Why?– Rev aggregates at concentrations needed for NMR or X-ray

crystallography– The only high resolution information available is for short peptide

fragments of HIV-1 Rev: a 22 amino acid fragment of Rev bound to a 34 nucleotide RRE RNA fragment

• What about insights from sequence comparisons? – HIV Rev sequence has low sequence identity with proteins with

known structure– Very little sequence similarity among different Rev family

members (e.g., EIAV vs HIV < 10%)

Problem: no high resolution Rev structure! - not even for HIV Rev, despite intense effort

ROC 2008 meeting 137/27/2008



HIV-1 Rev: Predictions vs ExperimentsPrediction on RNA-binding protein HIV-1 Rev

33 43 53DTRQARRNRR RRWRERQRAA AA++++++++++ ++++++++

Actual IRPredicted

Sequence based prediction on HIV-1 Rev (not included in the training set) identified every interface residue, plus 3 false positives

Predicted Actual

NMR structure (1ETF:B): 22 aa Rev peptide bound to RNABattiste et al., 1996,Science 273:1547

Interface residues = redNon-interface residues = greyRNA = green

ROC 2008 meeting 147/27/2008



PREDICTED:

Structure

Protein binding residues

RNA binding residues

KRRRK

RRDRW

EIAV Rev: Predictions vs Experiments

++

131 141 151 161 QRGDFSAWGDYQQAQERRWGEQSSPRVLRPGDSKRRRKHL++++++++++ ++ +++ ++++++ + ++++++++++++++++++++

61 71 81 91

ARRHLGPGPTQHTPSRRDRWIREQILQAEVLQERLEWRI+++++++++++++++ ++++++++++++++++

41 51GPLESDQWCRVLRQSLPEEKISSQTCI++++++++ ++

RRDRW

ERLE KRRRK

NES NLS

57 125 145 16531

Lee J Virol 2006; Terribilini RNA 2006

VALIDATED:

Protein binding residues

RNA binding residues

57-1

65

MB

P

WT

31-1

65

31-1

45

145

-16

5

IhmHoCarpenter

ROC 2008 meeting 157/27/2008



AADAA AALA KAAAK

ERDE

RRDRW

ERLE KRRRK

NES NLS

57 125 145 16531

KA

AA

K

AA

DA

A

AA

L A

ER

DE

WT

131 141 151 161 QRGDFSAWGDYQQAQERRWGEQSSPRVLRPGDSKRRRKHL++++++++++ ++ +++ ++++++ + ++++++++++++++++++++

61 71 81 91

ARRHLGPGPTQHTPSRRDRWIREQILQAEVLQERLEWRI+++++++++++++++ ++++++++++++++++

41 51GPLESDQWCRVLRQSLPEEKISSQTCI++++++++ ++

Mutations in EIAV Rev: Experimental evaluation of RNA binding sites

Lee J Virol 2006; Terribilini RNA 2006

ROC 2008 meeting 167/27/2008



Summary

KRRRK

RRDRW

HIV-1 RevEIAV Rev

Results show predicted protein & RNA binding sites in Rev proteins of HIV-1 & EIAV agree with available experimental data

ROC 2008 meeting 177/27/2008



Telomerase Reverse Transcriptase (TERT)Functions:

– “Cap” ends of chromosomes to prevent:– Recombination– End-to-end fusion– Degradation

– Allow complete replication of chromosomes

Interactions:Protein-DNA

– Binds linear chromosome ends

(& extends them)

Protein-RNA

– Telomerase reverse transcriptase (TERT) subunit contains an essential RNA component

Protein-Protein– Dyskerin - component of active

human telomerase complex– Many other interacting proteins:

e.g., PPI1, RAP1, TEP1, HSP90

Lingner (1997) Science 276: 561-567

Adapted from P. J. Mason

ROC 2008 meeting 187/27/2008



Human TERT: Preliminary docking of 3 modeled domains

Preliminary model (lacking TEN domain)

KurcinskiKolinskiKloczkowski

ROC 2008 meeting 197/27/2008



Predicted vs Actual RNA-Binding Residue in Human TRBD

Predicted Actual

ROC 2008 meeting 207/27/2008



Current & future work

Future: – Experimentally interrogate protein-RNA interfaces suggested by this work – Investigate these interfaces as potential therapeutic targets

Progress towards our Goals?

√ Model TERT domains from human√ Dock domains to generate a complete model for TERT protein Generate a working model for TERT-TR complex

Predict TR RNA tertiary structure, then dock with protein Underway…

ROC 2008 meeting 217/27/2008



Conclusions

•A combined classifier that uses the query sequence plus additional information derived from the known structure & a PSSM generated using PSI-BLAST sequence homologs (trained and tested on RB181, a dataset of diverse protein-RNA interfaces), predicts interface residues with ~ 86% overall accuracy, CC = 0.43

•Combining structure prediction with machine learning has potential to provide valuable insights into structure & function of important large RNP complexes - especially those for which high-resolution experimental structural information is not yet available

•Computational methods can provide insight into protein-RNA interfaces, even for "recalcitrant" proteins whose structures are not yet available

ROC 2008 meeting 227/27/2008



AcknowledgementsDobbs Lab @ Iowa State Universityhttp://ddobbs.public.iastate.edu/

Drena Dobbs, BCB & GDCB

– Michael Terribilini

– Jeffry Sander

– Peter Zaback

– Deepak Reyon

– Ben Lewis

Kolinski Lab @ University of Warsawhttp://biocomp.chem.uw.edu.pl/

Andrzej Kolinski, Chemistry– Mateusz Kurcinski

@ Iowa State University

Andrzej Kloczkowski, BBMB Robert Jernigan, BBMB Kai-Ming Ho, Physics

Supported by:

NSF IGERT Computational Molecular Biology USDA MGET Animal Genomics

Iowa State University:Bioinformatics & Computational Biology Program (BCB)LH Baker Center for Bioinformatics & Biological StatisticsCenter for Integrated Animal Genomics (CIAG)Center for Computational Intelligence, Learning & Discovery (CILD)

Honavar Lab @ Iowa State Universityhttp://www.cs.iastate.edu/~honavar/aigroup.htm

Vasant Honavar, BCB & Computer Science

– Cornelia Caragea

@ Washington State University

Susan Carpenter, Vet Micro & Patho

@ UCLA

Yungok Ihm, Biochemistry

Documents

1 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting A Computational