22
Genome Wide Searches for RNA Secondary Structure Motifs Russell S. Hamilton Davis Lab Wellcome Trust Centre for Cell Biology Drosophila melanogaster

Genome Wide Searches for RNA Secondary Structure Motifs

  • Upload
    tybalt

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

Genome Wide Searches for RNA Secondary Structure Motifs. Russell S. Hamilton Davis Lab Wellcome Trust Centre for Cell Biology. ?. Drosophila melanogaster. Introduction: RNA Localization 2. mRNA cis-acting signal. Trans-acting factors Dynein. -. +. microtubules. - PowerPoint PPT Presentation

Citation preview

Page 1: Genome Wide Searches for RNA Secondary Structure Motifs

Genome Wide Searches for RNA Secondary

Structure MotifsRussell S. Hamilton

Davis LabWellcome Trust Centre for Cell Biology

Drosophila melanogaster

Page 2: Genome Wide Searches for RNA Secondary Structure Motifs

Introduction: RNA Localization 2

+ -microtubules

mRNA cis-acting signal

Trans-acting factors

Dynein

RNA Localization is a mode of targeting various proteins to their site of function

Cis-acting signals in the mRNA are recognised by trans-acting factors bound to the dynein motor

Translation of the mRNA into protein is blocked during transport

The mRNA is anchored at the site of function before being translated to protein(Delanoue & Davis, 2005, Cell, in press)

Page 3: Genome Wide Searches for RNA Secondary Structure Motifs

gurken is localized to the dorso/anterior corner, forming a cap around the oocyte nucleus and establishes the dorso/ventral axis

gurken localization has been shown to be dynein dependent (MacDougal et al, 2003, Dev. Cell, 4, 307-19)

gurken localization signal has been mapped to 64nt necessary and sufficient for localization (Van De Bor & Davis, 2004, Curr. Opin. Cell Biol. 16, 300-7 )

gurken also localizes in the embryo

Introduction: gurken 3

D

V

A P

osk

bcd

grk

Localizing mRNA in oocyte

gurken encodes a TGFα homologue

Page 4: Genome Wide Searches for RNA Secondary Structure Motifs

Introduction: I Factor 4

Localized I Factor

nucleus

I Factor is a retrotransposon (or transposable element), which inserts itself into the genome of an organism

I Factor has been found to localize in a similar manner to gurken(Van De Bor, Hartswood, Jones, Finnegan & Davis)

The localization signal has been mapped to a 58nt signal necessary and sufficient for localization.

Van De Bor

Page 5: Genome Wide Searches for RNA Secondary Structure Motifs

Sequence Similarity %ID = 34%

gurken AAGTAATTTTCGTGCTCTCAACAATTGTCGCCGTCACAGATTGTTGTTCGAGCCGAATCTTACT 64Ifactor ---TGCACACCTCCCTCGTCACTCTTGATTTT-TCAAGAGCCTTCGATCGAGTAGGTGTGCA-- 58 * * *** ** *** *** * * ***** * *

Structural Similarity

V. Van Der Bor, D. Finnegan, E. Harstwood and C. Jones

H

St

I1

B

I2

H

St

I1

B

I2

gurken64nt stem loop

I Factor58nt stem loop

Are there more examples in the Drosophila genome using a similar mechanism of localization?

Search by secondary structure not sequence

Introduction: gurken and I Factor 5

Page 6: Genome Wide Searches for RNA Secondary Structure Motifs

Genome sequences

Database

Folded Genome sequences Comparison with grk & I Factor structures

Method Outline 6

Page 7: Genome Wide Searches for RNA Secondary Structure Motifs

RNALFOLDFolds large genomic sequences outputting stable structures of a given sizeSimilar to mfold, but optimised for folding on genome wide scale

2L chromosome arm genomic sequence

StableStructures

RNALfold Hofacker et al (2004) Bioinformatics 20, 191-198

Method: RNALFOLD 7

Window Length user defined• Use 64 and 58 (grk & I Factor LEs)

Page 8: Genome Wide Searches for RNA Secondary Structure Motifs

RNAdistance & RNAforester • Structures represented in bracket format Minimal representation maintaining all structural characteristics

• Structures then aligned (not by sequence) with the query structure e.g. gurken LE

• Scores can be weighted by sequence length and total number of base pairs

..(((((.....))))). Matches = + score

.-(-(((-....))))-. Mismatches = - score

( = base pair . = unpaired base - = gapRNAdistance

Global Structure Comparison Hofacker (1994) Monatsh.Chem. 125, 167-188

RNAforester Local Structure ComparisonHochsmann (2003) Proc. Comp. Sys. Bioinf. (CSB 2003)

Method: RNAdistance & RNAforester 8

Page 9: Genome Wide Searches for RNA Secondary Structure Motifs

Flexible secondary structure definition and searching algorithm

Two step processStep 1. Create a structure descriptionStep 2. Use the description to find matching structures in a sequence database

Uses Mfold (and pknots) for secondary structure predictionsOutput can be ranked by thermodynamic stability

User Defined ScoringBased on if/then/else statementse.g. if loop has 6-8 bases

then score += 10 else score -= 10

Algorithm SummaryDescription converted to a tree structureSequence being matched, has secondary structure converted to tree structureThen the matching can occur.

Method: RNAMotif 9

Macke, T.J. et al (2001) Nucl., Acids., Res., 29, 4724-4735

Page 10: Genome Wide Searches for RNA Secondary Structure Motifs

Define base pairings allowed (in addition to Watson-Crick)

Define stems, loops, and bulges• Including number of nucleotides • Setting a range 0-N means it can either be present or not

Can also put in sequence constraintsIncluding tolerated mismatches

Can search for pseudoknots, triplexes & quadruplexes

Very flexible method of describing secondary structures

Method: RNAMotif 10

Page 11: Genome Wide Searches for RNA Secondary Structure Motifs

4 Description files so far…

1. Basic2900 hitsMatches both gurken and I factor LEs

2. Basic + score2900 hitsScores nearer gurken as positiveScores nearer I factor as negative

3. Basic + score + seq contraint UU394 hitsUU in bulge present in both gurken and I factor

4. Basic + score + seq contraint UU + CAA/AAC151+ hitsCAA/AAC stem1 present in both gurken and I factor

loop4-12nt

stem7-8nt

stem2-4nt

stem4nt

bulge3-5nt

bulge1-2nt

bulge0-1nt

stem5-9nt

loop4-12nt

stem7-8nt

stem2-4nt

stem4nt

bulge3-5nt

bulge1-2nt

bulge0-1nt

stem5-9nt

Method: RNAMotif 11

Page 12: Genome Wide Searches for RNA Secondary Structure Motifs

PC1.3X107

CDS3.0X107

TS4.0X107

GN1.2X108

TE4.5X106

NC4.8X106

3-UTR6.3X106

5-UTR3.9X106

Sequence Databases

RNALfold[3]

RNADistance[4]

RNAMotif[5]

MatchesDatabase

Folds 2.8X108 ntat window lengths of 58 and 64 nt

Each stable structure is compared to gurken and I Factor LEs

Stable structures are filtered by rule based pattern matching

Web based Database Interface

Candidates for experimental validation

PC1.3X107

CDS3.0X107

TS4.0X107

GN1.2X108

TE4.5X106

NC4.8X106

3-UTR6.3X106

5-UTR3.9X106

Sequence Databases

PC1.3X107

CDS3.0X107

TS4.0X107

GN1.2X108

TE4.5X106

NC4.8X106

3-UTR6.3X106

5-UTR3.9X106

PC1.3X107

CDS3.0X107

TS4.0X107

GN1.2X108

TE4.5X106

NC4.8X106

3-UTR6.3X106

5-UTR3.9X106

Sequence Databases

RNALfold[3]

RNADistance[4]

RNAMotif[5]

MatchesDatabase

Folds 2.8X108 ntat window lengths of 58 and 64 nt

Each stable structure is compared to gurken and I Factor LEs

Stable structures are filtered by rule based pattern matching

Web based Database Interface

Candidates for experimental validation

Take all available sequence databases

Predict all stable secondary structures

Calculate similarity between grk/Ifactor and stable structures

Pattern match structures against an RNAMotif description

Results put in database and accessed via web interface

Method: Overview 12

Page 13: Genome Wide Searches for RNA Secondary Structure Motifs

Processing 6 processing nodes • Pentium 4 HT 1GB RAM

Data StorageRAID Array File ServerTape Backup Robot

Computational requirements are beyond desktop PC’sMain requirements are for processing power and enough storage space for the sequences being searched and the database of matching structures

Computational Infrastructure 13

Web ServerLinked to Database

Development Platform

Page 14: Genome Wide Searches for RNA Secondary Structure Motifs

http://wcbweb.icmb.ed.ac.uk/~ilan/bioinformatics.html

To stop your browser crashing, you can limit the number of hits displayed

Filter by percentage of the sequence deemed to have low complexity

Select the RNAMotif structure description used in the searches

Narrow down the search by CG, TE, CR or individual identifiers

X

Web Interface: Searching 14

Page 15: Genome Wide Searches for RNA Secondary Structure Motifs

RNAMotif raw output showing how sequence matches the structure description

Indicates if the sequence has regions of low complexity/repeat regions (option to filter these out)

RNAdistance scores displayedCustom RNAMotif Score

Web Interface: Search Results 15

Page 16: Genome Wide Searches for RNA Secondary Structure Motifs

Web Interface: Gene Mapping 16

Page 17: Genome Wide Searches for RNA Secondary Structure Motifs

Web Interface: Conservation Assessment 17

Page 18: Genome Wide Searches for RNA Secondary Structure Motifs

Results: Candidate Injections 18

We are currently in the process of injecting candidates from the database into oocytes and embryos to determine if the RNA is localized.

There have been suggestions that up to 20% of Drosophila genes may localize in the oocyte and/or embryo

So we want to show that our method is able to enrich for localizing genes

Results of candidate injections are stored in the database

Page 19: Genome Wide Searches for RNA Secondary Structure Motifs

Depending of the success of the experimental localization assays…

Expand the searches to: • Other Drosophilid genomes

12 will be sequenced in the near future• Mammalian genomes (particularly human)

Will require considerable computational powerSearch for LINE/SINE elements in human (transposon equivalents)

Develop the web interface to enable real time searches to be performed on genes/genomes of interest

• Requires massive computational power…

Future Work: Expanding Searches 19

Page 20: Genome Wide Searches for RNA Secondary Structure Motifs

Squid Protein

gurken mRNA is known to bind Squid protein

Used homology modelling to predict squid tertiary structure (~2.5Å)(Hamilton & Soares)

RNA tertiary structure prediction

Secondary structure alone may not be sufficient for finding similar structures

Experimental Structure Determination

RNA + Protein - X-Ray and/or NMRRNA only - NMR

Future Work: Tertiary Structure 20

RNA Binding Sites

Flexible Linker region

Squid homology model

RNA + protein 3D Structure

Staufen + RNARamos et al, 2000, EMBO, 19, 997-1009

Page 21: Genome Wide Searches for RNA Secondary Structure Motifs

Long Term Future…

Support Vector Machines (SVMs)

Take sequence & structure for localizing and non-localizing matches (+ other data)

Algorithm learns how to separate localizing from non-localizing

Future Work: Machine Learning 21

Problem is we don’t have enough data at the moment

However with all the candidate injections we will hopefully generate enough data for localizing and non-localizing genes

Page 22: Genome Wide Searches for RNA Secondary Structure Motifs

Funding

Davis LabIlan DavisVeronique Van De BorGeorgia VendraHille TekotteRenald DelanueCarine Meignin Alejandra ClarkIsabelle KosRichard Parton

Software

Acknowledgements 22

Finnegan LabDavid FinneganEve HartswoodCheryl Jones

Bioinformatics DiscussionsAlastair Kerr

Systems AdministrationPaul Taylor

Homology ModellingDinesh Soares