51
March 2006 Vineet Bafna ncRNA detection w/ multiple alignments

ncRNA detection w/ multiple alignments

  • Upload
    rainer

  • View
    28

  • Download
    1

Embed Size (px)

DESCRIPTION

ncRNA detection w/ multiple alignments. Comparative detection of ncRNA. Given a pairwise alignment, QRNA decides if it is RNA, coding or Other The key to detecting RNA is covarying mutations. Multiple alignment should provide more information on covarying mutations. RNAz. - PowerPoint PPT Presentation

Citation preview

Page 1: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

ncRNA detection w/ multiple alignments

Page 2: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Comparative detection of ncRNA

• Given a pairwise alignment, QRNA decides if it is RNA, coding or Other

• The key to detecting RNA is covarying mutations.

• Multiple alignment should provide more information on covarying mutations.

Page 3: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

RNAz

• Computes the probability of ncRNA in a multiple alignment.

• RNAz computes two ‘novel’ statistics: – Min. Free Energy of sequences (MFE)– Conserved secondary structure (SCI)

• Train an SVM using the following features– MFE– SCI– Mean pairwise identity– Number of sequences in the input

Page 4: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

SCI

• Apply min. energy folding to a multiple alignment.

• The score of a pair of column is dependent upon base-pairing as well as compensatory mutations.

• Let EA denote the consensus fold energy.• Let E denote the average MFE of all sequences

– SCI = EA / E

– Claim : Low SCI is bad, high is good– Q: What is the SCI for diverged (random) sequences?– What is the SCI for identical sequences?

Page 5: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

MFE

• Compute a z-score for a sequence with MFE=m

• Z = (m-)/• Instead of computing , by shuffling,

and computing (slow)• Use regression to predict , from

sequence length and base composition.

Page 6: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Non-linear classification

• The z-statistic and SCI capture different properties.

• Green is good (native), red is bad (shuffed).

• Is SCI a good statistic, given different levels of sequence identity?

Page 7: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Using RNAz to predict ncRNA

• Applying RNAz to conserved regions results in a discovery of 30k putative RNA.

• Is this list complete? Is it valid?

Page 8: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Structural Alignment

X07545 ..ACCCGGC.CAUA...GUGGCCG.GGCAA.CAC.CCGG.U.C..UCGUUM21086 ..ACCCGGC.CAUA...GCGGCCG.GGCAA.CAC.CCGG.A.C..UCAUGX05870 ..ACCCGGC.CACA...GUGAGCG.GGCAA.CAC.CCGG.A.C..UCAUUU05019 ..ACCCGGU.CAUA...GUGAGCG.GGUAA.CAC.CCGG.A.C..UCGUUM16530 ..ACCCGGC.AAUA...GGCGCCGGUGCUA.CGC.CCGG.U.C..UCUUCX01588 ..ACCCGGU.CACA...GUGAGCG.GGCAA.CAC.CCGG.A.C..UCAUUAF034619 ...GGCGGC.CACA...GCGGUGG.GGUUGCCUC.CCGU.A.C..CCAUCL27170 AGUGGUGGC.CAUA...UCGGCGG.GGUUC.CUCCCCGU.A.C..CCAUC

X05532 AGGAACGGC.CAUA...CCACGUC.GAUCG.CAC.CACA.U.C..CCGUC#=GC <<<<<<<<<........<<.<<<<.<...<.<...<<<<.<.<.......

Conserved sequences, and conserved structure are more apparent in multiple alignments.

Page 9: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

RNA multiple alignments

• Detection of RNA depends upon reliable prediction of covarying mutations, as well as regions of conserved sequence

• Precomputing multiple alignments based on sequence considerations is probably not sufficient (should be tested).

• How can structural alignments be computed?

Page 10: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Computing Structural Alignments

• Analogy: In sequence alignment, the score for aligning a column is position independent.

• In profiles, or HMMs, position specific scoring is used to distinguish conserved positions from non-conserved positions

• Similar ideas can be used for RNA.

G U G G C C GG C G G C C GG U G A G C GG U G A G CG G C G C C GG U G A G C GG C G G U G GU C G G C G GC C A C G U C

1

321

3

4

2

Pr(G|1) = 0.8

Page 11: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Covariance models=RNA profiles

AAAAU

UUU-A

AAAU-

---AU

S

W1

a W2

W3 b

a W4 b :

:

a W’2 b

Terminal symbols correspond to columns

Page 12: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Aligning a sequence to a covariance model

• We align each node of the covariance model (it is tree like, but may be a graph).

• The alignment score follows the same recurrence as in Lecture 7, but with position specific probabilities.

• Example:– A[Wi,(i,j)] = -log (Pr[Wi->s[i] Wj s[j] )+A[Wj,(i+1,j-1)]

• If we wish to compute the probability that a sequence belongs to a family, we compute the total likelihood (sum over all probabilities)

• If we wish to compute the structure of an unknown sequence by comparison to a covariance model, we compute the max likelihood parse in this graph.

Page 13: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Covariance models and ncRNA discovery

• Given a family of ncRNA sequences, scan a genomic sequence with a covariance model and retrieve all high scoring sub-sequences.

• This is the most common method, but it is expensive.

• Assume covariance model has m states, and the substring has at most n symbols, and the database has L symbols.

• Alignment cost = O(n2m1+n3m2)• Total time =?

Page 14: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Computing covariance models

• If we are given a CM, a multiple structural alignment is ‘easy’. – In turn, align each sequence to the CM.

• If we are given a multiple alignment, computing the covariance model is easy

• For simultaneous prediction, a Bayesian iterative approach is used– Compute a seed alignment– Use the alignment to compute a CM– Use the CM to compute a new alignment– Iterate

Page 15: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Open

• Compute a structural multiple alignment.• Existing methods do not work well without

good seed alignment, and require excessive hand curation.

• Here, we solve a simpler problem– Predict conserved structure in unaligned sequences.

Page 16: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Motivation to a new approach

– Base-pairs appear in ‘clusters’: we call them stacks, which is energetically favorable.

– Most of the stability of the RNA secondary structure is determined by stacks.

ACCUU AAGGA

p = (1/4)5 < 0.001.

Page 17: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Statistics of the stacks in Rfam database

• Most base-pairs are stacked up

Fraction of true stacks missed

00.10.20.30.40.50.60.70.80.9

1

1 2 3 4 5 6 7 8 9 10

length of stacks

Page 18: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Using stacks as anchors for predictions

• The idea of anchors as constraints has been used in multiple genomic sequence alignment.

– MAVID (Bray and Pachter, 2004)– TBA (Blanchette et al., 2004)

Several heuristic methods have been developed by finding anchored stacks:

– Waterman (1989) used a statistical approach to choose conserved stacks within fixed-size windows.

– Ji and Stormo (2004) and Perriquet et al. (2003) use primary sequence conservation of the stacks and the length of loop regions to reduce the searching space.

– stack anchor has low sequence similarity.

– It’s hard to find correct anchors

Page 19: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Problem

• Selecting one stack at a time may cause wrong matching stacks.

Page 20: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

A global approach: configuration of stacks

• RNA secondary structure can be viewed as stacks plus unpaired loops. (no individual base-pairs)

• The energy of the structure is the sum of the energies of stacks and loops.

• Stack configuration:– Nested stacks– Parallel stacks– Crossing stacks (pseudo

knots)• More generalized stacks can

include mismatches in the stacks.

Page 21: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

RNA Stack-based Consensus Folding (RNAscf) problem

• Find conserved stack configurations for a set of unaligned RNA sequence.

• Optimize both stability (free energy) of the structure and sequence similarity computed based on these common stacks as anchors.

Page 22: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

RNA stack-based consensus folding for pairwise sequences

Page 23: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

A matching stack-configurations on two sequences

Weights of different costs.Energy of the consensus structureSequence similarity of stacksSequence similarity of unpaired regions

Page 24: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

RNA Stack-based Consensus Folding for multiple sequences

Page 25: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Cost function for multiple sequences

A1,1 A1,2 A1,3 A1,4 A1,5 A1,6 A1,k-2 A1,k-1 A1,k

...

A2,1 A2,2 A2,3 A2,4 A2,5 A2,6 A2,k-2 A2,k-1 A2,k

As,1 As,2 As,3 As,4 As,5 As,6 As,k-2 As,k-1 As,k

Page 26: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Compute an optimal stack configuration for two sequences

• Dynamic programming algorithm is used to align RNA sequences and find an optimal configuration at the same time.

– The algorithm is similar to prior work (Sankoff 1985, Bafna et al. 1995)

– Differences: • We use stacks as the basic structural elements. • Prior work used individual base pairs.

– The computational time is O(n4) (n is the number of stacks). • Sankoff’s algorithm is O(m6), (m is the length of the sequences).• The number of possible stacks (size >= 4) is much smaller than

the length of the sequence.• It’s much faster.

Page 27: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

For any pair of stacks, there are three choices:

PA

PB

hairpin loop

PA

PB

Loop(PA)

Loop(PB)PA

PB

PX

PY

interior loop/bulge

PA

PB

PiA

PjB

P1A

P1B

multi-loop

Page 28: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

The score of matching stacks:

PA

PB

Page 29: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

The score of matching hairpin loops:

PA

PB

Loop(PA)

Loop(PB)

Page 30: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

The score of matching interior loops or bulges:

PA

PB

PX

PY

Loop(PX,PA)

Loop(PY,PA)

Page 31: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

The score of matching two multi-loops:

PA

PB

PiA

PjB

P1A

P1B

Loop(Pi,PA)

Loop(Pi,PB)

Page 32: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Consensus folding for multiple sequences

• We use a heuristic method based on the notion of star-alignment.

– Compute an optimal configuration from a random seed pair.– Align all individual sequences to this configuration.– Choose the conserved stack configuration in all sequences.– Allow some stacks to be partially conserved (at least appear in a certain

fraction of the sequences).

Page 33: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Compute the stack configuration for multiple sequences: RNAscf(k,h,f)

.

..

.........

Page 34: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Iterative procedure for RNAscf

1. P = RNAscf(k, h, f).2. In each sequence, extract the unpaired regions according to the loop regions in P.3. Predict additional putative stacks that are not crossing with P using smaller k’ and h’.4. Recompute the alignment for with additional putative stacks using RNAscf(k’,h’,f).

Page 35: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Test dataset

• We choose a set of 12 RNA families from Rfam database:– 20 sequences chosen from the families. (except for CRE and glms, we

choose 10 sequences) with annotated structures.– There are 953 stacks.– We compare RNAscf with 3 other programs that are available online for

RNA folding:• RNAfold (energy based minimization) (Hofacker 2003)• COVE (covariance model) (Eddy and Durbin 1994)

– Cove need a staring seed alignment which is produced by ClustalW.• comRNA (computing anchors in multiple sequences) (Ji, Xu and Stormo 2004).

– Sensitivity: the fraction of true stacks that overlapped with predicted stacks.

– Accuracy: the fraction of predicted stacks that overlapped with true stacks

Page 36: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Test results

Page 37: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Test results

Sensitivity

00.10.20.30.40.50.60.70.80.9

1

5s_rRNACRE_220(*)

ctRNA_236(+)

glmS(*)hammer_3intron_II(+)

lysinepurinesam_ribo

thiamine(+)

tRNA

ykok_element

RNAfold

COVE

comRNA

RNAscf

Page 38: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Test results

Accuracy

00.10.20.30.40.50.60.70.80.9

1

5s_rRNACRE_220(*)

ctRNA_236(+)

glmS(*)hammer_3intron_II(+)

lysinepurinesam_ribo

thiamine(+)

tRNA

ykok_element

RNAfold

COVE

comRNA

RNAscf

Page 39: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Performance improves when the number of sequences increases

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0 10 20 30 40 50 60 70 80

# of input sequences

Sensitivity

Accuracy

(Using Thiamine riboswitch subfamily (RF00059))

Page 40: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

RNAscf always finds the right consensus stack configuration.

(Sam riboswitch (RF00162))

Page 41: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Conclusion and future work

• RNAscf is a valid approach to RNA consensus structure prediction.– Use stack configuration to represent RNA secondary

structure.– Propose a dynamic programming algorithm to find optimal

stack configuration for pairwise sequences.– Use both primary sequence information and energy

information.– Use a star-alignment-like heuristic method to get the

consensus structure for multiple sequences.

Page 42: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Conclusion

• There is a signal due to to covarying mutations that is a good predictor of RNA structure.

• Can RNAscf scores be used as a statistic to discover ncRNA in ‘unaligned’ sequences?

• How good are sequence based alignments? Do they preserve structure?– Not for diverged families– Possibly for orthologous regions

Page 43: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

ncRNA discovery for specific families

Page 44: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Case study: miRNA

• dsRNA, and siRNA can be used to silence genes in mammalian tissue culture.

• miRNA is a new member of this class of endogenous interfering RNA

• RNA interference (RNAi) is a pwerful new technique to study gene function.

Page 45: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Case Study: miRNA

• ncRNA ~22 nt in length• Pairs to sites within the 3’ UTR,

specifying translational repression.• Similar to siRNA (involved in RNAi)• Unlike siRNA, miRNA do not need

perfect base complementarity• No computational techniques to

predict miRNA• Most predictions based on cloning

small RNAs from size fractionated samples

Page 46: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

miRNA (vs. siRNA)

• Derived from transcripts that form local hairpin structures.

• Sequences of the precursor, and processed miRNA is evolutionarily conserved

• Usually distinct, and distant, from other genes• siRNA (by contrast)

• Not evolutionarily conserved• Correspond to sequences of known or predicted mRNAs,

transposons, or regions of heterochromatic DNA.

Page 47: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

MiRscan

• Predicts miRNA • Start with evolutionarily conserved region. Ex:

C. elegans and C. briggsae• 36000 hairpins were found (including 50/53

known miRNA).• 50 known miRNA were used to train and score

the 36000 hairpins

Page 48: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Computational identification of miRNA

• 7 features are scored1. miRNA base-pairing2. Base-pairing of the rest of the

fold-back3. Stringent sequence conservation

in the 5’ end of fold back4. Sequence conservation in the 3’

end of fold back5. Sequence bias in the first 5 bases

of miRNA6. Tendency to form symmetric

internal loops7. Presence of 2-9 consensus base-

pairs between miRNA and terminal loop region

• Red: Conserved with C. briggsae• Blue: varying residues that maintain their

predicted paired or unpaired states

Page 49: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

MiRscan scoring

• 35 previously unannotated hairpins exceeded the Median score

Page 50: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

Molecular identification of miRNA

• Initial cloning and sequencing identified 300 clones representing 54 unique miRNA

• 10 fold scale up of the procedure identified 3423 clones as miRNA. These contain 77 distinct miRNA genes

• 77-54=23 novel miRNAs found• 20 were scored by MiRscan (yellow). 10

were among the top 35

Page 51: ncRNA detection w/ multiple alignments

March 2006 Vineet Bafna

MiRscan results

• 35 Predictions• 10 identified with a high throughput screen

(sequencing of 3423 clones)• 6 identified using a PCR assay.• 4 identified as false positives PCR hybridized to

larger ncRNAs• 15 unknown• Evolutionary conservation is important for ncRNA

detection• >97% of all miRNA had significant conservation between

C. briggsae, and C. elegans