Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments

  • View

  • Download

Embed Size (px)

Text of Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large...

Assignment 2

Assignment 2:

Papers read for this assignmentPaper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms

Paper 2: Optimal spliced alignments of short sequence reads

Badil Elhady, Michael Chan

focus onmotivationprinciple resultsconclusionsPaper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms

MotivationQuestion for the study?The correct alignment of mRNA sequences to genomic DNA is still a challenging task. ( Due to the presence of sequencing errors, micro-exons, alternative splicing)

MethodSplice Site PredictionSVM with large margin, decided under convex optimizationIntron Length ModelDynamic Programming is used to maximize the scoring function, leading to Optimal Alignment. (Smith-Waterman Alignments with Intron Model)

This leads to:Tuning the parameters of scoring function leads to.A larger score Other alignment would score lower

Accurately differentiates the exon-intron boundariesCompartmentalize the local alignment of EST.

Claim:Robust to mutations, insertions and deletions, as well as noise levels in accurately identifying intron boundaries as well as boundaries of the optimal local alignment.

Slice site predictionexpressed sequence tag or EST is a short sub-sequence of a transcribed cDNA sequence. They may be used to identify gene transcripts, and are instrumental in gene discovery and gene sequence determination. Splice Site PredictionsFrom a set of ETS, sequences were extracted of confirmed donor and acceptor slice sites.To recognize acceptor and donor slice sites, 2 SVM classifiers were trained. Using weighted degree kernel.kernel computes the similarity between sequences s and s.

The main idea of the algorithm is to compute a local alignment by determining the maximum over all alignments of all prefixes SE (1 : i) :=(SE(1), . . . , SE(i))SD(1 : j) := (SD(1), . . . , SD(j)) SE EST SequencesSD DNA SequencesRunning time is O(m*n*L)m length of SE n length of SD

Smith-Waterman does not distinguish between exons and introns.

Intron Length ModelScoring Function

In generalizing the Smith-Waterman algorithm by including an intron model taking splice site predictions as well as intron length into account.

The information is then used to optimize the parameters used for alignment.Smith-Waterman Alignments with Intron ModelSplice site prediction assisting paramsFor splice site predictionSet 1: genes (w/o full length cDNA) with EST confirmedSet 2: random subset for training large marginSet 3: parameter tuningSet 4: final evaluation

Experimental setupEvaluating PALMA vs. exalin, sim4, and blat.

Alignment of mRNA seq. artificially shorting the middle exon (3-50)nt as shown.

Artificially generating the data : as a control to know exactly what the correct alignment has to be. Add varying amounts of noise (p 0 ,1 ,5 and 10% of random mutations, deletions or insertions) to the query sequence.Replace a part of the DNA or mRNA sequence at its terminal ends with random sequence leading to a shortened correct alignment.

Experimental setup cont.This (2 setps )allows us to determine how well the methods perform infinding the correct local alignment including its boundaries.PALMA vs. exalin, sim4, and blat.

Add noisePALMA vs. exalin, sim4, and blat.

Varying lengthsPALMAs performance is essentially unaffected by the length of the exon (at least in thenoise-free case)ConclusionMotivation: high sensitivity detection of short exons in the midst of noise.Principles Splice Site PredictionIntron Length Modelmaximize scoring function, for Optimal Alignment.Results: PALMA detects short exons while exalin, blat, etc, are unsuccessful

Further Topicsvmatchsvmconvex optimization

Paper 2: Optimal spliced alignments of short sequence reads

SituationNGS has short length and inherent high error rate even compared to Sanger. It is fast but the accuracy?

Many methods are efficient and accurate if the sequence blocks (exons) are sufficiently long and are highly similar to the genomic sequence.

Reads from NG sequencing techniques do not have either of 2 properties.MotivationObjective to be able to accurately align the sequence reads over intron boundaries.

QPALMA takes the reads quality information as well as computational splice site predictions to compute accurate spliced alignments.

Next generation (NG) sequencing technologies are able to generate huge amounts of DNA sequence reads at a fraction of the cost of Sanger sequencing.

NG limitations concerning the read length and the rate of sequencing errors.The work on the previous paper proposed a method taking advantage of splice site predictions (Schulze et al., 2007).Instead of assembling the sequences before aligning them to the genome, one first aligns the single reads to the genome and then merges the alignments to infer gene structures.

PrinciplesLearn, in a supervised manner, how to score quality information, splice site predictions and sequence identity based on a representative set of sequence reads with known alignments.

Extended Smith-Waterman algorithm: Extension 1: Quality ScoresExtension 2: Splice SitesExtension 3: Non-affine Intron Length Model

Consider all genomic sequence reads overlapping with 2 consecutive exon boundaries.

Splice site predictionNeed to know acceptor and donor splice sites as well as suitable decoy sequences.

Extension 1: Quality ScoresExtension 2: Splice SitesExtension 3: Non-affine Intron Length Model

Extension 1: Quality Scoressame computational complexity as the original SmithWaterman algorithm (O(mn)). However, it uses a more complex scoring that may depend on the sequencing technology used.

Constant Extension 2: Splice Sites(O(mnL)) operations, where L is the maximal length of the intron.The idea is to maintain an additional recurrence matrix W used to keep track of the intron boundaries.

Extension 2: Splice Sites, cont

go and ge are the intron opening and extension scoresThe gdon (i) and gacc (i) gacc(i) are scoring functions for splice sites at position i in the sequence facc (i) := facc(gacc (i)) and fdon (i) := fdon(gdon (i)).Extension 3: Non-affine Intron Length ModelHere is scoring the intron length with an arbitrary function

Recurrence can be implemented as follows

Extension 3: Non-affine Intron Length ModelFor long introns this approach seem computationally infeasible.

An alignment pipeline against whole genomes!!! optimal alignments is time consuming => use vmatch(multi-step approach on enhanced suffix arrays) + high quality splice site detectionvmatch (1st round) finds global alignments of all short reads (max 2 mismatches) against the genome to identify large fraction of unspliced reads.If there are reads that cannot be aligned (leftover reads) spliced or low quality readsYet, there is possibility that the boundary of the reads are the spliced sitesCheck with QPALMA scoring function as a filter to quickly decide whether the read is spliced or not. all combinations of putative donor splice sites within the read and acceptor splice sites 2000 nt downstream of the read, and all combinations of putative acceptor splice sites within the read and donor splice sites 2000 nt upstream of the read.

[Optional]An alignment pipeline against whole genomes[Optional]An alignment pipeline against whole genomesleftover reads + spliced (predicted to be by QPALMA) used as seeds for vmatch (2nd round) and localize the splice sites with a windowResults

ConclusionMotivation: NGS is inaccurate. Principles 3 extentions to PALMAVmatch pipelining, for boundary precision.Results: lower errorQPALMA + vmatch pipelining = PALMA + 3extentions {SVM, large marigin}References A Tutorial on Support Vector Machines for Pattern RecognitionNCBI National Center forBiotechnology Information BioInfoBank Library Throughput Short Read Alignment via Bi-directional BWT

Mich_a___el__chan Badil_el ha dy