31
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments of short sequence reads Badil Elhady, Michael Chan

Assignment 2: Papers read for this assignment

  • Upload
    clancy

  • View
    63

  • Download
    0

Embed Size (px)

DESCRIPTION

Assignment 2: Papers read for this assignment. Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments of short sequence reads. Badil Elhady, Michael Chan . Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms. - PowerPoint PPT Presentation

Citation preview

Page 1: Assignment 2:  Papers read for this assignment

Assignment 2:

Papers read for this assignment

• Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms

• Paper 2: Optimal spliced alignments of short sequence reads

• Badil Elhady, Michael Chan

Page 2: Assignment 2:  Papers read for this assignment

Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms

Page 3: Assignment 2:  Papers read for this assignment

Motivation• Question for the study?

– The correct alignment of mRNA sequences to genomic DNA is still a challenging task. ( Due to the presence of sequencing errors, micro-exons, alternative splicing)

Page 4: Assignment 2:  Papers read for this assignment

Method• Splice Site Prediction

– SVM with large margin, decided under convex optimization• Intron Length Model• Dynamic Programming is used to maximize the scoring

function, leading to Optimal Alignment. (Smith-Waterman Alignments with Intron Model)

• This leads to:– Tuning the parameters of scoring function leads to.

• A larger score • Other alignment would score lower

Page 5: Assignment 2:  Papers read for this assignment

– Accurately differentiates the exon-intron boundaries– Compartmentalize the local alignment of EST.

– Claim:• Robust to mutations, insertions and deletions, as well as noise

levels in accurately identifying intron boundaries as well as boundaries of the optimal local alignment.

Slice site prediction

Page 6: Assignment 2:  Papers read for this assignment

Splice Site Predictions• From a set of ETS, sequences were

extracted of confirmed donor and acceptor slice sites.

• To recognize acceptor and donor slice sites, 2 SVM classifiers were trained. Using “ weighted degree ” kernel.

• kernel computes the similarity between sequences s and s’.

Page 7: Assignment 2:  Papers read for this assignment

• The main idea of the algorithm is to compute a local alignment by determining the maximum over all alignments of all prefixes– SE (1 : i) :=(SE(1), . . . , SE(i))– SD(1 : j) := (SD(1), . . . , SD(j))

» SE EST Sequences» SD DNA Sequences

– Running time is O(m*n*L)» m length of SE » n length of SD

» Smith-Waterman does not distinguish between exons and introns.

Intron Length Model

Page 8: Assignment 2:  Papers read for this assignment

Scoring Function

Page 9: Assignment 2:  Papers read for this assignment

• In generalizing the Smith-Waterman algorithm by including an intron model taking splice site predictions as well as intron length into account.

• The information is then used to optimize the parameters used for alignment.

Smith-Waterman Alignments with Intron Model

Splice site prediction assisting params

Page 10: Assignment 2:  Papers read for this assignment

Experimental setup• Evaluating PALMA vs. exalin, sim4, and blat.

• Alignment of mRNA seq. artificially shorting the middle exon (3-50)nt as shown.

Page 11: Assignment 2:  Papers read for this assignment

• Artificially generating the data : as a control to know exactly what the correct alignment has to be.

• Add varying amounts of noise (p ¼ 0 ,1 ,5 and 10% of random mutations, deletions or insertions) to the query sequence.

• Replace a part of the DNA or mRNA sequence at its terminal ends with random sequence leading to a shortened correct alignment.

Experimental setup cont.

Page 12: Assignment 2:  Papers read for this assignment

PALMA vs.

exalin, sim4, and blat.Add noise

Page 13: Assignment 2:  Papers read for this assignment

PALMA vs.

exalin, sim4, and blat.Varying lengths

Page 14: Assignment 2:  Papers read for this assignment

Conclusion– Motivation: high sensitivity detection of short

exons in the midst of noise.– Principles

• Splice Site Prediction• Intron Length Model• maximize scoring function, for Optimal Alignment.

– Results: PALMA detects short exons while exalin, blat, etc, are unsuccessful

Page 15: Assignment 2:  Papers read for this assignment

Further Topics• vmatch• svm• convex optimization

Page 16: Assignment 2:  Papers read for this assignment

Paper 2: Optimal spliced alignments of short sequence reads

Page 17: Assignment 2:  Papers read for this assignment

Situation• NGS has short length and inherent high error rate even

compared to Sanger. It is fast but the accuracy?

• Many methods are efficient and accurate if the sequence blocks (exons) are sufficiently long and are highly similar to the genomic sequence.

• Reads from NG sequencing techniques do not have either of 2 properties.

Page 18: Assignment 2:  Papers read for this assignment

Motivation• Objective to be able to accurately align the sequence

reads over intron boundaries.

• QPALMA takes the read’s quality information as well as computational splice site predictions to compute accurate spliced alignments.

Page 19: Assignment 2:  Papers read for this assignment

Principles• Learn, in a supervised manner, how to score quality

information, splice site predictions and sequence identity based on a representative set of sequence reads with known alignments.

• Extended Smith-Waterman algorithm: • Extension 1: Quality Scores• Extension 2: Splice Sites• Extension 3: Non-affine Intron Length Model

Page 20: Assignment 2:  Papers read for this assignment

1) Splice site prediction• Need to know acceptor and donor splice

sites as well as suitable decoy sequences.

• Extension 1: Quality Scores• Extension 2: Splice Sites• Extension 3: Non-affine Intron Length Model

Page 21: Assignment 2:  Papers read for this assignment

Extension 1: Quality Scores• same computational complexity as the original

Smith–Waterman algorithm (O(mn)). However, it uses a more complex scoring that may depend on the sequencing technology used.

Constant

Page 22: Assignment 2:  Papers read for this assignment

Extension 2: Splice Sites

• (O(mnL)) operations, where L is the maximal length of the intron.• The idea is to maintain an additional recurrence matrix W used to keep track

of the intron boundaries.

Page 23: Assignment 2:  Papers read for this assignment

• Recurrence can be implemented as follows

Extension 3: Non-affine Intron Length Model

• For long introns this approach seem computationally infeasible.

Page 24: Assignment 2:  Papers read for this assignment
Page 25: Assignment 2:  Papers read for this assignment

An alignment pipeline against whole genomes• !!! optimal alignments is time consuming • => use vmatch(multi-step approach on

enhanced suffix arrays) + high quality splice site detection

Page 26: Assignment 2:  Papers read for this assignment

• vmatch (1st round) finds global alignments of all short reads (max 2 mismatches) against the genome to identify large fraction of unspliced reads.– If there are reads that cannot be aligned (leftover reads) – spliced or

low quality reads• Yet, there is possibility that the boundary of the reads are the

spliced sites– Check with QPALMA scoring function as a filter to quickly decide

whether the read is spliced or not.– all combinations of putative donor splice sites within the read

and acceptor splice sites ≤2000 nt downstream of the read, and– all combinations of putative acceptor splice sites within the read

and donor splice sites ≤2000 nt upstream of the read.

[Optional]An alignment pipeline against whole genomes

Page 27: Assignment 2:  Papers read for this assignment

[Optional]An alignment pipeline against whole genomes

• leftover reads + spliced (predicted to be by QPALMA) used as seeds for vmatch (2nd round) and localize the splice sites with a ‘window’

Page 28: Assignment 2:  Papers read for this assignment

Results

Page 29: Assignment 2:  Papers read for this assignment

Conclusion– Motivation: NGS is inaccurate. – Principles

• 3 extentions to PALMA• Vmatch pipelining, for boundary precision.

– Results: lower error

• QPALMA + vmatch pipelining = PALMA + 3extentions – {SVM, large marigin}

Page 30: Assignment 2:  Papers read for this assignment

References • A Tutorial on Support Vector Machines for

Pattern Recognition• NCBI National Center for

Biotechnology Information http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html

• BioInfoBank Library http://lib.bioinfo.pl/• High Throughput Short Read Alignment via Bi-

directional BWT

Page 31: Assignment 2:  Papers read for this assignment

Mich_a___el__chan Badil_el ha dy