38
BNFO 602 Lecture 2 Usman Roshan

BNFO 602 Lecture 2

  • Upload
    sonel

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

BNFO 602 Lecture 2. Usman Roshan. -3 mil yrs. AAGACTT. AAGACTT. AAGACTT. AAGACTT. AAGACTT. -2 mil yrs. AAGGCTT. AAG G CTT. AAGGCTT. AAGGCTT. T_GACTT. T_GACTT. T _ GACTT. T_GACTT. -1 mil yrs. _GGGCTT. _ G GGCTT. _GGGCTT. TAGACCTT. T AG A C CTT. TAGACCTT. A _ C ACTT. A_CACTT. - PowerPoint PPT Presentation

Citation preview

Page 1: BNFO 602 Lecture 2

BNFO 602Lecture 2

Usman Roshan

Page 2: BNFO 602 Lecture 2

DNA Sequence Evolution

AAGACTT -3 mil yrs

-2 mil yrs

-1 mil yrs

today

AAGACTT

T_GACTTAAGGCTT

_GGGCTT TAGACCTT A_CACTT

ACCTT (Cat)

ACACTTC (Lion)

TAGCCCTTA (Monkey)

TAGGCCTT (Human)

GGCTT(Mouse)

T_GACTTAAGGCTT

AAGACTT

_GGGCTT TAGACCTT A_CACTT

AAGGCTT T_GACTT

AAGACTT

TAGGCCTT (Human)

TAGCCCTTA (Monkey)

A_C_CTT (Cat)

A_CACTTC (Lion)

_G_GCTT (Mouse)

_GGGCTT TAGACCTT A_CACTT

AAGGCTT T_GACTT

AAGACTT

Page 3: BNFO 602 Lecture 2

Sequence alignments

They tell us about

• Function or activity of a new gene/protein

• Structure or shape of a new protein

• Location or preferred location of a protein

• Stability of a gene or protein

• Origin of a gene or protein

• Origin or phylogeny of an organelle

• Origin or phylogeny of an organism

• And more…

Page 4: BNFO 602 Lecture 2

Pairwise sequence alignment

• How to align two sequences?

Page 5: BNFO 602 Lecture 2

Pairwise alignment

• How to align two sequences?• We use dynamic programming• Treat DNA sequences as strings over the

alphabet {A, C, G, T}

Page 6: BNFO 602 Lecture 2

Pairwise alignment

Page 7: BNFO 602 Lecture 2

Dynamic programmingDefine V(i,j) to be the optimal pairwise alignment score between S1..i and T1..j (|S|=m, |T|=n)

Page 8: BNFO 602 Lecture 2

Dynamic programming

Time and space complexity is O(mn)

Define V(i,j) to be the optimal pairwise alignment score between S1..i and T1..j (|S|=m, |T|=n)

Page 10: BNFO 602 Lecture 2

How do we pick gap parameters?

Page 11: BNFO 602 Lecture 2

Structural alignments

• Recall that proteins have 3-D structure.

Page 12: BNFO 602 Lecture 2

Structural alignment - example 1

Alignment of thioredoxins fromhuman and fly taken from theWikipedia website. This proteinis found in nearly all organismsand is essential for mammals.

PDB ids are 3TRX and 1XWC.

Page 13: BNFO 602 Lecture 2

Structural alignment - example 2

Computer generated aligned proteins

Unaligned proteins.2bbm and 1top areproteins from fly andchicken respectively.

Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html

Page 14: BNFO 602 Lecture 2

Structural alignments

• We can produce high quality manual alignments by hand if the structure is available.

• These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.

Page 15: BNFO 602 Lecture 2

Benchmark alignments

• Protein alignment benchmarks– BAliBASE, SABMARK, PREFAB,

HOMSTRAD are frequently used in studies for protein alignment.

– Proteins benchmarks are generally large and have been in the research community for sometime now.

– BAliBASE 3.0

Page 16: BNFO 602 Lecture 2

Biologically realistic scoring matrices

• PAM and BLOSUM are most popular• PAM was developed by Margaret

Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins

• BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

Page 17: BNFO 602 Lecture 2

PAM

• We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j

• Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families

• Compute probabilities of change and background probabilities by simple counting

Page 18: BNFO 602 Lecture 2

Local alignment

• Global alignment recursions:

• Local alignment recursions

V (i, j) =

V (i −1, j −1) + S(x i,y j )

V (i −1, j) + g

V (i, j −1) + g

⎨ ⎪

⎩ ⎪

⎬ ⎪

⎭ ⎪

V (i, j) =

0

V (i −1, j −1) + S(x i,y j )

V (i −1, j) + g

V (i, j −1) + g

⎨ ⎪ ⎪

⎩ ⎪ ⎪

⎬ ⎪ ⎪

⎭ ⎪ ⎪

Page 19: BNFO 602 Lecture 2

Local alignment traceback

• Let T(i,j) be the traceback matrices and m and n be length of input sequences.

• Global alignment traceback: – Begin from T(m,n) and stop at T(0,0).

• Local alignment traceback: – Find i*,j* such that T(i*,j*) is the maximum over all T(i,j).– Begin traceback from T(i*,j*) and stop when

T(i,j) <= 0.

Page 20: BNFO 602 Lecture 2

BLAST

• Local pairwise alignment heuristic

• Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.

• Online server: http://www.ncbi.nlm.nih.gov/blast

Page 21: BNFO 602 Lecture 2

BLAST

1. Given a query q and a target sequence, find substrings of length k (k-mers) of score at least t --- also called hits. k is normally 3 to 5 for amino acids and 12 for nucleotides.

2. Extend each hit to a locally maximal segment. Terminate the extension when the reduction in score exceeds a pre-defined threshold

3. Report maximal segments above score S.

Page 22: BNFO 602 Lecture 2

Finding k-mers quickly

• Preprocess the database of sequences:– For each sequence in the database store all k-

mers in hash-table.– This takes linear time

• Query sequence:– For each k-mer in the query sequence look up the

hash table of the target to see if it exists– Also takes linear time

Page 23: BNFO 602 Lecture 2

Profile-sequence alignment

• Given a family alignment, how can we align it to a sequence?

• First, we compute a profile of the alignment.• We then align the profile to the sequence using

standard dynamic programming.• However, we need to describe how to align a profile

vector to a nucleotide or residue.

Page 24: BNFO 602 Lecture 2

Profile

• A profile can be described by a set of vectors of nucleotide/residue frequencies.

• For each position i of the alignment, we we compute the normalized frequency of nucleotides A, C, G, and T

Page 25: BNFO 602 Lecture 2

Aligning a profile vector to a nucleotide

• ClustalW/MUSCLE – Let f be the profile vector

– Score(f,j)=

– where S(i,j) is substitution scoring matrix

f i S(i, j)i∈{A ,C ,G,T}

Page 26: BNFO 602 Lecture 2

Multiple sequence alignment

• “Two sequences whisper, multiple sequences shout out loud”---Arthur Lesk

• Computationally very hard---NP-hard

Page 27: BNFO 602 Lecture 2

Formally…

Page 28: BNFO 602 Lecture 2

Multiple sequence alignment

Unaligned sequences

GGCTT

TAGGCCTT

TAGCCCTTA

ACACTTC

ACTT

Aligned sequences

_G_ _ GCTT_

TAGGCCTT_

TAGCCCTTA

A_ _CACTTC

A_ _C_ CTT_ Conserved regions help us to identify functionality

Page 29: BNFO 602 Lecture 2

Sum of pairs score

Page 30: BNFO 602 Lecture 2

Sum of pairs score

• What is the sum of pairs score of this alignment?

Page 31: BNFO 602 Lecture 2

Iterative alignment(heuristic for sum-of-pairs)

• Pick a random sequence from input set S• Do (n-1) pairwise alignments and align to

closest one t in S• Remove t from S and compute profile of

alignment• While sequences remaining in S

– Do |S| pairwise alignments and align to closest one t

– Remove t from S

Page 32: BNFO 602 Lecture 2

Iterative alignment

• Once alignment is computed randomly divide it into two parts

• Compute profile of each sub-alignment and realign the profiles

• If sum-of-pairs of the new alignment is better than the previous then keep, otherwise continue with a different division until specified iteration limit

Page 33: BNFO 602 Lecture 2

Progressive alignment

• Idea: perform profile alignments in the order dictated by a tree

• Given a guide-tree do a post-order search and align sequences in that order

• Widely used heuristic

Page 34: BNFO 602 Lecture 2

Popular alignment programs

• ClustalW: most popular, progressive alignment• MUSCLE: fast and accurate, progressive and

iterative combination• T-COFFEE: slow but accurate, consistency based

alignment (align sequences in multiple alignment to be close to the optimal pairwise alignment)

• PROBCONS: slow but highly accurate, probabilistic consistency progressive based scheme

• DIALIGN: very good for local alignments

Page 35: BNFO 602 Lecture 2

MUSCLE

Page 36: BNFO 602 Lecture 2

MUSCLE

Page 37: BNFO 602 Lecture 2

Evaluation of multiple sequence alignments

• Compare to benchmark “true” alignments

• Use simulation

• Measure conservation of an alignment

• Measure accuracy of phylogenetic trees

• How well does it align motifs?

• More…

Page 38: BNFO 602 Lecture 2

Comparison of alignments on BAliBASE