66
Bioinformatics Algorithms David Hoksza http://siret.ms.mff.cuni.cz/hoksza Multiple Sequence Alignment

Bioinformatics Algorithmssiret.ms.mff.cuni.cz/.../bioinformatics/lecture05_msa.pdfOutline •Motivation •Scoring functions •Algorithms •exhaustive •multidimensional dynamic

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

  • Bioinformatics Algorithms

    David Hoksza

    http://siret.ms.mff.cuni.cz/hoksza

    Multiple Sequence Alignment

  • Outline

    • Motivation

    • Scoring functions

    • Algorithms

    • exhaustive • multidimensional dynamic programming

    • heuristics• progressive alignment

    • iterative alignment/refinement

    • block(local)-based alignment

  • Multiple sequence alignment (MSA)

    • Goal of MSA is to find “optimal” mapping of a set of sequences

    • Homologous residues (originating in the same position in a common ancestor) among a set of sequences are aligned together in columns

    • Usually employs multiple pairwise alignment (PA) computations to reveal the evolutionarily equivalent positions across all sequences

  • Motivation

    • Distant homologues• faint similarity can become apparent when present in many sequences

    • motifs might not be apparent from pairwise alignment only

    • Detection of key functional residues• amino acids critical for function tend to be conserved during the evolution and

    therefore can be revealed by inspecting sequences within given family

    • Prediction of secondary/tertiary structure

    • Inferring evolutionary history

    4

  • Representation of MSA

    • Column-based representation

    • Profile representation (position specific scoring matrix)

    • Sequence logo

  • Manual MSA

    • High quality MSA can be carried out by hand using expert knowledge• specific columns

    • highly conserved residues• buried hydrophobic residues

    • secondary structure (especially in RNA alignment)

    • expected patterns of insertions and deletions

    • Tedious, but• high-quality source of family

    information• a benchmark for evaluation of

    automatic MSA algorithms

    • BAliBASE• https://lbgi.fr/balibase/

    • PROSITE• http://prosite.expasy.org/

    • Pfam• http://pfam.sanger.ac.uk/

    • TIGRFAM• http://www.jcvi.org/cgi-

    bin/tigrfams/index.cgi

    • … (some databases are semi-automatic and many of the databases construct the MSA from the structure information)

    http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE/https://lbgi.fr/balibase/http://prosite.expasy.org/http://pfam.sanger.ac.uk/http://pfam.sanger.ac.uk/http://www.jcvi.org/cgi-bin/tigrfams/index.cgihttp://www.jcvi.org/cgi-bin/tigrfams/index.cgi

  • Scoring

    • How to score an MSA?

    𝑺 𝑨 = 𝑮 +𝑪𝑺(𝑨𝒊)

    • 𝐴𝑖 … 𝑖-th column• 𝐶𝑆(𝐴𝑖) … score of the 𝑖-th column• 𝐺 … gap function (assumes linear or constant gap penalty)

    • the score assumes independent columns

    • Two score types are usually considered• minimum entropy (ME)• sum of pairs (SP)

  • Minimum entropy (1)

    • ME aims to minimize entropy of each column

    • columns with low entropy (can be expressed with only few bits) are good for the alignment

    • the more bits we need to express a column, the more divers the column is

  • Minimum entropy (2)

    • Probability of a column • assumption of independency between columns and residues within columns

    𝑷 𝑨𝒊 = ෑ

    𝒂

    𝒑𝒊𝒂𝒄𝒊𝒂

    • 𝑐𝑖𝑎…observed counts for residue 𝑎 in 𝑖-th column 𝑐𝑖𝑎 = σ𝑗 ൝0 𝐴𝑖 [𝑗] ≠ 𝑎

    1 𝐴𝑖 𝑗 = 𝑎

    • 𝐴𝑖 [𝑗]… 𝑗-th symbol in 𝑖-th column

    • 𝑝𝑖𝑎… probability of residue 𝑎 in column 𝑖

    𝑪𝑴𝑬 𝑨𝒊 = −

    𝒂

    𝒄𝒊𝒂 𝐥𝐨𝐠𝒑𝒊𝒂 𝑴𝑬 =

    𝒊

    𝑪𝑴𝑬(𝑨𝒊)

    • completely conserved column would score 0

  • Sum of pairs

    • Sum of scores of all possible pairs in a multiple alignment 𝑨 for a particular scoring matrix

    • Score for each column is computed as the sum of all pairs of position in that column

    • Column scores are then summed to get the SP-score

    𝑆𝑃 𝐴 =

    𝑖=1

    |𝐴|

    𝐶𝑆𝑃 𝐴𝑖 =

    𝑖=1

    |𝐴|

    𝑘

  • SP - Example G K NT R N

    S H E

    -1 +1 +6 6• BLOSUM 62 scoring matrix

  • SP score drawback• Alignment of 𝑵 sequences, all containing leucine at given position from

    functional reasons

    • BLOSUM62 matrix 𝝈 𝑳, 𝑳 = 𝟒 → 𝑺𝑷 𝑨𝒊 = 𝟒 × 𝑵(𝑵 − 𝟏)/𝟐

    • Let us replace one of the leucines with glycine (incorrect alignment)

    𝝈 𝑳, 𝑮 = −𝟒 → the score decreases by 𝟖 × (𝑵 − 𝟏)

    • 𝑺𝑷 𝑨𝒊 is worse by a fraction of 8×(𝑁−1)

    4×𝑁(𝑁−1)/2=

    𝟒

    𝑵

    • Relative difference in score between the correct alignment and incorrectalignment decreases with the number of sequences in the alignment

    • BUT increasing the number of sequences (evidence) should give us more increased relative difference

  • Multidimensional dynamic programming (1)

    • Generalization of pairwise dynamic programming

    • 3 sequences: ATGC, AATC,TTGC

    • Resulting path • (0,0,0) → (1,1,0) → (1,2,1) → (2,3,2) → (3,3,3) → (4,4,4)

    A - T G C

    0 1 1 2 3 4

    A A T - C

    0 1 2 3 3 4

    - T T G C

    0 0 1 2 3 4

    x coordinate

    y coordinate

    z coordinate

  • Multidimensional dynamic programming (2)

    • Let us assume linear gap penalty model (not affine)

    • 𝛾 𝑔 = 𝑔𝑑 for a gap of length 𝑔 and gap cost 𝑑

    • initialization and backtracking are analogous with the 2D case

  • Multidimensional dynamic programming (3)

    • 7 edges• 3 edges

  • Computational complexity of MDP

    • Computation of each cell of the DP matrix takes 𝟐𝑵 − 𝟏 (all possible combinations of gaps column)

    • Let us assume all the sequences have approximately the same length 𝑳

    • Memory complexity

    𝑶 𝑳𝑵

    • Time complexity

    𝑶 𝟐𝑵𝑳𝑵

  • MDP - exercise

    • Let’s have sequence of length 50

    • Comparison of a pair of sequences using DP takes 0,1s

    • What is the time needed to compare 4 sequences?

    • Let’s say we have 1000 years and average sequence length is 50.

    • How many sequence can afford to compare?

  • Heuristic Algorithms

    • Progressive alignment methods• iterative building of the alignment

    • Feng & Doolittle

    • ClustalW, Clustal Omega

    • Consistency-based methods• T-Coffee

    • Iterative refinement• alignment built and then refined be

    realigning the constituent sequences• Barton & Sternberg

    • Block-based alignment• local alignment built by identifying

    blocks of ungapped MSA identified and assembled• DIALIGN

    • Mix of approaches• MAFFT, MUSCLE

  • Progressive alignment

    • Framework• First, two sequences are aligned using standard pairwise alignment

    • The remaining sequences are taken one by one and aligned to the previous ones

    • Repeated until all sequences are aligned

    • Parameters• The order in which the sequences are be aligned

    • Whether only one alignment is kept and sequences are added to it or whether also an alignment can be aligned to another alignment (as if a tree was being built)

    • The process used to align and score sequences or alignments against the existing ones

  • Star alignment

    • N sequences 𝒔𝟏, … , 𝒔𝑵 to be aligned

    1. Pick 𝒔𝒊 as a starting sequence – center

    2. Compute all optimal global alignments between 𝒔𝒊 and 𝒔𝒋, 𝑗 ≠ 𝑖

    3. Successively merge sequences into the arising MSA• once a gap always a gap rule

    • if a gap is introduced into the MSA it stays there forever

  • SA – example (1)

    S1: ATTGCCATT

    S2: ATGGCCATT

    S3: ATCCAATTTT

    S4: ATCTTCTT

    S5: ATTGCCGATT

    credit: Xingquan Zhu, Florida Atlantic University

    ATTGCCATT

    ATTGCCATT

    ATGGCCATT

    ATTGCCATT--

    ATC-CAATTTT

    ATTGCCATT

    ATCTTC-TT

    ATTGCC-ATT

    ATTGCCGATT

    ATTGCC-ATT--

    ATGGCC-ATT--

    ATTGCCGATT--

    ATCTTC--TT--

    ATC-CA-ATTTT

  • SA – example (2)

    pairwise alignment multiple alignment

    1.ATTGCCATT

    ATGGCCATT

    ATTGCCATT

    ATGGCCATT

    2.ATTGCCATT--

    ATC-CAATTTT

    ATTGCCATT--

    ATGGCCATT--

    ATC-CAATTTT

    3.ATTGCCATT

    ATCTTC-TT

    ATTGCCATT--

    ATGGCCATT--

    ATC-CAATTTT

    ATCTTC-TT--

    4.ATTGCC-ATT

    ATTGCCGATT

    ATTGCC-ATT--

    ATGGCC-ATT--

    ATC-CA-ATTTT

    ATCTTC--TT--

    ATTGCCGATT--

  • SA - choosing the center

    • Compute all pairwise alignment and pick sequence 𝒔𝒊 with maximum σ𝒋≠𝒊 𝒔(𝒔𝒊, 𝒔𝒋)• Choosing the sequence which is most similar to all the rest

    • Compute all pairwise alignments and compute MSA for every 𝒔𝒊 and pick the best

  • SA – time complexity

    • Average sequence length 𝐿

    • One global alignment computation in 𝐎(𝑳𝟐)

    • 𝑘 sequences → 𝐎(𝒌𝟐𝑳𝟐) pairwise computations

    • 𝑙 … upper bound on the MSA length → 𝐎(𝒍𝒌) for MSA construction

    𝑂 𝑘2𝐿2 + 𝑙𝑘 = 𝑶(𝒌𝟐𝑳𝟐)

  • SA - exercise

    • Compute SP for the constructed MSA

    • Compute SA for the previous example but add sequences to the MSA in different order. Does the order of addition impacts the score?

    • Compute MSA starting with S5. Does the score change?

    ATTGCC-ATT

    ATTGCCGATT

    ATGGCC-ATT

    ATTGCCGATT

    AT--CCAATTTT

    ATTGCCGATT--

    AT--CTTCTT

    ATTGCCGATT

  • Feng & Doolittle (1)

    1. Calculate a distance matrix from all-to-all pairwise alignments (𝑁(𝑁 − 1)/2)

    2. Convert raw alignment scores into (evolutionary) distances

    • 𝐷 = − log 𝑆𝑒𝑓𝑓 × 100 = − log𝑆𝑜𝑏𝑠−𝑆𝑟𝑎𝑛𝑑

    𝑆𝑚𝑎𝑥−𝑆𝑟𝑎𝑛𝑑× 100

    3. Construct a guide tree from the distance matrix using Fitch & Margoliash algorithm

    4. Align child nodes of each parent (can be sequence-sequence, sequence-MSA, MSA-MSA) in the order they were added to the tree

    • 𝑆𝑚𝑎𝑥 𝑎, 𝑏 =𝑆 𝑎,𝑎 +𝑆 𝑏,𝑏

    2

    • 𝑆𝑟𝑎𝑛𝑑 is an expected score obtained by randomization

    • 𝑆𝑒𝑓𝑓 can be viewed as normalized

    percentage similarity which

    decreases roughly exponentially to

    0 with increasing evolutionary

    distance.

    • –log makes the measure linear with

    evolutionary distance

    source: Feng, Da-Fei, and Russell F. Doolittle. "Progressive sequence alignment as a prerequisitetto correct phylogenetic trees." Journal of molecular evolution 25.4 (1987): 351-360.

  • Feng & Doolittle (2)

    • Sequence-sequence is aligned using classical dynamic programming

    • Sequence-MSA – sequence is aligned with each sequence in the group and the highest scoring alignment defines how the sequence is added to the group

    • MSA-MSA – as in previous case but all pairs of sequences are tested

    • When a sequence is added to a group, neutral symbol X is introduced instead of the gap position• allows to align gap positions

    • neutral – anything aligned with X scores 0

    • side effect – the gaps in two MSAs tend to come together in the resulting MSA

  • Profile/MSA Alignment

    • When adding a sequence to a group it is desirable to take into account the MSA built so far• mismatches at highly conserved positions should be penalized more

    • 2 MSA (profiles) of 𝑁 sequences, one from 1. . 𝑛, second 𝑛 + 1. . 𝑁

    𝒊

    𝑺 𝑨 𝒊 =

    𝒊

    𝒌

  • ClustalW

    • Similar to Feng & Doolittle but uses profile-based building

    1. Calculate matrix from all-to-all pairwise alignments (𝑁(𝑁 − 1)/2)

    2. Convert raw alignment similarity scores into evolutionary distances

    3. Construct a guided tree from the distance matrix using neighbor joining algorithm

    4. Progressively align the nodes in order of decreasing similarity, using sequence-sequence, sequence-profile, and profile-profile alignment

  • ClustalW - Alignment

    standard global

    alignment

    column-based

    global alignment

  • ClustalW – heuristics (1)

    • Weighting of subsequences to compensate for biased representation in large subfamilies – compensates defects of sum-of-pairs scoring• sequence contributions to the MSA are weighted by their relationships in the

    predicted evolutionary tree

    • Closely related sequences BLOSUM 80 – distant sequences BLOSUM 50

    • Position-specific gap open profile penalties multiplied by a modifier being function of the residues observed at the position• comes from structure-based alignments

  • ClustalW – heuristics (2)

    • Gap penalties are higher if there are no gaps at given position but some exist at nearby positions• force the gaps to be in the same places

    • Guide tree can be adjusted on the fly to postpone aligning low-similar sequences up to the point where more information is present

    • …

  • Clustal Omega

    • ClustalW does not scale well for a big number (thousands) of sequences

    • Bottleneck is the guide tree construction → 𝑂(𝑀 × 𝑁2)

    • Clustal Omega algorithm• mBed algorithm-based for guide tree construction

    • emBedding of each sequence in a space of 𝑛 dimensions where n is proportional to log N• Each sequence replaced by an 𝑛 element vector, where each element is the distance to one of 𝑛 reference sequences

    • Clustering of the vectors by UPGMA or K-means• Alignments of profiles using hidden Markov models (HHalign package)

    • Additional features• Adding sequences to existing alignments• External profiles - HMM profile from sequences homologous to the input set which can be used

    in MSA construction

    33

  • Progressive alignment drawbacks

    • Drawback of the progressive methods is the greedy nature of the algorithm• once an error, always an error

    • Solutions• Iterative approach

    • Tries to correct mistakes in the initial alignment … which might happen easily if pairwise sequence similarity is too low

    • Consistency-based approach• Tries to avoid mistakes in advance

    34

  • T-Coffee

    • A progressive alignment with the ability to consider information from all of the sequences during each alignment step, not just those being aligned at that stage (consistency)

    • Tackles ClustalW drawback

  • T-Coffee algorithm (1)

    1. Primary library generation• Generate primary libraries of sequence alignments (using several methods)• For each pair compute weights based on percent identity (shorter sequence)• Merge the primary libraries – pairs of aligned residues get a weight equal to the

    sum of the weights

    2. Extended library generation• For each pair of sequences try to align them using an intermediate sequence• Score each pair of residues by the lower of the two weights and sum the weights

    for that residue pair over all triplets

    3. Progressive alignment• Guided tree is used to build the multiple alignment• A pair of sequences is aligned using dynamic programming with weights based on

    the extended library• When aligning two sets of sequences, averaged library weights are used

  • T-Coffee algorithm (2)

    77+88

    = 165

  • Iterative refinement

    • Once an error, always an error• when a sequence is added to a MSA it cannot be changed later on (holds also for

    consistency-based approaches)

    • dependence on the initial alignment

    • Idea behind iterative refinement methods• optimal solution can be obtained by iterative improvement of suboptimal solutions

    • First, a suboptimal solution is identified by a fast heuristic method and then possibly improved by iterative removal and re-addition of the sequences

  • Barton-Sternberg

    1. Find the highest scoring pair of sequences and align them using the pairwise dynamic programming

    2. Identify most similar remaining sequence with respect to the existing profile and align it using sequence-profile alignment

    3. Repeat step 2 until all sequences are aligned

    4. Remove sequence 𝒔𝟏 and realign it to the profile by profile-sequence alignment. Repeat for 𝒔𝟐, … , 𝒔𝑵

    5. Repeat step 4 for a fixed number of times or until convergence

  • Block-based alignment

    • Both progressive and iterative methods assume that all parts of all sequences are consistent with a global alignment• not every position in the alignment has to correspond to a homologous site in all

    sequences

    • Block-based solution approach to the problem of global alignment by • splitting sequences into blocks

    • aligning the blocks

    • merging block alignments

  • DIALIGN

    • Alignment constructed from gap-free local alignments between pairs of sequences• based on diagonals in the dynamic programming matrix

    • Procedure• align all possible pairs of sequences• determine all diagonals and assign weights

    • for a diagonal 𝐷 of length 𝑙, score 𝒔 is obtained from substitution matrix• determine length-independent weight as 𝒘 𝑫 = − 𝐥𝐨𝐠𝑷(𝒍, 𝒔), where 𝑷(𝒍, 𝒔) is the

    probability that diagonal of sequence of length 𝒍 will have score at least 𝒔

    • build MSA by adding consistent diagonals in order of decreasing weight (and overlap with other diagonals)

    • explore unaligned regions and include them if possible

  • DIALIGN - Example

    atctaatagttaaactcccccgtgcttag

    cagtgcgtgtattactaacggttcaatcgcg

    caaagagtatcacccctgaattgaataa

  • atctaatagttaaactcccccgtgcttag

    cagtgcgtgtattactaacggttcaatcgcg

    caaagagtatcacccctgaattgaataa

    DIALIGN – Example

  • atctaatagttaaactcccccgtgcttag

    cagtgcgtgtattactaacggttcaatcgcg

    caaagagtatcacccctgaattgaataa

    DIALIGN – Example

  • atctaatagttaaactcccccgtgcttag

    cagtgcgtgtattactaacggttcaatcgcg

    caaagagtatcacccctgaattgaataa

    DIALIGN – Example

  • atctaatagttaaactcccccgtgcttag

    cagtgcgtgtattactaacggttcaatcgcg

    caaagagtatcacccctgaattgaataa

    DIALIGN – Example

  • atctaatagttaaactcccccgtgcttag

    cagtgcgtgtattactaacggttcaatcgcg

    caaagagtatcacccctgaattgaataa

    DIALIGN – Example

  • atc------taatagttaaactcccccgtgcttag

    cagtgcgtgtattactaacggttcaatcgcg

    caaagagtatcacccctgaattgaataa

    DIALIGN – Example

  • atc------taatagttaaactcccccgtgcttag

    cagtgcgtgtattactaacggttcaatcgcg

    caaa--gagtatcacccctgaattgaataa

    DIALIGN – Example

  • atc------taatagttaaactcccccgtgcttag

    cagtgcgtgtattactaacggttcaatcgcg

    caaa--gagtatcacc----------cctgaattgaataa

    DIALIGN – Example

  • atc------taatagttaaactcccccgtgc-ttag

    cagtgcgtgtattactaac----------gg-ttcaatcgcg

    caaa--gagtatcacc----------cctgaattgaataa

    DIALIGN – Example

  • atc------TAATAGTTAaactccccCGTGC-TTag

    cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg

    caaa--GAGTATCAcc----------CCTGaaTTGAATaa

    DIALIGN – Example

  • MAFFT

    • Multiple Alignment using Fast Fourier Transform

    • 2-cycle progressive method + iterative refinement• Fast low-quality all pairwise distances

    • Tentative MSA construction

    • Refinement of distances based on MSA

    • Second progressive alignment stage

    • Iterative refinement

    • Reducing CPU time• 6-mer method for pairwise alignments

    • FFT-based group-to-group alignment algorithm

    53

    FFT-NS-1

    FFT-NS-2

    FFT-NS-i

  • MAFFT – guide tree construction

    • Sequences and groups are progressively aligned using FFT-based alignment (next slides) in the order given by a guide tree

    • To quickly obtain the all-to-all distance matrix• 20 AAs grouped into 6 physico-chemical groups

    • Number of shared 6-tuples 𝑇𝑖𝑗 is computed and turned into 𝐷𝑖𝑗• 𝐷𝑖𝑗 = 1 − [𝑇𝑖𝑗/min(𝑇𝑖𝑖 , 𝑇𝑗𝑗)]

    • UPGMA method is used to obtain the guide tree from 𝐷𝑖𝑗

    54

  • MAFFT – FFT-based group-to-group alignment (1)

    • AA sequence converted to a sequence of vectors (signals) of normalized volume ( ො𝑣 𝑎 = 𝑣 𝑎 − 𝑣 /𝜎𝑣) and polarity ( Ƹ𝑝 𝑎 = 𝑝 𝑎 − 𝑝 /𝜎𝑝)

    • Correlation between 2 AA sequences is assessed using• 𝑐 𝑘 = 𝑐𝑣 𝑘 + 𝑐𝑝 𝑘

    • 𝑐𝑥 𝑘 = σ1≤𝑛≤𝑁,1≤𝑛+𝑘≤𝑀ෞ𝑥1(𝑛)ෞ𝑥2(𝑛 + 𝑘) , 𝑥 ∈ {𝑣, 𝑝}• ෝ𝑥𝑖(𝑗) … 𝑥 component (𝑣 or 𝑝) of 𝑗-th site in sequence 𝑖

    • If 𝑋𝑖(𝑚) are FFT of ෝ𝑥𝑖(𝑛), then, by cross-correlation theorem, 𝑐𝑥 𝑘 ֞𝑋1∗ 𝑚 ∗ 𝑋2 𝑚

    • FFT reduces the 𝑂(𝑛2) time to 𝑂(𝑛 log 𝑛)

    55

    • Correlation has high peaks when the two compared sequences have high similarity regions offset by the lags• sliding window of given size is used to reveal homologous regions

    (sequence identity in the window is measured)• Successive homologous sequences are combined

  • MAFFT – FFT-based group-to-group alignment (2)

    • Matrix 𝑺𝒊𝒋 (1 ≤ 𝑖, 𝑗 ≤ 𝑛) is constructed with values corresponding to the scores of the 𝒏 identified homologous segments (0 if (𝑖, 𝑗) is not homologous pair)

    • Optimal path through 𝑺 is identified

    • DP matrix of the sequences is then computed, where the optimal path must go through the centers of the segments of optimal path in 𝑆, thus restricting the number of elements which need to be visited

    56

  • MAFFT – FFT-based group-to-group alignment (3)

    • Group-to-group alignment is extension of the approach by considering groups as linear combinations of the volume/polar components of the groups

    • ො𝑥𝑔𝑟𝑜𝑢𝑝𝑖 𝑛 = σ𝑗∈𝑔𝑟𝑜𝑢𝑝𝑖𝑤𝑗 ො𝑥𝑗(𝑛) , 𝑥 ∈ 𝑣, 𝑝 , 𝑖 ∈ {1,2}

    • 𝑤𝑗 is weighting factor for sequence 𝑗 calculated as in CLUSTALW in the progressive stage, and as in (*) in the iterative stage

    • For nucleotide sequences, the 2D vectors consisting of polar/volume components are replaced by 4D vector of A, C, G, T components

    57(*) Gotoh, O. (1995). A weighting system and algorithm for aligning many phylogenetically related sequences. Bioinformatics, 11(5), 543-551.

  • MAFFT – iterative refinement

    • Alignment divided into two groups and realigned• Tree-dependent restricted partitioning *

    • A tree-dependent, restricted partitioning technique to efficiently reduce the execution time of iterative algorithms

    • Repeated until no better alignment is obtained

    58(*) Hirosawa, M., Totoki, Y., Hoshida, M., & Ishikawa, M. (1995). Comprehensive study on iterative algorithms of multiple sequence alignment. Bioinformatics, 11(1), 13-18.

  • MAFFT – improvements

    • Version 5 • Consistency-base scoring - new objective function to reveal distant homologues

    applied to the iterative refinement stage

    • TCoffee-like approach of incorporation of all pairwise alignment information into the objective function

    • Computed from all-to-all pairwise alignments before constructing MSA

    • Summation of weighted sum-of-pairs score

    • Dropped re-construction phase of the guide tree

    • Version 6• New tree-building algorithm, PartTree, for handling larger numbers of sequences

    • multiple ncRNA alignment framework incorporating structural information

    59

  • MUSCLE

    • MUltiple Sequence Comparison by Log-Expectation

    • Stages1. Draft progressive

    2. Improved progressive

    3. Refinement

    61

  • MUSCLE details

    • k-mer distance• Fraction of common k-mers in a pair of sequences

    • Possibly on compressed alphabet (similar residues, e.g. hydrophobic, get the same letter)

    • Approximates well fraction of common residues in global alignment

    • Kimura distance (correction)• Computed from fractional identity of sequences 𝐷 which is good approximation for closely related

    sequences• exact if positions are allowed to mutate only ones → multiple mutations on single site (more distant sequences)

    require correction

    • 𝑑𝑘𝑖𝑚𝑢𝑟𝑎 = − log𝑒(1 − 𝐷 −𝐷2

    5)

    • Stage 3 refinement• Choose edge (go from leafs to root) from TREE2 and delete it• Build profile for MSA of the resulting tree, re-align and accept change if it led to an improvement• Iterate until convergence

    62

  • Comparison of MSA algorithms - benchmark

    • Benchmarking → guidelines

    • Commonly used dataset for benchmarking MSA algorithms is BAliBASE• high quality manually refined reference alignments based on 3D structural

    superpositions

    63cases with small numbers of

    equidistant

    sequences,

    and was further

    subdivided by

    percent identity

    families with one

    or more “orphan”

    sequences

    divergent

    subfamilies, with

    less than 25%

    identity between

    the groups

    sequences with

    large terminal

    extensions

    sequences with

    large internal

    insertions and

    deletions

    sequences with

    repeats,

    transmembrane

    regions, and

    inverted domains,

    respectively

    families with linear

    motifs often found

    in disordered

    regions that are

    difficult to align

    sequences with

    subfamily specific

    features, motifs in

    disordered regions

    and

    fragmentary/erron

    eous sequences

    http://www.lbgi.fr/balibase/

  • Comparison of main algorithms

    64

    source: Pais, F. S. M., de Cássia Ruy, P., Oliveira, G., & Coimbra, R. S. (2014). Assessing the efficiency of multiple sequence alignment programs. Algorithms for Molecular Biology, 9(1), 4.

    SP: sum of pairs

    TC: based on number of columns aligned 100% correctly

  • Guidelines - accuracy• No MSA program outperformed all others in all test cases

    • For the R1-5 sets T-Coffee, Probcons, MAFFT and Probalign were superior with regard to alignment accuracy• When aligning available short versions of the sequences (BBS), Probcons and T-Coffee outperformed Probalign

    and MAFFT• statistically significant superiority of Probcons and T-Coffee in comparison to Probalign and MAFFT in R1 & R2

    • When aligning full-length proteins (BB) of R1, -3 and R5, which represent more difficult test cases, and also R4, where large terminal extensions are present, Probalign, MAFFT and, surprisingly, CLUSTAL OMEGA, generally outperformed both Probcons and T-Coffee• T-Cofee and Probcons worked great when aligning truncated sequences but did not do that well for datasets with

    long N/C terminal ends due to presence of non-conserved residues at terminal ends • MAFFT, Probalign and even CLUSTAL OMEGA may be preferred over T-Coffee and Probcons when aligning sequences with

    these long terminal extensions

    • Contradicting performance of CLUSTAL OMEGA - performed well in three reference R3-5 sets with full-length sequences but not with the short versions

    • CLUSTALW, DIALIGN-TX and POA had Z-scores below the average in almost all test cases from the first five reference sets

    65

    source: Pais, F. S. M., de Cássia Ruy, P., Oliveira, G., & Coimbra, R. S. (2014). Assessing the efficiency of multiple sequence alignment programs. Algorithms for Molecular Biology, 9(1), 4.

  • Guidelines – computational costs

    • Speed• CLUSTALW and MUSCLE were the fastest of the evaluated programs

    • T-Coffee and MAFFT deliver fast alignments in multi-core environment

    • Probcons and Probalign exceeded the 2.5 hours cutoff in the last three subsets from Reference 9

    • Memory• CLUSTALW consumed least memory

    • T-Coffee running in single-core mode - results indicated that the program consumed generally more RAM than the others and was also the slowest

    66

    source: Pais, F. S. M., de Cássia Ruy, P., Oliveira, G., & Coimbra, R. S. (2014). Assessing the efficiency of multiple sequence alignment programs. Algorithms for Molecular Biology, 9(1), 4.

  • PhyloDNA Puzzles

    67