Bioinformatics Algorithmssiret.ms.mff.cuni.cz/.../bioinformatics/lecture05_msa.pdfOutline •Motivation •Scoring functions •Algorithms •exhaustive •multidimensional dynamic

Bioinformatics Algorithms

David Hoksza

http://siret.ms.mff.cuni.cz/hoksza

Multiple Sequence Alignment

Outline

• Motivation

• Scoring functions

• Algorithms

• exhaustive • multidimensional dynamic programming

• heuristics• progressive alignment

• iterative alignment/refinement

• block(local)-based alignment

Multiple sequence alignment (MSA)

• Goal of MSA is to find “optimal” mapping of a set of sequences

• Homologous residues (originating in the same position in a common ancestor) among a set of sequences are aligned together in columns

• Usually employs multiple pairwise alignment (PA) computations to reveal the evolutionarily equivalent positions across all sequences

Motivation

• Distant homologues• faint similarity can become apparent when present in many sequences

• motifs might not be apparent from pairwise alignment only

• Detection of key functional residues• amino acids critical for function tend to be conserved during the evolution and

therefore can be revealed by inspecting sequences within given family

• Prediction of secondary/tertiary structure

• Inferring evolutionary history

4

Representation of MSA

• Column-based representation

• Profile representation (position specific scoring matrix)

• Sequence logo

Manual MSA

• High quality MSA can be carried out by hand using expert knowledge• specific columns

• highly conserved residues• buried hydrophobic residues

• secondary structure (especially in RNA alignment)

• expected patterns of insertions and deletions

• Tedious, but• high-quality source of family

information• a benchmark for evaluation of

automatic MSA algorithms

• BAliBASE• https://lbgi.fr/balibase/

• PROSITE• http://prosite.expasy.org/

• Pfam• http://pfam.sanger.ac.uk/

• TIGRFAM• http://www.jcvi.org/cgi-

bin/tigrfams/index.cgi

• … (some databases are semi-automatic and many of the databases construct the MSA from the structure information)

http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE/https://lbgi.fr/balibase/http://prosite.expasy.org/http://pfam.sanger.ac.uk/http://pfam.sanger.ac.uk/http://www.jcvi.org/cgi-bin/tigrfams/index.cgihttp://www.jcvi.org/cgi-bin/tigrfams/index.cgi

Scoring

• How to score an MSA?

𝑺 𝑨 = 𝑮 +𝑪𝑺(𝑨𝒊)

• 𝐴𝑖 … 𝑖-th column• 𝐶𝑆(𝐴𝑖) … score of the 𝑖-th column• 𝐺 … gap function (assumes linear or constant gap penalty)

• the score assumes independent columns

• Two score types are usually considered• minimum entropy (ME)• sum of pairs (SP)

Minimum entropy (1)

• ME aims to minimize entropy of each column

• columns with low entropy (can be expressed with only few bits) are good for the alignment

• the more bits we need to express a column, the more divers the column is

Minimum entropy (2)

• Probability of a column • assumption of independency between columns and residues within columns

𝑷 𝑨𝒊 = ෑ

𝒂

𝒑𝒊𝒂𝒄𝒊𝒂

• 𝑐𝑖𝑎…observed counts for residue 𝑎 in 𝑖-th column 𝑐𝑖𝑎 = σ𝑗 ൝0 𝐴𝑖 [𝑗] ≠ 𝑎

1 𝐴𝑖 𝑗 = 𝑎

• 𝐴𝑖 [𝑗]… 𝑗-th symbol in 𝑖-th column

• 𝑝𝑖𝑎… probability of residue 𝑎 in column 𝑖

𝑪𝑴𝑬 𝑨𝒊 = −

𝒂

𝒄𝒊𝒂 𝐥𝐨𝐠𝒑𝒊𝒂 𝑴𝑬 =

𝒊

𝑪𝑴𝑬(𝑨𝒊)

• completely conserved column would score 0

Sum of pairs

• Sum of scores of all possible pairs in a multiple alignment 𝑨 for a particular scoring matrix

• Score for each column is computed as the sum of all pairs of position in that column

• Column scores are then summed to get the SP-score

𝑆𝑃 𝐴 =

𝑖=1

|𝐴|

𝐶𝑆𝑃 𝐴𝑖 =

𝑖=1

|𝐴|

𝑘

SP - Example G K NT R N

S H E

-1 +1 +6 6• BLOSUM 62 scoring matrix

SP score drawback• Alignment of 𝑵 sequences, all containing leucine at given position from

functional reasons

• BLOSUM62 matrix 𝝈 𝑳, 𝑳 = 𝟒 → 𝑺𝑷 𝑨𝒊 = 𝟒 × 𝑵(𝑵 − 𝟏)/𝟐

• Let us replace one of the leucines with glycine (incorrect alignment)

𝝈 𝑳, 𝑮 = −𝟒 → the score decreases by 𝟖 × (𝑵 − 𝟏)

• 𝑺𝑷 𝑨𝒊 is worse by a fraction of 8×(𝑁−1)

4×𝑁(𝑁−1)/2=

𝟒

𝑵

• Relative difference in score between the correct alignment and incorrectalignment decreases with the number of sequences in the alignment

• BUT increasing the number of sequences (evidence) should give us more increased relative difference

Multidimensional dynamic programming (1)

• Generalization of pairwise dynamic programming

• 3 sequences: ATGC, AATC,TTGC

• Resulting path • (0,0,0) → (1,1,0) → (1,2,1) → (2,3,2) → (3,3,3) → (4,4,4)

A - T G C

0 1 1 2 3 4

A A T - C

0 1 2 3 3 4

- T T G C

0 0 1 2 3 4

x coordinate

y coordinate

z coordinate


• Let us assume linear gap penalty model (not affine)

• 𝛾 𝑔 = 𝑔𝑑 for a gap of length 𝑔 and gap cost 𝑑

• initialization and backtracking are analogous with the 2D case


• 7 edges• 3 edges

Computational complexity of MDP

• Computation of each cell of the DP matrix takes 𝟐𝑵 − 𝟏 (all possible combinations of gaps column)

• Let us assume all the sequences have approximately the same length 𝑳

• Memory complexity

𝑶 𝑳𝑵

• Time complexity

𝑶 𝟐𝑵𝑳𝑵

MDP - exercise

• Let’s have sequence of length 50

• Comparison of a pair of sequences using DP takes 0,1s

• What is the time needed to compare 4 sequences?

• Let’s say we have 1000 years and average sequence length is 50.

• How many sequence can afford to compare?

Heuristic Algorithms

• Progressive alignment methods• iterative building of the alignment

• Feng & Doolittle

• ClustalW, Clustal Omega

• Consistency-based methods• T-Coffee

• Iterative refinement• alignment built and then refined be

realigning the constituent sequences• Barton & Sternberg

• Block-based alignment• local alignment built by identifying

blocks of ungapped MSA identified and assembled• DIALIGN

• Mix of approaches• MAFFT, MUSCLE

Progressive alignment

• Framework• First, two sequences are aligned using standard pairwise alignment

• The remaining sequences are taken one by one and aligned to the previous ones

• Repeated until all sequences are aligned

• Parameters• The order in which the sequences are be aligned

• Whether only one alignment is kept and sequences are added to it or whether also an alignment can be aligned to another alignment (as if a tree was being built)

• The process used to align and score sequences or alignments against the existing ones

Star alignment

• N sequences 𝒔𝟏, … , 𝒔𝑵 to be aligned

1. Pick 𝒔𝒊 as a starting sequence – center

2. Compute all optimal global alignments between 𝒔𝒊 and 𝒔𝒋, 𝑗 ≠ 𝑖

3. Successively merge sequences into the arising MSA• once a gap always a gap rule

• if a gap is introduced into the MSA it stays there forever

SA – example (1)

S1: ATTGCCATT

S2: ATGGCCATT

S3: ATCCAATTTT

S4: ATCTTCTT

S5: ATTGCCGATT

credit: Xingquan Zhu, Florida Atlantic University

ATTGCCATT

ATTGCCATT

ATGGCCATT

ATTGCCATT--

ATC-CAATTTT

ATTGCCATT

ATCTTC-TT

ATTGCC-ATT

ATTGCCGATT

ATTGCC-ATT--

ATGGCC-ATT--

ATTGCCGATT--

ATCTTC--TT--

ATC-CA-ATTTT

SA – example (2)

pairwise alignment multiple alignment

1.ATTGCCATT

ATGGCCATT

ATTGCCATT

ATGGCCATT

2.ATTGCCATT--

ATC-CAATTTT

ATTGCCATT--

ATGGCCATT--

ATC-CAATTTT

3.ATTGCCATT

ATCTTC-TT

ATTGCCATT--

ATGGCCATT--

ATC-CAATTTT

ATCTTC-TT--

4.ATTGCC-ATT

ATTGCCGATT

ATTGCC-ATT--

ATGGCC-ATT--

ATC-CA-ATTTT

ATCTTC--TT--

ATTGCCGATT--

SA - choosing the center

• Compute all pairwise alignment and pick sequence 𝒔𝒊 with maximum σ𝒋≠𝒊 𝒔(𝒔𝒊, 𝒔𝒋)• Choosing the sequence which is most similar to all the rest

• Compute all pairwise alignments and compute MSA for every 𝒔𝒊 and pick the best

SA – time complexity

• Average sequence length 𝐿

• One global alignment computation in 𝐎(𝑳𝟐)

• 𝑘 sequences → 𝐎(𝒌𝟐𝑳𝟐) pairwise computations

• 𝑙 … upper bound on the MSA length → 𝐎(𝒍𝒌) for MSA construction

𝑂 𝑘2𝐿2 + 𝑙𝑘 = 𝑶(𝒌𝟐𝑳𝟐)

SA - exercise

• Compute SP for the constructed MSA

• Compute SA for the previous example but add sequences to the MSA in different order. Does the order of addition impacts the score?

• Compute MSA starting with S5. Does the score change?

ATTGCC-ATT

ATTGCCGATT

ATGGCC-ATT

ATTGCCGATT

AT--CCAATTTT

ATTGCCGATT--

AT--CTTCTT

ATTGCCGATT

Feng & Doolittle (1)

1. Calculate a distance matrix from all-to-all pairwise alignments (𝑁(𝑁 − 1)/2)

2. Convert raw alignment scores into (evolutionary) distances

• 𝐷 = − log 𝑆𝑒𝑓𝑓 × 100 = − log𝑆𝑜𝑏𝑠−𝑆𝑟𝑎𝑛𝑑

𝑆𝑚𝑎𝑥−𝑆𝑟𝑎𝑛𝑑× 100

3. Construct a guide tree from the distance matrix using Fitch & Margoliash algorithm

4. Align child nodes of each parent (can be sequence-sequence, sequence-MSA, MSA-MSA) in the order they were added to the tree

• 𝑆𝑚𝑎𝑥 𝑎, 𝑏 =𝑆 𝑎,𝑎 +𝑆 𝑏,𝑏

2

• 𝑆𝑟𝑎𝑛𝑑 is an expected score obtained by randomization

• 𝑆𝑒𝑓𝑓 can be viewed as normalized

percentage similarity which

decreases roughly exponentially to

0 with increasing evolutionary

distance.

• –log makes the measure linear with

evolutionary distance

source: Feng, Da-Fei, and Russell F. Doolittle. "Progressive sequence alignment as a prerequisitetto correct phylogenetic trees." Journal of molecular evolution 25.4 (1987): 351-360.

Feng & Doolittle (2)

• Sequence-sequence is aligned using classical dynamic programming

• Sequence-MSA – sequence is aligned with each sequence in the group and the highest scoring alignment defines how the sequence is added to the group

• MSA-MSA – as in previous case but all pairs of sequences are tested

• When a sequence is added to a group, neutral symbol X is introduced instead of the gap position• allows to align gap positions

• neutral – anything aligned with X scores 0

• side effect – the gaps in two MSAs tend to come together in the resulting MSA

Profile/MSA Alignment

• When adding a sequence to a group it is desirable to take into account the MSA built so far• mismatches at highly conserved positions should be penalized more

• 2 MSA (profiles) of 𝑁 sequences, one from 1. . 𝑛, second 𝑛 + 1. . 𝑁

𝒊

𝑺 𝑨 𝒊 =

𝒊

𝒌

ClustalW

• Similar to Feng & Doolittle but uses profile-based building

1. Calculate matrix from all-to-all pairwise alignments (𝑁(𝑁 − 1)/2)

2. Convert raw alignment similarity scores into evolutionary distances

3. Construct a guided tree from the distance matrix using neighbor joining algorithm

4. Progressively align the nodes in order of decreasing similarity, using sequence-sequence, sequence-profile, and profile-profile alignment

ClustalW - Alignment

standard global

alignment

column-based

global alignment

ClustalW – heuristics (1)

• Weighting of subsequences to compensate for biased representation in large subfamilies – compensates defects of sum-of-pairs scoring• sequence contributions to the MSA are weighted by their relationships in the

predicted evolutionary tree

• Closely related sequences BLOSUM 80 – distant sequences BLOSUM 50

• Position-specific gap open profile penalties multiplied by a modifier being function of the residues observed at the position• comes from structure-based alignments

ClustalW – heuristics (2)

• Gap penalties are higher if there are no gaps at given position but some exist at nearby positions• force the gaps to be in the same places

• Guide tree can be adjusted on the fly to postpone aligning low-similar sequences up to the point where more information is present

• …

Clustal Omega

• ClustalW does not scale well for a big number (thousands) of sequences

• Bottleneck is the guide tree construction → 𝑂(𝑀 × 𝑁2)

• Clustal Omega algorithm• mBed algorithm-based for guide tree construction

• emBedding of each sequence in a space of 𝑛 dimensions where n is proportional to log N• Each sequence replaced by an 𝑛 element vector, where each element is the distance to one of 𝑛 reference sequences

• Clustering of the vectors by UPGMA or K-means• Alignments of profiles using hidden Markov models (HHalign package)

• Additional features• Adding sequences to existing alignments• External profiles - HMM profile from sequences homologous to the input set which can be used

in MSA construction

33

Progressive alignment drawbacks

• Drawback of the progressive methods is the greedy nature of the algorithm• once an error, always an error

• Solutions• Iterative approach

• Tries to correct mistakes in the initial alignment … which might happen easily if pairwise sequence similarity is too low

• Consistency-based approach• Tries to avoid mistakes in advance

34

T-Coffee

• A progressive alignment with the ability to consider information from all of the sequences during each alignment step, not just those being aligned at that stage (consistency)

• Tackles ClustalW drawback

T-Coffee algorithm (1)

1. Primary library generation• Generate primary libraries of sequence alignments (using several methods)• For each pair compute weights based on percent identity (shorter sequence)• Merge the primary libraries – pairs of aligned residues get a weight equal to the

sum of the weights

2. Extended library generation• For each pair of sequences try to align them using an intermediate sequence• Score each pair of residues by the lower of the two weights and sum the weights

for that residue pair over all triplets

3. Progressive alignment• Guided tree is used to build the multiple alignment• A pair of sequences is aligned using dynamic programming with weights based on

the extended library• When aligning two sets of sequences, averaged library weights are used

T-Coffee algorithm (2)

77+88

= 165

Iterative refinement

• Once an error, always an error• when a sequence is added to a MSA it cannot be changed later on (holds also for

consistency-based approaches)

• dependence on the initial alignment

• Idea behind iterative refinement methods• optimal solution can be obtained by iterative improvement of suboptimal solutions

• First, a suboptimal solution is identified by a fast heuristic method and then possibly improved by iterative removal and re-addition of the sequences

Barton-Sternberg

1. Find the highest scoring pair of sequences and align them using the pairwise dynamic programming

2. Identify most similar remaining sequence with respect to the existing profile and align it using sequence-profile alignment

3. Repeat step 2 until all sequences are aligned

4. Remove sequence 𝒔𝟏 and realign it to the profile by profile-sequence alignment. Repeat for 𝒔𝟐, … , 𝒔𝑵

5. Repeat step 4 for a fixed number of times or until convergence

Block-based alignment

• Both progressive and iterative methods assume that all parts of all sequences are consistent with a global alignment• not every position in the alignment has to correspond to a homologous site in all

sequences

• Block-based solution approach to the problem of global alignment by • splitting sequences into blocks

• aligning the blocks

• merging block alignments

DIALIGN

• Alignment constructed from gap-free local alignments between pairs of sequences• based on diagonals in the dynamic programming matrix

• Procedure• align all possible pairs of sequences• determine all diagonals and assign weights

• for a diagonal 𝐷 of length 𝑙, score 𝒔 is obtained from substitution matrix• determine length-independent weight as 𝒘 𝑫 = − 𝐥𝐨𝐠𝑷(𝒍, 𝒔), where 𝑷(𝒍, 𝒔) is the

probability that diagonal of sequence of length 𝒍 will have score at least 𝒔

• build MSA by adding consistent diagonals in order of decreasing weight (and overlap with other diagonals)

• explore unaligned regions and include them if possible

DIALIGN - Example

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

atctaatagttaaactcccccgtgcttag



DIALIGN – Example

atc------taatagttaaactcccccgtgcttag



DIALIGN – Example



caaa--gagtatcacccctgaattgaataa

DIALIGN – Example



caaa--gagtatcacc----------cctgaattgaataa

DIALIGN – Example

atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

DIALIGN – Example

atc------TAATAGTTAaactccccCGTGC-TTag

cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg

caaa--GAGTATCAcc----------CCTGaaTTGAATaa

DIALIGN – Example

MAFFT

• Multiple Alignment using Fast Fourier Transform

• 2-cycle progressive method + iterative refinement• Fast low-quality all pairwise distances

• Tentative MSA construction

• Refinement of distances based on MSA

• Second progressive alignment stage

• Iterative refinement

• Reducing CPU time• 6-mer method for pairwise alignments

• FFT-based group-to-group alignment algorithm

53

FFT-NS-1

FFT-NS-2

FFT-NS-i

MAFFT – guide tree construction

• Sequences and groups are progressively aligned using FFT-based alignment (next slides) in the order given by a guide tree

• To quickly obtain the all-to-all distance matrix• 20 AAs grouped into 6 physico-chemical groups

• Number of shared 6-tuples 𝑇𝑖𝑗 is computed and turned into 𝐷𝑖𝑗• 𝐷𝑖𝑗 = 1 − [𝑇𝑖𝑗/min(𝑇𝑖𝑖 , 𝑇𝑗𝑗)]

• UPGMA method is used to obtain the guide tree from 𝐷𝑖𝑗

54

MAFFT – FFT-based group-to-group alignment (1)

• AA sequence converted to a sequence of vectors (signals) of normalized volume ( ො𝑣 𝑎 = 𝑣 𝑎 − 𝑣 /𝜎𝑣) and polarity ( Ƹ𝑝 𝑎 = 𝑝 𝑎 − 𝑝 /𝜎𝑝)

• Correlation between 2 AA sequences is assessed using• 𝑐 𝑘 = 𝑐𝑣 𝑘 + 𝑐𝑝 𝑘

• 𝑐𝑥 𝑘 = σ1≤𝑛≤𝑁,1≤𝑛+𝑘≤𝑀ෞ𝑥1(𝑛)ෞ𝑥2(𝑛 + 𝑘) , 𝑥 ∈ {𝑣, 𝑝}• ෝ𝑥𝑖(𝑗) … 𝑥 component (𝑣 or 𝑝) of 𝑗-th site in sequence 𝑖

• If 𝑋𝑖(𝑚) are FFT of ෝ𝑥𝑖(𝑛), then, by cross-correlation theorem, 𝑐𝑥 𝑘 ֞𝑋1∗ 𝑚 ∗ 𝑋2 𝑚

• FFT reduces the 𝑂(𝑛2) time to 𝑂(𝑛 log 𝑛)

55

• Correlation has high peaks when the two compared sequences have high similarity regions offset by the lags• sliding window of given size is used to reveal homologous regions

(sequence identity in the window is measured)• Successive homologous sequences are combined


• Matrix 𝑺𝒊𝒋 (1 ≤ 𝑖, 𝑗 ≤ 𝑛) is constructed with values corresponding to the scores of the 𝒏 identified homologous segments (0 if (𝑖, 𝑗) is not homologous pair)

• Optimal path through 𝑺 is identified

• DP matrix of the sequences is then computed, where the optimal path must go through the centers of the segments of optimal path in 𝑆, thus restricting the number of elements which need to be visited

56


• Group-to-group alignment is extension of the approach by considering groups as linear combinations of the volume/polar components of the groups

• ො𝑥𝑔𝑟𝑜𝑢𝑝𝑖 𝑛 = σ𝑗∈𝑔𝑟𝑜𝑢𝑝𝑖𝑤𝑗 ො𝑥𝑗(𝑛) , 𝑥 ∈ 𝑣, 𝑝 , 𝑖 ∈ {1,2}

• 𝑤𝑗 is weighting factor for sequence 𝑗 calculated as in CLUSTALW in the progressive stage, and as in (*) in the iterative stage

• For nucleotide sequences, the 2D vectors consisting of polar/volume components are replaced by 4D vector of A, C, G, T components

57(*) Gotoh, O. (1995). A weighting system and algorithm for aligning many phylogenetically related sequences. Bioinformatics, 11(5), 543-551.

MAFFT – iterative refinement

• Alignment divided into two groups and realigned• Tree-dependent restricted partitioning *

• A tree-dependent, restricted partitioning technique to efficiently reduce the execution time of iterative algorithms

• Repeated until no better alignment is obtained

58(*) Hirosawa, M., Totoki, Y., Hoshida, M., & Ishikawa, M. (1995). Comprehensive study on iterative algorithms of multiple sequence alignment. Bioinformatics, 11(1), 13-18.

MAFFT – improvements

• Version 5 • Consistency-base scoring - new objective function to reveal distant homologues

applied to the iterative refinement stage

• TCoffee-like approach of incorporation of all pairwise alignment information into the objective function

• Computed from all-to-all pairwise alignments before constructing MSA

• Summation of weighted sum-of-pairs score

• Dropped re-construction phase of the guide tree

• Version 6• New tree-building algorithm, PartTree, for handling larger numbers of sequences

• multiple ncRNA alignment framework incorporating structural information

59

MUSCLE

• MUltiple Sequence Comparison by Log-Expectation

• Stages1. Draft progressive

2. Improved progressive

3. Refinement

61

MUSCLE details

• k-mer distance• Fraction of common k-mers in a pair of sequences

• Possibly on compressed alphabet (similar residues, e.g. hydrophobic, get the same letter)

• Approximates well fraction of common residues in global alignment

• Kimura distance (correction)• Computed from fractional identity of sequences 𝐷 which is good approximation for closely related

sequences• exact if positions are allowed to mutate only ones → multiple mutations on single site (more distant sequences)

require correction

• 𝑑𝑘𝑖𝑚𝑢𝑟𝑎 = − log𝑒(1 − 𝐷 −𝐷2

5)

• Stage 3 refinement• Choose edge (go from leafs to root) from TREE2 and delete it• Build profile for MSA of the resulting tree, re-align and accept change if it led to an improvement• Iterate until convergence

62

Comparison of MSA algorithms - benchmark

• Benchmarking → guidelines

• Commonly used dataset for benchmarking MSA algorithms is BAliBASE• high quality manually refined reference alignments based on 3D structural

superpositions

63cases with small numbers of

equidistant

sequences,

and was further

subdivided by

percent identity

families with one

or more “orphan”

sequences

divergent

subfamilies, with

less than 25%

identity between

the groups

sequences with

large terminal

extensions

sequences with

large internal

insertions and

deletions

sequences with

repeats,

transmembrane

regions, and

inverted domains,

respectively

families with linear

motifs often found

in disordered

regions that are

difficult to align

sequences with

subfamily specific

features, motifs in

disordered regions

and

fragmentary/erron

eous sequences

http://www.lbgi.fr/balibase/

Comparison of main algorithms

64

source: Pais, F. S. M., de Cássia Ruy, P., Oliveira, G., & Coimbra, R. S. (2014). Assessing the efficiency of multiple sequence alignment programs. Algorithms for Molecular Biology, 9(1), 4.

SP: sum of pairs

TC: based on number of columns aligned 100% correctly

Guidelines - accuracy• No MSA program outperformed all others in all test cases

• For the R1-5 sets T-Coffee, Probcons, MAFFT and Probalign were superior with regard to alignment accuracy• When aligning available short versions of the sequences (BBS), Probcons and T-Coffee outperformed Probalign

and MAFFT• statistically significant superiority of Probcons and T-Coffee in comparison to Probalign and MAFFT in R1 & R2

• When aligning full-length proteins (BB) of R1, -3 and R5, which represent more difficult test cases, and also R4, where large terminal extensions are present, Probalign, MAFFT and, surprisingly, CLUSTAL OMEGA, generally outperformed both Probcons and T-Coffee• T-Cofee and Probcons worked great when aligning truncated sequences but did not do that well for datasets with

long N/C terminal ends due to presence of non-conserved residues at terminal ends • MAFFT, Probalign and even CLUSTAL OMEGA may be preferred over T-Coffee and Probcons when aligning sequences with

these long terminal extensions

• Contradicting performance of CLUSTAL OMEGA - performed well in three reference R3-5 sets with full-length sequences but not with the short versions

• CLUSTALW, DIALIGN-TX and POA had Z-scores below the average in almost all test cases from the first five reference sets

65


Guidelines – computational costs

• Speed• CLUSTALW and MUSCLE were the fastest of the evaluated programs

• T-Coffee and MAFFT deliver fast alignments in multi-core environment

• Probcons and Probalign exceeded the 2.5 hours cutoff in the last three subsets from Reference 9

• Memory• CLUSTALW consumed least memory

• T-Coffee running in single-core mode - results indicated that the program consumed generally more RAM than the others and was also the slowest

66


PhyloDNA Puzzles

67

Documents

Bioinformatics Algorithmssiret.ms.mff.cuni.cz/.../bioinformatics/lecture05_msa.pdfOutline •Motivation •Scoring functions •Algorithms •exhaustive •multidimensional dynamic