Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Bioinformatics Algorithms
David Hoksza
http://siret.ms.mff.cuni.cz/hoksza
Multiple Sequence Alignment
Outline
• Motivation
• Scoring functions
• Algorithms
• exhaustive • multidimensional dynamic programming
• heuristics• progressive alignment
• iterative alignment/refinement
• block(local)-based alignment
Multiple sequence alignment (MSA)
• Goal of MSA is to find “optimal” mapping of a set of sequences
• Homologous residues (originating in the same position in a common ancestor) among a set of sequences are aligned together in columns
• Usually employs multiple pairwise alignment (PA) computations to reveal the evolutionarily equivalent positions across all sequences
Motivation
• Distant homologues• faint similarity can become apparent when present in many sequences
• motifs might not be apparent from pairwise alignment only
• Detection of key functional residues• amino acids critical for function tend to be conserved during the evolution and
therefore can be revealed by inspecting sequences within given family
• Prediction of secondary/tertiary structure
• Inferring evolutionary history
4
Representation of MSA
• Column-based representation
• Profile representation (position specific scoring matrix)
• Sequence logo
Manual MSA
• High quality MSA can be carried out by hand using expert knowledge• specific columns
• highly conserved residues• buried hydrophobic residues
• secondary structure (especially in RNA alignment)
• expected patterns of insertions and deletions
• Tedious, but• high-quality source of family
information• a benchmark for evaluation of
automatic MSA algorithms
• BAliBASE• https://lbgi.fr/balibase/
• PROSITE• http://prosite.expasy.org/
• Pfam• http://pfam.sanger.ac.uk/
• TIGRFAM• http://www.jcvi.org/cgi-
bin/tigrfams/index.cgi
• … (some databases are semi-automatic and many of the databases construct the MSA from the structure information)
http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE/https://lbgi.fr/balibase/http://prosite.expasy.org/http://pfam.sanger.ac.uk/http://pfam.sanger.ac.uk/http://www.jcvi.org/cgi-bin/tigrfams/index.cgihttp://www.jcvi.org/cgi-bin/tigrfams/index.cgi
Scoring
• How to score an MSA?
𝑺 𝑨 = 𝑮 +𝑪𝑺(𝑨𝒊)
• 𝐴𝑖 … 𝑖-th column• 𝐶𝑆(𝐴𝑖) … score of the 𝑖-th column• 𝐺 … gap function (assumes linear or constant gap penalty)
• the score assumes independent columns
• Two score types are usually considered• minimum entropy (ME)• sum of pairs (SP)
Minimum entropy (1)
• ME aims to minimize entropy of each column
• columns with low entropy (can be expressed with only few bits) are good for the alignment
• the more bits we need to express a column, the more divers the column is
Minimum entropy (2)
• Probability of a column • assumption of independency between columns and residues within columns
𝑷 𝑨𝒊 = ෑ
𝒂
𝒑𝒊𝒂𝒄𝒊𝒂
• 𝑐𝑖𝑎…observed counts for residue 𝑎 in 𝑖-th column 𝑐𝑖𝑎 = σ𝑗 ൝0 𝐴𝑖 [𝑗] ≠ 𝑎
1 𝐴𝑖 𝑗 = 𝑎
• 𝐴𝑖 [𝑗]… 𝑗-th symbol in 𝑖-th column
• 𝑝𝑖𝑎… probability of residue 𝑎 in column 𝑖
𝑪𝑴𝑬 𝑨𝒊 = −
𝒂
𝒄𝒊𝒂 𝐥𝐨𝐠𝒑𝒊𝒂 𝑴𝑬 =
𝒊
𝑪𝑴𝑬(𝑨𝒊)
• completely conserved column would score 0
Sum of pairs
• Sum of scores of all possible pairs in a multiple alignment 𝑨 for a particular scoring matrix
• Score for each column is computed as the sum of all pairs of position in that column
• Column scores are then summed to get the SP-score
𝑆𝑃 𝐴 =
𝑖=1
|𝐴|
𝐶𝑆𝑃 𝐴𝑖 =
𝑖=1
|𝐴|
𝑘
SP - Example G K NT R N
S H E
-1 +1 +6 6• BLOSUM 62 scoring matrix
SP score drawback• Alignment of 𝑵 sequences, all containing leucine at given position from
functional reasons
• BLOSUM62 matrix 𝝈 𝑳, 𝑳 = 𝟒 → 𝑺𝑷 𝑨𝒊 = 𝟒 × 𝑵(𝑵 − 𝟏)/𝟐
• Let us replace one of the leucines with glycine (incorrect alignment)
𝝈 𝑳, 𝑮 = −𝟒 → the score decreases by 𝟖 × (𝑵 − 𝟏)
• 𝑺𝑷 𝑨𝒊 is worse by a fraction of 8×(𝑁−1)
4×𝑁(𝑁−1)/2=
𝟒
𝑵
• Relative difference in score between the correct alignment and incorrectalignment decreases with the number of sequences in the alignment
• BUT increasing the number of sequences (evidence) should give us more increased relative difference
Multidimensional dynamic programming (1)
• Generalization of pairwise dynamic programming
• 3 sequences: ATGC, AATC,TTGC
• Resulting path • (0,0,0) → (1,1,0) → (1,2,1) → (2,3,2) → (3,3,3) → (4,4,4)
A - T G C
0 1 1 2 3 4
A A T - C
0 1 2 3 3 4
- T T G C
0 0 1 2 3 4
x coordinate
y coordinate
z coordinate
Multidimensional dynamic programming (2)
• Let us assume linear gap penalty model (not affine)
• 𝛾 𝑔 = 𝑔𝑑 for a gap of length 𝑔 and gap cost 𝑑
• initialization and backtracking are analogous with the 2D case
Multidimensional dynamic programming (3)
• 7 edges• 3 edges
Computational complexity of MDP
• Computation of each cell of the DP matrix takes 𝟐𝑵 − 𝟏 (all possible combinations of gaps column)
• Let us assume all the sequences have approximately the same length 𝑳
• Memory complexity
𝑶 𝑳𝑵
• Time complexity
𝑶 𝟐𝑵𝑳𝑵
MDP - exercise
• Let’s have sequence of length 50
• Comparison of a pair of sequences using DP takes 0,1s
• What is the time needed to compare 4 sequences?
• Let’s say we have 1000 years and average sequence length is 50.
• How many sequence can afford to compare?
Heuristic Algorithms
• Progressive alignment methods• iterative building of the alignment
• Feng & Doolittle
• ClustalW, Clustal Omega
• Consistency-based methods• T-Coffee
• Iterative refinement• alignment built and then refined be
realigning the constituent sequences• Barton & Sternberg
• Block-based alignment• local alignment built by identifying
blocks of ungapped MSA identified and assembled• DIALIGN
• Mix of approaches• MAFFT, MUSCLE
Progressive alignment
• Framework• First, two sequences are aligned using standard pairwise alignment
• The remaining sequences are taken one by one and aligned to the previous ones
• Repeated until all sequences are aligned
• Parameters• The order in which the sequences are be aligned
• Whether only one alignment is kept and sequences are added to it or whether also an alignment can be aligned to another alignment (as if a tree was being built)
• The process used to align and score sequences or alignments against the existing ones
Star alignment
• N sequences 𝒔𝟏, … , 𝒔𝑵 to be aligned
1. Pick 𝒔𝒊 as a starting sequence – center
2. Compute all optimal global alignments between 𝒔𝒊 and 𝒔𝒋, 𝑗 ≠ 𝑖
3. Successively merge sequences into the arising MSA• once a gap always a gap rule
• if a gap is introduced into the MSA it stays there forever
SA – example (1)
S1: ATTGCCATT
S2: ATGGCCATT
S3: ATCCAATTTT
S4: ATCTTCTT
S5: ATTGCCGATT
credit: Xingquan Zhu, Florida Atlantic University
ATTGCCATT
ATTGCCATT
ATGGCCATT
ATTGCCATT--
ATC-CAATTTT
ATTGCCATT
ATCTTC-TT
ATTGCC-ATT
ATTGCCGATT
ATTGCC-ATT--
ATGGCC-ATT--
ATTGCCGATT--
ATCTTC--TT--
ATC-CA-ATTTT
SA – example (2)
pairwise alignment multiple alignment
1.ATTGCCATT
ATGGCCATT
ATTGCCATT
ATGGCCATT
2.ATTGCCATT--
ATC-CAATTTT
ATTGCCATT--
ATGGCCATT--
ATC-CAATTTT
3.ATTGCCATT
ATCTTC-TT
ATTGCCATT--
ATGGCCATT--
ATC-CAATTTT
ATCTTC-TT--
4.ATTGCC-ATT
ATTGCCGATT
ATTGCC-ATT--
ATGGCC-ATT--
ATC-CA-ATTTT
ATCTTC--TT--
ATTGCCGATT--
SA - choosing the center
• Compute all pairwise alignment and pick sequence 𝒔𝒊 with maximum σ𝒋≠𝒊 𝒔(𝒔𝒊, 𝒔𝒋)• Choosing the sequence which is most similar to all the rest
• Compute all pairwise alignments and compute MSA for every 𝒔𝒊 and pick the best
SA – time complexity
• Average sequence length 𝐿
• One global alignment computation in 𝐎(𝑳𝟐)
• 𝑘 sequences → 𝐎(𝒌𝟐𝑳𝟐) pairwise computations
• 𝑙 … upper bound on the MSA length → 𝐎(𝒍𝒌) for MSA construction
𝑂 𝑘2𝐿2 + 𝑙𝑘 = 𝑶(𝒌𝟐𝑳𝟐)
SA - exercise
• Compute SP for the constructed MSA
• Compute SA for the previous example but add sequences to the MSA in different order. Does the order of addition impacts the score?
• Compute MSA starting with S5. Does the score change?
ATTGCC-ATT
ATTGCCGATT
ATGGCC-ATT
ATTGCCGATT
AT--CCAATTTT
ATTGCCGATT--
AT--CTTCTT
ATTGCCGATT
Feng & Doolittle (1)
1. Calculate a distance matrix from all-to-all pairwise alignments (𝑁(𝑁 − 1)/2)
2. Convert raw alignment scores into (evolutionary) distances
• 𝐷 = − log 𝑆𝑒𝑓𝑓 × 100 = − log𝑆𝑜𝑏𝑠−𝑆𝑟𝑎𝑛𝑑
𝑆𝑚𝑎𝑥−𝑆𝑟𝑎𝑛𝑑× 100
3. Construct a guide tree from the distance matrix using Fitch & Margoliash algorithm
4. Align child nodes of each parent (can be sequence-sequence, sequence-MSA, MSA-MSA) in the order they were added to the tree
• 𝑆𝑚𝑎𝑥 𝑎, 𝑏 =𝑆 𝑎,𝑎 +𝑆 𝑏,𝑏
2
• 𝑆𝑟𝑎𝑛𝑑 is an expected score obtained by randomization
• 𝑆𝑒𝑓𝑓 can be viewed as normalized
percentage similarity which
decreases roughly exponentially to
0 with increasing evolutionary
distance.
• –log makes the measure linear with
evolutionary distance
source: Feng, Da-Fei, and Russell F. Doolittle. "Progressive sequence alignment as a prerequisitetto correct phylogenetic trees." Journal of molecular evolution 25.4 (1987): 351-360.
Feng & Doolittle (2)
• Sequence-sequence is aligned using classical dynamic programming
• Sequence-MSA – sequence is aligned with each sequence in the group and the highest scoring alignment defines how the sequence is added to the group
• MSA-MSA – as in previous case but all pairs of sequences are tested
• When a sequence is added to a group, neutral symbol X is introduced instead of the gap position• allows to align gap positions
• neutral – anything aligned with X scores 0
• side effect – the gaps in two MSAs tend to come together in the resulting MSA
Profile/MSA Alignment
• When adding a sequence to a group it is desirable to take into account the MSA built so far• mismatches at highly conserved positions should be penalized more
• 2 MSA (profiles) of 𝑁 sequences, one from 1. . 𝑛, second 𝑛 + 1. . 𝑁
𝒊
𝑺 𝑨 𝒊 =
𝒊
𝒌
ClustalW
• Similar to Feng & Doolittle but uses profile-based building
1. Calculate matrix from all-to-all pairwise alignments (𝑁(𝑁 − 1)/2)
2. Convert raw alignment similarity scores into evolutionary distances
3. Construct a guided tree from the distance matrix using neighbor joining algorithm
4. Progressively align the nodes in order of decreasing similarity, using sequence-sequence, sequence-profile, and profile-profile alignment
ClustalW - Alignment
standard global
alignment
column-based
global alignment
ClustalW – heuristics (1)
• Weighting of subsequences to compensate for biased representation in large subfamilies – compensates defects of sum-of-pairs scoring• sequence contributions to the MSA are weighted by their relationships in the
predicted evolutionary tree
• Closely related sequences BLOSUM 80 – distant sequences BLOSUM 50
• Position-specific gap open profile penalties multiplied by a modifier being function of the residues observed at the position• comes from structure-based alignments
ClustalW – heuristics (2)
• Gap penalties are higher if there are no gaps at given position but some exist at nearby positions• force the gaps to be in the same places
• Guide tree can be adjusted on the fly to postpone aligning low-similar sequences up to the point where more information is present
• …
Clustal Omega
• ClustalW does not scale well for a big number (thousands) of sequences
• Bottleneck is the guide tree construction → 𝑂(𝑀 × 𝑁2)
• Clustal Omega algorithm• mBed algorithm-based for guide tree construction
• emBedding of each sequence in a space of 𝑛 dimensions where n is proportional to log N• Each sequence replaced by an 𝑛 element vector, where each element is the distance to one of 𝑛 reference sequences
• Clustering of the vectors by UPGMA or K-means• Alignments of profiles using hidden Markov models (HHalign package)
• Additional features• Adding sequences to existing alignments• External profiles - HMM profile from sequences homologous to the input set which can be used
in MSA construction
33
Progressive alignment drawbacks
• Drawback of the progressive methods is the greedy nature of the algorithm• once an error, always an error
• Solutions• Iterative approach
• Tries to correct mistakes in the initial alignment … which might happen easily if pairwise sequence similarity is too low
• Consistency-based approach• Tries to avoid mistakes in advance
34
T-Coffee
• A progressive alignment with the ability to consider information from all of the sequences during each alignment step, not just those being aligned at that stage (consistency)
• Tackles ClustalW drawback
T-Coffee algorithm (1)
1. Primary library generation• Generate primary libraries of sequence alignments (using several methods)• For each pair compute weights based on percent identity (shorter sequence)• Merge the primary libraries – pairs of aligned residues get a weight equal to the
sum of the weights
2. Extended library generation• For each pair of sequences try to align them using an intermediate sequence• Score each pair of residues by the lower of the two weights and sum the weights
for that residue pair over all triplets
3. Progressive alignment• Guided tree is used to build the multiple alignment• A pair of sequences is aligned using dynamic programming with weights based on
the extended library• When aligning two sets of sequences, averaged library weights are used
T-Coffee algorithm (2)
77+88
= 165
Iterative refinement
• Once an error, always an error• when a sequence is added to a MSA it cannot be changed later on (holds also for
consistency-based approaches)
• dependence on the initial alignment
• Idea behind iterative refinement methods• optimal solution can be obtained by iterative improvement of suboptimal solutions
• First, a suboptimal solution is identified by a fast heuristic method and then possibly improved by iterative removal and re-addition of the sequences
Barton-Sternberg
1. Find the highest scoring pair of sequences and align them using the pairwise dynamic programming
2. Identify most similar remaining sequence with respect to the existing profile and align it using sequence-profile alignment
3. Repeat step 2 until all sequences are aligned
4. Remove sequence 𝒔𝟏 and realign it to the profile by profile-sequence alignment. Repeat for 𝒔𝟐, … , 𝒔𝑵
5. Repeat step 4 for a fixed number of times or until convergence
Block-based alignment
• Both progressive and iterative methods assume that all parts of all sequences are consistent with a global alignment• not every position in the alignment has to correspond to a homologous site in all
sequences
• Block-based solution approach to the problem of global alignment by • splitting sequences into blocks
• aligning the blocks
• merging block alignments
DIALIGN
• Alignment constructed from gap-free local alignments between pairs of sequences• based on diagonals in the dynamic programming matrix
• Procedure• align all possible pairs of sequences• determine all diagonals and assign weights
• for a diagonal 𝐷 of length 𝑙, score 𝒔 is obtained from substitution matrix• determine length-independent weight as 𝒘 𝑫 = − 𝐥𝐨𝐠𝑷(𝒍, 𝒔), where 𝑷(𝒍, 𝒔) is the
probability that diagonal of sequence of length 𝒍 will have score at least 𝒔
• build MSA by adding consistent diagonals in order of decreasing weight (and overlap with other diagonals)
• explore unaligned regions and include them if possible
DIALIGN - Example
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
DIALIGN – Example
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
DIALIGN – Example
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
DIALIGN – Example
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
DIALIGN – Example
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
DIALIGN – Example
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
DIALIGN – Example
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaa--gagtatcacccctgaattgaataa
DIALIGN – Example
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
DIALIGN – Example
atc------taatagttaaactcccccgtgc-ttag
cagtgcgtgtattactaac----------gg-ttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
DIALIGN – Example
atc------TAATAGTTAaactccccCGTGC-TTag
cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg
caaa--GAGTATCAcc----------CCTGaaTTGAATaa
DIALIGN – Example
MAFFT
• Multiple Alignment using Fast Fourier Transform
• 2-cycle progressive method + iterative refinement• Fast low-quality all pairwise distances
• Tentative MSA construction
• Refinement of distances based on MSA
• Second progressive alignment stage
• Iterative refinement
• Reducing CPU time• 6-mer method for pairwise alignments
• FFT-based group-to-group alignment algorithm
53
FFT-NS-1
FFT-NS-2
FFT-NS-i
MAFFT – guide tree construction
• Sequences and groups are progressively aligned using FFT-based alignment (next slides) in the order given by a guide tree
• To quickly obtain the all-to-all distance matrix• 20 AAs grouped into 6 physico-chemical groups
• Number of shared 6-tuples 𝑇𝑖𝑗 is computed and turned into 𝐷𝑖𝑗• 𝐷𝑖𝑗 = 1 − [𝑇𝑖𝑗/min(𝑇𝑖𝑖 , 𝑇𝑗𝑗)]
• UPGMA method is used to obtain the guide tree from 𝐷𝑖𝑗
54
MAFFT – FFT-based group-to-group alignment (1)
• AA sequence converted to a sequence of vectors (signals) of normalized volume ( ො𝑣 𝑎 = 𝑣 𝑎 − 𝑣 /𝜎𝑣) and polarity ( Ƹ𝑝 𝑎 = 𝑝 𝑎 − 𝑝 /𝜎𝑝)
• Correlation between 2 AA sequences is assessed using• 𝑐 𝑘 = 𝑐𝑣 𝑘 + 𝑐𝑝 𝑘
• 𝑐𝑥 𝑘 = σ1≤𝑛≤𝑁,1≤𝑛+𝑘≤𝑀ෞ𝑥1(𝑛)ෞ𝑥2(𝑛 + 𝑘) , 𝑥 ∈ {𝑣, 𝑝}• ෝ𝑥𝑖(𝑗) … 𝑥 component (𝑣 or 𝑝) of 𝑗-th site in sequence 𝑖
• If 𝑋𝑖(𝑚) are FFT of ෝ𝑥𝑖(𝑛), then, by cross-correlation theorem, 𝑐𝑥 𝑘 ֞𝑋1∗ 𝑚 ∗ 𝑋2 𝑚
• FFT reduces the 𝑂(𝑛2) time to 𝑂(𝑛 log 𝑛)
55
• Correlation has high peaks when the two compared sequences have high similarity regions offset by the lags• sliding window of given size is used to reveal homologous regions
(sequence identity in the window is measured)• Successive homologous sequences are combined
MAFFT – FFT-based group-to-group alignment (2)
• Matrix 𝑺𝒊𝒋 (1 ≤ 𝑖, 𝑗 ≤ 𝑛) is constructed with values corresponding to the scores of the 𝒏 identified homologous segments (0 if (𝑖, 𝑗) is not homologous pair)
• Optimal path through 𝑺 is identified
• DP matrix of the sequences is then computed, where the optimal path must go through the centers of the segments of optimal path in 𝑆, thus restricting the number of elements which need to be visited
56
MAFFT – FFT-based group-to-group alignment (3)
• Group-to-group alignment is extension of the approach by considering groups as linear combinations of the volume/polar components of the groups
• ො𝑥𝑔𝑟𝑜𝑢𝑝𝑖 𝑛 = σ𝑗∈𝑔𝑟𝑜𝑢𝑝𝑖𝑤𝑗 ො𝑥𝑗(𝑛) , 𝑥 ∈ 𝑣, 𝑝 , 𝑖 ∈ {1,2}
• 𝑤𝑗 is weighting factor for sequence 𝑗 calculated as in CLUSTALW in the progressive stage, and as in (*) in the iterative stage
• For nucleotide sequences, the 2D vectors consisting of polar/volume components are replaced by 4D vector of A, C, G, T components
57(*) Gotoh, O. (1995). A weighting system and algorithm for aligning many phylogenetically related sequences. Bioinformatics, 11(5), 543-551.
MAFFT – iterative refinement
• Alignment divided into two groups and realigned• Tree-dependent restricted partitioning *
• A tree-dependent, restricted partitioning technique to efficiently reduce the execution time of iterative algorithms
• Repeated until no better alignment is obtained
58(*) Hirosawa, M., Totoki, Y., Hoshida, M., & Ishikawa, M. (1995). Comprehensive study on iterative algorithms of multiple sequence alignment. Bioinformatics, 11(1), 13-18.
MAFFT – improvements
• Version 5 • Consistency-base scoring - new objective function to reveal distant homologues
applied to the iterative refinement stage
• TCoffee-like approach of incorporation of all pairwise alignment information into the objective function
• Computed from all-to-all pairwise alignments before constructing MSA
• Summation of weighted sum-of-pairs score
• Dropped re-construction phase of the guide tree
• Version 6• New tree-building algorithm, PartTree, for handling larger numbers of sequences
• multiple ncRNA alignment framework incorporating structural information
59
MUSCLE
• MUltiple Sequence Comparison by Log-Expectation
• Stages1. Draft progressive
2. Improved progressive
3. Refinement
61
MUSCLE details
• k-mer distance• Fraction of common k-mers in a pair of sequences
• Possibly on compressed alphabet (similar residues, e.g. hydrophobic, get the same letter)
• Approximates well fraction of common residues in global alignment
• Kimura distance (correction)• Computed from fractional identity of sequences 𝐷 which is good approximation for closely related
sequences• exact if positions are allowed to mutate only ones → multiple mutations on single site (more distant sequences)
require correction
• 𝑑𝑘𝑖𝑚𝑢𝑟𝑎 = − log𝑒(1 − 𝐷 −𝐷2
5)
• Stage 3 refinement• Choose edge (go from leafs to root) from TREE2 and delete it• Build profile for MSA of the resulting tree, re-align and accept change if it led to an improvement• Iterate until convergence
62
Comparison of MSA algorithms - benchmark
• Benchmarking → guidelines
• Commonly used dataset for benchmarking MSA algorithms is BAliBASE• high quality manually refined reference alignments based on 3D structural
superpositions
63cases with small numbers of
equidistant
sequences,
and was further
subdivided by
percent identity
families with one
or more “orphan”
sequences
divergent
subfamilies, with
less than 25%
identity between
the groups
sequences with
large terminal
extensions
sequences with
large internal
insertions and
deletions
sequences with
repeats,
transmembrane
regions, and
inverted domains,
respectively
families with linear
motifs often found
in disordered
regions that are
difficult to align
sequences with
subfamily specific
features, motifs in
disordered regions
and
fragmentary/erron
eous sequences
http://www.lbgi.fr/balibase/
Comparison of main algorithms
64
source: Pais, F. S. M., de Cássia Ruy, P., Oliveira, G., & Coimbra, R. S. (2014). Assessing the efficiency of multiple sequence alignment programs. Algorithms for Molecular Biology, 9(1), 4.
SP: sum of pairs
TC: based on number of columns aligned 100% correctly
Guidelines - accuracy• No MSA program outperformed all others in all test cases
• For the R1-5 sets T-Coffee, Probcons, MAFFT and Probalign were superior with regard to alignment accuracy• When aligning available short versions of the sequences (BBS), Probcons and T-Coffee outperformed Probalign
and MAFFT• statistically significant superiority of Probcons and T-Coffee in comparison to Probalign and MAFFT in R1 & R2
• When aligning full-length proteins (BB) of R1, -3 and R5, which represent more difficult test cases, and also R4, where large terminal extensions are present, Probalign, MAFFT and, surprisingly, CLUSTAL OMEGA, generally outperformed both Probcons and T-Coffee• T-Cofee and Probcons worked great when aligning truncated sequences but did not do that well for datasets with
long N/C terminal ends due to presence of non-conserved residues at terminal ends • MAFFT, Probalign and even CLUSTAL OMEGA may be preferred over T-Coffee and Probcons when aligning sequences with
these long terminal extensions
• Contradicting performance of CLUSTAL OMEGA - performed well in three reference R3-5 sets with full-length sequences but not with the short versions
• CLUSTALW, DIALIGN-TX and POA had Z-scores below the average in almost all test cases from the first five reference sets
65
source: Pais, F. S. M., de Cássia Ruy, P., Oliveira, G., & Coimbra, R. S. (2014). Assessing the efficiency of multiple sequence alignment programs. Algorithms for Molecular Biology, 9(1), 4.
Guidelines – computational costs
• Speed• CLUSTALW and MUSCLE were the fastest of the evaluated programs
• T-Coffee and MAFFT deliver fast alignments in multi-core environment
• Probcons and Probalign exceeded the 2.5 hours cutoff in the last three subsets from Reference 9
• Memory• CLUSTALW consumed least memory
• T-Coffee running in single-core mode - results indicated that the program consumed generally more RAM than the others and was also the slowest
66
source: Pais, F. S. M., de Cássia Ruy, P., Oliveira, G., & Coimbra, R. S. (2014). Assessing the efficiency of multiple sequence alignment programs. Algorithms for Molecular Biology, 9(1), 4.
PhyloDNA Puzzles
67