23
Sequence alignment: Pairwise and multiple Contents Sequence alignments: why? Some definitions for sequence alignments Algorithm for sequence alignments: dynamic programming Database searching Multiple sequence alignments: why? An example of a multiple alignment How to generate a multiple alignment? Algorithmic complexity A standard multiple alignment program: ClustalW Databases of multiple alignments; domains Sequence alignment: why? Early in the days of protein and gene sequence analysis, it was discovered that the sequences from related proteins or genes were similar, in the sense that one could align the sequences so that many corresponding residues match. This discovery was very important, since strong similarity between two genes is a strong argument for their homology. Bioinformatics is based on it. A note on terminology: Homology means that two (or more) sequences have a common ancestor. This is a statement about evolutionary history. Similarity simply means that two sequences 1

Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

Sequence alignment: Pairwise and multiple

Contents Sequence alignments: why?

Some definitions for sequence alignments

Algorithm for sequence alignments: dynamic programming

Database searching

Multiple sequence alignments: why?

An example of a multiple alignment

How to generate a multiple alignment?

Algorithmic complexity

A standard multiple alignment program: ClustalW

Databases of multiple alignments; domains

Sequence alignment: why?Early in the days of protein and gene sequence analysis, it was discovered that the sequences from related proteins or genes were similar, in the sense that one could align the sequences so that many corresponding residues match. This discovery was very important, since strong similarity between two genes is a strong argument for their homology. Bioinformatics is based on it.

A note on terminology: Homology means that two (or more) sequences have a common ancestor. This is a statement about evolutionary history. Similarity simply means that two sequences are similar, by some criterion. It does not refer to any historical process, just to a comparison of the sequences by some method. It is a logically weaker statement. However, in bioinformatics these two terms are often confused and used interchangeably. The reason is probably that significant similarity is such a strong argument for homology.

The basis for comparison of proteins and genes using the similarity of their sequences is that the the proteins or genes are related by evolution; they have a common ancestor. Random mutations in the sequences accumulate over time, so that proteins or genes that have a common ancestor far back in time are not as similar as proteins or genes that diverged from each other

1

Page 2: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments.

There are many features of sequence alignments that give interesting information. For example, a closer analysis of the alignment can reveal which parts of the sequences that are likely to be important for the function, if the proteins are involved in similar processes. In parts of the sequence of a protein which are not very critical for its function, the random mutations can accumulate more easily. In parts of the sequence that are critical for the function of the protein, hardly any mutations will be accepted; nearly all changes in such regions will destroy the function.

Sequence alignment has become an essential part of biological science. There are now many different techniques and implementations of methods to perform alignment of sequences.

The dotplot is a simple, graphical way of displaying sequence similarity. One can use it to easily spot segments of good sequence similarity. The two sequences are placed on each side of 2-dimensional matrix, and each cell in the matrix is then filled with a value for how well a short window of the sequences match at that point.

This dotplot shows the hemoglobin A chain from human compared to erythrocruorin from Chironomus (insect).

2

Page 3: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

2. Some definitions for sequence alignmentsGaps and insertions

In an alignment, one may achieve much better correspondence between two sequences if one allows a gap to be introduced in one sequence. Equivalently, one could allow an insertion in the other sequence. Biologically, this corresponds to a mutation event that eliminates a part of a gene, or introduces new DNA into a gene.

Optimal alignment

The alignment that is the best, given a defined set of rules and parameter values for comparing different sequence alignments. There is no such thing as the single best alignment, since optimality always depends on the assumptions one bases the alignment on. For example, what penalty should gaps carry? All sequence alignment procedures make some such assumptions.

Global alignment

An alignment that assumes that the two proteins are basically similar over the entire length of one another. The alignment attempts to match them to each other from end to end, even though parts of the alignment are not very convincing. A tiny example:

LGPSTKDFGKISESREFDN | |||| | LNQLERSFGKINMRLEDA

Local alignment

An alignment that searches for segments of the two sequences that match well. There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity, according to some criterion. Using the same sequences as above, one could get:

----------FGKI---------- |||| ----------FGKI----------

It may seem that one should always use local alignments. However, it may be difficult to spot an overall similary, as opposed to just a domain-to-domain similarity, if one uses only local alignment. So global alignment is useful in some cases. The popular programs BLAST and FASTA for searching sequence databases produce local alignments.

Substitution matrix

A substitution matrix describes the likelihood that two residue types would mutate to each other in evolutionary time. This is used to estimate how well two residues of given types would match

3

Page 4: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

if they were aligned in a sequence alignment. The matrix is a symmetrical 20*20 matrix, where each element contains the score for substituting a residue of type i with a residue of type j in a protein, where i and j are one of the 20 amino-acid residue types. Same residues should obviously have high scores, but if we have different residues in a position, how should that be scored? There are many possibilities:

The same residues in a position give the score value 1, and different residues give 0. The same residues give a score 1, similar residues (for example: Tyr/Phe, or Ile/Leu) give 0.5, and

all others 0. One may calculate, using well established sequence alignments, the frequencies (probabilities)

that a particular residue in a position is exchanged for another. This was done originally be Margaret Dayhoff, and her matrices are called the PAM (Point Accepted Mutation) matrices, which describe the exchange frequencies after having accepted a given number of point mutations over the sequence. Typical values are PAM 120 (120 mutations per 100 residues in a protein) and PAM 250. There are many other substitution matrices: BLOSUM, Gonnet, etc.

Gap penalty

The gap penalty is used to help decide whether on not to accept a gap or insertion in an alignment when it is possible to achieve a good alignment residue-to-residue at some other neighbouring point in the sequence. One cannot let gaps/insertion occur without penalty, because an unreasonable 'gappy' alignment would result. Biologically, it should in general be easier for a protein to accept a different residue in a position, rather than having parts of the sequence chopped away or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions). Some different possibilities:

A single gap-open penalty. This will tend to stop gaps from occuring, but once they have been introduced, they can grow unhindered.

A gap penalty proportional to the gap length. This will work against larger gaps. A gap penalty that combines a gap-open value with a gap-length value.

3. Algorithm for sequence alignment: dynamic programmingMaking an alignment by hand is possible, but tedious. In some cases, when one has a lot of information about the proteins, such as active site residues, secondary structure, 3D structure, mutations, etc, it may still be necessary to make a manual alignment (or at least edit an alignment) to fit all the data. The available automatic methods may not be able to produce a good enough alignment in such cases.

Of course, we would like to have a completely automatic method to perform sequence alignment. The method of choice is based on so-called dynamic programming, which is a general

4

Page 5: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

algorithm for solving certain optimization problems. The word "programming" does not mean that it has to be a computer program; this is just mathematical jargon for using a fixed set of rules to arrive at a solution.

For any automatic method to work, we need to be explicit about the assumptions that should go into it. We therefore need to have an explicit scheme for the gap penalties and for the substitution matrix. The chosen gap penalties and substitution matrices are often collectively called the scoring scheme.

There are many different possible scoring schemes. One may also complicate things further by allowing position-specific scores: If one knows from other sources (3D structure) that a gap should absolutely not be allowed in a certain part of a sequence, then the gap-open penalty could be set to a very high value in that part.

Given a scoring scheme, how does an alignment algorithm work? Let us use the classical Needleman-Wunsch-Sellers algorithm to demonstrate how a dynamic-programming algorithm can work. Please note that there are other variants of dynamic programming in sequence analysis.

The Needleman-Wunsch-Sellers algorithm sets up a matrix where each sequence is placed along the sides of the matrix. Each element in the matrix represents the two residues of the sequences being aligned at that position. To calculate the score in every position (i, j) one looks at the alignment that has already been made up to that point, and finds the best way to continue. Having gone through the entire matrix in this way, one can go back and trace which way through the matrix gives the best alignment.

Let us use the following gap-penalty function, where k is the length of the gap, copen the gap-open penalty constant, and clength the gap-length penalty constant:

W(k) = copen + clength * k

The formulat describing the Needleman-Wunsch-Sellers method is recursive, and for the position (i, j) is as follows, where D is value of element (i, j) in the matrix and subst is the substitution matrix:

Di, j = max {Di - 1, j - 1 + subst(Ai, Bj)Di - 1, j - k + W(k) (where k = 1, ..., j - 1)Di - k, j - 1 + W(k) (where k = 1, ..., i - 1)

After one has applied this to the matrix, one finds the optimal alignment by tracing backwards from the diagonal element backwards to the previous highest value, and so on.

5

Page 6: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

4. Database searchingWhen we have a sequence, and we want to find other sequences similar to it in a database, we do not really need the full alignment of this sequence against all others. All we want is a value, a score, that will tell us how similar our probe sequence is to the every other sequence. This score should be sensitive (so that as many of the true homologs are found) and specific (so that as few false positives are hit).

There is a simple rule-of-thumb: A database hit having a sequence identity of 25% or more (protein lengths 200 residues or more) is almost certainly a true hit, if one uses reasonable parameter settings for the common programs BLAST or FASTA. There are cases where this is not true, for example when the sequences have a high amount of low-complexity regions (Ser-Thr-rich regions, and such), but this can usually be dealt with by applying a low-complexity filter.

But what to do about hits with lower degree of identity? The basic problem is how to judge whether a score is significant or not. Could a given score be the result of pure chance? The various search programs (BLAST, FASTA) attempt to answer this question by computing an expectation value (or something similar). This is an estimate of the likelihood that a given hit is due to pure chance, given the size of the database. This calculation uses probability theory and various (reasonable) assumptions. It should be as low as possible. If the value is close to 1 (say, 0.01) rather than 0.0 or 1.0e-45, then the hit is suspect.

5. Multiple sequence alignment: why?Pairwise alignments are fundamental and useful, but there are some problems with them. For instance, when using one of the popular sequence searching programs (FASTA, BLAST) which perform pairwise alignments to find similar sequences in a database, one very often obtains many sequences that are significantly similar to the query sequence. Comparing each and every sequence to every other may be possible when one has just a few sequences, but it quickly becomes impractical as the number of sequences increases.

What we need is multiple sequence alignment, where all similar sequences can be compared in one single figure or table. The basic idea is that the sequences are aligned on top of each other, so that a coordinate system is set up, where each row is the sequence for one protein, and each column is the 'same' position in each sequence. Each column corresponds to a specific residue in the 'prototypical' protein.

As with pairwise alignment, there will be gaps in some sequences, most often shown by the dash '-' or dot '.' character. Note that to construct a multiple alignment, one may have to introduce gaps in sequences at positions where there were no gaps in the corresponding pairwise alignment.

6

Page 7: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

This means that multiple alignments typically contain more gaps than any given pair of aligned sequences.

6. An example of a multiple alignmentLet us use the the cellulose-binding domain of cellobiohydrolase I (CBD-CBH1) as an example of what one may do with a multiple sequence alignment. This is a small (about 30-35 residues) disulphide-bonded domain of known 3D structure (PDB code 1CBH). Homologous domains can be found in a number of other cellulose-degrading enzymes. It is believed that the function of the domain is either to bind with high affinity to the cellulose fiber to allow the adjacent enzymatic domain to hydrolyse the cellulose. Another possibility is that the CBD domain wedges itself in between cellulose chains, making it easier for the enzymatic domain to attack the fiber.

The multiple alignment of these sequences is taken from Pfam (identifier CBD_1, accession code PF00734). Shown below is the so-called seed alignment, containing the sequences the Pfam curators have used to define the family. This is just a part of the complete alignment file; some comments have been removed. For each sequence, the SWISS-PROT identifier and the position in the parent protein is given on the left. The top line shows the position numbers using the 1CBH 3D structure scheme. The bottom line shows the consensus, which we define here as the same amino-acid residue type in 14 or more sequences (out of 18). Please note that this definition of consensus is just one of many possible.

1 2 3 45678901...234567890123456789012

GUX1_TRIRE/481-509 HYGQCGGI...GYSGPTVCASGTTCQVLNPYYGUN1_TRIRE/427-455 HWGQCGGI...GYSGCKTCTSGTTCQYSNDYYGUX1_PHACH/484-512 QWGQCGGI...GYTGSTTCASPYTCHVLNPYYGUN2_TRIRE/25-53 VWGQCGGI...GWSGPTNCAPGSACSTLNPYYGUX2_TRIRE/30-58 VWGQCGGQ...NWSGPTCCASGSTCVYSNDYYGUN5_TRIRE/209-237 LYGQCGGA...GWTGPTTCQAPGTCKVQNQWYGUNF_FUSOX/21-49 IWGQCGGN...GWTGATTCASGLKCEKINDWYGUX3_AGABI/24-52 VWGQCGGN...GWTGPTTCASGSTCVKQNDFYGUX1_PENJA/505-533 DWAQCGGN...GWTGPTTCVSPYTCTKQNDWYGUXC_FUSOX/482-510 QWGQCGGQ...NYSGPTTCKSPFTCKKINDFYGUX1_HUMGR/493-521 RWQQCGGI...GFTGPTQCEEPYICTKLNDWYGUX1_NEUCR/484-512 HWAQCGGI...GFSGPTTCPEPYTCAKDHDIYPSBP_PORPU/26-54 LYEQCGGI...GFDGVTCCSEGLMCMKMGPYYGUNB_FUSOX/29-57 VWAQCGGQ...NWSGTPCCTSGNKCVKLNDFYPSBP_PORPU/69-97 PYGQCGGM...NYSGKTMCSPGFKCVELNEFFGUNK_FUSOX/339-370 AYYQCGGSKSAYPNGNLACATGSKCVKQNEYYPSBP_PORPU/172-200 RYAQCGGM...GYMGSTMCVGGYKCMAISEGSPSBP_PORPU/128-156 EYAACGGE...MFMGAKCCKFGLVCYETSGKW

consensus ...QCGG.......G...C.....C.......

7

Page 8: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

It is somewhat difficult to see the patterns of conservation in this table. The positions that are nearly completely conserved are easy to spot, but with more varied patterns, it becomes more difficult. For example, in position 24 there is a threonine in 9 sequences, and lysine in 5.

There is usually a problem with the numbering scheme in multiple alignments: the numbers in a multiple alignment are usually different from the numbering of any of the single sequences. (The terms "absolute numbers" versus "relative" have been used to describe the difference). Therefore it is necessary to be very careful when using sequence numbers from a multiple alignment; the numbers may be very different from the actual positions of the residues in any single sequence. For a few protein families (e.g. serine proteases of the trypsin family), a general scheme has been adopted that most scientists in the field use.

It is common to use shaded boxes or coloured background to highlight residues or segments of a multiple alignment where the residues are strongly conserved. Commercial and some academic software can be used to add such features, but there is no common standard for exactly how this should look. It is necessary to check with the program documentation to figure out exactly how it works.

There are several other observations one might make about multiple alignments. For example, the fact that residues are aligned in a column does not necessarily mean that they are actually aligned structurally or in any other way. There is no common way of showing a 'frayed' alignment.

The correlation between residues far apart in the sequence in a protein family is usually difficult to spot in a multiple alignment. Other methods must be used to visualize this.

The combination of a known 3D structure and a multiple alignment can be very powerful for understanding a protein domain. Of course, knowledge of the biology and chemistry of the proteins increases the understanding. Often, a multiple alignment can help tie together many different observations into a coherent view of the structure and function of a domain.

Here is a schematic image of the 3D structure (PDB code 1CBH), as determined by protein NMR (Kraulis et al, 1989). A few residues are labelled, just to show some features of the structure, and to help with comparing with the multiple alignment.

8

Page 9: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

A very useful representation of the conservation patterns is the so-called sequence logo. This shows the conserved residues as larger characters, where the total height of a column is proportional to how conserved that position is. Technically, the height is proportional to the information content of the position. Here is a web site for generating a sequence logo in PostScript format from an alignment.

9

Page 10: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

On closer examination of this multiple alignment (together with the known 3D structure), there are a number of features in the multiple sequence alignment CBD_1 that stand out. For example:

One of the sequences (GUNK_FUSOX) contains an insertion. The location of this insertion corresponds to a turn in the 3D structure, so it can easily be accommodated without large rearrangements of any other parts of the domain.

There is a problem close to the insertion in the sequence GUNK_FUSOX. The Tyr residue immediately to the left of it should probably be in the column 13, which has a conserved aromatic residue (see above). Otherwise, the entire loop structure in this region would have to be very different. However, having two gaps on either side of a residue is something most multiple alignment programs do not like.

There are three strongly conserved Gly residues: numbers 6, 7, 15: There could be several (different) reasons for this; the phi-psi angles may have values forbidden or highly unfavourable for other residues, or the positions in the structure may not allow for any sidechain without destroying the function.

Inspection of the conserved residues shows that several of them (2 Trp/Phe, 4 Gln, 31 Tyr/Phe/Trp, 32 Tyr) are located on the same side of the 3D structure. There seems to be no particular architectural reason for the conservation of these residues (they do not form a hydrophobic core, for instance), so maybe it has to do with the function of binding cellulose?

Residue 13 Tyr/Phe/Trp has a very special reason for being conserved as an aromatic residue, as shown by the structure. There is a hydrogen bond between the amide H of the next residue with the pi-system of the aromatic ring.

There are three strictly conserved Cys residues in the alignment. This is strange. These proteins are secreted, so one would assume that the cysteines form disulfide bridges, so the number of conserved Cys residues should be even. If one looks closer at the individual sequences in SWISS-PROT, one can spot another conserved Cys three residues further on from the end of the Pfam alignment. Apparently, the Pfam definition ought really to contain a few more residues at the C-terminus.

10

Page 11: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

7. How to generate a multiple alignemnt?Given a pairwise alignment, just add the third, then the fourth, and so on, until all have been aligned.

It depends not only on the various parameters (insertion/deletion penalties, substitution coefficients,...) but also on the order in which sequences are added to the multiple alignment.

In pairwise alignments, one has a two-dimensional matrix with the sequences on each axis, and the elements in the matrix are initially the substitution coefficients, which are then operated on (as you have done previously) to locate the best "path" through the matrix. The number of operations required to do this is approximately proportional to the product of the lengths of the two sequences.

A possible general method would be to extend the pairwise alignment method into a simultaneous N-wise alignment, using a complete dynamical-programming algorithm in N dimensions. Algorithmically, this is not difficult to do.

In the case of three sequences to be aligned, one can visualize this reasonable easily: One would set up a three-dimensional matrix (a cube) instead of the two-dimensional matrix for the pairwise comparison. Then one basically performs the same procedure as for the two-sequence case. This time, the result is a path that goes diagonally through the cube from one corner to the opposite.

The problem here is that the time to compute this N-wise alignment becomes prohibitive as the number of sequences grows. The algorithmic complexity is something like O(c2n), where c is a constant, and n is the number of sequences. (See the next section for an explanation of this notation.) This is disastrous, as may be seen in a simple example: if a pairwise alignment of two sequences takes 1 second, then four sequences would take 104 seconds (2.8 hours), five sequences 106 seconds (11.6 days), six sequences 108 seconds (3.2 years), seven sequences 1010 seconds (317 years), and so on.

An algorithm is supposed to do two things:

1. Compute the correct answer given valid input data. 2. Perform the computation in a reasonable time.

The second point is of course essential. A correct algorithm is useless if it is too slow. In such cases, we must find another algorithm, or maybe we will have to accept some approximation which allows us to use another algorithm, which runs in reasonable time, but only generates an approximately correct result (whatever that means).

Computer science is the science that deals with the issues of proving that an algorithm is correct, and what its properties are, such as how long time it can be expected to need in the average case, or how much time it will use in the worst case. These analyses are done without reference to any specific computer or programming language; the issue is how many calculation steps (such as

11

Page 12: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

additions and multiplications, or exhanges of items in an array) are needed to obtain the result, and how this depends on the size of the problem.

One of the most important properties of an algorithm is how its execution time increases as the problem is made larger. By a larger problem, we mean e.g. more sequences to align, or longer sequences to align. This is the so-called algorithmic (or computational) complexity of the algorithm.

Let's say we have two alternative algorithms (A and B) for solving the same problem. Algorithm A is the fastest when we have a small number of input data points. But what happens when we have larger input data sets? If the time require by A and B increases in the same way, then A will always be the best. But

what if the time to execute A is proportional to the square of the input data size, while B is linear? Then clearly there will be a point at which B becomes the better choice.

There is a notation to describe this, called the big-O notation (svenska: "stort ordo"). If we have a problem size (number of input data points) n, then an algorithm takes O(n) time if the time increases linearly with n. If the algorithm needs time proportional to the square of n, then it is O(n2).

It is important to realize that an algorithm that is quick on small problems may be totally useless on large problems if it has a bad O() behaviour. As a rule of thumb one can use the following characterizations (which a proper computer scientist would not like), where n is the size of the problem, and c is a constant:

12

Page 13: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

O(c)

utopian

O(log n) excellentO(n) very goodO(n log n) decentO(n2) not so goodO(n3) pretty badO(cn) disaster

Please note that the phrase "the size of the problem" may mean different things depending on the context. For example, in sequence searching, it may mean the number of residues in the query sequence or in the database, or the number of sequences in the database. The O() of an algorithm may be different depending on which parameter is relevant for describing the size of the problem. Also keep in mind that there are other resources, such as the amount of memory an algorithm needs, that can limit the usefulness of it.

Alternative algorithms for solving the same problem may differ in how much computation must be done to set up the initial data structure, the initialization stage. There are examples of algorithms where the setup stage is expensive, but the computations that follow it are cheap, compared to some other method. In these cases, the choice of the best algorithm depends on how often the problem arises, and whether the setup can be saved and maintained between runs.

It is in some cases necessary to distinguish between the behaviour of an algorithm in the worst case compared to the average case. Some algorithms may have very different behaviour

13

Page 14: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

depending on the exact values of the input data. For example, an algorithm to do multiple alignment may finish quickly if the sequences are very similar, or slow if the sequences are just barely similar. Some algorithms have quite good average O() behaviour, but may be awful in the worst case. Clearly, this can be important to know.

9. A standard multiple alignment program: ClustalWFrom what we have learned in previous sections, doing a simultaneous N-wise alignment is not a realistic option if we have, say, 50 sequences to align. What to do? The obvious alternative is to use a so-called progressive alignment method: The alignment is built up in stages where a new sequence is added to an existing alignment, using some rules to determine in which order the sequences should be added, and how.

ClustalW (Thompson, Higgins & Gibson, 1994) is one of the standard programs implementing one variant of the progressive method in wide use today for multiple sequence alignment. The W denotes a specific version that has been developed from the original Clustal program.

The basic steps of the algorithm implemented in ClustalW are:

1. Compute the pairwise alignments for all against all sequences. The similarities are stored in a matrix (sequences versus sequences).

2. Convert the sequence similarity matrix values to distance measures, reflecting evolutionary distance between each pair of sequences.

3. Construct a tree (the so-called guide tree) for the order in which pairs of sequences are to be aligned and combined with previous alignments. This is done using a neighbour-joining clustering algorithm. In the case of ClustalW, a method by Saitou & Nei is used.

4. Progressively align the sequences/alignments together into each branch point of the guide tree, starting with the least distant pairs of sequences. At each branch point, one must do either a sequence-sequence, sequence-profile, or profile-profile alignment.

Note that the original idea of using a simultaneous N-wise dynamic programming alignment method had the algorithmic complexity O(c2n), whereas this method has something like O(n2) (which comes from the all-against-all pairwise comparison step).

A number of rules (tricks, some would say) are used to increase the success rate of the procedure:

Each sequence is weighted according to how different it is from the other sequences. This accounts for the case where one specific subfamily is overrepresented in the data set.

The substitution matrix used for each alignment step depends on the similarity of the sequences (a somewhat circular argument, but what the hell...).

Position-specific gap-open penalties are modified according to residue type using empirical observations in a set of alignments based on 3D structures. In general,

14

Page 15: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

hydrophobic residues have higher gap penalties than hydrophilic, since they are more likely to be in the hydrophobic core, where gaps should not occur.

Gap-open penalties are decreased if the position is spanned by a consecutive stretch of five or more hydrophilic residues.

Both gap-open and gap-extend penalties are increased if there are no gaps in a column, but gaps occur nearby in the alignment.

The guide tree can in some circumstances be overriden, for instance by deferring joining two branches if they are too dissimilar, until more information has been added by processing other branches.

There are some specific cases where ClustalW is know to have problems.

If the sequences are similar only in some smaller regions, while the larger parts are not recognisably similar, then ClustalW may have problems aligning all sequences properly. This is because ClustalW tries to find global alignments, not local. In such a case, it may be wise to cut out the similar parts with some other tool (text editor).

If one sequence contains a large insertion compared to the rest, then there may be problems, for much the same reason as the previous point.

If one sequence contains a repetitive element (such as a domain), while another sequence only contains one copy of the element, then ClustalW may split the single domain into two half-domains to try to align the first half with the first the domain in the first sequence, and the other half to the second domain in the first sequence. There are many proteins that contain multiple, very similar copies of a domain, so one swhould watch out for this.

ClustalW is an example of an algorithm that has given up on trying to be perfect (because it takes too much time), and instead uses an approximation strategy, combined with more-or-less intelligent tricks that guide the computation towards a successful (but not necessarily optimal) result. This is called a heuristic algorithm.

One important point to keep in mind is that since ClustalW is a heuristic algorithm, it cannot produce a solution that is guaranteed to be optimal. But in practice, the results it produces are good enough, and one should perhaps worry more about the quality of the input data. For example, if one has sequences that are just barely significantly similar, one should worry more about if all of them really belong in the alignment at all, rather than if the alignment is perfect or not (which it in such a case almost certainly isn't).

ClustalW has a number of parameters that the user can change. This will affect the exact manner in which the computation proceeds, and it may be useful to compare runs with different parameters; the near-perfect parameter set varies with the specific case.

10. Databases of multiple alignments; domains

15

Page 16: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

Very early in the days of protein sequence analysis, it was observed that some protein sequences contained long segments that were very similar to other proteins, while the rest of the sequence in that protein had no detectable similarity. Today, we take more or less for granted that proteins are composed of domains, segments of sequence which have been joined together by genetic events during evolution so that the new protein has a function that is based on the activities of the domains it contains.

Often the domains detectable by sequence analysis correspond to structural domains in the 3D structure as well. There are now many well-documented cases where it has been shown that domains can exists perfectly well in isolation, when excised from the original protein. Surprisingly often, a domain can be expressed and folded all on its own.

There are today several databases that keep track of which domains have been discovered, which proteins are involved, and that store the multiple sequence alignments of the relevant segments of the protein sequences. We have already discussed one such database, Pfam. Also, several of the primary sequence databases now contain information about the domains in the sequence entries.

The idea behind Pfam is twofold:

1. Create and maintain good-quality multiple sequence alignments of well-defined protein sequence domains from proteins in SWISS-PROT.

2. Use these multiple alignments for creating so-called HMMs that can be used in profile searches of sequence databases.

The multiple alignment used to define a domain (protein family) in Pfam are called the seed alignment. It is created by a curator, or taken from the literature. It is used to generate a profile HMM for identifying other sequences in the databases (SWISS-PROT and TREMBL) that contain the domain. The search results are inspected to decide which cutoff should be used for that particular Pfam entry. The search hits are then aligned automatically into a so-called full alignment.

There are a number of other useful databases of multiple sequence alignments, such as:

PRINTS , multiple motifs consisting of ungapped, aligned segments of sequences, which serve as fingerprints for a protein family.

BLOCKS , multiple motifs of ungapped, locally aligned segments created automatically.

These databases allow analysis of new sequences in terms of which domains can be detected in the sequence. This is often more useful, and sometimes also more sensitive (although this is somewhat controversial) than doing sequence-to-sequence comparisons. For instance, if a new protein has a kinase domain, then it is more helpful to use a domain database (with some appropriate search software, such as HMMER for Pfam) to identify it directly in the sequence. The alternative, using BLAST or FASTA to find similar sequences, would return thousands of sequences, and it would require some work to sort out that this is because the query sequence contains a very common kinase domain.

16

Page 17: Sequence alignment: why?  · Web view2019. 1. 17. · more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments

17