Alignments Lecture

  • View

  • Download

Embed Size (px)

Text of Alignments Lecture

  • 8/8/2019 Alignments Lecture


    Pharmaceutical Bioinformatics, 7.5pLecture notes


    Alignments in bioinformatics

    Lecture notes

    Compiled by:

    Ola Spjuth []Department of Pharmaceutical Biosciences

    Uppsala University

  • 8/8/2019 Alignments Lecture


    Pharmaceutical Bioinformatics, 7.5pLecture notes



    Sequence Analysis 3 Biological Background for Sequence Analysis 3 Searching of databases for sequences similar to a new sequence 4

    Sequence alignment 5

    Multiple sequence alignment 6 Evaluating local multiple alignments 7

    Tools for sequence alignment 8 BLAST 8 Clustal 10

    Uses of multiple alignment 11 Searching 13 PCR primer design 13

    Structural alignments 14 Data produced by structural alignment 15

    References 15

  • 8/8/2019 Alignments Lecture


  • 8/8/2019 Alignments Lecture


    Pharmaceutical Bioinformatics, 7.5pLecture notes


    Four different nucleotides taken three at a time can result in 64 different possibletriplet codes; more than enough to encode 20 amino acids. The way that these 64codes are mapped onto 20 amino acids is first, that one amino acid may be encoded

    by 1 to 6 different triplet codes, and second, that 3 of the 64 codes, called stop codons,specify "end of peptide sequence". Where multiple codons specify the same aminoacid, the different codons are used with unequal frequency and this distribution of frequency is referred to as "codon usage". Codon usage varies between species.

    The fact that DNA nucleotides need to be read three at a time to specify a proteinsequence implies that a DNA sequence has three different reading frames determined

    by whether you start at nucleotide one, two, or three. (Nucleotide four will be in thesame frame as nucleotide one and so on). Both strands of DNA can be copied intoRNA (for translation into protein). Thus, a DNA sequence with its (inferred)complementary strand can specify six different reading frames.

    It is possible to chemically determine the sequence of amino acids in a protein and of nucleotides in RNA or DNA. However, it is vastly easier at present to determine thesequence of DNA than that of RNA or protein. Since the sequence of a protein can bedetermined from the DNA sequence that encodes it, most protein sequences are infact inferred from DNA sequences. Conversion of RNA to a DNA copy (cDNA) is asimple laboratory proceedure, so RNA molecules are themselves sequenced as cDNAcopies.

    Searching of databases for sequences similar to a new sequence

    If you have just determined a sequence of an interesting bit of DNA, one of the firstquestions you are likely to ask yourself is "has anybody else seen anything like this?"Fortunately, there has been a very successful international effort to collect all thesequences people have determined in one place so they can be searched. For DNAsequences, three groups have cooperated in this effort, one in Japan, one in Europe,and one in the United States to produce DDBJ, EMBL and GenBank, respectively.These databases are frequently reconciled with each other, so that searching any oneis virtually the same as searching all three. The problem is that these databases areHUGE and, as a result, you must compare your sequence with this vast number of other sequences efficiently. A number of programs have been written to rapidlysearch a database for a query sequence, two of which, BLAST and FASTA, will bediscussed in this course. The techniques used by these programs to make searchingrapid result in some loss of rigor of comparison. It is possible (although, as it turnsout, unlikely) that a weak but relevant similarity could be missed by these programs.In addition, many times these programs will flag a sequence as being similar to your query sequence when this similarity is not significant. Thus, these programs should beseen as tools for identifying a small subset of sequences from the database for retrieval and further analysis rather than ends in themselves.

    Databases of protein sequences, including Uniprot and PIR, also exist and cansimilarly be searched.

  • 8/8/2019 Alignments Lecture


  • 8/8/2019 Alignments Lecture


  • 8/8/2019 Alignments Lecture


    Pharmaceutical Bioinformatics, 7.5pLecture notes


    First 90 positions of a protein multiple sequence alignment of instances of the acidicribosomal protein P0 (L10E) from several organisms. Generated with ClustalW.

    Sequences can be aligned across their entire length (global alignment) or only incertain regions (local alignment). This is true for pairwise and multiple alignments.Global alignments need to use gaps (representing insertions/deletions) while localalignments can avoid them, aligning regions between gaps.

    Evaluating local multiple alignments

    Some programs give quantitative measures for the significance of the alignment.These are usually based on the chance occurrence of such alignments and depend onthe size and composition of the aligned sequences. Empirical measures are alsoextremely useful for deciding the 'correctness' of the multiple alignment. Consistencyis a powerful measure for correct multiple alignments. If the same alignment is foundin the sequence-to-sequence searches and various multiple alignment methods it ismost probably correct. One pitfall to avoid is biased sequence composition that maylead to trivial alignments.

    Experimental data can be used in evaluating, and even constructing, multiplealignments. For example, if we know the catalytic site in the aligned proteins we

  • 8/8/2019 Alignments Lecture


    Pharmaceutical Bioinformatics, 7.5pLecture notes


    expect the sites to be aligned together and may 'force' that alignment. Such manualalignments can serve as a seed to an alignment with more sequences.

    Local multiple alignments (blocks) from different programs can be joined or usedtogether. Another approach is 'divide and conquer'. Blocks present in all sequencesdivide them into separate parts, in each of which more blocks can be searched for.

    Tools for sequence alignment


    BLAST is an acronym for Basic Local Alignment Search Tool, and it consists of a setof algorithms for comparing biological sequences such as nucleotides or proteinsequences. A nucleotide sequence is nothing but a DNA (or part of) sequenceexpressed as a long string of 4 characters: A,T,C and G. They stand for Adenine,Guanine, Cytosine and Thymine. So, every nucleotide sequence consists of only thesefour characters arranged in different orders.

    BLAST allows you to compare your sequence against a database of sequences andinforms you if your sequence matches any of the sequences in the database, alongwith a lot of information like:

    * Homology of match (% of characters matched)* Alignment length (over what length did the nucleotides match)

  • 8/8/2019 Alignments Lecture


    Pharmaceutical Bioinformatics, 7.5pLecture notes


    * Evalue (Expectation value. The number of different alignents with scoresequivalent to or better than S that are expected to occur in a database search bychance. The lower the E value, the more significant the score)

    For a complete BLAST glossary you may visit

    So, now that you know BLAST can be used to align two sequences and to study thesimilarity between two or more sequences, let us look into the principles of sequencealignment briefly.

    Sequence alignment refers to arranging two sequences in an order such that their similar portions are highlighted.

    For ex:

    AGCTATGGGCAAATTTGGAACAAACCAAAAAGT........ ........ ...............


    The portions in the sequence which do not match are shown by gaps in the alignment.

    Global Alignment: It refers to the alignment in which all the characters in bothsequences participate in the alignment.

    Local Alignment: It refers to finding closely matching regions between sequences. In

    local alignment the beginning part (say 0.100 nucleotides) of a sequence may alignwith the ending part of another sequence (say 400-500).

    BLAST flavours

    The BLAST programs are widely used tools for searching DNA and protein databasesfor sequence similarity to identify homologs to a query sequence. While often referredto as just "BLAST", this can really be thought of as a set of programs: blastp, blastn,

    blastx, tblastn, and tblastx.

    The five flavours of BLAST perform the following tasks:

    blastpo Compares an amino acid query sequence against a protein sequence

    database blastn

    o Compares a nucleotide query sequence against a nucleotide sequencedatabase

    blastxo Compares the six-frame conceptual translation products of a nucleotide

    query sequence (both strands) against a protein sequence database tblastn

  • 8/8/2019 Alignments Lecture


    Pharmaceutical Bioinformatics, 7.5pLecture notes


    o Compares a protein query sequence against a nucleotide sequencedatabase dynamically translated in all six reading frames (bothstrands).

    Tblastxo Compares the six-frame translations of a nucleotide query sequence

    against the six-frame translations of a nucleotide sequence database.(Due to the nature of tblastx, gapped alignments are not available withthis option)

    Links for BLAST:

    NCBI's blast tool can be found at An article on methodology behind blast:

    How t