232
Introduction to Bioinformatics Lecture 1: Overview of Bioinformatics and Molecular Biology What is Bioinformatics? Defining the terms bioinformatics and computational biology is not necessarily an easy task, as evidenced by multiple definitions available over the web. A recent google search for "definition of bioinformatics" returned over 43,000 results! In the past few years, as the areas have grown, a greater confusion into these two terms has prevailed. For some, the terms bioinformatics and computational biology have become completely interchangeable terms, while for others, there is a great distinction. I'll throw my two cents in, based on what my experience has been to the consensus use of these two terms. Computational biology and bioinformatics are multidisciplinary fields, involving researchers from different areas of specialty, including (but in no means limited to) statistics, computer science, physics, biochemestry, genetics, molecular biology and mathematics. The goal of these two fields is as follows: Bioinformatics: Typically refers to the field concerned with the collection and storage of biological information. All matters concerned with biological databases are considered bioinformatics. Computational biology: Refers to the aspect of developing algorithms and statistical models necessary to analyze biological data through the aid of computers. In this respect, my understanding of bioinformatics and computational biology follows the NIH definitions listed below: Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. Others have offered various opinions into these definitions as well: http://kbrin.kwing.louisville.edu/~rouchka/definition.html www.jntuworld.com www.jntuworld.com

Bio in for Matics

Embed Size (px)

Citation preview

  • Introduction to Bioinformatics

    Lecture 1: Overview of Bioinformatics and Molecular Biology What is Bioinformatics?

    Defining the terms bioinformatics and computational biology is not necessarily an easy task, as evidenced by multiple definitions available over the web. A recent google search for "definition of bioinformatics" returned over 43,000 results! In the past few years, as the areas have grown, a greater confusion into these two terms has prevailed. For some, the terms bioinformatics and computational biology have become completely interchangeable terms, while for others, there is a great distinction. I'll throw my two cents in, based on what my experience has been to the consensus use of these two terms.

    Computational biology and bioinformatics are multidisciplinary fields, involving researchers from different areas of specialty, including (but in no means limited to) statistics, computer science, physics, biochemestry, genetics, molecular biology and mathematics. The goal of these two fields is as follows:

    Bioinformatics: Typically refers to the field concerned with the collection and storage of biological information. All matters concerned with biological databases are considered bioinformatics.

    Computational biology: Refers to the aspect of developing algorithms and statistical models necessary to analyze biological data through the aid of computers.

    In this respect, my understanding of bioinformatics and computational biology follows the NIH definitions listed below:

    Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

    Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

    Others have offered various opinions into these definitions as well: http://kbrin.kwing.louisville.edu/~rouchka/definition.html

    www.jntuworld.com

    www.jntuworld.com

  • Image Source: http://ccb.wustl.edu/ Bioinformatics = Hot Field Smart Money: #1 among next hot jobs http://smartmoney.com/consumer/index.cfm?story=working-june02

    Business Week: Among 50 Masters of Innovation http://www.businessweek.com/bw50/content/mar2001/bf20010323_198.htm So why is bioinformatics a hot field? One answer to this question is that it is tied to the human genome project which has generated a lot of popular interest. Various advances in molecular biology techniques (such as genome sequencing and microarrays) has led to a large amount of data that needs to be analyzed. Now that we are close to having the human genome finished, what does it all mean? Thats where bioinformatics steps in. Bioinformatics can lead to important discoveries as well as help companies save time and money in the long run. In addition, there needs to be methods to manage large amounts of data. One of the biggest reasons for bioinformatics being a hot field is the old supply and demand adage. There just are too few people adequately trained in both biology and computer science to solve the problems that biologists need to have solved.

    www.jntuworld.com

    www.jntuworld.com

  • Introduction to Molecular Biology (For a good overview of this topic, please read: http://www.ebi.ac.uk/microarray/biology_intro.html) In order to be a good computational biologist, it is important to understand the terminology and basic processes behind the biological problems. Many interesting problems arise out of sequence analysis. There are two different types of biological sequences studied in this class: DNA/RNA and amino acids. But first, lets make sure the basics are covered. Cells Every organism is made up of tiny structures called cells. Often these cells are too small to be seen with the naked eye. Each cell is in itself a complex system enclosed in a membrane. Some organisms, such as bacteria and bakers yeast are composed of only a single cell (i.e. they are unicellular). Other organisms are made up of many different cells (i.e. they are multicellular). For instance, the human body is composed of around 60 trillion cells. Humans have about 320 different cell types, each having a different type of function or structural property.

    Structure of an animal cell. Image source: www.ebi.ac.uk/microarray/ biology_intro.htm There are two types of organisms: eukaryotes and prokaryotes. Eukaryotes (or as Bruce Roe from the University of Oklahoma calls them the You and I Karyotes) represent most of the organisms which we can see, including plants and animals. Prokaryotes

    www.jntuworld.com

    www.jntuworld.com

  • (such as bacteria) are smaller than eukaryotic cells and have simpler structure. Prokaryotes are single cellular organisms (but not all single-celled organisms are prokaryotes!) So what is the difference between the two types of cells? A eukaryotic cell has a nucleus, which is separated from the rest of the cell by a membrane. Inside the nucleus are the chromosomes, where all of the genetic information for the organism is stored. In addition, eukaryotic cells contain membrane bound organelles with various functions, including centrioles, lysosomes, mitochondria, ribosomes, etc. Contained within the nucleus are one or several long double stranded DNA molecules organized as chromosomes. For humans, there are 22 pairs of autosomes, as well as one pair of sex chromosomes. One copy of each pair is inherited from each parent.

    Karyotype showing the 23 pairs of human chromosomes. Image source: http://avery.rutgers.edu/WSSP/StudentScholars/Session8/Session8.html

    www.jntuworld.com

    www.jntuworld.com

  • Image source: www.biotec.or.th/Genome/ whatGenome.html

    DNA Deoxyribonucleic Acid (DNA) is the basis for the building blocks encoding the information of life. A single stranded DNA molecule, called a polynucleotide or oligomer, is a chain of small molecules called nucleotides. There are four different nucleotides, or bases: adenosine (A), cytosine (C), guanine (G) and thymine (T). The bases can be separated into two different types: purines (A and G) and pyrimidines (C and T). The difference between purines and pyrimidines is in the base structure. Stringing together a simple alphabet of four characters together we can get enough information to create a complex organism! Different nucleotides can be strung together to form a polynucleotide. However, the ends of the polynucleotide are different, meaning that each polynucleotide sequence will have a directionality. The ends of the polynucleotide are marked either 3 or 5. The general convention is to label the coding strand from 5 to 3 (left to right).

    www.jntuworld.com

    www.jntuworld.com

  • For instance, the following is a polynucleotide:

    5 GTAAAGTCCCGTTAGC 3 DNA can be either single-stranded or double stranded. When DNA is double-stranded, the second strand is referred to as the reverse complement strand. This name is derived from the fact that the directionality of this second strand runs in the opposite direction as the first, and the fact that the bases in the second strand are complementary to the bases in the first. Complementary bases are determined by which pairs of nucleotides can form bonds between them. In the case of DNA, A binds to T, and C binds to G. For the polynucleotide given above, the double-stranded polynucleotide is as follows:

    5 GTAAAGTCCCGTTAGC 3 | | | | | | | | | | | | | | | |

    3 CATTTCAGGGCAATCG 5 Two complementary polynucleotide chains form a stable structure known as the DNA double helix. This spring represents the 50th anniversary of the discovery of the double helix structure of DNA by Watson, Crick and Franklin.

    DNA double helix structure.

    Image source: www.genecrc.org/site/ lc/lc2b.htm Note that in this image, there appear to be two types of grooves: A larger one, which is called the major groove and a smaller one, known as the minor groove. In addition, there are roughly 10.5 base pairs in one complete turn of the helix. RNA Ribonucleic Acid (RNA) is similar to DNA in the fact that it is constructed from nucleotides. However, instead of thymine (T), an alternative base uracil (U) is found in RNA. RNA can be found as double-stranded or single-stranded, and can also be part of a hybrid helix where one strand is an RNA strand and the other is a DNA strand. RNA is generally found as a single stranded molecule that may form a secondary structure or tertiary structures due to the complementary bases between parts of the same strand. RNA folding will be discussed in detail during a later class period. RNA is important in

    www.jntuworld.com

    www.jntuworld.com

  • the cell and contributes in a variety of ways. One of the most important roles of RNA is in protein synthesis. Two of the major RNA molecules involved in protein synthesis are messenger RNA (mRNA) and transfer RNA (tRNA).

    Secondary structure for E. coli Rnase P RNA. Image source: www.mbio.ncsu.edu/JWB/MB409/lecture/ lecture05/lecture05.htm mRNA mRNA encodes the genetic information as copied from the DNA molecules. Transcription is the process in which DNA is copied into an RNA molecule. The resulting linear molecule is an mRNA transcript. In eukaryotic cells, before the mRNA can be translated into a protein, it needs to be modified. The nature of most eukaryotic genes is that the genes are created in pieces, where coding regions, called exons, are interspersed with noncoding regions, called introns. One of the steps in processing the mRNA is to remove the intronic regions and to splice together the coding, or exonic regions. The processed mRNA can then be transported from the nucleus and translated into a protein sequence.

    www.jntuworld.com

    www.jntuworld.com

  • mRNA processing. Image source: http://departments.oxy.edu/biology/Stillman/bi221/111300/processing_of_hnrnas.htm tRNA tRNA molecules develop a well-defined three-dimensional structure which is critical in the creation of proteins. Attached to each tRNA molecule is an amino acid (which will be discussed momentarily). The amino acid to be attached is determined by a three base sequence called an anticodon sequence, which is complementary to the sequence in the mRNA. Translation is the process in which the nucleotide base sequence of the processed mRNA is used to order and join the amino acids into a protein with the help of ribosomes and tRNA.

    www.jntuworld.com

    www.jntuworld.com

  • tRNA secondary structure. Image Source: http://www.tulane.edu/~biochem/nolan/lectures/ rna/frames/trnabtx2.htm

    tRNA tertiary structure. Image source: www.biology.ucsc.edu/people/ areslab/BMB100A/11-26.html

    Genetic Code Since there are 4 possible bases (A, C, G, U) and 3 bases in the codon, there are 4 * 4 * 4 = 64 possible codon sequences. However, the codon AUG can also be used as a signal to initiate translation, while the codons UAA, UAG, and UGA are terminal codons signaling the end of translation. That leaves a 61 codon sequences that can code for amino acids (AUG can also code for an amino acid). However, there are only 20 amino acids. Therefore the genetic code is redundant, meaning that a single amino acid could be coded for by several different codons.

    www.jntuworld.com

    www.jntuworld.com

  • Second Position of Codon U C A G

    U UUU Phe [F] UUC Phe [F] UUA Leu [L] UUG Leu [L]

    UCU Ser [S]UCC Ser [S]UCA Ser [S]UCG Ser [S]

    UAU Tyr [Y]UAC Tyr [Y]UAA STOPUAG STOP

    UGU Cys [C] UGC Cys [C] UGA STOP UGG Trp [W]

    UCAG

    C CUU Leu [L] CUC Leu [L] CUA Leu [L] CUG Leu [L]

    CCU Pro [P]CCC Pro [P]CCA Pro [P]CCG Pro [P]

    CAU His [H]CAC His [H]CAA Gln [Q]CAG Gln [Q]

    CGU Arg [R] CGC Arg [R] CGA Arg [R] CGG Arg [R]

    UCAG

    A AUU Ile [I] AUC Ile [I] AUA Ile [I] AUG Met [M]

    ACU Thr [T]ACC Thr [T]ACA Thr [T]ACG Thr [T]

    AAU Asn [N]AAC Asn [N]AAA Lys [K]AAG Lys [K]

    AGU Ser [S] AGC Ser [S] AGA Arg [R] AGG Arg [R]

    UCAG

    F i r s t

    P o s i t i o n

    G GUU Val [V] GUC Val [V] GUA Val [V] GUG Val [V]

    GCU Ala [A]GCC Ala [A]GCA Ala [A]GCG Ala [A]

    GAU Asp [D]GAC Asp [D]GAA Glu [E]GAG Glu [E]

    GGU Gly [G] GGC Gly [G] GGA Gly [G] GGG Gly [G]

    UCAG

    Third

    Position

    Genetic Code. Note that the initiator codon is labeled in green, and the terminal codons are labeled in red. The first column gives the triplet base; the second the three letter amino acid label, and the third the one letter amino acid label. Adapted from: http://psyche.uthct.edu/shaun/SBlack/geneticd.html Amino Acids Amino acids are the building blocks from which proteins are made. There are 20 different amino acids that vary from each other by their side chain groups. Amino acids can be classified into different groups based on their solubility in water. Hydrophilic amino acids are water soluable, while hydrophobic are not. This property becomes important when a protein sequence is made. Amino acids are linked to one another via a single chemical bond, called a peptide bond. A linear chain of amino acids can be referred to as a peptide (if it is short less than 30 a.a. long) or polypeptide (which can be upwards of 4000 residues long).

    www.jntuworld.com

    www.jntuworld.com

  • One-letter Three-letter Full name

    G GLY Glycine A ALA Alanine V VAL Valine L LEU Leucine I ILE Isoleucine F PHE PhenylalanineP PRO Proline S SER Serine T THR Threonine C CYS Cysteine M MET Methionine W TRP Tryptophan Y TYR Tyrosine N ASN Asparagine Q GLN Glutamine D ASP Aspartic acidE GLU Glutamic acidK LYS Lysine R ARG Arginine H HIS Histidine

    Amino Acid Codes.

    Proteins Proteins are polypeptides that have a three dimensional structure. They can be described through four different hierarchical levels:

    Primary structure the sequence of amino acids constituting the polypeptide chain.

    Secondary structure the local organization of the parts of the polypeptide chain into secondary structures such as helices and sheets.

    Tertiary structure the three dimensional arrangements of the amino acids as they react to one another due to the polarity and resulting interactions between their side chains.

    www.jntuworld.com

    www.jntuworld.com

  • Quaternary structure if a protein consists of several protein subunits held together, then the protein can be described as well by the number and relative positions of the subunits.

    Visualization of Protein Structures.

    Magenta: alpha helix Gold: Beta Sheets

    Blue: Monomer A Orange: Monomer B

    Image source: http://www.ebi.ac.uk/microarray/biology_intro.html Calculating the secondary and tertiary structure of a protein given its primary structure is not an easy task. Protein folding prediction will be covered at some point close to the end of the semester. Monomer Any small molecule that can be linked with others of the same type to form a polymer. For the purpose of this class, the molecules could be nucleic acids, amino acids, or proteins. Dimer - Two small molecules of the same type linked together. Trimer Three small molecules of the same type linked together. Oligimer General term for a short polymer most commonly consisting of nucleic acids or amino acids. Polymer Any large molecule consisting of multiple identical or similar subunits linked by covalent bonds.

    www.jntuworld.com

    www.jntuworld.com

  • Putting it all together, we get the flow of genetic information. That is, DNA directs the synthesis of RNA, and RNA then in turn directs the synthesis of protein. This flow of genetic information from nucleic acids to protein has been called the Central Dogma of Molecular Biology.

    Central Dogma of Molecular Biology Image Source: http://www.people.virginia.edu/~rjh9u/dnaprot.html

    DNA

    RNA

    PROTEIN

    What is a Gene? Aaah, the million dollar question. In short, a gene can be described as the physical and functional unit of heredity that carries information from one generation to the next. A gene can be thought of as the DNA sequence necessary for the synthesis of a functional protein or RNA molecule. Genome, Transcriptome, Proteome Whenever the term genome is used, it typically refers to the chromosomal DNA of an organism, or as far as sequencing is concerned, the heterochromatic regions of the chromosomal DNA. The number of chromosomes and genome size varies quite significantly from one organism to another. An example list of genome sizes is given below. Dont be fooled by this table that the size of the genome and the number of genes determines the complexity of an organism. In fact, many plant genomes are much greater in size than the human genome!

    www.jntuworld.com

    www.jntuworld.com

  • ORGANISM CHROMOSOMES GENOME SIZE GENES Homo sapiens

    (Humans) 23 3,200,000,000 ~ 30,000

    Mus musculus (Mouse)

    20 2,600,000,000 ~30,000

    Drosophila melanogaster

    (Fruit Fly)

    4 180,000,000 ~18,000

    Saccharomyces cerevisiae (Yeast)

    16 14,000,000 ~6,000

    Zea mays (Corn) 10 2,400,000,000 ??? The term transcriptome refers to the complete collection of all possible mRNAs (including splice variants) of an organism. This can be thought of as the regions of an organisms genome that get transcribed into messenger RNA. In some cases, the transcriptome can be extended to include all transcribed elements, including non-coding RNAs used for structural and regulatory purposes. The term proteome refers to the complete collection of proteins that can be produced by an organism. The proteome can be studied either as a static (sum of all proteins possible) or a dynamic (all proteins found at a specific time point) entity.

    www.jntuworld.com

    www.jntuworld.com

  • Molecular Biology Reference Books Lewin, B (1999), Genes VII (published by Oxford University Press) ISBN: 019879276X Lodish et al (1995), Molecular Cell Biology, 3rd edition (published by Scientific American Books, Freeman and Cpy, New York) ISBN 0 7167 2380 8 Gonick, L & Wheelis, M (1991), The Cartoon Guide to Genetics (published by Harper Perrenial, New York) ISBN 0 06 273099 1 Online tutorials The Learning Center: http://www.genecrc.org/site/lc/lc1a.htm On-Line Biology Book: http://www.emc.maricopa.edu/faculty/farabee/BIOBK/BioBookTOC.html EMBL-EBI Introduction to Biology: http://www.ebi.ac.uk/microarray/biology_intro.html One site you will be intimately familiar with by the end of the semester: http://www.ncbi.nlm.nih.gov Reading assignment http://www.ebi.ac.uk/microarray/biology_intro.html Chapters 1 & 2 (Durbin, et al.) Chapters 1 & 3 (Mount)

    Introduction to Bioinformatics

    Lecture 2: Pairwise Sequence Alignment In molecular biology, a common question is to ask whether or not two sequences are related. The most common way to tell whether or not they are related is to compare them to one another to see if they are similar. If we look at two words in the English language, we note that two words that are spelled similarly may mean two completely different things, such as the words pear and tear.

    www.jntuworld.com

    www.jntuworld.com

  • Biological sequences that are similar (but not exact) provide useful information to help discover functional, structural, and evolutionary information. One common mistake is to describe two sequences as having some sort of homology or a percent homology based on their sequence similarity. This is a misuse of the biological term. Two sequences in different organisms are homologous if they have been derived from a common ancestor sequence. Two sequences may or may not be homologous regardless of their sequence similarity. However, the greater the sequence similarity, the greater chance there is that they share similar function and/or structure.

    SEQUENCE SIMILARITY HOMOLOGY! Biological Definitions for Related Sequences Homologs are similar sequences in two different organisms that have been derived from a common ancestor sequence. Homologs can be described as either orthologous or paralogous. Orthologs are similar sequences in two different organisms that have arisen due to a speciation event. Orthologs typically retain their functionality throughout evolution. Paralogs are similar sequences within a single organism that have arisen due to a gene duplication event. Xenologs are similar sequences that do not share the same evolutionary origin, but rather have arisen out of horizontal transfer events through symbiosis, viruses, etc.

    www.jntuworld.com

    www.jntuworld.com

  • Image Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html Hamming or edit distance One method in determining sequence similarity is to determine the edit distance between two sequences. If we take the example of pear and tear, how similar are these two words? We notice that if we change the p to a t, and keep the ear, then we can change pear to tear. Thus, there is a mismatch in the first letter, and matches in the last three. An alignment of these two is as follows: P E A R | | | T E A R One way to score this alignment is to calculate the Hamming distance, which is the minimum number of letters by which the two words differ. In this example, the Hamming distance would be 1. The Hamming distance is calculated by summing up the number of mismatches when two words are aligned to one another.

    www.jntuworld.com

    www.jntuworld.com

  • With biological sequences, it is often necessary to align two sequences that are of different lengths, or that have regions that have been inserted or deleted over time. Thus, the notion of gaps needs to be introduced. Consider the words alignment and ligament. One alignment of these two words is as follows: A L I G N M E N T | | | | | | | - L I G A M E N T In this case, a gap is denoted in the alignment by a - character. Now an alignment can produce one of the following: a match between two characters, a mismatch between two characters (also called a substitution or mutation), a gap in the first sequence (which can be thought of as the deletion of a character in the first sequence), or a gap in the second sequence (which can be thought of as the insertion of a character in the first sequence). Consider the following two nucleic acid sequences: ACGGACT and ATCGGATCT. The following are two valid alignments: A C G G A C T | | | | | A T C G G A T _ C T

    A T C G G A T C T | | | | | | A C G G A C T Alignment scoring schemes Which alignment is the better alignment? One way to judge this is to assign a positive score for each match, and a negative score for each mismatch, and a negative score for each insertion/deletion (collectively referred to as indels). One scoring scheme might assign the following values: match: +2 mismatch: -1 indel 2 Using this scoring scheme, the first alignment has 5 matches, 1 mismatch, and 4 indels. The score for this alignment is: 5 * 2 1(1) 4(2) = 10 1 8 = 1. The second alignment has 6 matches, 1 mismatch, and 2 indels. The score for the second alignment is 6 * 2 1(1) 2 (2) = 12 1 4 = 7.

    www.jntuworld.com

    www.jntuworld.com

  • Therefore, using the above scoring scheme, the second alignment is a better alignment, since it produces a higher alignment score. Visual Alignments -- Dot Plots One of the more basic, yet important techniques for determining the alignment between two sequences is by using a visual alignment known as dot plots. Dot plots of sequence similarity are created using a matrix where the rows in the matrix correspond to the characters in the first sequence and the columns in the matrix correspond to the characters in the second sequence. The dot plot is created as follows: loop through each row. For the current row, take the character in that row and compare it to the character in each column. If they are equal, place a dot in the matrix. Continue until all nodes in the matrix have been considered. A C C T G A G C T C A C C T G A G T T A A C C T G A G C T C A C C T G A G T T A

    www.jntuworld.com

    www.jntuworld.com

  • Results for aligning ACCTGAGCTCACCTGAGTTA to itself using the Dot Matrix option of the AlignX feature of Informaxs Vector NTI program. When a dotplot is created to compare nucleic acids, there will be a lot of noise, since one out of every four positions will match at random, if there are an equal number of A, C, G, and T in the sequence. Therefore, dot plots can be filtered for stringency requiring that a certain percentage of nucleotides match in a given window size. With the example above, if we filter the sequences to only show matches of two or more consecutive nucleotides, the dot plot now looks as the following:

    www.jntuworld.com

    www.jntuworld.com

  • Information within Dot Plots Dot plots are useful as a first-level filter for determining an alignment between two sequences. Regions of similarity will show up as diagonals within the dot plot matrix. Regions containing insertions/deletions can be readily determined. One potential application is to determine the number of coding regions (exons) contained within a processed mRNA.

    Example of a Dot Plot showing insertion/deletion events. Regions of genomic DNA can contain repetitive regions. For instance, approximately 50 percent of the human genome is composed of repetitive elements, which can be on the order of a few hundred bases (SINEs Alu elements) or a few thousand (LINES). In addition, regions of low complexity are present as well. Repetitive elements and methods to filter them out will be discussed during a later class period. In addition to repetitive elements, regions of a genome can be duplicated. The duplicated region can be found either as a direct repeat, meaning that it occurs in the same direction, or as an inverted repeat, meaning that the sequence of the duplicated region is found in the reverse complement direction. Dot plots can readily show regions of direct and inverted repeats. Dot plots show all possible matches of residues between two sequences given a certain threshold level. Thus, the researcher can decide which alignments are the most significant.

    www.jntuworld.com

    www.jntuworld.com

  • Example dot plots showing the presence of direct and inverted repeats. Dot plots can also be used in order to compare two different assemblies of the same sequence. Below are three dotplots of various chromosomes. The first shows two separate assemblies of human chromosome 5 compared against each other. The second shows one assembly of chromosome 5 compared against itself, indicating the presence of repetitive regions. The final dotplot shows chromosome Y compared against itself, indicating the presence of inverted repeats.

    Comparison of two assemblies of chromosome 5. The figure to the left indicates the alignment of two separate assemblies, while the figure to the right indicates the alignment of a single assembly against itself.

    www.jntuworld.com

    www.jntuworld.com

  • Self plot of chromosome Y. Indicated are several regions of both direct and inverted repeats. Available Dot Plot Packages Vector NTI software package (under AlignX) Dotlet Java applet: http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html Dotter (http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html ) GCG software package:

    Compare http://www.hku.hk/bruhk/gcgdoc/compare.html DotPlot+ http://www.hku.hk/bruhk/gcgdoc/dotplot.html

    Emboss software package: Dotmatcher Dotpath Dotup DNA strider Pipmaker: http://bio.cse.psu.edu/pipmaker/ -- Returns back a pdf of the alignment dotmatcher: http://www.hku.hk/bruhk/emboss/dotmatcher.html Overview of Dotplot techniques: http://imagebeat.com/dotplot/overview.html

    www.jntuworld.com

    www.jntuworld.com

  • Dot Plot Articles Gibbs & McIntyre, 1970

    Gibbs, A. J. & McIntyre, G. A. (1970). The diagram method for comparing sequences. its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16, 1-11.

    Staden, 1982 Staden, R. (1982). An interactive graphics program for comparing and aligning nucleic-acid and amino-acid sequences. Nucl. Acid. Res. 10 (9), 2951-2961. The shortcoming of visual methods is that they do not yield a direct measure into the similarity between two sequences. In order to get a measure into sequence similarity, dynamic programming can be employed. Finding an optimal alignment of two sequences Suppose there are two sequences X and Z to be aligned, where |X| = m and |Z| = n. If gaps are allowed in the sequences, then the potential length of both the first and second sequences is m+n. Several methods will be discussed to align these sequences. Brute Force Method If we are interested in determining the optimal alignment (either global or local), then we note that there are 2m+n subsequences with spaces for the sequence X, and 2m+n subsequences with spaces for the sequence Z using the power set rules. Thus, a brute force method of comparing these two sequences for the optimal alignment would require 2m+n * 2m+n = 2(2(m+n)) = 4m+n comparisons. It doesnt take long for this to be an impossible search! Dynamic Programming Luckily, sequence alignment has an optimal-substructure property, and therefore there is a much easier way to consider all of the possible alignments using what is called dynamic programming (DP). Dynamic programming techniques are used in many different aspects of computer science. DP algorithms solve optimization problems by dividing the problem into independent subproblems. Each subproblem is then only stored once, and the answer is stored in a table, thus avoiding the work of recomputing the solution.

    www.jntuworld.com

    www.jntuworld.com

  • With sequence alignment, the subproblems can be thought of as the alignment of the prefixes of the two sequences to a certain point. Therefore, a dynamic programming matrix is computed. The optimal alignment score for any particular point in the matrix is built upon the optimal alignment that has been computed to that point. Dynamic programming techniques align two sequences by beginning at the ends of the two sequences and attempting to align all possible pairs of characters (one from each sequence) using a scoring scheme for matches, mismatches, and gaps. The highest set of scores defines the optimal alignment between the two sequences. We will first consider dynamic programming in terms of DNA, where only exact matches are considered for a match score. Later we will discuss how substitution matrices can be used to score amino acid matches and mismatches. Dynamic programming approaches are guaranteed to provide the optimal alignment given a particular scoring scheme. For large sequences, dynamic programming can be slow and memory intensive. Discuss the time and space necessary for microarray analysis. Setting up the Dynamic Programming Matrix Now we are ready to go ahead and start creating the dynamic programming matrix. The first step is to align one of the sequences across the columns of the matrix, and the other sequence across the rows. Note that an alignment can also begin with a gap in one of the sequences, so that has to be taken care of as well. Lets assume that we want to align the sequence GAATTCAGTTA to GGATCGA. The length of the first sequence is 11 residues, and the length of the second is 7. Since it is possible to begin an alignment with a gap, the size of the matrix should be 8 x 12. Row 0 and column 0 will represent gaps. Rows 1-7 will be labeled with the corresponding residue of the sequence GGATCGA, while columns 1-11 will be labeled with the corresponding residue of the sequence GAATTCAGTTA. The initial matrix, S, is as follows:

    - G A A T T C A G T T A - G G A T C G A

    www.jntuworld.com

    www.jntuworld.com

  • Now we need to decide upon the scoring scheme to be used. This requires parameters for a match score, a mismatch score, and a gap score. The match and mismatch scores will be combined into a single match/mismatch score, s(aibj). Well see how this can later be used with a substitution matrix. There will also be a single linear gap penalty score, w. For our first example, we have the following parameters: Sequence #1: GAATTCAGTTA; M = 11 Sequence #2: GGATCGA; N = 7

    s(aibj) = +5 if ai = bj (match score) s(aibj) = -3 if aibj (mismatch score) w = -4 (gap penalty)

    Three steps in dynamic programming Once you have the scoring functions set and the sequences to align, there are three steps involved in calculating the optimal scoring alignment. The methods used to finish these three steps are dependent upon whether global or local sequence alignment is desired. The three steps are as follows:

    Initialization Matrix Fill (scoring) Traceback (alignment)

    Global Alignment: Needleman-Wunsch Algorithm In global sequence alignment, an attempt to align the entirety of two different sequences is made, up to and including the ends of the sequence. Needleman and Wunsch (1970) were among the first to describe a dynamic programming algorithm for global sequence alignment.

    www.jntuworld.com

    www.jntuworld.com

  • Initialization Step. In the initialization step of global alignment, each row Si,0 is set to w * i. In addition, each column S0,j is set to w * j. Remember, that w is the gap penalty. Using the scoring scheme described above, the initialization step results in the following:

    - G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44 G -4 G -8 A -12 T -16 C -20 G -24 A -28

    Matrix Fill Step. One possible solution of the matrix fill step finds the maximum global alignment score by starting in the upper left hand corner in the matrix and finding the maximal score Si,j for each position in the matrix. In order to find Si,j for any i,j it is minimal to know the score for the matrix positions to the left, above and diagonal to i, j. In terms of matrix positions, it is necessary to know Si-1,j, Si,j-1 and Si-1, j-1.

    For each position, Si,j is defined to be the maximum score at position i,j; i.e.

    Si,j = MAXIMUM[ Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal), Si,j-1 + w (gap in sequence #1), Si-1,j + w (gap in sequence #2)] Note that in the example, Si-1,j-1 will be red, Si,j-1 will be green and Si-1,j will be blue.

    Using this information, the score at position 1,1 in the matrix can be calculated. Since the first residue in both sequences is a G, s(a1b1) = 5, and by the assumptions stated earlier, w = -4. Thus, S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 - 4] = MAX[5, -8, -8].

    www.jntuworld.com

    www.jntuworld.com

  • A value of 5 is then placed in position 1,1 of the scoring matrix. Note that there is also an arrow placed back into the cell that resulted in the maximum score, S0,0.

    - G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44 G -4 5 G -8 A -12 T -16 C -20 G -24 A -28

    Now we proceed to S1,2. Since a1 = G and b2 = A, there is a mismatch. Therefore, sa1b2 = -3 and by the assumptions stated earlier, w = -4. Thus, S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2 - 4] = MAX[-4 - 3, 5 4, -8 4] = MAX[-7, 1, -12] = 1. An arrow is placed back into the cell that resulted in the maximum score, which is the cell S1,1.

    - G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44 G -4 5 1 G -8 A -12 T -16 C -20 G -24 A -28

    www.jntuworld.com

    www.jntuworld.com

  • We can proceed to fill in the rest of the first row in a similar fashion, resulting in the following matrix:

    - G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44 G -4 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35G -8 A -12 T -16 C -20 G -24 A -28

    Now we can start to fill in the second row, beginning with S2,1. Note that a2 = G and b1 = G, so sa2b1 = 5 and by the assumptions stated earlier, w = -4. Thus, S2,1= MAX[S1,0 +5, S0,2 - 4, S1,1 - 4] = MAX-4 + 5, -8 4, 5 - 4] = MAX[1, -12, 1] = 1. Note that in this case, there are two possible paths to the maximum value. Therefore, an arrow is placed back into each cell resulting in the maximum score, which are sells S1,0 and S1,1.

    - G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44 G -4 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35G -8 1 A -12 T -16 C -20 G -24 A -28

    www.jntuworld.com

    www.jntuworld.com

  • We can then proceed to fill in the rest of the matrix in a similar fashion. The resulting matrix is as follows:

    - G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44 G -4 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35G -8 1 2 -2 -6 -10 -14 -18 -14 -18 -22 -26A -12 -3 6 7 3 -1 -5 -9 -13 -17 -21 -17T -16 -7 2 3 12 8 4 0 -4 -8 -12 -16C -20 -11 -2 -1 8 9 13 9 5 1 -3 -7 G -24 -15 -6 -5 4 5 9 10 14 10 6 2 A -28 -19 -10 -1 0 1 5 14 10 11 7 11

    Each cell has one to three arrows indicating from which cell the maximum score was obtained. The matrix fill step is now complete. Traceback Step. After the matrix fill step, the maximum global alignment score for the two sequences is 11 (the value in the lower right hand cell). The traceback step will obtain the actual alignment(s) that result in the maximum score. The traceback begins in position SM,N; i.e. the position where both sequences are globally aligned. Since pointers have been kept back to all possible predacessors, the traceback is simple. At each cell, we look to see where we move next according to the pointers. To begin, the only possible predacessor is the diagonal match.

    - G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44 G -4 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35G -8 1 2 -2 -6 -10 -14 -18 -14 -18 -22 -26A -12 -3 6 7 3 -1 -5 -9 -13 -17 -21 -17T -16 -7 2 3 12 8 4 0 -4 -8 -12 -16C -20 -11 -2 -1 8 9 13 9 5 1 -3 -7 G -24 -15 -6 5 4 5 9 10 14 10 6 2 A -28 -19 -10 -1 0 1 5 14 10 11 7 11

    www.jntuworld.com

    www.jntuworld.com

  • This gives us an alignment of A | A Note that the blue letters and gold arrows indicate the path leading to the maximum score. We can continue to follow the path until we get to the following situation:

    - G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44 G -4 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35G 2 -2 -6 -10 -14 -18 -14 -18 -22 -26A -12 -3 6 7 3 -1 -5 -9 -13 -17 -21 -17T -16 -7 2 3 12 8 4 0 -4 -8 -12 -16C -20 -11 -2 -1 8 9 13 9 5 1 -3 -7 G -24 -15 -6 5 4 5 9 10 14 10 6 2 A -28 -19 -10 -1 0 1 5 14 10 11 7 11

    The resulting global alignment is as follows: G A A T T C A G T T A | | | | | | G G A T C G - A

    Remembering that the scoring scheme used was +5 for a match, -3 for a mismatch, and 4 for a gap, we can double check the score of the alignment:

    G A A T T C A G T T A | | | | | | G G A T C G - A

    + - + - + + - + - - +

    5 3 5 4 5 5 4 5 4 4 5

    5 3 + 5 4 + 5 + 5 4 + 5 4 4 + 5 = 11

    www.jntuworld.com

    www.jntuworld.com

  • so this alignment results in a global alignment score of 11. Note that in the case of the sequence and scoring schemes we chose, there was only one maximal alignment. It is possible that there could be multiple alignments yielding the same score, as evidenced by having multiple ways to obtain the maximal score in a given cell in the scoring matrix. In such a case, the traceback can be accomplished in any manner desired, as long as the same set of rules is consistently used in order for reproducibility. Local Alignment: Smith-Waterman Algorithm In 1981, Temple Smith and Mike Waterman proposed a modification to the Needleman-Wunsch algorithm in order to obtain a local sequence alignment resulting in the highest-scoring local match between two sequences. Why choose a local alignment algorithm?

    More meaningful point out conserved regions between two sequences Aligns two sequences of different lengths to be matched Aligns two partially overlapping sequences Aligns two sequences where one is a subsequence of another

    There are only two slight modifications that need to be made to the Needleman-Wunsch Algorithm in order to make it a local alignment algorithm. The first modification requires negative scores for mismatches. The second modification requires that when the dynamic programming scoring matrix value becomes negative, the value is set to zero, which has the effect of terminating any alignment up to that point. This has the effect of changing the matrix score to: Si,j = MAXIMUM[ Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal), Si,j-1 + w (gap in sequence #1), Si-1,j + w (gap in sequence #2), 0] The local alignments are then produced by starting at the highest-scoring positions in the scoring matrix and following a trace path from those positions up to a box that scores zero.

    www.jntuworld.com

    www.jntuworld.com

  • Initialization Step. In the initialization step of local alignment, each row Si,0 is set to 0. In addition, each column S0,j is set to 0. Using the scoring scheme described above, the initialization step results in the following:

    - G A A T T C A G T T A - 0 0 0 0 0 0 0 0 0 0 0 0 G 0 G 0 A 0 T 0 C 0 G 0 A 0

    Matrix Fill Step. One possible solution of the matrix fill step finds the maximum local alignment score by starting in the upper left hand corner in the matrix and finding the maximal score Si,j for each position in the matrix. In order to find Si,j for any i,j it is minimal to know the score for the matrix positions to the left, above and diagonal to i, j. In terms of matrix positions, it is necessary to know Si-1,j, Si,j-1 and Si-1, j-1.

    For each position, Si,j is defined to be the maximum score at position i,j; i.e.

    Si,j = MAXIMUM[ Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal), Si,j-1 + w (gap in sequence #1), Si-1,j + w (gap in sequence #2), 0]

    - G A A T T C A G T T A - 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 G 0 A 0 T 0 C 0 G 0 A 0

    www.jntuworld.com

    www.jntuworld.com

  • Note that in the example, Si-1,j-1 will be red, Si,j-1 will be green and Si-1,j will be blue. Using this information, the score at position 1,1 in the matrix can be calculated. Since the first residue in both sequences is a G, s(a1b1) = 5, and by the assumptions stated earlier, w = -4. Thus, S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 4,0] = MAX[5, -4, -4, 0]. Now we proceed to S1,2. Since a1 = G and b2 = A, there is a mismatch. Therefore, sa1b2 = -3 and by the assumptions stated earlier, w = -4. Thus, S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2 4, 0] = MAX[0 - 3, 5 4, 0 4, 0] = MAX[-3, 1, -4, 0] = 1. An arrow is placed back into the cell that resulted in the maximum score, which is the cell S1,1.

    - G A A T T C A G T T A - 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 1 G 0 A 0 T 0 C 0 G 0 A 0

    Now we proceed to S1,3. Since a1 = G and b3 = A, there is a mismatch. Therefore, sa1b2 = -3 and by the assumptions stated earlier, w = -4. Thus, S1,3 = MAX[S0,2 -3, S1,2 - 4, S0,3 4, 0] = MAX[0 - 3, 1 4, 0 4, 0] = MAX[-3, -3, -4, 0] = 0. Since the maximum score is 0 (all other possible scores are negative), no arrow is drawn back from this location.

    - G A A T T C A G T T A - 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 1 0 G 0 A 0 T 0 C 0 G 0 A 0

    www.jntuworld.com

    www.jntuworld.com

  • We can then proceed to fill in the rest of the matrix in a similar fashion. The resulting matrix is as follows:

    - G A A T T C A G T T A - 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 1 0 0 0 0 0 5 1 0 0 G 0 5 2 0 0 0 0 0 5 2 0 0 A 0 1 10 7 3 0 0 5 1 2 0 5 T 0 0 6 7 12 8 4 1 2 6 7 3 C 0 0 2 3 8 9 13 9 5 2 3 4 G 0 5 1 0 4 5 9 10 14 10 6 2 A 0 1 10 6 2 1 4 14 10 11 7 11

    Each cell has one to three arrows indicating from which cell the maximum score was obtained. The matrix fill step is now complete. Traceback Step. After the matrix fill step, the maximum local alignment score for the two sequences is 14, which can be found by locating the highest values in the score matrix. Note that 14 is found in two separate cells, indicating there are multiple alignments producing the maximal alignment score. The traceback step will find the actual local alignments resulting in the maximum score. The traceback begins in the position with the highest value. Since pointers have been kept back to all possible predacessors, the traceback is simple. At each cell, we look to see where we move next according to the pointers. When we reach a cell where there is not a pointer to a previous cell, then we have reached the beginning of the local alignment. First, consider the case where the 14 is in the last row.

    - G A A T T C A G T T A - 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 1 0 0 0 0 0 5 1 0 0 G 0 5 2 0 0 0 0 0 5 2 0 0 A 0 1 10 7 3 0 0 5 1 2 0 5 T 0 0 6 7 12 8 4 1 2 6 7 3 C 0 0 2 3 8 9 13 9 5 2 3 4 G 0 5 1 0 4 5 9 10 14 10 6 2

    www.jntuworld.com

    www.jntuworld.com

  • A 0 1 10 6 2 1 4 14 10 11 7 11 Note that the blue letters and gold arrows indicate the path leading to the maximum score. We can continue to follow the path until we get to the following situation:

    - G A A T T C A G T T A - 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 1 0 0 0 0 0 5 1 0 0 G 0 5 2 0 0 0 0 0 5 2 0 0 A 0 1 10 7 3 0 0 5 1 2 0 5 T 0 0 6 7 12 8 4 1 2 6 7 3 C 0 0 2 3 8 9 13 9 5 2 3 4 G 0 5 1 0 4 5 9 10 14 10 6 2 A 0 1 10 6 2 1 4 14 10 11 7 11

    At this point, or alignment (which is built starting at the end of the alignment) is as follows: C - A | | C G A Now the current cell gets its score either from a match of the Ts or a gap in the second sequence. Well consider both as possibilities: Match of the Ts (1) and gap in second (2).

    - G A A T T C A G T T A - 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 1 0 0 0 0 0 5 1 0 0 G 0 5 2 0 0 0 0 0 5 2 0 0 A 0 1 10 7 3 0 0 5 1 2 0 5 T 0 0 6 7 12 8 4 1 2 6 7 3 C 0 0 2 3 8 9 13 9 5 2 3 4 G 0 5 1 0 4 5 9 10 14 10 6 2 A 0 1 10 6 2 1 4 14 10 11 7 11

    www.jntuworld.com

    www.jntuworld.com

  • Once we reach the node with 0 and there are no pointers from this node, we are finished. The two local alignments resulting in a score of 14 in the final row are: G A A T T C - A | | | | | G G A T C G A

    + - + + - + - +

    5 3 5 5 4 5 4 5

    G A A T T C - A | | | | | G G A T C G A

    + - + - + + - +

    5 3 5 4 5 5 4 5 As you can see, each of these has 5 matches, 1 mismatch, and 2 gaps, so the score is 5(5) 1(3) 2(4) = 25 3 8 = 14. This coincides with the maximum local alignment score calculated in the matrix. Incorporation of Scoring Matrices Amino Acids Certain amino acid substitutions commonly occur in related proteins from different species. Since the proteins in all of the species are functional, the substations maintain protein structure and function. Often the substitutions result in a chemically similar amino acid. Other substitutions are relatively rare. Thus, rather than create a dynamic programming matrix with a match/mismatch score, it would be better to weight a matching score for two residues dependent upon the likelihood that such a substitution would be observed in nature. In a substitution matrix (whether it is an amino acid or nucleic acid), the residues are listed both as column and row headings. Each position is in the matrix is filled with a score reflecting how often one residue would be paired with another in an alignment of related sequences.

    www.jntuworld.com

    www.jntuworld.com

  • Percent Accepted Mutation (PAM) Matrices Margaret Dayhoff pioneered the research in amino acid substitutions for found through the alignment of common protein sequences. The resulting Percent Accepted Mutation (PAM) Matrices give the changes expected for a given period of evolutionary time. The assumption with this evolutionary model is that amino acid substitutions over short periods of evolutionary history can be extrapolated to longer distances. Assumptions in Creating PAM matrices Each change in the current amino acid at a particular site is assumed to be independent of previous mutational events at that site. Calculation of PAM matrices

    amino acid substitutions of evolving proteins were estimated using 1572 changes in 71 groups of protein sequences at least 85% similar. Since the proteins have similar functions, the mutations are called accepted mutations meaning they are accepted by natural selection without negatively affecting a proteins fitness.

    Similar sequences were organized into phylogenetic trees The number of changes of each amino acid into every other amino acid was

    counted. Relative mutabilities were evaluated by counting the number of changes of each

    amino acid divided by a normalization factor. This normalized the data for variations in amino acid composition, mutation rate, and sequence length.

    The amino acid exchange counts and mutability values were used to generate a 20 x 20 mutation probability matrix representing all possible amino acid changes.

    A detailed example of calculating the PAM matrix is located in Mount, p50. Since the changes are independent of previous mutational events, the PAM1 matrix can be multiplied by itself N times to give the transition matrices for sequences that have undergone N mutations. Thus, the PAM250 matrix can be used for sequences that are 20% similar, while the PAM 120, PAM80, and PAM60 matrices represent 40%, 50%, and 60% similarity. Note that PAM1 is 1 accepted mutation per 100 amino acids; PAM10 is 10 accepted mutations per 100 amino acids; PAM250 is 250 accepted mutations per 100 amino acids and so on. Thus, the substitution matrix chosen when aligning two sequences should take into account the divergence between the two sequences.

    www.jntuworld.com

    www.jntuworld.com

  • Example PAM1 matrix (normalized probabilities multiplied by 10000) Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V

    Ala A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18

    Arg R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1

    Asn N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1

    Asp D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1

    Cys C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2

    Gln Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1

    Glu E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2

    Gly G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5

    His H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1

    Ile I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33

    Leu L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15

    Lys K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1

    Met M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4

    Phe F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0

    Pro P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2

    Ser S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2

    Thr T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9

    Trp W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0

    Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1

    Val V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901

    Taken from: http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeE.html#page7 Log Odds matrices PAM matrices are usually converted into another form, called log odds matrices. Odds ratios are converted into logarithms in order that the scores may be added, rather than multiplied. Each cell of the log-odds matrix is calculated by first finding the odds ratio for each substitution. The odds ratio is calculated by taking the scores in the above matrix, which is the probability of one amino acid mutating to another given amino acid, and dividing it by the frequency of the first amino acid. Such a ratio gives the relative frequency of change. The ratio is then converted to a log10, so that the scores are additive, and it is multiplied by 10. The log odds for converting from the first amino acid to the second is added to the log odds for converting from the second amino acid to the first, and the average is taken to produce a symmetric matrix, since the direction of mutation cannot necessarily be inferred. An example of how the log odds score for changes between Phe and Tyr is given in Mount, pp 80 81. Make sure to look at this to see if you have any questions.

    www.jntuworld.com

    www.jntuworld.com

  • log-odds form of PAM250 Scoring matrix. Image Source: http://www.blc.arizona.edu/courses/bioinformatics/dayhoff.html (Image in Mount, p82) Blocks Amino Acid Substitution Matrices (BLOSUM) One of the arguments against the Dayhoff PAM matrices is that they represent only a small number of families, and therefore may not truly reflect amino acid distributions that one is likely to encounter. Therefore, another set of substitution matrices, called BLOSUM matrices were developed using a much larger number of protein families. The BLOSUM matrices were developed by Stephen and Georgia Henikoff by looking at a large set of approximately 2000 amino acid patterns organized into blocks, which are conserved regions within protein families as identified by the protein database, Prosite. The blocks that were studied were also signatures of a protein family, indicating that members of the family could be found by searching for these blocks. In order to deal with overrepresentation of amino acid substitutions occurring in the most closely related members of the family, a consensus sequence of the block is formed. Sequences that were 60% identical to the consensus were grouped together to form the BLOSUM60 matrix; sequences 80% identical were grouped together to form the BLOSUM80 matrix, etc. Nucleic Acid Scoring Matrices In addition to using a match/mismatch scoring scheme for DNA sequences, nucleotide mutation matrices can be constructed as well. These matrices are based upon two

    www.jntuworld.com

    www.jntuworld.com

  • different models of nucleotide evolution: the first, the Jukes-Cantor model, assumes there are uniform mutation rates among nucleotides, while the second, the Kimura model, assumes that there are two separate mutation rates: one for transitions (where the structure of purine/pyrimidine stays the same), and one for transversions. Generally, the rate of transitions is thought to be higher than the rate of transversions. Jukes-Cantor Model of evolution: = common rate of base substitution

    Kimura Model of Evolution: = rate of transitions; = rate of transversions

    A

    C

    G

    T

    PURINES: A, GPYRIMIDINES C, T Transitions: AG; CT Transversions: AC, AT, CG, GT

    www.jntuworld.com

    www.jntuworld.com

  • http://www.cs.man.ac.uk/~jowh6/phase/node26.html Tables 3.4 and 3.5 indicate nucleotide substitution matrices with the equivalent distance of 1 PAM. Table 3.4 PAM1 Odds Matrices

    A. Model of uniform mutation rates among nucleotides.

    A G T C A 0.99 G 0.00333 0.99 T 0.00333 0.00333 0.99 C 0.00333 0.00333 0.00333 0.99

    B. Model of 3-fold higher transitions than transversions.

    A G T C A 0.99 G 0.006 0.99 T 0.002 0.002 0.99 C 0.002 0.002 0.006 0.99

    Table 3.5 PAM1 Log-Odds Matrices

    A. Model of uniform mutation rates among nucleotides.

    A G T C A 2 G -6 2 T -6 -6 2 C -6 -6 -6 2

    B. Model of 3-fold higher transitions than transversions.

    A G T C A 2 G -5 2

    www.jntuworld.com

    www.jntuworld.com

  • T -7 -7 2 C -7 -7 -5 2

    Gap Penalties The scoring matrices used to this point assume a linear gap penalty where each gap is given the same penalty score. However, over evolutionary time, it is more likely that a contiguous block of residues has become inserted/deleted in a certain region (for example, it is more likely to have 1 gap of length k than k gaps of length 1). Therefore, a better scoring scheme to use is an initial higher penalty for opening a gap, and a smaller penalty for extending the gap. The affine gap penalty can then be formulated as follows:

    wx = g + r(x-1) where wx is the total gap penalty, g is the gap open penalty, r is the gap extend penalty, and x is the length of the gap. The gap penalty needs to be chosen relative to the score matrix, so that gaps will not be excluded from the alignment, or propagate throughout the alignment. Typical values are 12 for gap opening, and 4 for gap extension. Affine gap penalties increase the number of matrices (or at least storage space) to be filled out. The information to be processed is now:

    Di - 1, j - 1 + subst(Ai, Bj) Mi - 1, j - 1 + subst(Ai, Bj) Mi, j = max { Ii - 1, j - 1 + subst(Ai, Bj)

    Di , j - 1 - extend Di, j = max { Mi , j - 1 - open Mi-1 , j - open Ii, j = max { Ii-1 , j - extend

    Where M is the match matrix, D is the delete matrix, and I is the insert matrix. Assessing the significance of sequence alignments When two sequences of length m and n are not obviously similar but show an alignment, it becomes necessary to assess the significance of the alignment. The alignment of scores of random sequences has been shown to follow a Gumbel extreme value distribution.

    www.jntuworld.com

    www.jntuworld.com

  • Image source: http://roso.epfl.ch/mbi/papers/discretechoice/node11.html Using a Gumbel extreme value distribution, the expected number of alignments with a score at least S (E-value) is:

    E = Kmn e-S Where: m,n: Lengths of sequences K ,: statistical parameters dependent upon scoring system and background residue frequencies Recall that the log-odds scoring schemes examined to this point normally use a S = 10*log10x scoring system. We can normalize the raw scores obtained using these non-gapped scoring systems to obtain the amount of bits of information contained in a score, or the amout of nats of information contained within a score.

    www.jntuworld.com

    www.jntuworld.com

  • Converting to bit scores A raw score can be normalized to a bit score using the formula:

    The E-value corresponding to a given bit score can then be calculated as:

    Converting to nats is similar. However, we just substitute e for 2 in the above equations. Converting scores to either bits or nats gives a standardized unit by which the scores can be compared. P-values P values can be calculated as the probability of obtaining a given score at random. P-values can be estimated as: P = 1 e-E Which is approximately e-E A quick determination of significance If a scoring matrix has been scaled to bit scores, then it can quickly be determined whether or not an alignment is significant. For a typical amino acid scoring matrix, K = 0.1 and lambda depends on the values of the scoring matrix. If a PAM or BLOSUM matrix is used, then lambda is precomputed. For instance, if the log odds matrix is in units of bits, then lambda = loge2, and the significance cutoff can be calculates as log2(mn). Example (p110 Mount)

    www.jntuworld.com

    www.jntuworld.com

  • Suppose we have two sequences, each approximately 250 amino acids long that are aligned using a Smith-Waterman approach and the PAM250 matrix. The following local alignment occurs: F W L E V E G N S M T A P T G F W L D V Q G D S M T A P A G Using the PAM250 matrix (p82), the score for this local alignment can be calculated as: S = 9 + 17 + 6 + 3 + 4 + 2 + 5 + 2 + 2 + 6 + 3 + 2 + 6 + 1 + 5 = 73 S is in 10 * log10x, so this should be converted to a bit score. S = 10 log10x S/10 = log10x S/10 = log10x * (log210/log210) S/10 * log210 = log10x / log210 S/10 * log210 = log2x 1/3 S ~ log2x so S ~ 1/3S In this case, S = 1/3 * 73 = 24.3 The significance cutoff is: log2(mn) = log2(250 * 250) = 16 bits Since the alignment score is above the significance cutoff, this is a significant local alignment. Estimation of P and E When a PAM250 scoring matrix is being used, K is estimated to be 0.09, while lambda is estimated to be 0.229. Using equations 30 and 31 (Mount), we can convert the score to a bit score: S = 0.229 * 73 ln 0.09 * 250 * 250 S = 16.72 8.63 = 8.09 bits P(S >= 8.09) = 1 e(-e-8.09) = 3.1* 10-4 Therefore, we see that the probability of observing an alignment with a bitscore greater than 8.09 is about 3 in 1000.

    www.jntuworld.com

    www.jntuworld.com

  • Significance of Gapped Alignments Gapped alignments make use of the same statistics as ungapped alignments in determining the statistical significance. However, in gapped alignments, the values for lambda and K cannot be easily estimated. Emperical estimations and gap scores have been determined by looking at the alignments of randomized sequences. Bayesian Statistics Bayesian statistics are built upon conditional probabilities, which are used to derive the joint probability of two events or conditions. P(B|A) is the probability of B given condition A is true. P(B) is the probability of condition B occurring, regardless of conditions A. Suppose that A can have two states, A1 and A2, and B can have two states, B1 and B2. Suppose that P(B1) = 0.3 is known. Therefore, P(B2) = 1 0.3 = 0.7. These probabilities are known as marginal probabilities. Now we would like to determine the probability of A1 and B1 occurring together, which is denoted as: P(A1, B1) and is called the joint probability. Note that in this case the marginal probabilities A1 and A2 are missing. Thus, there is not enough information at this point to calculate the marginal probability. However, if more information about the joint occurrence of A1 and B1 are given, then the joint probabilities may be derived using Bayes Rule: P(A1, B1) = P(B1)P(A1|B1) P(A1, B1) = P(A1)P(B1|A1) Suppose that we are given P(A1|B1) = 0.8. Then, since there are only two different possible states for A, P(A2|B1) = 1 0.8 = 0.2. If we are also given P(A2|B2) = 0.7, then P(A1|B2) = 0.3. Using Bayes Rule, the joint probability of having states A1 and B1 occurring at the same time is P(B1)P(A1|B1) = 0.3 * 0.8 = 0.24 and P(A2,B2) = P(B2)P(A2|B2) = 0.7 * 0.7 = 0.49. The other joint probabilities can be calculated from these as well. The calculation of the joint probabilities results in posterior probabilities, since they are not known initially, but are calculated using prior probabilities and initial information. Applications of Bayesian Statistics Bayesian statistics have many applications in bioinformatics. One application is in determining the evolutionary distance between two sequences (Agarwal and States, 1996 covered in Mount, pp 122-124). Another is in sequence alignment algorithms (Zhu et al, 1998; Mount pp 124-134). The significance of an alignment can also be computed using a Bayesian framwork (Durbin, et al, pp 36-38). More applications using Bayesian

    www.jntuworld.com

    www.jntuworld.com

  • statistics will be examined when the Gibbs Sampling algorithm is discussed during a later class period. Drawbacks to Dynamic Programming Approach Dynamic programming approaches are guaranteed to give the optimal alignment between two sequences given a scoring scheme. However, the two main drawbacks to DP approaches is that they are compute and memory intensive, in the cases discussed to this point taking at least O(n2) time and space. Linear space algorithms have been used in order to deal with one drawback to dynamic programming. The basic idea is to concentrate only on those areas of the matrix more likely to contain the maximum alignment. The most well-known of these linear space algorithms is the Myers-Miller algorithm. Available pairwise sequence alignment programs FASTA suite of programs LALIGN BESTFIT SIM GAP NAP LAP2 GAP2 http://genome.cs.mtu.edu/align/align.html EMBOSS APPLICATIONS http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/index.html WEB FORMS FOR EMBOSS APPLICATIONS http://bioweb.pasteur.fr/seqanal/alignment/intro-uk.html#EMBOSS http://bioinfo.pbi.nrc.ca:8090/EMBOSS/index.html BAYESIAN TUTORIAL http://www.wadsworth.org/resnres/bioinfo/tut1/index.htm Expressed Sequences to Genomes Sim4 est2genome spidey

    www.jntuworld.com

    www.jntuworld.com

  • www.jntuworld.com

    www.jntuworld.com

  • ######################################## # Program: needle # Rundate: Wed Jan 22 20:09:50 2003 # Report_file: outfile.align ######################################## #======================================= # # Aligned_sequences: 2 # 1: gi # 2: gi # Matrix: EDNAFULL # Gap_penalty: 12.0 # Extend_penalty: 4.0 # # Length: 1030 # Identity: 537/1030 (52.1%) # Similarity: 537/1030 (52.1%) # Gaps: 493/1030 (47.9%) # Score: 1649.0 # # #=======================================

    gi 1 0

    gi 1 ATACAAAATTTACGTGACTGGAGGGTGAAAGGGAATGTGGGAGGTCAGTG 50

    gi 1 GGCAATAATGATACAATGTATCATGCCTCT 30 |||||||||||||||||||||||||||||| gi 51 CATTTAAAACATAAAGAAATGGCAATAATGATACAATGTATCATGCCTCT 100

    gi 31 TTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGCAATAG 80 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 101 TTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGCAATAG 150

    gi 81 CAA---------------------------ATAAATTGTAACTGATGTAA 103 ||| |||||||||||||||||||| gi 151 CAATATCTCTGCATATAAATATTTCTGCATATAAATTGTAACTGATGTAA 200

    gi 104 GAGGTTTCATATTGCTAATAGCAGCTACAATCCAGCTACCATTCTGCTTT 153 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 201 GAGGTTTCATATTGCTAATAGCAGCTACAATCCAGCTACCATTCTGCTTT 250

    gi 154 TATTTTA---------------------------------------TGGT 164 ||||||| |||| gi 251 TATTTTAAATTTATATGCAGAAATATTTATATGCAGAGATATTGCTTGGT 300

    gi 165 TGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATC 214 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 301 TGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATC 350

    gi 215 ATGTTCATACCTCTTATCTTCCTCCCACGGCTCCTGGGCAACGTGCTGGT 264 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 351 ATGTTCATACCTCTTATCTTCCTCCCACGGCTCCTGGGCAACGTGCTGGT 400

    gi 265 CTGTGTGC--------------------------------CCAGTGCAGG 282 |||||||| |||||||||| gi 401 CTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGG 450

    gi 283 CTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAG 332 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 451 CTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAG 500

    gi 333 TATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTT 382 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 501 TATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTT 550

    gi 383 GTTCCCTAAGTCCAACTACTAAAC-------------------------- 406 |||||||||||||||||||||||| gi 551 GTTCCCTAAGTCCAACTACTAAACAAGCTAGGCCCTTTTGCTAATCATGT 600

    www.jntuworld.com

    www.jntuworld.com

  • gi 407 -----------------------TGGGGGATATTATGAAGGGCCTTGAGC 433 ||||||||||||||||||||||||||| gi 601 TCATACCTCTTATCTTCCTCCCATGGGGGATATTATGAAGGGCCTTGAGC 650

    gi 434 ATCTGGATTCTGCCTAATAAAA---------------------------- 455 |||||||||||||||||||||| gi 651 ATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGCATCTGCATAT 700

    gi 456 -----------------------------------------TATTTCTGA 464 ||||||||| gi 701 AAATATTTCTGCATATAAATTGTAACATGATGTATTTAAATTATTTCTGA 750

    gi 465 ATA-------------------------------TTTTACTAAAAAGGGA 483 ||| |||||||||||||||| gi 751 ATAAGAAATCTTACCACGTTTCTCCGTACTATGTTTTTACTAAAAAGGGA 800

    gi 484 ATGTGGGAGGTCAGTGCATTTAAAACATAAAGAAATGAAGAGCTAGTTCA 533 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 801 ATGTGGGAGGTCAGTGCATTTAAAACATAAAGAAATGAAGAGCTAGTTCA 850

    gi 534 AACC 537 |||| gi 851 AACCACTTACATCAGTTACAATTTATATGCAGAAATATTTATATGCAGAG 900

    gi 538 537

    gi 901 ATATTGCTTTAGGTCGGAATAGGGTTGGTATTTTATTTTCGTCTTACCAT 950

    gi 538 537

    gi 951 CGACCTAACATCGACGATAATAGCAGCTACAATCCAGCTACCATTCTGCT 1000

    gi 538 537

    gi 1001 TTTATTTTATGGTTGGGATAAGGCTGGATT 1030

    #--------------------------------------- #---------------------------------------

    www.jntuworld.com

    www.jntuworld.com

  • water results

    ######################################## # Program: water # Rundate: Wed Jan 22 20:11:48 2003 # Report_file: outfile.align ######################################## #======================================= # # Aligned_sequences: 2 # 1: gi # 2: gi # Matrix: EDNAFULL # Gap_penalty: 12.0 # Extend_penalty: 4.0 # # Length: 660 # Identity: 484/660 (73.3%) # Similarity: 484/660 (73.3%) # Gaps: 152/660 (23.0%) # Score: 1660.0 # # #=======================================

    gi 1 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATA 50 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 71 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATA 120

    gi 51 ACAGTGATAATTTCTGGGTTAAGGCAATAGCAA----------------- 83 ||||||||||||||||||||||||||||||||| gi 121 ACAGTGATAATTTCTGGGTTAAGGCAATAGCAATATCTCTGCATATAAAT 170

    gi 84 ----------ATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATA 123 |||||||||||||||||||||||||||||||||||||||| gi 171 ATTTCTGCATATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATA 220

    gi 124 GCAGCTACAATCCAGCTACCATTCTGCTTTTATTTTA------------- 160 ||||||||||||||||||||||||||||||||||||| gi 221 GCAGCTACAATCCAGCTACCATTCTGCTTTTATTTTAAATTTATATGCAG 270

    gi 161 --------------------------TGGTTGGGATAAGGCTGGATTATT 184 |||||||||||||||||||||||| gi 271 AAATATTTATATGCAGAGATATTGCTTGGTTGGGATAAGGCTGGATTATT 320

    gi 185 CTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTT 234 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 321 CTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTT 370

    gi 235 CCTCCCACGGCTCCTGGGCAACGTGCTGGTCTGTGTGC------------ 272 |||||||||||||||||||||||||||||||||||||| gi 371 CCTCCCACGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACT 420

    gi 273 --------------------CCAGTGCAGGCTGCCTATCAGAAAGTGGTG 302 |||||||||||||||||||||||||||||| gi 421 TTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTG 470

    gi 303 GCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCT 352 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 471 GCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCT 520

    gi 353 TGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACT 402 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 521 TGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACT 570

    gi 403 AAAC---------------------------------------------- 406 |||| gi 571 AAACAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTTCCTC 620

    gi 407 ---TGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAA 453 |||||||||||||||||||||||||||||||||||||||||||||||

    www.jntuworld.com

    www.jntuworld.com

  • gi 621 CCATGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAA 670

    gi 454 AATATTTCTGAATATTTTACTAAAAAGGGAATGTGGGAGGTCAGTGCATT 503 ||.|..| | ||||||..|...|...|.||.| ..|..|...|||||. gi 671 AAAACAT-T---TATTTTCATTGCATCTGCATAT-AAATATTTCTGCATA 715

    gi 504 TAAAACATAA 513 ||||...||| gi 716 TAAATTGTAA 725

    #--------------------------------------- #---------------------------------------

    www.jntuworld.com

    www.jntuworld.com

  • Blast 2 sequences

    Score = 258 bits (134), Expect = 7e-66 Identities = 134/134 (100%) Strand = Plus / Plus

    Query: 273 ccagtgcaggctgcctatcagaaagtggtggctggtgtggctaatgccctggcccacaag 332 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 441 ccagtgcaggctgcctatcagaaagtggtggctggtgtggctaatgccctggcccacaag 500

    Query: 333 tatcactaagctcgctttcttgctgtccaatttctattaaaggttcctttgttccctaag 392 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 501 tatcactaagctcgctttcttgctgtccaatttctattaaaggttcctttgttccctaag 560

    Query: 393 tccaactactaaac 406 |||||||||||||| Sbjct: 561 tccaactactaaac 574 Score = 216 bits (112), Expect = 4e-53 Identities = 112/112 (100%) Strand = Plus / Plus

    Query: 161 tggttgggataaggctggattattctgagtccaagctaggcccttttgctaatcatgttc 220 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 297 tggttgggataaggctggattattctgagtccaagctaggcccttttgctaatcatgttc 356

    Query: 221 atacctcttatcttcctcccacggctcctgggcaacgtgctggtctgtgtgc 272 |||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 357 atacctcttatcttcctcccacggctcctgggcaacgtgctggtctgtgtgc 408 .

    www.jntuworld.com

    www.jntuworld.com

  • LALIGN /seqprg/slib/bin/lalign -N 5000 -n -r "+5/-4" -f -12 -g -4 -w 75 -q @ @ resetting to DNA matrix resetting to DNA matrix LALIGN finds the best local alignments between two sequences version 2.1u03 April 2000 Please cite: X. Huang and W. Miller (1991) Adv. Appl. Math. 12:373-381

    resetting to DNA matrix alignments < E( 0.05):score: 75 (50 max) Comparison of: (A) @ gi|22758817|gb|AY128651.1| Homo sapiens beta-globi - 537 nt (B) @ gi|22758817|gb|AY128651.1| Homo sapiens beta-globi - 1058 nt using matrix file: DNA, gap penalties: -12/-4 E(limit) 0.05

    73.3% identity in 660 nt overlap (1-513:99-753); score: 1660 E(10000): 3.5e-130

    10 20 30 40 50 60 70 gi|227 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGC :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

    gi|227 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGC 100 110 120 130 140 150 160 170

    80 90 100 110 120 gi|227 AATAGCAA---------------------------ATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATA :::::::: ::::::::::::::::::::::::::::::::::::::::

    gi|227 AATAGCAATATCTCTGCATATAAATATTTCTGCATATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATA 180 190 200 210 220 230 240

    130 140 150 160 gi|227 GCAGCTACAATCCAGCTACCATTCTGCTTTTATTTTA-------------------------------------- :::::::::::::::::::::::::::::::::::::

    gi|227 GCAGCTACAATCCAGCTACCATTCTGCTTTTATTTTAAATTTATATGCAGAAATATTTATATGCAGAGATATTGC 250 260 270 280 290 300 310 320

    170 180 190 200 210 220 230 gi|227 -TGGTTGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTT ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

    gi|227 TTGGTTGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTT 330 340 350 360 370 380 390

    240 250 260 270 gi|227 CCTCCCACGGCTCCTGGGCAACGTGCTGGTCTGTGTGC--------------------------------CCAGT :::::::::::::::::::::::::::::::::::::: :::::

    gi|227 CCTCCCACGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGT 400 410 420 430 440 450 460 470

    280 290 300 310 320 330 340 350 gi|227 GCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCT :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

    gi|227 GCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCT 480 490 500 510 520 530 540

    360 370 380 390 400 gi|227 TGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAAC--------------------- ::::::::::::::::::::::::::::::::::::::::::::::::::::::

    gi|227 TGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACAAGCTAGGCCCTTTTGCTAAT 550 560 570 580 590 600 610 620

    410 420 430 440 450 gi|227 ----------------------------TGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAA :::::::::::::::::::::::::::::::::::::::::::::::

    gi|227 CATGTTCATACCTCTTATCTTCCTCCCATGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAA 630 640 650 660 670 680 690

    460 470 480 490 500 510 gi|227 AATATTTCTGAATATTTTACTAAAAAGGGAATGTGGGAGGTCAGTGCATTTAAAACATAA :: : : : :::::: : : : :: : : : ::::: :::: :::

    gi|227 AAAACAT-T---TATTTTCATTGCATCTGCATATAA-ATATTTCTGCATATAAATTGTAA 700 710 720 730 740 750

    www.jntuworld.com

    www.jntuworld.com

  • CECS 694-02 Introduction to Bioinformatics

    Lecture 3: Multiple Sequence Alignment

    Two Issues with the Programming Project 1. Amino Acid Sequence Alignment 2. Calculating alignment score using affine gap penalties

    Amino Acid Sequence alignment With amino acid sequence alignment, there is no longer a straight match/mismatch score as there is with DNA sequence alignment, since different amino acids are allowed to mutate while still maintaining the functionality of a protein. Therefore, when aligning two sequences using amino acid sequences, it is necessary to use a lookup table to find the match score between two amino acids. This lookup table is the scoring matrix as described in the previous class, such as a PAM or BLOSUM matrix. Now when you have two residues, you can look up in this matrix to determine their match score. You can use the symmetric PAM250 matrix on page 82 for amino acid sequence alignments for this project.

    Pam250 Matrix, P 82 (Mount)

    www.jntuworld.com

    www.jntuworld.com

  • Calculating alignments using affine gap penalties (Dont worry about for this project this will be part of the second programming assignment) In order to calculate an alignment using affine gap penalties, it is necessary to consider the possibility of either extending an existing gap, or to open a new gap. In order to calculate the maximum alignment score matrix, V, it is necessary to consider three separate matrices: a match matrix (M), an insertion matrix (I), and a deletion matrix (D). The scores for each of these matrices is calculated as follows: Mi,j = MAX{ Mi-1, j-1 + s(xi, yi), Ii-1, j-1 + s(xi, yi), Di-1, j-1 + s(xi, yi) } Ii,j = MAX{ Mi-1, j g, // Opening new gap, g = gap open penalty; Ii-1, j r} // Extending existing gap, r = gap extend penalty Di,j = MAX{Mi,j-1 g, // Opening new gap; Di,j-1 r} // Extending existing gap Vi,j = MAX {Mi,j, Ii,j, Di,j}

    Multiple Sequence Alignment

    Description Similar genes can be conserved across species that perform similar or identical functions. Many genes are represented in highly conserved forms across organisms. Unique human and mouse genes By performing a simultaneous alignment of multiple sequences having similar or identical functions, we can gain information about which regions have been subject to mutations over evolutionary time and which are evolutionarily conserved. Such knowledge tells which regions or domains of a gene are critical to its functionality. Sometimes genes that are similar in sequence can be mutated or rearranged to perform an altered function. By looking at multiple alignments of such sequences, we can tell which changes in the sequence have caused a change in the functionality.

    www.jntuworld.com

    www.jntuworld.com

  • Multiple sequence alignment yields information concerning the structure and function of proteins, and can help lead to the discovery of important sequence domains or motifs with biological significance while at the same time uncovering evolutionary relationships among genes. In multiple sequence alignment, the idea is to take three or more sequences, and align them so that the greatest number of similar characters are aligned in the same column of the alignment. The difficulty with multiple sequence alignment is that now there are a number of different combinations of matches, insertions, and deletions that must be considered when looking at several different sequences. Methods to guarantee the highest scoring alignment are not feasible. Therefore, approximation methods are put to use in multiple sequence alignment.

    Example multiple alignment of 8 immunoglobulin sequences. There are four approaches to multiple sequence alignment we will consider: Dynamic Programming Approach, Progressive alignment, Iterative alignment, and statistical modeling.

    Extension of Dynamic Programming Approach The attractiveness of dynamic programming with two sequences is that it guarantees to give the optimal alignment of sequences given a specific scoring scheme. In addition, it is a relatively easy method to implement.

    www.jntuworld.com

    www.jntuworld.com

  • Dynamic programming approaches can be extended to multiple alignment as well. Consider the example where we have three amino acid sequences VSNS, SNA, and AS to align. Instead of filling a two dimensional matrix as we did with two sequences, we now fill a three dimensional space.

    Figure source: http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/node2.html#SECTION00020000000000000000

    Suppose the length of each sequence is n residues. If there are two such sequences, then the number of comparisons needed to fill in the scoring matrix is n2, since it is a two-dimensional matrix. The number of comparisons needed to fill in the scoring cube when three sequences are aligned is n3, and when four sequences are aligned, the number of comparisons needed is n4. Thus, as the number of sequences increases, the number of comparisons needed increases exponentially, i.e. nN where n is the length of the sequences, and N is the number of sequences. Thus, without any changes to the dynamic programming approach, this becomes impractical for even a small number of short sequences rather quickly. Carillo and Lipman Sum of Pairs (1988) MSA Lipman, et al. 1989 Gupta et al 1995 Substantial reduction in memory and number of required steps Idea for reduction of memory and computations:

    www.jntuworld.com

    www.jntuworld.com

  • Multiple sequence alignment imposes an alignment on each of the pairs of sequences. Alignments found for each of the pairs of sequences can imposes bounds on the location of the MSA within the cube (three sequences) or N-dimensional space (N sequences). Step 1: Fin