G4120: Introduction to Computational Biology Oliver ...microbiology.columbia.edu/icb/spring2003/Lecture 2.pdfAlgorithms in Computational Biology Algorithm An algorithm is simply a

G4120: Introduction to Computational Biology

Oliver Jovanovic, Ph.D.Columbia UniversityDepartment of Microbiology

Lecture 2January 30, 2003

Copyright © 2003 Oliver Jovanovic, All Rights Reserved.

Binary Computing

Modern computers are digital, and function at the binary level, with the most basicmeasure of information, a bit, representing only one of two possibilities: 0 or 1.

01 bit = or = 2 possibilities

10 0

2 bits = or or = 2 x 2 = 4 possibilities1 10 0 0

3 bits = or or or = 2 x 2 x 2 = 8 possibilities1 1 10 0 0 0

4 bits = or or or or = 2 x 2 x 2 x 2 = 16 possibilities1 1 1 10 0 0 0 0

5 bits = or or or or or = 2 x 2 x 2 x 2 x 2 = 32 possibilities1 1 1 1 10 0 0 0 0 0

6 bits = or or or or or or = 2 x 2 x 2 x 2 x 2 x 2 = 64 possibilities1 1 1 1 1 10 0 0 0 0 0 0

7 bits = or or or or or or or = 2 x 2 x 2 x 2 x 2 x 2 x 2 = 128 possibilities1 1 1 1 1 1 10 0 0 0 0 0 0 0

8 bits = or or or or or or or or = 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 = 256 possibilities1 1 1 1 1 1 1 1

Binary Representation of DNA

DNA has only four possibilities (so can be represented by 2 bits)G = 00C = 11A = 01T = 10

Complementation (with intelligent choice of representation)G C C A = 00 11 11 01C G G T = 11 00 00 10

Additional Binary Units8 bits = 1 byte1,024 bits = 1 kilobit1,024 bytes (approximately 1 thousand bytes) = 1 kilobyte (K)1,024 kilobytes (approximately 1 million bytes) = 1 megabyte (M)1,024 megabytes (approximately 1 billion bytes) = 1 gigabyte (G)1,024 gigabytes (approximately 1 trillion bytes) = 1 terabyte (T)

ASCII Representation of DNA

American Standard Code for Information Interchange (ASCII)• For practical purposes, DNA and RNA is generally represented in ASCII code, using the upper or lower case

letters A, C, G, and T or A, C, G and U.• Each ASCII character occupies one byte, and thus has 256 possibilities, including all upper and lower case letters

of the English alphabet, the ten Arabic numerals, punctuation, and special characters, such as @.• Thus, a kilobase of DNA (1,000 base pairs) occupies just under a kilobyte of storage in ASCII. An entire human

genome, roughly 3 billion base pairs, occupies just under 3 gigabytes of storage in ASCII.

TranscriptionTranscription is computationally trivial. One need only substitute a U for a T if dealing with a sense strand, orcomplement, then transcribe if dealing with the antisense strand.

TranslationTranslation is also computationally trivial task. A computer program can refer to a species appropriate translationtable to translate DNA or RNA into the appropriate protein sequence.

AUA I IsoleucineAUC I IsoleucineAUG M Methionine startAUU I Isoleucineetc.

Alternate RepresentationCan readily convert an ASCII representation of DNA into other forms, such as graphics, or even music.

Algorithms in Computational Biology

AlgorithmAn algorithm is simply a series of steps used to solve a problem. One of a computer's great strengths is its ability torapidly and accurately repeat recursive steps in an algorithm.

Computational Biology AlgorithmsEarly algorithms for searching sequence data depended on consensus sequences. Thus, to find a prokaryotic promoter,one would try to find something that matched a consensus -10 sequence (TATAAT), not too far downstream of aconsensus -35 sequence (TTGACA).

It rapidly became clear that biologically significant sequences rarely perfectly matched a consensus, and more sophisticatedapproaches were adopted, including the use of matrices, Markov chains and hidden Markov models.

Matrices take into account the distribution of every possible nucleotide (or amino acid) at a position in a set of knownsequences. Searching with a matrix is therefore more sensitive than searching with a consensus, and can find biologicalfeatures that a strict consensus approach would miss.

Markov chains and hidden Markov models are probabilistic models of sequences, and have proven useful in databasesearching, gene finding and multiple sequence alignment.

Consensus v. Matrix

Consensus -35 -10TTGACA.................TATAAT

Matrix -35 Region T T G A C AA 11 8 8 7 8 7 3 5 5 0 1 0 14 5 9 5C 3 4 2 4 4 3 5 2 8 1 1 2 3 11 2 5G 3 2 4 2 4 5 5 5 5 2 1 17 1 2 3 3T 4 7 7 8 5 6 8 9 3 17 18 2 4 3 7 9

Spacer RegionLength 9 10 11 12 13 14 15 1 6 14 6 1 1 1

-10 Region T A T A A TA 4 5 3 4 4 0 20 5 12 11 0 7 4 6C 5 4 5 4 5 2 0 3 3 4 1 2 7 6G 2 5 5 8 7 2 0 3 3 3 0 6 5 6T 10 6 8 5 6 17 1 9 3 4 20 6 5 4

Search Algorithms in Computational Biology

Global Alignment SearchNeedleman-Wunsch algorithm, Needleman & Wunsch, 1970

• Finds the best alignment of two complete sequences that maximizes the number of matches and minimizes thenumber of gaps.

Local Alignment SearchSmith-Waterman algorithm, Smith & Waterman, 1981

• Makes an optimal alignment of the best segment of similarity between two sequences.

• Often better for comparing sequences of different lengths, or when looking at a particular region of interest.

Heuristic Approximations to Smith-WatermanFASTA, Pearson, 1988

BLAST, Altschul, 1990BLAST 2 (aka Gapped BLAST), Altschul, 1997

Global Alignment (Needleman-Wunsch)

Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizesthe number of matches and minimizes the number of gaps.

GAP RK2_ssb x Ecoli_ssb January 29, 2003 00:07

. . . . . 1 ..MSHNQFQFIGNLTRDTEVRHGNSNKPQAIFDIAVNEEWRNDA.GDKQE 47 |. :||| .| |||: . | :| .| ||. | |: .| 1 ASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKE 50 . . . . . 48 RTDFFRIKCFGSQAEAHGKYLGKGSLVFVQGKIRNTKY.EKDGQTVYGTD 96 .|:. |: || || .|| ||| |:::|.:| |: :. || | |: 51 QTEWHRVVLFGKLAEVASEYLRKGSQVYIEGQLRTRKWTDQSGQDRYTTE 100 . . . . . 97 FIAD...KVDYLDTKAPGGSNQE........................... 116 : . . | : ||. 101 VVVNVGGTMQMLGGRQGGGAPAGGNIGGGQPQGGWGQPQQPQGGNQFSGG 150 . . ...........................

151 AQSRPQQSAPAAPSNEPPMDFDDDIPF 177

Matrix: blosum62Gap Penalties: defaultLength: 177Percent Similarity: 45.690Percent Identity: 32.759

Local Alignment (Smith-Waterman)

BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments arefound by inserting gaps to maximize the number of matches using the local homology algorithm of Smith andWaterman.

BESTFIT RK2_ssb x Ecoli_ssb January 29, 2003 00:08

. . . . . 4 NQFQFIGNLTRDTEVRHGNSNKPQAIFDIAVNEEWRNDA.GDKQERTDFF 52 |. :||| .| |||: . | :| .| ||. | |: .|.|:. 6 NKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTEWH 55 . . . . 53 RIKCFGSQAEAHGKYLGKGSLVFVQGKIRNTKY.EKDGQTVYGTDFIAD 100 |: || || .|| ||| |:::|.:| |: :. || | |: : . 56 RVVLFGKLAEVASEYLRKGSQVYIEGQLRTRKWTDQSGQDRYTTEVVVN 104

Matrix: blosum62Gap Penalties: defaultLength: 99Percent Similarity: 50.515Percent Identity: 36.082

Global v. Local Alignment

1 ..MSHNQFQFIGNLTRDTEVRHGNSNKPQAIFDIAVNEEWRNDA.GDKQE 47 |. :||| .| |||: . | :| .| ||. | |: .| 1 ASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKE 50 . . . . . 48 RTDFFRIKCFGSQAEAHGKYLGKGSLVFVQGKIRNTKY.EKDGQTVYGTD 96 .|:. |: || || .|| ||| |:::|.:| |: :. || | |: 51 QTEWHRVVLFGKLAEVASEYLRKGSQVYIEGQLRTRKWTDQSGQDRYTTE 100 . . . . . 97 FIAD...KVDYLDTKAPGGSNQE........................... 116 : . . | : ||. 101 VVVNVGGTMQMLGGRQGGGAPAGGNIGGGQPQGGWGQPQQPQGGNQFSGG 150 . . ...........................

151 AQSRPQQSAPAAPSNEPPMDFDDDIPF 177

Search Algorithm Details

Dayhoff MatricesFor protein comparison, all of these search algorithms use Dayhoff substitution matrices which encode loglikelihoods of an amino acid substitution. When scoring an alignment, penalties are assigned for what thesubstitution matrix considers poor substitutions, the worse the substitution, the greater the penalty.

Blosum 62Blosum 62 is the default matrix for NCBI BLAST. It is optimized for known close homologies.

PAM 250PAM 250 is a matrix which is optimized for known distant homologies.

GapsWhen scoring an alignment, penalties are assigned for creating and extending gaps. The longer the gap, the greaterthe penalty.

Variables• Can vary the gap penalties.• Can get different results depending on which matrix you use. The choice should depend on the task.• Can get different results when using the slower, but optimal Smith-Waterman algorithm, or a faster heuristic

approximation such as BLAST.

Computational Biology and the Internet

OverviewIn 1981 there were 213 computers acting as Internet hosts, in 2001 there were over 60 million Internet hosts.Computational biology uses applications with Internet connectivity (EndNote, MacVector), Internet applications (DNAArtist), Web applications (BLAST, GenMark) and Web databases (GenBank, PubMed), among other Internet resources.

EmailRecommend using Apple’s Mail.

UsenetRecommend using groups.google.com.

FTP (File Transfer Protocol)Recommend using Fetch and Internet Explorer.

WWW (World Wide Web)Recommend Microsoft’s Internet Explorer and Apple's Safari for browsing, www.google.com for searching.

URL (Uniform Resource Locator)http://www.columbia.eduftp://ftp.ncbi.nlm.nih.govfile://Macintosh%20HD/Documents/Lecture%201.pdf (note how a blank space “ ” is replaced by “%20”).

Internet Protocols

TCP/IP (Transmission Control Protocol/Internet Protocol)A multilayered protocol architecture that allows for the transmission of data over networks.

RouterA device that routes packets of data between networks. A router sits between your computer and local area network andthe networks beyond it.

DatagramPackets of data containing the IP addressing information needed to switch them from one network to another until theyarrive at their final destination.

TCP (Transmission Control Protocol)Provides reliable delivery of datagrams using connection oriented streaming with error detection and correction.

UDP (User Datagram Protocol)Provides low overhead connectionless datagram delivery.

Internet Addressing

IP (Internet Protocol) AddressAn IP address is a 32 bit number, written in the form of four decimal numbers in the range 0-255 thatare separated by dots (i.e. 128.59.59.214).

Subnet MaskA subnet mask allows for defining a local network, called a subnet, within a larger network.

DNS (Domain Name Server)These specialized servers automatically translate an easy to remember domain name (i.e.cancercenter.columbia.edu) into the appropriate IP address (i.e. 156.111.5.92).

Health Sciences Internet Setup

Overview• The Health Science campus network has two core routers, both redundantly linked to a router in each building.

Each floor of a building then has its own router. Several microwave dishes connect the core routers to thedowntown Columbia campus, which has multiple high speed cable connections to the rest of the Internet.

• The network is centrally administered by a group called CUbhis (Columbia University Biomedical and HealthInformation Services).

• There are also AppleTalk zones (an older networking protocol still used by Apple) throughout the Health Sciencescampus, but these are not centrally administered.

IP Address: 156.111.x.x or 156.145.x.x

Subnet Mask: 255.255.255.0

Router: 156.x.x.1

DNS Servers156.111.60.150

156.111.70.150

Search Domains (Optional)

columbia.edu

NCBI and PubMed

NCBI (National Center for Biotechnology Information)http://www.ncbi.nlm.nih.gov

PubMedhttp://ncbi.nlm.nih.gov/pubmed

EndNote and PubMed• EndNote can directly connect to PubMed and search and retrieve references from it.• Alternately, the PubMed Clipboard can be used to collect references, which are then exported as a

text file in MEDLINE format, which can be imported by EndNote.Send to Clipboard, then select Display MEDLINE and Send to Text, click on Send to, then Save As Plain Text.

or select Display MEDLINE and Send to File, click on Send to, then Save File As.

EndNote

Directly Connecting to PubMed with EndNoteEdit -> Connection Files -> Open Connection Manager -> PubMed (NLM). Click to mark as favorite.

Tools -> Connect -> PubMed (NLM)

Importing PubMed text files in MEDLINE format with EndNoteEdit -> Import Filters -> Open Filter Manager -> PubMed (NLM). Click to mark as favorite.

File -> Import -> Import Options: PubMed (NLM), click Choose File, select file, then click Import.

EndNote and PDF Files• Can also use EndNote to organize PDF files, which are otherwise easy to lose track of. It is possible to

add PDFs to the Image field of a Chart, Equation or Figure reference type, but to organize PDFs mosteffectively, it is best to add an Image field called "PDF" to the default Journal Article reference type.

EndNote -> Preferences, select Reference Types, click on Modify Reference Types, type PDF into the blank Image field of Journal Article, then click OK, then click Save.Select References -> Insert Object, which inserts the file into the Image field. Then double click to open the PDF.

Computational Biology Resources

BooksDeveloping Bioinformatics Computer Skills, by Cynthia Gibas & Per Jambeck.Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Second Edition, by A.D. Baxevanis & B.F.F. Ouellette.Introduction to Computational Biology: Maps, Sequences and Genomes by Michael S. Waterman.Bioinformatics: Sequence and Genome Analysis by David W. Mount.Biological Sequence Analysis by Richard Durbin, et al.

JournalsBioinformatics (formerly CABIOS)Journal of Computational BiologyComparative and Functional GenomicsBiotechnology Software & Internet Journal

Websiteshttp://www.ncbi.nlm.nih.gov National Center for Biotechnology Informationhttp://bio.oreilly.com O'Reilly Bioinformaticshttp://www.iscb.org International Society for Computational Biologyhttp://www.sanger.ac.uk Wellcome Trust Sanger Institutehttp://www.ebi.ac.uk European Bioinformatics Institutehttp://www.google.com Google