Bioinformatics Class

BIOINFORMATICS

Biology easily has 500 years of exciting problems to work on--Donald E. Knuth Ever since the structure of DNA was unraveled in 1953, the molecular

biology has witnessed tremendous advances.

The need to process the ever growing biological data has created entirely new problems that are interdisciplinary in nature.

Scientists from biological sciences are the creators & ultimate users of this data.

Due to huge size & high complexity of the biological data, the help of many other disciplines—in particular from mathematics& computer science is required

This need has created a new field “Computational Molecular Biology & bioinformatics”

The Commercial Market

Current bioinformatics market is worth 300 million / year (Half software)

Prediction: $2 billion / year in 5-6 years

~50 Bioinformatics companies: Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic, GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools, Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist, eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic, GeneFormatics, Molecular Simulations, Bioinformatics Solutions….BIOCON(INDIA)

“Comptational molecular biology & Bioinformatics”(cmb)

“CMB consists of the development and use of computer science & mathematical techniques to solve problems in molecular biology”.

Bioinformatics Unit – 1. Basic Concepts Unit – 2. Suffix Trees and Applications Unit – 3. sequence alignment: pair wise

alignment. Multiple Alignments Unit – 4. Sequencing Unit – 5. Motif Prediction

Bioinformatics Unit – 1. Basic Concepts of Molecular Biology: Cellular Architecture, Nucleic Acids (RNA & DNA ),

DNA replication, Repair and recombination. Transcription, Genetic code, Gene expression, Protein structure and function, Molecular biology tools. Statistical Methods: Estimation, Hypothesis testing, Random Walks, Markov Models(HMM).

Unit – 2. Suffix Trees Definition and examples, Ukkonen’s linear-

time suffix tree algorithm, Applications (exact string matching, longest common sub strings of two strings, Recognizing DNA contamination). Pair wise Sequence Alignment (Edit distance , Dynamic Programming Calculator of edit distance, string similarity, gaps).

Unit – 3. sequence alignment Pair wise sequence alignment (local), HMM for

pair wise alignment. Multiple String Alignments : Need of MSA, Family & Super Family representation,

multiple sequence comparison for structural inferences, Multiple alignments with

sum-of- pairs, consensus objective functions. Profile HMM for multiple sequence alignment. Database searching for similar sequence (FASTA, BLAST), PAM, BLOSOM substitution matrices.

Unit – 4. Sequencing Fragment Assembly (Shortest common super

string algorithms based on multi-graph), Sequencing by Hybridization, Protein

sequencin

Unit – 5. Motif Prediction

Motif Prediction, Gene prediction, Introduction to Protein structure Prediction.

BOOKS RECOMMENDED Dan Gusfied, Algorithm on strings, Trees and

Sequences : Computer science & Computational Biology, Cambridge University Press, 1997. (Chapters: 5,6,7,10,11,14,15)

J.Setubal & Meidanis, Introduction to computational Molecular Biology – PWS Publishing Company, 1997(Chapters : 1,8).

W.J. Ewens & G.R. Grant – Statistical Methods in Bioinformatics – Springer-1989.

Contd…

R. Durbin, S.R. Eddy, A. Krogh and G.J. Mitchison, Biological Sequence Analysis : Probabilistic Models of Proteins and Nuclics Acids, Cambridge University Press – 1998(Chapters: 3,5 & 6)

R.C. Denier, S. Tavare, M.S. Waterman, Computational Genome Analysis, Springer, 2005.

Contd…

N.C. Jones and P.A. Pevzner – An Introduction to Bioinformatics Algorithms – MIT Press- 2004.

D.E. Krane, M.L. Raymer – Fundamental Concepts of Bioinformatics – Pearson Education – 2003.

J. Tisdall, Beginning Perl for Bioinformatics, O’Reilly, 2001.

Contd… M.S. Waterman, Introduction to Computational

Biology – CRC Press, 2000. A. Baxevainis & B. Ouellete, Bio – Informatics: A

Practical Guide to the Analysis of Genes and Proteins, Willy- Interescience, 2001.

M. Ridley, Genome: The autobiography of a species, Fourth Estate, 1999.

Lodish, Berk, Zipursky, Blalimore & Darnell, Molecular Cell Biology, W.H.Freeman, 2000.

Class 2

UNIT 1 LIFE AT ITS SIMPLEST DNA RNA PROTEIN GENETIC-CODE

QUICK-PRIMER ON GENETICS…. EVERY CELL IN THE HUMAN BODY CONTAINS A COPY OF

THE GENOME

THINK OF THE GENOME AS A BOOK —THE BLUEPRINT THAT CONTAINS DETAILS OF WHAT EACH INDIVIDUAL OUGHT TO BE LIKE. NOW, EACH HUMAN GENOME CONTAINS 23 CHROMOSOMES.

IF THE GENOME WERE A BOOK, THEN THINK OF CHROMOSOMES AS THE CHAPTERS IN IT

EACH OF THESE CHAPTERS TELL ABOUT SEVERAL THOUSAND STORIES CALLED GENES

e.g the colour of individual skin,eyes & hair, left or right handed, his IQ, and everything that matters

What Venter did? He could figure out what language the book

was written in & how to read it.

He could figure out the grammar of the book & therefore, how to write in the language

This is the kind of language that will allow him to create life.

Venter…magic…new-life? Venter used his knowledge to create

Mycoplasma Laboratorium, a chromosome that is 381 genes long, transplanted into living cell, it is expected to take control of the cell and become a new “life-form”.

The new life-form can mop up excessive carbon dioxide and contribute to resolving problems like global-warming.

Contd…

Until now, scientists have managed to take the genome out of one cell put it another cell and create an altogether new organism.

But, nobody knew how to create the genome itself. Venter did just that.

Biology easily has 500 years of exciting problems to work on--Donald E. Knuth Ever since the structure of DNA was unraveled in 1953, the molecular

biology has witnessed tremendous advances.

The need to process the ever growing biological data has created entirely new problems that are interdisciplinary in nature.

Scientists from biological sciences are the creators & ultimate users of this data.

Due to huge size & high complexity of the biological data, the help of many other disciplines—in particular from mathematics& computer science is required

This need has created a new field “Computational Molecular Biology & bioinformatics”

“Comptational molecular biology & Bioinformatics”(cmb)

“CMB consists of the development and use of computer science & mathematical techniques to solve problems in molecular biology”.

Living vs. Nonliving

Both kinds of matter are composed by the same atoms and confirms to the same physical and chemical rules.

What is the difference then????

Living vs. Nonliving Living things can move,

reproduce, grow, eat They have an active

participation in their environment

Living beings act the way they do due to a complex array of chemical reactions that occur inside them. These reactions never cease.

Living organism is constantly exchanging matter & energy with its surroundings.

Anything that is in equilibrium with its surrounding can generally be considered dead

(exceptions are vegetative forms, like seeds, and viruses which may be completely inactive for long periods of time and are not dead.)

Life starts… Life started some 3.5 billions of years ago, shortly

after the Earth itself was formed. The first life forms were very simple, but over billions

of years a continuously acting process called evolution made them evolve and diversify

Both complex and simple organisms have a similar molecular chemistry or biochemistry.

Main actors in the chemistry of life are molecules called proteins and nucleic acids

ACTORS IN CHEMISTRY OF LIFE :Proteins & nucleic acids: Proteins are responsible for what a living being is and

does in a physical sense. “we are our proteins”-Russell Doolittle Nucleic acids, on the other hand, encode information

necessary to produce proteins and responsible for passing along this “recipe” to subsequent generations

Recent research is devoted to the understanding of the structure and function of proteins and nucleic acids

Proteins

Different Roles of Proteins Enzymes Carry signals Transport small molecules such as oxygen Form cellular structures (tissues) Regulate cell processes (such as defense

mechanisms) What are proteins made of?

Amino acids – chain of amino acids = protein

Amino acids

Backbone of polypeptide chain Convention

Begin at N-terminal End at C-terminal

Torsion or rotation angles around: C-N bond () C-C bond ()

PROTEIN STRUCTURE Protein is not just a linear sequence of residues-

primary structure Proteins actually fold in 3D, presenting secondary,

tertiary and quaternary structures 3D shape of a protein is related to its function Protein can be made out of 20 different kinds of

amino acids make the resulting 3D structure very complex and without symmetry

No simple and accurate method for determining the 3Dstructure is known.

Genomic Code

DNAdeoxyribosenucleic acid

Basic unit = nucleotideSugar,Phosphate,Base (A, G, T, C)

adenine, thymine cytosine, guanine.

Contd… DNA is a chain of simpler molecules Actually it is a double chain (strands) Each simple chain has a backbone consisting of

repetitions of the same basic unit This unit is formed by a sugar molecule called 2-

deoxyribose attached to a phosphate residue The sugar molecules contains five carbon atoms and

they are labeled 1 through 5 DNA molecules also have a orientation (starts at the

5 end finishes at the 3 end)

Contd…

Attached to each 1 carbon in the backbone are other molecules called bases

There are 4 kinds of bases: A(ADENINE) G(GUANINE) C(CYTOSINE) T(THYMINE)

CONTD… Bases A & G belong to a larger group of substances called

purines where as C & T belong to pyrimidines When we see the basic unit of a DNA molecule as consisting of

sugar, phosphate, & its base we call it nucleotide Bases & nucleotides are not the same thing. DNA molecule having a few nucleotides is referred to as an

oligonucleotide DNA molecule in nature is very long, much longer than proteins In humancell, each DNA molecules have hundreds of millions of

nucleotides

Contd… DNA molecules are double strands The two strands are tied together in a helical structure (watson

& crick 1953) Each base in one strand is paired with a base in the other

strand A pairs with T (COMPLEMENTARY BASES/watson crick base

pairs) C pairs with G base pair ( bp) provides the unit of length

RNA

RNA is a nucleic acid made from long chain of nucleotides

Each nucleotide consists of a nitrogen base, a ribose sugar, and a phosphate

RNA is very similar to DNA , but differs with the following basic compositional and structural differences

DNA vs. RNA DNA is double stranded Very long chain of

nucleotides DNA contains deoxyribose DNA is more stable

Complementary nucleotide to Adenine is Thymine

DNA performs essentially one function

RNA is single stranded Comparatively shorter chain

of nucleotides RNA contains ribose

Less stable, more prone to hydrolysis

Complementary nucleotide to adenine is uracil

There are different kinds of RNA performing different functions

Class 3

26/06/08

Central Dogma of Molecular Biology

How the information in DNA results in proteins? A promoter is a region before each gene in the DNA that serves

as an indication to the cellular mechanism that a gene is ahead.

Having recognized the beginning of a gene a copy of the gene is made on an RNA molecule.

The resulting RNA is mRNA (substitute U for T). This process is called transcription.

the mRNA will be used to manufacture protein.

After transcription, the introns are spliced from the mRNA=>introns are that part of gene that are not used in protein synthesis

After introns are spliced out the shortened mRNA containing copies of only exons plus regulatory regions in the beginning & end leaves the nucleus

Contd… Because of the intron/exon phenomenon, we use

different names to the entire gene & to the spliced sequence consisting of exons only.

The former is called genomic DNA & the latter complementary DNA or cDNA

t RNA are the molecules that actually implement the genetic code in a process called translation. They make the connection between a codon and the specific amino acid this codon codes for.

When a stop codon appears no tRNA associates with it and the synthesis ends.

Central Dogma: DNA -> RNA -> Protein

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

JUNK DNA Genes are certain contiguous regions of the chromosome, but

they do not cover the entire molecule

There are intergenic regions which does not have any known functions. They are called “ junk DNA” because they appear to be there for no particular use.

Recent research has shown that junk DNA has more information content than previously believed

The amount of junk DNA varies from species to species

humans>90% junk DNA

OPEN READING FRAME( ORF ) An ORF in a DNA sequence is a contiguous stretch of this sequence

beginning at the start codon, having an integral number of codon, such that none of its codon is a stop codon

Consider the sequence TAATCGAATGGGC one reading frame: TAA TCG AAT GGG second reading frame: AAT CGA ATG GGC third reading frame: ATC GAA TGG fourth reading frame: TCG AAT GGG

4TH frame is a subset of one of the frame starting at position 1 Sometimes we talk about 6 not 3 different frames in a sequence.(look

at the opposite strand and count another 3)

Genome

Complete set of chromosome inside a cell is called a genome.

The number of chromosomes in a genome is characteristic of species

Every cell in a human being has 46 chromosomes, whereas in mice this number is 40

Is Genome like a computer program?

Genome = computer program Genome of an organism is seen as a computer

program that completely specifies the organism

Cell machinery = interpreter of this program

Biological functions performed by proteins = execution of this program

Class 4

01/07/08

The Eye of the Fly Fruit flies (Drosophila melanoglaster) have a gene

called eyeless which , if it is “knocked out” (i.e. eliminated the genome using molecular biology methods) results in fruit flies without eyes

It is obvious that eyeless gene plays a role in eye development

Researchers have identified a human gene responsible for a condition called aniridia

In humans who are missing this gene( or in whom the gene has mutated just enough for its protein product onto stop functioning properly), the eyes develop without irises

Cont… If the gene for aniridia is inserted into an

eyeless drosophila “knock out", it causes the production of normal drosophila eyes. It is an interesting observation.

Could there be some similarity in how eyeless and aniridia function??? Even though flies & humans are vastly different organisms?

To gain insight into how eyeless & aniridia work together, we can compare their sequences

Contd…

20 years ago similarity between eyeless & aniridia DNA sequences would have been like looking for a needle in a haystack

Most scientists compared the respective gene sequences by hand aligning them one under the other in a word processor & looking for matches character by character.

This was time consuming & hard on the eyes.

Contd… In the late 1980s, fast computer programs for

comparing sequences changed molecular biology for ever

Many tools that are widely available to the biology community- including everything from multiple alignment, phylogenetic analysis, motif identification, &

homology modeling software, to web-based database search services-rely on pair wise sequence comparison algorithms as a core element of their function

How the genome is studied? Say, human genome Sequencing: The basic information we want to extract

from any piece of DNA is its base pair sequence. The process of obtaining this information is called sequencing

A human chromosome has around 108 base pairs. but, the largest pieces of DNA that can be sequenced

in the laboratory are 700bp long.=>there is a gap of some 105 between the scales of what

we can actually sequence and a chromosome size. This gap is at the heart of many problems in

computational biology

Cutting & Breaking DNA Because a DNA molecule is so long, some tool to cut

it at specific points (like a pair of scissors) or to break it apart in some way is needed

The pair of scissors is represented by restriction enzymes

They cut DNA molecules in all places where a certain sequence appears( usually a palindrome sequence)

Some common types of restriction enzymes are 4 cutters, 6 cutters, & 8 cutters

It is rare to see an odd cutter because sequences of odd length cannot be palindromes

COPYING DNA Also known as DNA amplification Very important in DNA cloning Given a piece of DNA, one way of obtaining further

copies is to use nature itself We insert this piece into the genome of an organism

(host) and then let the organism multiply itself. Upon host multiplication, the inserted piece gets

multiplied along with original DNA. Then we kill the host & dispose the rest keeping only

the inserts in the desired quantity. DNA produced in this way is called recombinant.

Reading & measuring DNA

Reading is done with a technique known as gel electrophoresis which is based on separation of molecules by their size

This process involves a gel medium & a strong electric field

HUMAN GENOME PROJECT

The Human Genome Project

What is the Human Genome Project?

U.S. govt. project coordinated by the Department of Energy and the National Institutes of Health

goals (1998-2003) identify the approximate 100,000 genes in human DNA determine the sequences of the 3 billion bases that make up

human DNA store this information in databases develop tools for data analysis address the ethical, legal, and social issues that arise from

genome research

Why is the Department of Energy involved?

-after atomic bombs were dropped during War War II, Congress told DOE to conduct studies to understand the biological and health effects of radiation and chemical by-products of all energy production

-best way to study these effects is at the DNA level

Whose genome is being sequenced?

the first reference genome is a composite genome from several different people

generated from 10-20 primary samples taken from numerous anonymous donors across racial and ethnic groups

Benefits of HGP Research

improvements in medicine microbial genome research for fuel and

environmental cleanup DNA forensics improved agriculture and livestock better understanding of evolution and human

migration more accurate risk assessment

Ethical, Legal, and Social Implications of HGP Research

fairness in the use of genetic information privacy and confidentiality psychological impact and stigmatization genetic testing reproductive issues education, standards, and quality control commercialization conceptual and philosophical implications

Human Genome Project Information Websitehttp://www.ornl.gov/hgmis

For More Information...

Contd…

A large effort like this cannot be entertained by a single lab!!!!!!

On computer science side, databases with updated & consistent information have to be maintained,

Fast access to the data has to be provided After the sequencing there is a still difficult

task of analyzing the data obtained

Contd..

Treatment of genetic diseases based on data produced by the Human Genome Project is still going on, although encouraging pioneering efforts have already yielded results.

Class 5

08/07/08

What is a database?

A collection of information, usually stored in an electronic format that can be searched by a computer.

A brief history of biological databases

1965 M. O. Dayhoff et al. publish “Atlas of Protein Sequences and

Structures”1982 EMBL initiates DNA sequence database,

followed within a year by GenBank (then at LANL) and in 1984 by DNA Database of Japan

1988 EMBL/GenBank/DDBJ agree on common format for data elements

Biological databases: why?

There are two main functions of biological databases:

Make biological data available to scientists. As much as possible of a particular type of information should be available in one single place (book, site, database). Published data may be difficult to find or access, and collecting it from the literature is very time-consuming. And not all data is actually published explicitly in an article (genome sequences!).

To make biological data available in computer-readable form. Since analysis of biological data almost always involves computers, having the data in computer-readable form (rather than printed on paper) is a necessary first step.

The different types of databases

One may characterize the available biological databases by several different properties. Here is a list to help you think about the various properties a particular database may have

Type of data

nucleotide sequences protein sequences proteins sequence patterns or motifs macromolecular 3D structure gene expression data metabolic pathways

Contd… Data entry and quality control

Scientists (teams) deposit data directly Appointed curators add and update data Are erroneous data removed or marked? Type and degree of error checking Consistency, redundancy, conflicts, updates

Primary or derived data

Primary databases: experimental results directly into database Secondary databases: results of analysis of primary databases Aggregate of many databases

Links to other data items Combination of data Consolidation of data

Contd… Technical design

Flat-files Relational database (SQL) Object-oriented database (e.g. CORBA, XML) Maintainer status

Large, public institution (e.g. EMBL, NCBI) Quasi-academic institute (e.g. Swiss Institute of Bioinformatics, TIGR) Academic group or scientist Commercial company Availability

Publicly available, no restrictions Available, but with copyright Accessible, but not downloadable Academic, but not freely available Proprietary, commercial; possibly free for academics

Accession codes vs identifiers

Many databases in bioinformatics (SWISS-PROT, EMBL, GenBank, Pfam) use a system where an entry can be identified in two different ways.

Basically, it has two names: Identifier Accession code (or number)

Contd… Identifier An identifier ("locus" in GenBank, "entry name" in SWISS-

PROT) is a string of letters and digits that generally is interpretable in some meaningful way by a human, for instance as a recognizable abbreviation of the full protein or gene name.

SWISS-PROT uses a system where the entry name consists of two parts: the first denotes the protein and the second part denotes the species it is found in. For example, KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo sapiens.

An identifier can usually change. For example, the database curators may decide that the identifier for an entry no longer is appropriate. However, this does not happen very often. In fact, it happens so rarely that it's not really a big problem.

Contd… Accession code (number) An accession code (or number) is a number (possibly with a few

characters in front) that uniquely identifies an entry in its database. For example, the accession code for KRAF_HUMAN in SWISS-PROT is P04049.

The main conceptual difference from the identifier is that it is supposed to be stable: any given accession code will, as soon as it has been issued, always refer to that entry, or its ancestors. It is often called the primary key for the entry. The accession code, once issued, must always point to its entry, even after large changes have been made to the entry. This means that in discussions about specific database entries (e.g. an article about a specific protein), one should always give the accession code for the entry in the relevant database.

In the case where two entries are merged into one single, then the new entry will have both accession codes, where one will be the primary and the other the secondary accession code. When an entry is split into two, both new entries will get new accession codes, but will also have the old accession code as secondary codes.

Nucleotide sequence databases Primary nucleotide sequence databases

The databases EMBL, GenBank, and DDBJ are the three primary nucleotide sequence databases:

They include sequences submitted directly by scientists and genome sequencing group, and sequences taken from literature and patents. There is comparatively little error checking and there is a fair amount of redundancy.

The entries in the EMBL, GenBank and DDBJ databases are synchronized on a daily basis, and the accession numbers are managed in a consistent manner between these three centers.

The nucleotide databases have reached such large sizes that they are available in subdivisions that allow searches or downloads that are more limited, and hence less time-consuming. For example, GenBank has currently 17 divisions.

There are no legal restrictions on the use of the data in these databases. However, there are some patented sequences in the databases.

Contd… EMBL www.ebi.ac.uk/embl/ The EMBL (European Molecular Biology Laboratory) nucleotide

sequence database is maintained by the European Bioinformatics Institute (EBI) in Hinxton, Cambridge, UK.

GenBank www.ncbi.nlm.nih.gov/Genbank/ The GenBank nucleotide database is maintained by the National Center

for Biotechnology Information (NCBI), which is part of the National Institute of Health (NIH), a federal agency of the US government.

It can be accessed and searched through the Entrez system at NCBI, or one can download the entire database as flat files.

DDBJ www.ddbj.nig.ac.jp The DNA Data Bank of Japan began as a collaboration with EMBL and

GenBank. It is run by the National Institute of Genetics. One can search for entries by accession number.

http://www.ebi.ac.uk/embl/

http://www.ebi.ac.uk/embl/

http://www.ncbi.nlm.nih.gov/Genbank/

http://www.ncbi.nlm.nih.gov/Genbank/

http://www.ncbi.nlm.nih.gov/Entrez/



http://www.ddbj.nig.ac.jp/

Other nucleotide sequence databases secondary databases The following databases contain subsets of the EMBL/GenBank databases. Some also

contain more information or links than the primary ones, or have a different organization of the data to better some specific purpose. However, the nucleotide sequences themselves should always be available in the EMBL/GenBank databases. In this sense, the databases below are secondary databases.

UniGene http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene The UniGene system attempts to process the GenBank sequence data into a non-redundant

set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.

SGD http://www.yeastgenome.org/ The Saccharomyces Genome Database (SGD) is a scientific database of the molecular

biology and genetics of the yeast Saccharomyces cerevisiae. EBI Genomes www.ebi.ac.uk/genomes/ This web site provides access and statistics for the completed genomes, and information

about ongoing projects. Genome Biology www.ncbi.nlm.nih.gov/Genomes/ The Genome Biology site at NCBI contains information about the available complete

genomes. Ensembl www.ensembl.org Ensembl is a joint project between EMBL-EBI and the Sanger Centre to develop a software

system which produces and maintains automatic annotation on eukaryotic genomes.

http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene




http://www.yeastgenome.org/



http://www.ebi.ac.uk/genomes/

http://www.ebi.ac.uk/genomes/

http://www.ncbi.nlm.nih.gov/Genomes/index.html

http://www.ncbi.nlm.nih.gov/Genomes/index.html

http://www.ensembl.org/

Protein sequence databases

The two protein sequence databases SWISS-PROT and PIR are different from the nucleotide databases in that they are both curated.

This means that groups of designated curators (scientists) prepare the entries from literature and/or contacts with external experts.

SWISS-PROT, TrEMBL www.expasy.ch/sprot SWISS-PROT is a protein sequence database which strives to provide a high

level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.

It was started in 1986 by Amos Bairoch in the Department of Medical Biochemistry at the University of Geneva. This database is generally considered one of the best protein sequence databases in terms of the quality of the annotation. Its size is given in the table below.

TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT. The procedure that is used to produce it was developed by Rolf Apweiler. The annotation of an entry in TrEMBL has not (yet) reached the standards required for inclusion into SWISS-PROT proper.

The SWISS-PROT database has some legal restrictions: the entries themselves are copyrighted, but freely accessible and usable by academic researchers. Commercial companies must buy a license fee from SIB.

http://www.expasy.ch/sprot/

PIR pir.georgetown.edu The Protein Information Resource (PIR) is a division of the National Biomedical

Research Foundation (NBRF) in the US. It is involved in a collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japanese International Protein Sequence Database (JIPID). The PIR-PSD (Protein Sequence Database) release 70.01 (22 Oct 2000) contains 254,293 entries.

PIR grew out of Margaret Dayhoff's work in the middle of the 1960s. It strives to be comprehensive, well-organized, accurate, and consistently annotated. However, it is generally believed that it does not reach the level of completeness in the entry annotation as does SWISS-PROT. Although SWISS-PROT and PIR overlap extensively, there are still many sequences which can be found in only one of them.

One can search for entries or do sequence similarity searches at the PIR site. PIR also produces the NRL-3D, which is a database of sequences extracted

from the three-dimensional structures in the Protein Databank (PDB). It appears that the PIR web site, and possibly also the underlying database, has

improved considerably since one year ago. This means that if one is interested in protein sequences, there is now even more reason to check out PIR;

Other relevant databases GeneCards www.genecards.org GeneCards is a database of human genes, their products and

their involvement in diseases. It offers concise information about the functions of all human genes that have an approved symbol, as well as selected others. It is a typical example of a secondary database, which contains many links to other databases, and attempts to consolidate the information that is available for a specific class of entity, in this case human genes.

GeneLynx www.genelynx.org GeneLynx is a database of Web links for human genes. It

contains pointers to a large number of other databases. This is also a typical secondary database. It is maintained by Boris Lenhard and Wyeth Wasserman at CGB, KI, Sweden.

Contd… KEGG www.genome.ad.jp/kegg/ The Kyoto Encyclopedia of Genes and Genomes

(KEGG) is an effort to computerize current knowledge of molecular and cellular biology in terms of the information pathways that consist of interacting molecules or genes and to provide links from the gene catalogs produced by genome sequencing projects.

Amos' WWW links page www.expasy.org/links.html

A page of many links to biological databases and/or web sites

POPULAR BIOINFORMATICS DATABASES

http://everest.bic.nus.edu.sg/~bhuvana/lsm2104/popualr-bioinformatics-databases.htm

Class 6

10/07/08

Growth of GenBank database

Base PairsSequences

05

101520253035404550

Year

Base

pai

rs (

billi

ons)

051015202530354045

Sequ

ence

s (m

illio

ns)

www.ncbi.nlm.nih.gov Created in 1988 as part of the National

Library of Medicine at NIH Establish public databases Research in computational biology Develop software tools for sequence

analysis Disseminate biomedical information

http://www.ncbi.nlm.nih.gov/

Types of databases at NCBI

Primary databases Original submissions by experimentalists Content controlled by the submitter

Examples: GenBank, SNP, GEO Derivative databases

Built from primary data Content controlled by third party (NCBI)

Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain, Gene

Entrez

Literature & Text

PubMed 15 million citations in MEDLINE Links to participating online journals

Books Linked from PubMed and other records Searchable from within Entrez

Nucleotide databases

Genbank RefSeq PDB

Primary GenBank / EMBL / DDBJ 54,694,591

Derivative RefSeq 1,132,972 Third Party Annotation 4,763 PDB 5,887 Total 55,838,213

EMBL/ GenBank /DDBJ (European Molecular Biology Laboratory)

Archive containing all sequences from: genome projects sequencing centers individual scientists patent offices

Database is doubling every 15 months Sequences from >200,000 different species >1000 new species added every month

Protein Databases

Genpept CDS from GenBank entries

TrEMBL (1996) Automatic CDS translations from EMBL

Highly redundant Not all experimentally determined Many inaccuracies

Secondary protein database

SWISS-PROT (1986) Best annotated, least redundant

PIR (Protein Information Resource) More automated annotation Collaborations with MIPS and JIPID

Secondary protein databases

SWISS-PROT (1986) Best annotated, least redundant

PIR (Protein Information Resource) More automated annotation Collaborations with MIPS and JIPID

Uniprot (2003) UniProt (Universal Protein Resource) is a central

repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

Uniprot UniProt Knowledgebase (UniProt)

Central access point for extensive curated protein information, including function, classification, and cross-reference.

UniProt Non-redundant Reference (UniRef) Set of databases that combine closely related

sequences into a single record to speed searches. UniProt Archive (UniParc)

Comprehensive repository, reflecting the history of all protein sequences.

No annotation, used internally

NCBI Derivative Sequence Data

ATTGACTA

TTGACA

CGTG

AATTGACTA

TATA

GCC

G

ACGTGC

ACGTGCACG

TGC

TTGACA

TTGACA

TTGACA

CGTG

A CGTGA

CG

TGA

ATTG

ACTA

ATTGACTA ATTGACTA

ATTGACTA

TATAGCCG

TATAGCCGTA

TAGCCG

TATAGCCG

GenBank

TATAGCCG TATAGCCGTATAGCCGTATAGCCG

ATGA

CATT

GAGA

ATTATTC

C GAGA

ATTC

CGAGA

ATTC GAGA

ATTC

GAGA

ATTC

C GAGA

ATTC

C

UniGene

RefSeq

GenomeAssembly

Labs

Curators

Algorithms

TATAGCCGAGCTCCGATACCGATGACAA

Top image: confocal detection by the MegaBACE sequencer of fluorescently labeled DNA

Bottom image: computer image of sequence read by automated sequencer

High-throughput DNA sequencing

The trend of data growth

012345678

1980 1985 1990 1995 2000

Years

Nucl

eotid

es(b

illio

n)

21st century is a century of biotechnology & bioinformatics:

Microarray: Global expression analysis: RNA levels of every gene in the genome analyzed in parallel.

Proteomics:Global protein analysis generates by large mass spectra libraries.

Metabolomics:Global metabolite analysis: 25,000 secondary metabolites characterized

Genomics: New sequence information is being produced at increasing rates. (The

contents of GenBank double every one and half year)

Glycomics:Global sugar metabolism analysis

How to handle the large amount of information?

Drew Sheneman, New Jersey--The Newark Star Ledger

Answer: bioinformatics and Internet

Bioinformatics – NEED FOR ALGORITHM?

IBM 7090 computer

In1960s: the birth of bioinformatics

Margaret Oakley Dayhoff created:The first protein databaseThe first program for sequence assembly

There is a need for computers and algorithms that allow: Access, processing, storing, sharing, retrieving, visualizing, annotating…

Why do we need the Internet?

“omics” projects and the information associated with involve a huge amount of data that is stored on computers all over the world.

Because it is impossible to maintain up-to-date copies of all relevant databases within the lab. Access to the data is via the internet.

You are here

Your request

Database storage

Results

The Commercial Market

Current bioinformatics market is worth 300 million / year (Half software)

Prediction: $2 billion / year in 5-6 years

~50 Bioinformatics companies: Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic, GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools, Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist, eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic, GeneFormatics, Molecular Simulations, Bioinformatics Solutions….BIOCON(INDIA)

Scope

Make you familiar with bioinformatics resources available on the web

LOGO

They are big databases and searching either one should produce similar results because they exchange information routinely.

-GenBank (NCBI): http://www.ncbi.nlm.nih.gov

-DDBJ (DNA DataBase of Japan): http://www.ddbj.nig.ac.jp

-TIGR: http://tigr.org/tdb/tgi

-Yeast: http://yeastgenome.org

-E. coli: http://colibase.bham.ac.uk/blast/

Specialized databases:Tissues, species… -ESTs (Expressed Sequence Tags)

~at NCBI http://www.ncbi.nlm.nih.gov/dbEST ~at TIGR http://tigr.org/tdb/tgi

- ...many more!

They are big databases too: -Swiss-Prot (very high level of annotation)

http://au.expasy.org/

-PIR (protein identification resource) the world's most comprehensive catalog of information on proteinshttp://www.pir.uniprot.org/

Translated databases: -TREMBL (translated EMBL): includes entries that have

not been annotated yet into Swiss-Prot. http://www.ebi.ac.uk/trembl/access.html

-GenPept (translation of coding regions in GenBank)

-pdb (sequences derived from the 3D structure Brookhaven PDB) http://www.rcsb.org/pdb/

Protein (amino acid) databases

Database homology searching

Use algorithms to efficiently provide mathematical basis of searches that can be translated to statistical significance.

Assumes that sequence, structure, and function are inter-related.

All similarity searching methods rely on the concepts of alignment and distance between sequences.

A similarity score is calculated from a distance: the number of DNA bases or amino acids that are different between two sequences.

Calculating alignment scores

Scoring system: Uses scoring matrices that allow biologists to quantify the quality of sequence alignments.

The raw score S is calculated by summing the scores for each aligned position and the scores for gaps. Gap creation/extension scores are inherent to the scoring system in use (BLAST, FASTA…)

The score for an identity or a mismatch is given by the specified substitution matrix (e.g., BLOSUM62).

Devising a scoring system

How the matrices were created: Very similar sequences were aligned.

From these alignments, the frequency of substitution between each pair of amino acids was calculated and then PAM1 was built.

After normalizing to log-odds format, the full series of PAM matrices can be calculated by multiplying the PAM1 matrix by itself.

Some popular scoring matrices are: PAM (Percent Accepted Mutation): for evolutionary studies. For example in PAM1, 1 accepted point mutation per 100 amino acids is erquired.

BLOSUM (BLOcks amino acid SUbstitution Matrix): for finding common motifs. For example in BLOSUM62, the alignment is created using sequences sharing no more than 62% identity.

Devising a scoring system

Importance: Scoring matrices appear in all analysis involving sequence comparison.

The choice of matrix can strongly influence the outcome of the analysis.

Understanding theories underlying a given scoring matrix can aid in making proper choice: -Some matrices reflect similarity: good for database searching -Some reflect distance: good for phylogenies Log-odds matrices, a normalisation method for matrix values:

S is the probability that two residues, i and j, are aligned by evolutionary descent and by chance. qij are the frequencies that i and j are observed to align in sequences known to

be related. pi and pj are their frequencies of occurrence in the set of sequences.

Database search methods: Sequence Alignment

Two broad classes of sequence alignments exist:

Global alignment: not sensitive

Local alignment: faster

QKESGPSSSYC

VQQESGLVRTTC

ESG

ESG

The most widely used local similarity algorithms are:Smith-Waterman (http://www.ebi.ac.uk/MPsrch/)Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nih.gov)Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/;

http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)

Which algorithm to use for database similarity search?

Speed: BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a LOT OF COMPUTER POWER)

Sensitivity/statistics: FASTA is more sensitive, misses less homologuesSmith-Waterman is even more sensitive. BLAST calculates probabilities

FASTA more accurate for DNA-DNA search then BLAST

Genomics: Completed genomes as 2002

Currently the genome of over 600 organisms are sequenced:

Organism Base pairs Whole-genome shotgun Map-based

54 Bacteria 0.8-6 million + –Yeast 15 million – +

C. elegans (roundworm) 100 million – +

Drosophila (fruitfly) 120 million + –Arabidopsis (thale cress) 130 million – +Rice 435 million – +

Human 3 billion + +

Fugu (puffer fish) 365 million + –

Anopheles (malaria-carrying mosquito) 278 million + –

This generates large amounts of information to be handled by individual computers.

The dilemma: DNA or protein?

Is the comparison of two nucleotide sequences accurate?

By translating into amino acid sequence, are we losing information? The genetic code is degenerate (Two or more codons can represent the same amino acid)

Very different DNA sequences may code for similar protein sequences We certainly do not want to miss those cases!

Search by similarity

Using nucleotide seq. Using amino acid seq.

Tools to search databases

Comparing DNA sequences give more random matches:

Reasons for translating

A good alignment with end-gaps A very poor alignment

Almost 50% identity!

Conservation of protein in evolution (DNA similarity decays faster!)

It is almost always better to compare coding sequences in their amino acid form, especially if they are very divergent.

Very highly similar nucleotide sequences may give better results.

Conclusion:

FASTA: Compares a DNA query to DNA database, or a protein query to protein database

FASTX: Compares a translated DNA query to a protein databaseTFASTA: Compares a protein query to a translated DNA database

BLAST and FASTA variants

BLASTN: Compares a DNA query to DNA database.

BLASTP: Compares a protein query to protein database.

BLASTX: Compares the 6-frame translations of DNA query to protein database.TBLASTN: Compares a protein query to the 6-frame translations of a DNA

database.

TBLASTX: Compares the 6-frame translations of DNA query to the 6-frame translations of a DNA database (each sequence is comparable to

BLASTP searches!)

PSI-BLAST: Performs iterative database searches. The results from each round are incorporated into a 'position specific' score matrix, which is used for further searching

A practical example of sequence alignmenthttp://www.ncbi.nlm.nih.gov

BLAST results

Detailed BLAST results

E value: is the expectation value or probability to find by chance hits similar to your sequence. The lower the E, the more significant the score.

Database searching tips

Use latest database version.

Use BLAST first, then a finer tool (FASTA,…)

Search both strands when using FASTA.

Translate sequences where relevant

Search 6-frame translation of DNA database

E < 0.05 is statistically significant, usually biologically interesting.

If the query has repeated segments, delete them and repeat search

Most widely used sites for sequence analysis

Sites for alignment of 2 sequences:

T-COFFEE (http://www.ch.embnet.org/software/TCoffee.html): more accurate than ClustalW for sequences with less than 30% identity.

ClustalW (http://www.ch.embnet.org/software/ClustalW.html; http://align.genome.jp)bl2sequ (http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi)LALIGN (http://www.ch.embnet.org/software/LALIGN_form.html)MultiALIGN (http://prodes.toulouse.inra.fr/multalin/multalin.html)

Sites for DNA to protein translation: These algorithms can translate DNA sequences in any of the 3 forward or three reverse sense frames.

Translate (http://au.expasy.org/tools/dna.html)Translate a DNA sequence: (http://www.vivo.colostate.edu/molkit/translate/index.html)Transeq (http://www.ebi.ac.uk/emboss/transeq)

Documents

Bioinformatics Class