Upload
vishal-agarwal
View
46
Download
4
Embed Size (px)
Citation preview
BIOINFORMATICS
Biology easily has 500 years of exciting problems to work on--Donald E. Knuth Ever since the structure of DNA was unraveled in 1953, the molecular
biology has witnessed tremendous advances.
The need to process the ever growing biological data has created entirely new problems that are interdisciplinary in nature.
Scientists from biological sciences are the creators & ultimate users of this data.
Due to huge size & high complexity of the biological data, the help of many other disciplines—in particular from mathematics& computer science is required
This need has created a new field “Computational Molecular Biology & bioinformatics”
The Commercial Market
Current bioinformatics market is worth 300 million / year (Half software)
Prediction: $2 billion / year in 5-6 years
~50 Bioinformatics companies: Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic, GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools, Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist, eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic, GeneFormatics, Molecular Simulations, Bioinformatics Solutions….BIOCON(INDIA)
“Comptational molecular biology & Bioinformatics”(cmb)
“CMB consists of the development and use of computer science & mathematical techniques to solve problems in molecular biology”.
Bioinformatics Unit – 1. Basic Concepts Unit – 2. Suffix Trees and Applications Unit – 3. sequence alignment: pair wise
alignment. Multiple Alignments Unit – 4. Sequencing Unit – 5. Motif Prediction
Bioinformatics Unit – 1. Basic Concepts of Molecular Biology: Cellular Architecture, Nucleic Acids (RNA & DNA ),
DNA replication, Repair and recombination. Transcription, Genetic code, Gene expression, Protein structure and function, Molecular biology tools. Statistical Methods: Estimation, Hypothesis testing, Random Walks, Markov Models(HMM).
Unit – 2. Suffix Trees Definition and examples, Ukkonen’s linear-
time suffix tree algorithm, Applications (exact string matching, longest common sub strings of two strings, Recognizing DNA contamination). Pair wise Sequence Alignment (Edit distance , Dynamic Programming Calculator of edit distance, string similarity, gaps).
Unit – 3. sequence alignment Pair wise sequence alignment (local), HMM for
pair wise alignment. Multiple String Alignments : Need of MSA, Family & Super Family representation,
multiple sequence comparison for structural inferences, Multiple alignments with
sum-of- pairs, consensus objective functions. Profile HMM for multiple sequence alignment. Database searching for similar sequence (FASTA, BLAST), PAM, BLOSOM substitution matrices.
Unit – 4. Sequencing Fragment Assembly (Shortest common super
string algorithms based on multi-graph), Sequencing by Hybridization, Protein
sequencin
Unit – 5. Motif Prediction
Motif Prediction, Gene prediction, Introduction to Protein structure Prediction.
BOOKS RECOMMENDED Dan Gusfied, Algorithm on strings, Trees and
Sequences : Computer science & Computational Biology, Cambridge University Press, 1997. (Chapters: 5,6,7,10,11,14,15)
J.Setubal & Meidanis, Introduction to computational Molecular Biology – PWS Publishing Company, 1997(Chapters : 1,8).
W.J. Ewens & G.R. Grant – Statistical Methods in Bioinformatics – Springer-1989.
Contd…
R. Durbin, S.R. Eddy, A. Krogh and G.J. Mitchison, Biological Sequence Analysis : Probabilistic Models of Proteins and Nuclics Acids, Cambridge University Press – 1998(Chapters: 3,5 & 6)
R.C. Denier, S. Tavare, M.S. Waterman, Computational Genome Analysis, Springer, 2005.
Contd…
N.C. Jones and P.A. Pevzner – An Introduction to Bioinformatics Algorithms – MIT Press- 2004.
D.E. Krane, M.L. Raymer – Fundamental Concepts of Bioinformatics – Pearson Education – 2003.
J. Tisdall, Beginning Perl for Bioinformatics, O’Reilly, 2001.
Contd… M.S. Waterman, Introduction to Computational
Biology – CRC Press, 2000. A. Baxevainis & B. Ouellete, Bio – Informatics: A
Practical Guide to the Analysis of Genes and Proteins, Willy- Interescience, 2001.
M. Ridley, Genome: The autobiography of a species, Fourth Estate, 1999.
Lodish, Berk, Zipursky, Blalimore & Darnell, Molecular Cell Biology, W.H.Freeman, 2000.
Class 2
UNIT 1 LIFE AT ITS SIMPLEST DNA RNA PROTEIN GENETIC-CODE
QUICK-PRIMER ON GENETICS…. EVERY CELL IN THE HUMAN BODY CONTAINS A COPY OF
THE GENOME
THINK OF THE GENOME AS A BOOK —THE BLUEPRINT THAT CONTAINS DETAILS OF WHAT EACH INDIVIDUAL OUGHT TO BE LIKE. NOW, EACH HUMAN GENOME CONTAINS 23 CHROMOSOMES.
IF THE GENOME WERE A BOOK, THEN THINK OF CHROMOSOMES AS THE CHAPTERS IN IT
EACH OF THESE CHAPTERS TELL ABOUT SEVERAL THOUSAND STORIES CALLED GENES
e.g the colour of individual skin,eyes & hair, left or right handed, his IQ, and everything that matters
What Venter did? He could figure out what language the book
was written in & how to read it.
He could figure out the grammar of the book & therefore, how to write in the language
This is the kind of language that will allow him to create life.
Venter…magic…new-life? Venter used his knowledge to create
Mycoplasma Laboratorium, a chromosome that is 381 genes long, transplanted into living cell, it is expected to take control of the cell and become a new “life-form”.
The new life-form can mop up excessive carbon dioxide and contribute to resolving problems like global-warming.
Contd…
Until now, scientists have managed to take the genome out of one cell put it another cell and create an altogether new organism.
But, nobody knew how to create the genome itself. Venter did just that.
Biology easily has 500 years of exciting problems to work on--Donald E. Knuth Ever since the structure of DNA was unraveled in 1953, the molecular
biology has witnessed tremendous advances.
The need to process the ever growing biological data has created entirely new problems that are interdisciplinary in nature.
Scientists from biological sciences are the creators & ultimate users of this data.
Due to huge size & high complexity of the biological data, the help of many other disciplines—in particular from mathematics& computer science is required
This need has created a new field “Computational Molecular Biology & bioinformatics”
“Comptational molecular biology & Bioinformatics”(cmb)
“CMB consists of the development and use of computer science & mathematical techniques to solve problems in molecular biology”.
Living vs. Nonliving
Both kinds of matter are composed by the same atoms and confirms to the same physical and chemical rules.
What is the difference then????
Living vs. Nonliving Living things can move,
reproduce, grow, eat They have an active
participation in their environment
Living beings act the way they do due to a complex array of chemical reactions that occur inside them. These reactions never cease.
Living organism is constantly exchanging matter & energy with its surroundings.
Anything that is in equilibrium with its surrounding can generally be considered dead
(exceptions are vegetative forms, like seeds, and viruses which may be completely inactive for long periods of time and are not dead.)
Life starts… Life started some 3.5 billions of years ago, shortly
after the Earth itself was formed. The first life forms were very simple, but over billions
of years a continuously acting process called evolution made them evolve and diversify
Both complex and simple organisms have a similar molecular chemistry or biochemistry.
Main actors in the chemistry of life are molecules called proteins and nucleic acids
ACTORS IN CHEMISTRY OF LIFE :Proteins & nucleic acids: Proteins are responsible for what a living being is and
does in a physical sense. “we are our proteins”-Russell Doolittle Nucleic acids, on the other hand, encode information
necessary to produce proteins and responsible for passing along this “recipe” to subsequent generations
Recent research is devoted to the understanding of the structure and function of proteins and nucleic acids
Proteins
Different Roles of Proteins Enzymes Carry signals Transport small molecules such as oxygen Form cellular structures (tissues) Regulate cell processes (such as defense
mechanisms) What are proteins made of?
Amino acids – chain of amino acids = protein
Amino acids
Backbone of polypeptide chain Convention
Begin at N-terminal End at C-terminal
Torsion or rotation angles around: C-N bond () C-C bond ()
PROTEIN STRUCTURE Protein is not just a linear sequence of residues-
primary structure Proteins actually fold in 3D, presenting secondary,
tertiary and quaternary structures 3D shape of a protein is related to its function Protein can be made out of 20 different kinds of
amino acids make the resulting 3D structure very complex and without symmetry
No simple and accurate method for determining the 3Dstructure is known.
Genomic Code
DNAdeoxyribosenucleic acid
Basic unit = nucleotideSugar,Phosphate,Base (A, G, T, C)
adenine, thymine cytosine, guanine.
Contd… DNA is a chain of simpler molecules Actually it is a double chain (strands) Each simple chain has a backbone consisting of
repetitions of the same basic unit This unit is formed by a sugar molecule called 2-
deoxyribose attached to a phosphate residue The sugar molecules contains five carbon atoms and
they are labeled 1 through 5 DNA molecules also have a orientation (starts at the
5 end finishes at the 3 end)
Contd…
Attached to each 1 carbon in the backbone are other molecules called bases
There are 4 kinds of bases: A(ADENINE) G(GUANINE) C(CYTOSINE) T(THYMINE)
CONTD… Bases A & G belong to a larger group of substances called
purines where as C & T belong to pyrimidines When we see the basic unit of a DNA molecule as consisting of
sugar, phosphate, & its base we call it nucleotide Bases & nucleotides are not the same thing. DNA molecule having a few nucleotides is referred to as an
oligonucleotide DNA molecule in nature is very long, much longer than proteins In humancell, each DNA molecules have hundreds of millions of
nucleotides
Contd… DNA molecules are double strands The two strands are tied together in a helical structure (watson
& crick 1953) Each base in one strand is paired with a base in the other
strand A pairs with T (COMPLEMENTARY BASES/watson crick base
pairs) C pairs with G base pair ( bp) provides the unit of length
RNA
RNA is a nucleic acid made from long chain of nucleotides
Each nucleotide consists of a nitrogen base, a ribose sugar, and a phosphate
RNA is very similar to DNA , but differs with the following basic compositional and structural differences
DNA vs. RNA DNA is double stranded Very long chain of
nucleotides DNA contains deoxyribose DNA is more stable
Complementary nucleotide to Adenine is Thymine
DNA performs essentially one function
RNA is single stranded Comparatively shorter chain
of nucleotides RNA contains ribose
Less stable, more prone to hydrolysis
Complementary nucleotide to adenine is uracil
There are different kinds of RNA performing different functions
Class 3
26/06/08
Central Dogma of Molecular Biology
How the information in DNA results in proteins? A promoter is a region before each gene in the DNA that serves
as an indication to the cellular mechanism that a gene is ahead.
Having recognized the beginning of a gene a copy of the gene is made on an RNA molecule.
The resulting RNA is mRNA (substitute U for T). This process is called transcription.
the mRNA will be used to manufacture protein.
After transcription, the introns are spliced from the mRNA=>introns are that part of gene that are not used in protein synthesis
After introns are spliced out the shortened mRNA containing copies of only exons plus regulatory regions in the beginning & end leaves the nucleus
Contd… Because of the intron/exon phenomenon, we use
different names to the entire gene & to the spliced sequence consisting of exons only.
The former is called genomic DNA & the latter complementary DNA or cDNA
t RNA are the molecules that actually implement the genetic code in a process called translation. They make the connection between a codon and the specific amino acid this codon codes for.
When a stop codon appears no tRNA associates with it and the synthesis ends.
Central Dogma: DNA -> RNA -> Protein
Protein
RNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
JUNK DNA Genes are certain contiguous regions of the chromosome, but
they do not cover the entire molecule
There are intergenic regions which does not have any known functions. They are called “ junk DNA” because they appear to be there for no particular use.
Recent research has shown that junk DNA has more information content than previously believed
The amount of junk DNA varies from species to species
humans>90% junk DNA
OPEN READING FRAME( ORF ) An ORF in a DNA sequence is a contiguous stretch of this sequence
beginning at the start codon, having an integral number of codon, such that none of its codon is a stop codon
Consider the sequence TAATCGAATGGGC one reading frame: TAA TCG AAT GGG second reading frame: AAT CGA ATG GGC third reading frame: ATC GAA TGG fourth reading frame: TCG AAT GGG
4TH frame is a subset of one of the frame starting at position 1 Sometimes we talk about 6 not 3 different frames in a sequence.(look
at the opposite strand and count another 3)
Genome
Complete set of chromosome inside a cell is called a genome.
The number of chromosomes in a genome is characteristic of species
Every cell in a human being has 46 chromosomes, whereas in mice this number is 40
Is Genome like a computer program?
Genome = computer program Genome of an organism is seen as a computer
program that completely specifies the organism
Cell machinery = interpreter of this program
Biological functions performed by proteins = execution of this program
Class 4
01/07/08
The Eye of the Fly Fruit flies (Drosophila melanoglaster) have a gene
called eyeless which , if it is “knocked out” (i.e. eliminated the genome using molecular biology methods) results in fruit flies without eyes
It is obvious that eyeless gene plays a role in eye development
Researchers have identified a human gene responsible for a condition called aniridia
In humans who are missing this gene( or in whom the gene has mutated just enough for its protein product onto stop functioning properly), the eyes develop without irises
Cont… If the gene for aniridia is inserted into an
eyeless drosophila “knock out", it causes the production of normal drosophila eyes. It is an interesting observation.
Could there be some similarity in how eyeless and aniridia function??? Even though flies & humans are vastly different organisms?
To gain insight into how eyeless & aniridia work together, we can compare their sequences
Contd…
20 years ago similarity between eyeless & aniridia DNA sequences would have been like looking for a needle in a haystack
Most scientists compared the respective gene sequences by hand aligning them one under the other in a word processor & looking for matches character by character.
This was time consuming & hard on the eyes.
Contd… In the late 1980s, fast computer programs for
comparing sequences changed molecular biology for ever
Many tools that are widely available to the biology community- including everything from multiple alignment, phylogenetic analysis, motif identification, &
homology modeling software, to web-based database search services-rely on pair wise sequence comparison algorithms as a core element of their function
How the genome is studied? Say, human genome Sequencing: The basic information we want to extract
from any piece of DNA is its base pair sequence. The process of obtaining this information is called sequencing
A human chromosome has around 108 base pairs. but, the largest pieces of DNA that can be sequenced
in the laboratory are 700bp long.=>there is a gap of some 105 between the scales of what
we can actually sequence and a chromosome size. This gap is at the heart of many problems in
computational biology
Cutting & Breaking DNA Because a DNA molecule is so long, some tool to cut
it at specific points (like a pair of scissors) or to break it apart in some way is needed
The pair of scissors is represented by restriction enzymes
They cut DNA molecules in all places where a certain sequence appears( usually a palindrome sequence)
Some common types of restriction enzymes are 4 cutters, 6 cutters, & 8 cutters
It is rare to see an odd cutter because sequences of odd length cannot be palindromes
COPYING DNA Also known as DNA amplification Very important in DNA cloning Given a piece of DNA, one way of obtaining further
copies is to use nature itself We insert this piece into the genome of an organism
(host) and then let the organism multiply itself. Upon host multiplication, the inserted piece gets
multiplied along with original DNA. Then we kill the host & dispose the rest keeping only
the inserts in the desired quantity. DNA produced in this way is called recombinant.
Reading & measuring DNA
Reading is done with a technique known as gel electrophoresis which is based on separation of molecules by their size
This process involves a gel medium & a strong electric field
HUMAN GENOME PROJECT
The Human Genome Project
What is the Human Genome Project?
U.S. govt. project coordinated by the Department of Energy and the National Institutes of Health
goals (1998-2003) identify the approximate 100,000 genes in human DNA determine the sequences of the 3 billion bases that make up
human DNA store this information in databases develop tools for data analysis address the ethical, legal, and social issues that arise from
genome research
Why is the Department of Energy involved?
-after atomic bombs were dropped during War War II, Congress told DOE to conduct studies to understand the biological and health effects of radiation and chemical by-products of all energy production
-best way to study these effects is at the DNA level
Whose genome is being sequenced?
the first reference genome is a composite genome from several different people
generated from 10-20 primary samples taken from numerous anonymous donors across racial and ethnic groups
Benefits of HGP Research
improvements in medicine microbial genome research for fuel and
environmental cleanup DNA forensics improved agriculture and livestock better understanding of evolution and human
migration more accurate risk assessment
Ethical, Legal, and Social Implications of HGP Research
fairness in the use of genetic information privacy and confidentiality psychological impact and stigmatization genetic testing reproductive issues education, standards, and quality control commercialization conceptual and philosophical implications
Human Genome Project Information Websitehttp://www.ornl.gov/hgmis
For More Information...
Contd…
A large effort like this cannot be entertained by a single lab!!!!!!
On computer science side, databases with updated & consistent information have to be maintained,
Fast access to the data has to be provided After the sequencing there is a still difficult
task of analyzing the data obtained
Contd..
Treatment of genetic diseases based on data produced by the Human Genome Project is still going on, although encouraging pioneering efforts have already yielded results.
Class 5
08/07/08
What is a database?
A collection of information, usually stored in an electronic format that can be searched by a computer.
A brief history of biological databases
1965 M. O. Dayhoff et al. publish “Atlas of Protein Sequences and
Structures”1982 EMBL initiates DNA sequence database,
followed within a year by GenBank (then at LANL) and in 1984 by DNA Database of Japan
1988 EMBL/GenBank/DDBJ agree on common format for data elements
Biological databases: why?
There are two main functions of biological databases:
Make biological data available to scientists. As much as possible of a particular type of information should be available in one single place (book, site, database). Published data may be difficult to find or access, and collecting it from the literature is very time-consuming. And not all data is actually published explicitly in an article (genome sequences!).
To make biological data available in computer-readable form. Since analysis of biological data almost always involves computers, having the data in computer-readable form (rather than printed on paper) is a necessary first step.
The different types of databases
One may characterize the available biological databases by several different properties. Here is a list to help you think about the various properties a particular database may have
Type of data
nucleotide sequences protein sequences proteins sequence patterns or motifs macromolecular 3D structure gene expression data metabolic pathways
Contd… Data entry and quality control
Scientists (teams) deposit data directly Appointed curators add and update data Are erroneous data removed or marked? Type and degree of error checking Consistency, redundancy, conflicts, updates
Primary or derived data
Primary databases: experimental results directly into database Secondary databases: results of analysis of primary databases Aggregate of many databases
Links to other data items Combination of data Consolidation of data
Contd… Technical design
Flat-files Relational database (SQL) Object-oriented database (e.g. CORBA, XML) Maintainer status
Large, public institution (e.g. EMBL, NCBI) Quasi-academic institute (e.g. Swiss Institute of Bioinformatics, TIGR) Academic group or scientist Commercial company Availability
Publicly available, no restrictions Available, but with copyright Accessible, but not downloadable Academic, but not freely available Proprietary, commercial; possibly free for academics
Accession codes vs identifiers
Many databases in bioinformatics (SWISS-PROT, EMBL, GenBank, Pfam) use a system where an entry can be identified in two different ways.
Basically, it has two names: Identifier Accession code (or number)
Contd… Identifier An identifier ("locus" in GenBank, "entry name" in SWISS-
PROT) is a string of letters and digits that generally is interpretable in some meaningful way by a human, for instance as a recognizable abbreviation of the full protein or gene name.
SWISS-PROT uses a system where the entry name consists of two parts: the first denotes the protein and the second part denotes the species it is found in. For example, KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo sapiens.
An identifier can usually change. For example, the database curators may decide that the identifier for an entry no longer is appropriate. However, this does not happen very often. In fact, it happens so rarely that it's not really a big problem.
Contd… Accession code (number) An accession code (or number) is a number (possibly with a few
characters in front) that uniquely identifies an entry in its database. For example, the accession code for KRAF_HUMAN in SWISS-PROT is P04049.
The main conceptual difference from the identifier is that it is supposed to be stable: any given accession code will, as soon as it has been issued, always refer to that entry, or its ancestors. It is often called the primary key for the entry. The accession code, once issued, must always point to its entry, even after large changes have been made to the entry. This means that in discussions about specific database entries (e.g. an article about a specific protein), one should always give the accession code for the entry in the relevant database.
In the case where two entries are merged into one single, then the new entry will have both accession codes, where one will be the primary and the other the secondary accession code. When an entry is split into two, both new entries will get new accession codes, but will also have the old accession code as secondary codes.
Nucleotide sequence databases Primary nucleotide sequence databases
The databases EMBL, GenBank, and DDBJ are the three primary nucleotide sequence databases:
They include sequences submitted directly by scientists and genome sequencing group, and sequences taken from literature and patents. There is comparatively little error checking and there is a fair amount of redundancy.
The entries in the EMBL, GenBank and DDBJ databases are synchronized on a daily basis, and the accession numbers are managed in a consistent manner between these three centers.
The nucleotide databases have reached such large sizes that they are available in subdivisions that allow searches or downloads that are more limited, and hence less time-consuming. For example, GenBank has currently 17 divisions.
There are no legal restrictions on the use of the data in these databases. However, there are some patented sequences in the databases.
Contd… EMBL www.ebi.ac.uk/embl/ The EMBL (European Molecular Biology Laboratory) nucleotide
sequence database is maintained by the European Bioinformatics Institute (EBI) in Hinxton, Cambridge, UK.
GenBank www.ncbi.nlm.nih.gov/Genbank/ The GenBank nucleotide database is maintained by the National Center
for Biotechnology Information (NCBI), which is part of the National Institute of Health (NIH), a federal agency of the US government.
It can be accessed and searched through the Entrez system at NCBI, or one can download the entire database as flat files.
DDBJ www.ddbj.nig.ac.jp The DNA Data Bank of Japan began as a collaboration with EMBL and
GenBank. It is run by the National Institute of Genetics. One can search for entries by accession number.
Other nucleotide sequence databases secondary databases The following databases contain subsets of the EMBL/GenBank databases. Some also
contain more information or links than the primary ones, or have a different organization of the data to better some specific purpose. However, the nucleotide sequences themselves should always be available in the EMBL/GenBank databases. In this sense, the databases below are secondary databases.
UniGene http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene The UniGene system attempts to process the GenBank sequence data into a non-redundant
set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.
SGD http://www.yeastgenome.org/ The Saccharomyces Genome Database (SGD) is a scientific database of the molecular
biology and genetics of the yeast Saccharomyces cerevisiae. EBI Genomes www.ebi.ac.uk/genomes/ This web site provides access and statistics for the completed genomes, and information
about ongoing projects. Genome Biology www.ncbi.nlm.nih.gov/Genomes/ The Genome Biology site at NCBI contains information about the available complete
genomes. Ensembl www.ensembl.org Ensembl is a joint project between EMBL-EBI and the Sanger Centre to develop a software
system which produces and maintains automatic annotation on eukaryotic genomes.
Protein sequence databases
The two protein sequence databases SWISS-PROT and PIR are different from the nucleotide databases in that they are both curated.
This means that groups of designated curators (scientists) prepare the entries from literature and/or contacts with external experts.
SWISS-PROT, TrEMBL www.expasy.ch/sprot SWISS-PROT is a protein sequence database which strives to provide a high
level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.
It was started in 1986 by Amos Bairoch in the Department of Medical Biochemistry at the University of Geneva. This database is generally considered one of the best protein sequence databases in terms of the quality of the annotation. Its size is given in the table below.
TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT. The procedure that is used to produce it was developed by Rolf Apweiler. The annotation of an entry in TrEMBL has not (yet) reached the standards required for inclusion into SWISS-PROT proper.
The SWISS-PROT database has some legal restrictions: the entries themselves are copyrighted, but freely accessible and usable by academic researchers. Commercial companies must buy a license fee from SIB.
PIR pir.georgetown.edu The Protein Information Resource (PIR) is a division of the National Biomedical
Research Foundation (NBRF) in the US. It is involved in a collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japanese International Protein Sequence Database (JIPID). The PIR-PSD (Protein Sequence Database) release 70.01 (22 Oct 2000) contains 254,293 entries.
PIR grew out of Margaret Dayhoff's work in the middle of the 1960s. It strives to be comprehensive, well-organized, accurate, and consistently annotated. However, it is generally believed that it does not reach the level of completeness in the entry annotation as does SWISS-PROT. Although SWISS-PROT and PIR overlap extensively, there are still many sequences which can be found in only one of them.
One can search for entries or do sequence similarity searches at the PIR site. PIR also produces the NRL-3D, which is a database of sequences extracted
from the three-dimensional structures in the Protein Databank (PDB). It appears that the PIR web site, and possibly also the underlying database, has
improved considerably since one year ago. This means that if one is interested in protein sequences, there is now even more reason to check out PIR;
Other relevant databases GeneCards www.genecards.org GeneCards is a database of human genes, their products and
their involvement in diseases. It offers concise information about the functions of all human genes that have an approved symbol, as well as selected others. It is a typical example of a secondary database, which contains many links to other databases, and attempts to consolidate the information that is available for a specific class of entity, in this case human genes.
GeneLynx www.genelynx.org GeneLynx is a database of Web links for human genes. It
contains pointers to a large number of other databases. This is also a typical secondary database. It is maintained by Boris Lenhard and Wyeth Wasserman at CGB, KI, Sweden.
Contd… KEGG www.genome.ad.jp/kegg/ The Kyoto Encyclopedia of Genes and Genomes
(KEGG) is an effort to computerize current knowledge of molecular and cellular biology in terms of the information pathways that consist of interacting molecules or genes and to provide links from the gene catalogs produced by genome sequencing projects.
Amos' WWW links page www.expasy.org/links.html
A page of many links to biological databases and/or web sites
POPULAR BIOINFORMATICS DATABASES
http://everest.bic.nus.edu.sg/~bhuvana/lsm2104/popualr-bioinformatics-databases.htm
Class 6
10/07/08
Growth of GenBank database
Base PairsSequences
05
101520253035404550
Year
Base
pai
rs (
billi
ons)
051015202530354045
Sequ
ence
s (m
illio
ns)
www.ncbi.nlm.nih.gov Created in 1988 as part of the National
Library of Medicine at NIH Establish public databases Research in computational biology Develop software tools for sequence
analysis Disseminate biomedical information
Types of databases at NCBI
Primary databases Original submissions by experimentalists Content controlled by the submitter
Examples: GenBank, SNP, GEO Derivative databases
Built from primary data Content controlled by third party (NCBI)
Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain, Gene
Entrez
Literature & Text
PubMed 15 million citations in MEDLINE Links to participating online journals
Books Linked from PubMed and other records Searchable from within Entrez
Nucleotide databases
Genbank RefSeq PDB
Primary GenBank / EMBL / DDBJ 54,694,591
Derivative RefSeq 1,132,972 Third Party Annotation 4,763 PDB 5,887 Total 55,838,213
EMBL/ GenBank /DDBJ (European Molecular Biology Laboratory)
Archive containing all sequences from: genome projects sequencing centers individual scientists patent offices
Database is doubling every 15 months Sequences from >200,000 different species >1000 new species added every month
Protein Databases
Genpept CDS from GenBank entries
TrEMBL (1996) Automatic CDS translations from EMBL
Highly redundant Not all experimentally determined Many inaccuracies
Secondary protein database
SWISS-PROT (1986) Best annotated, least redundant
PIR (Protein Information Resource) More automated annotation Collaborations with MIPS and JIPID
Secondary protein databases
SWISS-PROT (1986) Best annotated, least redundant
PIR (Protein Information Resource) More automated annotation Collaborations with MIPS and JIPID
Uniprot (2003) UniProt (Universal Protein Resource) is a central
repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.
Uniprot UniProt Knowledgebase (UniProt)
Central access point for extensive curated protein information, including function, classification, and cross-reference.
UniProt Non-redundant Reference (UniRef) Set of databases that combine closely related
sequences into a single record to speed searches. UniProt Archive (UniParc)
Comprehensive repository, reflecting the history of all protein sequences.
No annotation, used internally
NCBI Derivative Sequence Data
ATTGACTA
TTGACA
CGTG
AATTGACTA
TATA
GCC
G
ACGTGC
ACGTGCACG
TGC
TTGACA
TTGACA
TTGACA
CGTG
A CGTGA
CG
TGA
ATTG
ACTA
ATTGACTA ATTGACTA
ATTGACTA
TATAGCCG
TATAGCCGTA
TAGCCG
TATAGCCG
GenBank
TATAGCCG TATAGCCGTATAGCCGTATAGCCG
ATGA
CATT
GAGA
ATTATTC
C GAGA
ATTC
CGAGA
ATTC GAGA
ATTC
GAGA
ATTC
C GAGA
ATTC
C
UniGene
RefSeq
GenomeAssembly
Labs
Curators
Algorithms
TATAGCCGAGCTCCGATACCGATGACAA
Top image: confocal detection by the MegaBACE sequencer of fluorescently labeled DNA
Bottom image: computer image of sequence read by automated sequencer
High-throughput DNA sequencing
The trend of data growth
012345678
1980 1985 1990 1995 2000
Years
Nucl
eotid
es(b
illio
n)
21st century is a century of biotechnology & bioinformatics:
Microarray: Global expression analysis: RNA levels of every gene in the genome analyzed in parallel.
Proteomics:Global protein analysis generates by large mass spectra libraries.
Metabolomics:Global metabolite analysis: 25,000 secondary metabolites characterized
Genomics: New sequence information is being produced at increasing rates. (The
contents of GenBank double every one and half year)
Glycomics:Global sugar metabolism analysis
How to handle the large amount of information?
Drew Sheneman, New Jersey--The Newark Star Ledger
Answer: bioinformatics and Internet
Bioinformatics – NEED FOR ALGORITHM?
IBM 7090 computer
In1960s: the birth of bioinformatics
Margaret Oakley Dayhoff created:The first protein databaseThe first program for sequence assembly
There is a need for computers and algorithms that allow: Access, processing, storing, sharing, retrieving, visualizing, annotating…
Why do we need the Internet?
“omics” projects and the information associated with involve a huge amount of data that is stored on computers all over the world.
Because it is impossible to maintain up-to-date copies of all relevant databases within the lab. Access to the data is via the internet.
You are here
Your request
Database storage
Results
The Commercial Market
Current bioinformatics market is worth 300 million / year (Half software)
Prediction: $2 billion / year in 5-6 years
~50 Bioinformatics companies: Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic, GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools, Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist, eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic, GeneFormatics, Molecular Simulations, Bioinformatics Solutions….BIOCON(INDIA)
Scope
Make you familiar with bioinformatics resources available on the web
LOGO
They are big databases and searching either one should produce similar results because they exchange information routinely.
-GenBank (NCBI): http://www.ncbi.nlm.nih.gov
-DDBJ (DNA DataBase of Japan): http://www.ddbj.nig.ac.jp
-TIGR: http://tigr.org/tdb/tgi
-Yeast: http://yeastgenome.org
-E. coli: http://colibase.bham.ac.uk/blast/
Specialized databases:Tissues, species… -ESTs (Expressed Sequence Tags)
~at NCBI http://www.ncbi.nlm.nih.gov/dbEST ~at TIGR http://tigr.org/tdb/tgi
- ...many more!
They are big databases too: -Swiss-Prot (very high level of annotation)
http://au.expasy.org/
-PIR (protein identification resource) the world's most comprehensive catalog of information on proteinshttp://www.pir.uniprot.org/
Translated databases: -TREMBL (translated EMBL): includes entries that have
not been annotated yet into Swiss-Prot. http://www.ebi.ac.uk/trembl/access.html
-GenPept (translation of coding regions in GenBank)
-pdb (sequences derived from the 3D structure Brookhaven PDB) http://www.rcsb.org/pdb/
Protein (amino acid) databases
Database homology searching
Use algorithms to efficiently provide mathematical basis of searches that can be translated to statistical significance.
Assumes that sequence, structure, and function are inter-related.
All similarity searching methods rely on the concepts of alignment and distance between sequences.
A similarity score is calculated from a distance: the number of DNA bases or amino acids that are different between two sequences.
Calculating alignment scores
Scoring system: Uses scoring matrices that allow biologists to quantify the quality of sequence alignments.
The raw score S is calculated by summing the scores for each aligned position and the scores for gaps. Gap creation/extension scores are inherent to the scoring system in use (BLAST, FASTA…)
The score for an identity or a mismatch is given by the specified substitution matrix (e.g., BLOSUM62).
Devising a scoring system
How the matrices were created: Very similar sequences were aligned.
From these alignments, the frequency of substitution between each pair of amino acids was calculated and then PAM1 was built.
After normalizing to log-odds format, the full series of PAM matrices can be calculated by multiplying the PAM1 matrix by itself.
Some popular scoring matrices are: PAM (Percent Accepted Mutation): for evolutionary studies. For example in PAM1, 1 accepted point mutation per 100 amino acids is erquired.
BLOSUM (BLOcks amino acid SUbstitution Matrix): for finding common motifs. For example in BLOSUM62, the alignment is created using sequences sharing no more than 62% identity.
Devising a scoring system
Importance: Scoring matrices appear in all analysis involving sequence comparison.
The choice of matrix can strongly influence the outcome of the analysis.
Understanding theories underlying a given scoring matrix can aid in making proper choice: -Some matrices reflect similarity: good for database searching -Some reflect distance: good for phylogenies Log-odds matrices, a normalisation method for matrix values:
S is the probability that two residues, i and j, are aligned by evolutionary descent and by chance. qij are the frequencies that i and j are observed to align in sequences known to
be related. pi and pj are their frequencies of occurrence in the set of sequences.
Database search methods: Sequence Alignment
Two broad classes of sequence alignments exist:
Global alignment: not sensitive
Local alignment: faster
QKESGPSSSYC
VQQESGLVRTTC
ESG
ESG
The most widely used local similarity algorithms are:Smith-Waterman (http://www.ebi.ac.uk/MPsrch/)Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nih.gov)Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/;
http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)
Which algorithm to use for database similarity search?
Speed: BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a LOT OF COMPUTER POWER)
Sensitivity/statistics: FASTA is more sensitive, misses less homologuesSmith-Waterman is even more sensitive. BLAST calculates probabilities
FASTA more accurate for DNA-DNA search then BLAST
Genomics: Completed genomes as 2002
Currently the genome of over 600 organisms are sequenced:
Organism Base pairs Whole-genome shotgun Map-based
54 Bacteria 0.8-6 million + –Yeast 15 million – +
C. elegans (roundworm) 100 million – +
Drosophila (fruitfly) 120 million + –Arabidopsis (thale cress) 130 million – +Rice 435 million – +
Human 3 billion + +
Fugu (puffer fish) 365 million + –
Anopheles (malaria-carrying mosquito) 278 million + –
This generates large amounts of information to be handled by individual computers.
The dilemma: DNA or protein?
Is the comparison of two nucleotide sequences accurate?
By translating into amino acid sequence, are we losing information? The genetic code is degenerate (Two or more codons can represent the same amino acid)
Very different DNA sequences may code for similar protein sequences We certainly do not want to miss those cases!
Search by similarity
Using nucleotide seq. Using amino acid seq.
Tools to search databases
Comparing DNA sequences give more random matches:
Reasons for translating
A good alignment with end-gaps A very poor alignment
Almost 50% identity!
Conservation of protein in evolution (DNA similarity decays faster!)
It is almost always better to compare coding sequences in their amino acid form, especially if they are very divergent.
Very highly similar nucleotide sequences may give better results.
Conclusion:
FASTA: Compares a DNA query to DNA database, or a protein query to protein database
FASTX: Compares a translated DNA query to a protein databaseTFASTA: Compares a protein query to a translated DNA database
BLAST and FASTA variants
BLASTN: Compares a DNA query to DNA database.
BLASTP: Compares a protein query to protein database.
BLASTX: Compares the 6-frame translations of DNA query to protein database.TBLASTN: Compares a protein query to the 6-frame translations of a DNA
database.
TBLASTX: Compares the 6-frame translations of DNA query to the 6-frame translations of a DNA database (each sequence is comparable to
BLASTP searches!)
PSI-BLAST: Performs iterative database searches. The results from each round are incorporated into a 'position specific' score matrix, which is used for further searching
A practical example of sequence alignmenthttp://www.ncbi.nlm.nih.gov
BLAST results
Detailed BLAST results
E value: is the expectation value or probability to find by chance hits similar to your sequence. The lower the E, the more significant the score.
Database searching tips
Use latest database version.
Use BLAST first, then a finer tool (FASTA,…)
Search both strands when using FASTA.
Translate sequences where relevant
Search 6-frame translation of DNA database
E < 0.05 is statistically significant, usually biologically interesting.
If the query has repeated segments, delete them and repeat search
Most widely used sites for sequence analysis
Sites for alignment of 2 sequences:
T-COFFEE (http://www.ch.embnet.org/software/TCoffee.html): more accurate than ClustalW for sequences with less than 30% identity.
ClustalW (http://www.ch.embnet.org/software/ClustalW.html; http://align.genome.jp)bl2sequ (http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi)LALIGN (http://www.ch.embnet.org/software/LALIGN_form.html)MultiALIGN (http://prodes.toulouse.inra.fr/multalin/multalin.html)
Sites for DNA to protein translation: These algorithms can translate DNA sequences in any of the 3 forward or three reverse sense frames.
Translate (http://au.expasy.org/tools/dna.html)Translate a DNA sequence: (http://www.vivo.colostate.edu/molkit/translate/index.html)Transeq (http://www.ebi.ac.uk/emboss/transeq)