Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
2/3/2016
1
Computational
Molecular Biology
Alexander P. Goultiaev
Erwin M. Bakker
Organization
Contact:
Alexander P. Goultiaev
Erwin M. Bakker,
email: [email protected]
Website:
For agenda, materials, assignments, references, links, etc.
http://liacs.leidenuniv.nl/~bakkerem2/cmb2016/
Course (6EC):
Lectures
Lecture-Assignments
Assignments (40% of the grade):• lab assignments (25% of the grade)
• final assignment (15% of the grade)
Exam (60% of the grade; grade >5; closed book exam)
Final grade = 0.4 x grade assignments + 0.6 x grade exam
2/3/2016
2
Overview
General Introduction
Sequence Alignment
Introduction to Molecular Biology
Sequence Alignment and Database
Search
Lab: Sequence Alignment
Multiple Sequence Alignment
Lab: Multiple Sequence Alignment
Sequence Alignment Lab Review
Overview
Classical Profiles
Hidden Markov Models
Multiple Sequence Alignment Lab Review
Gene Finding
RNA Structure Prediction
Lab: Structure Prediction
Protein Structure Prediction
BWT and Next Generation Sequencing
2/3/2016
3
The Structure of DNARosalind Franklin, James D. Watson, Francis Crick
(1953)
Nucleotides (bases)
• Adenine (A)
• Cytosine (C)
• Guanine (G)
• Thymine (T)
Complementary
Binding:
• T – A
• A – T
• C – G
• G - C
6
Genes
Gene: Contiguous subparts of single
strand DNA that are templates for
producing proteins. Genes can
appear in either of the DNA strand.
Chromosomes: compact chains of
coiled DNA
Genome: The set of all genes in a
given organism.
Noncoding part: The function of DNA
material between genes is largely
unknown. Certain intergenic regions
of DNA are known to play a major
role in cell regulation (controls the
production of proteins and their
possible interactions with DNA).Source: www.mtsinai.on.ca/pdmg/Genetics/basic.htm
2/3/2016
4
Central Dogma of Molecular BiologyFrom Sequence to Function
DNA
Transcription
and Splicing
RNA
Translation
Protein
{A,C,T,G}*
{A,C,U,G}*
{20 symbols}*
Code is Structure is Function
2/3/2016
5
Molecule of the Month
www.pdb.org
Animated gifs from: proteinexplorer.org
March 2008:
Cadherin
• Adhesive Proteins
• Selective Stickiness:
The red tyrosine
amino acid will
bind to Cadherins
on neighbouring
cells
Molecule of the Month
www.pdb.org
November 2015:
Glutamate-gated Chloride
Receptors
• Receptors of Chloride Ion
channels in nerve systems
of parasites like worms
• Targets for antibiotics,
because our own cells don't
use them
2/3/2016
6
The Cell
Evolutie: The Tree of Life
Phylogenetic trees:
• Traditioneel:
morfologie
• Nieuwe inzichten door
bestudering van
biologische
sequenties, genomen,
etc.
From: wikipedia
2/3/2016
7
From: Mark Ragan, Phylogenetics without multiple sequence alignment,
IPAM Workshop on Multiple Sequence Alignment UCLA, 13 January 2015
Tikfouten:
Dit is een gen.
Dit is geen gen.
Dit is een gen.
Dit is een zen.
Dit is een den.q w e r t
a s d f g
z x c v b
2/3/2016
8
Biologische (Sequentie) Databases
Efficiente algoritmen voor het zoeken in steeds
groter wordende biodatabases:
Zoek patroon P in een text T.
(Knuth-Morris-Pratt, 1974/1977).
BLAST, FASTA
SPOT FINDING IN MICROARRAYS
DROSOPHILA: DYNAMIC GENE EXPRESSION PATTERNS
FLUORESCENT TRANSGENIC MOUSE
BUILDING 3D ATLASZEBRA FISHTAGGING INDIVIDUAL CHROMOSOMES
FISH
TAGGING TELOMERSPROTEIN GELS
GO ANNOTATION
KEGG PATHWAYS
ENSEMBLE CONTIG VIEW
CL
US
TE
R V
IE
W
~1012 bp
Overview
String Alignment
t h i s _ i s _ a _ t e s t _ s t r i n g
x x x x x x x x x x x x x x x
t h o s _ i s a _ t e x t _ s t r o n g
t h i s _ i s _ a _ t e s t _ s t r i n g
t d c c
t h o s _ i s a _ t e x t _ s t r o n g
t = typo; c = change leading to different semantics; d = deletion;
2/3/2016
9
Overview
String Alignment
A G T C A A G T C A A G T - C A
s d i
A G A C A A G - C A A G T A C A
K S Q E T K S Q E T K S Q E - T
s d i
K V Q E T K - Q E T K S Q E V T
s = substitution; d = deletion; i = insertion;
DNA sequences.
Protein sequences: Peptide or amino acid sequences.
Overview
Alignment and Database Search
(symmetric)
2/3/2016
10
Overview
Alignment and Database Search
BLAST Queries:
Overview
Multiple Alignment, Profiles, Gene Finding
2/3/2016
11
Overview Phylogeny
Overview Phylogeny
2/3/2016
12
Overview Phylogeny
Overview
Physical Mapping (PQ Trees), Sequencing
2/3/2016
13
Overview
Physical Mapping (PQ Trees), Sequencing
Overview Hidden Markov Models
MODEL
- OBSERVER -
2/3/2016
14
Overview
Applications of HMM
Speech Recognition
Music Recognition
Weather Prediction
…
Molecular Biology
CpG Island Detection
Protein Profile Alignment
Gene Finding
Copy Number Detection
…
Overview
RNA Structure
RNA Folding Simulations
2/3/2016
15
Overview
Structure Prediction
DNA
sequence 3D structure protein functions
DNA (gene) →→→ pre-RNA →→→ RNA →→→ Protein
RNA-polymerase Spliceosome Ribosome
ACCGACCAAGCGGCGTTCACC
ATGAGGCTGCTGACCCTCCTG
GGCCTTCTG…
TDQAAFDTNIVTLTRFVMEQG
RKARGTGEMTQLLNSLCTAVK
AISTAVRKAGIAHLYGIAGST
NVTGDQVKKLDVLSNDLVINV
LKSSFATCVLVTEEDKNAIIV
EPEKRGKYVVCFDPLDGSSNI
DCLVSIGTIFGIYRKNSTDEP
SEKDALQPGRNLVAAGYALYG
SATML
Overview
Structure Prediction
Primary structure, i.e., the sequence of
amino acids is given/determined
Secondary structure prediction
algorithms consider the residues in a
polypeptide chain to be in:
Three (helix, strand, coil) states
Four (helix, strand, coil, turn) states or even
Eight states and try to predict the location of
these states.
…
2/3/2016
16
Overview
Structure Prediction
Tertiary structure Prediction three general strategies
1. Comparative (homology) modeling. If there is a clear sequence homology between the target and one or more known structures, an algorithm tries to obtain the most accurate structural model for the target, consistent with the known set.
2. Fold recognition: approaches that try to recognize a known fold in a domain within the target protein. Alternative sequence structure alignments are scored using some kind of conformational energy calculation, based on statistics of known structures.
3. Ab initio methods. Modeling of structures using potential energy calculations.
Sequencing
1953 Watson and Crick: the structure of the DNA molecule DNA carrier of the genetic information, the challenge of
reading the DNA sequence became central to biological research.
methods for DNA sequencing were extremely inefficient, laborious and costly.
1965 Holley: reliably sequencing the yeast gene for tRNAAla required the equivalent of a full year's work per person per base pair (bp) sequenced (1bp/person year).
1970 Two classical methods for sequencing DNA fragments by Sanger and Gilbert.
2/3/2016
17
Sanger sequencing (sketch):
a. The polymerase extends the labeled primer, randomly incorporating either:
a normal dCTP base, or
a modified ddCTP base
At every position where a ddCTP is inserted, polymerization terminates ! => a population of fragments, where the length of each fragment is a function of the relative distance from the modified base to the primer.
b. electrophoretic separation of the products of each of the four reaction tubes (ddG, ddA, ddT, and ddC), run in individual lanes. The bands on the gel represent the respective fragments (Labelled Strands) shown to the right => the complement of the original template (bottom to top)
Next Generation Sequencing Technologies
(1) A modified nucleotide is added to the complementary DNA strand by a DNA polymerase enzyme.
(2) A laser is used to obtain a read of the nucleotide just added.
(3) The full sequence of a fragment thus determined through successive iterations of the process.
(4) A visualization of the matrix where fragment clusters are attached to flow cells.
2/3/2016
18
Next Generation Sequencing
Picture from: A. Gritsenko, Scaffolding of next-generation sequencing assemblies using diverse information sources, MSc Thesis, Leiden, 2011
OverviewExact Pattern Matching, Aho-Corasick, Suffix Trees
Exact Matching
Search a pattern P in text T,
where P and T are strings
Knuth Morris Pratt
Preprocess pattern P
Aho Corasick
Pattern P is set of strings {P1,…,Pr}
Suffix Trees
Preprocess text T, where T can be a set of texts,
i.e., a database of texts
2/3/2016
19
Molecular Biology Databases
UniProt / SWISS-PROT
Genbank (NCBI, America)
EMBL (Europe)
DDBJ (Japan)
ENTREZ (NCBI)
Others
Sequence Databases
First molecular biology database:
Atlas of Protein Sequences (1965)
As a result of novel efficient DNA sequencing techniques:
Early 80’s three major nucleotide sequence databases
EMBL (European Molecular Biology Laboratory)
Genbank (America, National Institute of Health)
DDBJ (DNA DataBank of Japan)
See www.oxfordjournals.org/nar/database/c/ for an overview of a collection of 1552 Molecular Biology Databases
2/3/2016
20
Sequence Databases
The three major nucleotide sequence databases
EMBL (European)
Genbank (America)
DDBJ (Japan)
In 1988 a common (flat) file format was defined
Currently all sequences are exchanged on a daily basis between these three databases
Submission of a sequence entry at one of the databases
The database that ‘owns’ the submission will update the sequence in the remaining two databases
More than 300 000 organisms
Genbank Flat-file Format
an elementary unit of a sequence database
represents a single sequence entry
used to exchange data between the
databases
GenBank, EMBL, and DDBJ use flat-files
that have only small syntactic differences:
EMBL flat files have line-type prefixes at
every line
Etc. …
2/3/2016
21
Genbank Flat-file Format
The flat-file has the following global structure:
Header
Description of the record
Features
Annotations on the record
Sequence
The nucleotide sequence itself
Genbank Flat-file Format
Header
The description of the sequence such as names of organism, gene name, sequence length, keywords, literature reference data
Identifiers for searching Accession number ( www.ncbi.nlm.nih.gov/Sequin/acc.html )
• Combination of 1 letter and 5 or 2 letters and 6 digits (U12345, AF123456), indicating the ‘owner’ of nucleotide sequence.
• Does not change, even if the data in the record changes for example due to a sequence correction, etc.
• Version number is used in those cases U12345.1; U12345.2
GI number (GenInfo number)• Identifier unique for every sequence
• If a sequence changes a new GI-number is assigned
ORF numbers• For every open reading frame (ORF) encoded by a nucleotide
sequence a separate number is used.
2/3/2016
22
Genbank Flat-file Format
Features
Contain the relevant biological annotations of the sequence
Encoded proteins
RNA sequences
Signals such as binding sites
protein binding sites
mRNA
Exons
introns, etc.
Genbank Flat-file Format
Data fields linkage
Many data fields provide links to other databases such as
Pubmed literature database
Protein databases
Etc.
These are hard links between entries from different sequence databases that share the same data fields of the ENTREZ retrieval system
2/3/2016
23
ENTREZ Retrieval System
Maintained by the American National Center for Biotechnology Information (NCBI)
ENTREZ provides unified access to various databases and resources
Access to the primary database of nucleotide sequences
Access to curated databases RefSeq non-redundant reference sequences
ENTREZ Gene, gene-centered information
Etc.
NCBI ENTREZ (GQuery)http://www.ncbi.nlm.nih.gov/gquery/
2/3/2016
24
NCBIhttp://www.ncbi.nlm.nih.gov/
EMBLhttp://www.embl.org/
2/3/2016
25
EMBL-EBIhttp://www.ebi.ac.uk/
EMBL-EBIhttp://www.ebi.ac.uk/services
Programmatic access for processing pipelines.
2/3/2016
26
GenBank
http://www.ncbi.nlm.nih.gov/genbank/
Genbank Growthhttp://www.ncbi.nlm.nih.gov/genbank/statistics
2/3/2016
27
Dec. 2015:
203 939 111 071
DDBJ http://www.ddbj.nig.ac.jp/index-e.html
2/3/2016
28
DDBJ
SWISS-PROT (UniProtKB)
http://www.uniprot.org/
2/3/2016
29
SWISS-PROT
By University of Geneva and EMBL
Created in 1987, one of the oldest and most used protein sequence database
Access via:
Entrez (NCBI)
SRS (European Bioinformatics Institute (EBI))
It is a secondary database
Manually curated
Adds value to data available in primary nucleotide sequence database
Only very small portion of data is directly submitted to SWISS-PROT
SWISS-PROT
Data
Core Data
Sequence
Bibliography
Taxonomic information
Annotation
Protein function
Domains
Specific sites
Etc.
2/3/2016
30
SWISS-PROT
ENTREZ
ENTREZ Protein entries from
SWISS-PROT
Translations from coding regions in
GenBank, RefSeq
Hard links to relevant entries in the
other databases
Links to similar, neighboring protein
sequences
NCBI Example: Lactococcus lactis
2/3/2016
31
http://www.ncbi.nlm.nih.gov/gquery
2/3/2016
32
NCBI
Microbial Genomes
NCBI
Microbial Genomes
2/3/2016
33
NCBI
Microbial Genomes
Lactococcus lactis subsp. lactis KF147
Lactococcus lactis. This microbe is a member of the lactic acid bacteria and produces lactic acid from sugars. It is found in many environments including plant and animal habitats. Lactococcus lactis is used as a starter culture for the production of cheese products(such as cheddar) and in milk fermentations and, as such, is one of the most important microbes in the food industry. Many of the important functions for fermentation are encoded on the many conjugative plasmids these bacteria contain. The degradation of casein, acidification by lactic acid, and production of flavor compounds, processes that are caused by the bacteria, contribute to the final product.
Lactococcus lactis subsp. lactis KF147. This strain will be sequence for comparataive genome analysis.
Project ID: 42831
NCBI
Microbial Genomes
2/3/2016
34
NCBI
Microbial Genomes
NCBI
Microbial Genomes
2/3/2016
35
Ensembl http://www.ensembl.org/index.html
Ensembl
2/3/2016
36
Ensembl
Gorilla
http://www.uni-giessen.de/ecoli/IECA/index.php
2/3/2016
37
HOME WORK
Visit each of the following sites and get
an impression of the available tools and
data sources:
UniProt / SWISS-PROT
Genbank (NCBI, America)
EMBL (Europe)
DDBJ (Japan)
ENTREZ (NCBI)