37
2/3/2016 1 Computational Molecular Biology Alexander P. Goultiaev Erwin M. Bakker Organization Contact: Alexander P. Goultiaev Erwin M. Bakker, email: [email protected] Website: For agenda, materials, assignments, references, links, etc. http://liacs.leidenuniv.nl/~bakkerem2/cmb2016 / Course (6EC): Lectures Lecture-Assignments Assignments (40% of the grade): lab assignments (25% of the grade) final assignment (15% of the grade) Exam (60% of the grade; grade >5; closed book exam) Final grade = 0.4 x grade assignments + 0.6 x grade exam

Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

1

Computational

Molecular Biology

Alexander P. Goultiaev

Erwin M. Bakker

Organization

Contact:

Alexander P. Goultiaev

Erwin M. Bakker,

email: [email protected]

Website:

For agenda, materials, assignments, references, links, etc.

http://liacs.leidenuniv.nl/~bakkerem2/cmb2016/

Course (6EC):

Lectures

Lecture-Assignments

Assignments (40% of the grade):• lab assignments (25% of the grade)

• final assignment (15% of the grade)

Exam (60% of the grade; grade >5; closed book exam)

Final grade = 0.4 x grade assignments + 0.6 x grade exam

Page 2: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

2

Overview

General Introduction

Sequence Alignment

Introduction to Molecular Biology

Sequence Alignment and Database

Search

Lab: Sequence Alignment

Multiple Sequence Alignment

Lab: Multiple Sequence Alignment

Sequence Alignment Lab Review

Overview

Classical Profiles

Hidden Markov Models

Multiple Sequence Alignment Lab Review

Gene Finding

RNA Structure Prediction

Lab: Structure Prediction

Protein Structure Prediction

BWT and Next Generation Sequencing

Page 3: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

3

The Structure of DNARosalind Franklin, James D. Watson, Francis Crick

(1953)

Nucleotides (bases)

• Adenine (A)

• Cytosine (C)

• Guanine (G)

• Thymine (T)

Complementary

Binding:

• T – A

• A – T

• C – G

• G - C

6

Genes

Gene: Contiguous subparts of single

strand DNA that are templates for

producing proteins. Genes can

appear in either of the DNA strand.

Chromosomes: compact chains of

coiled DNA

Genome: The set of all genes in a

given organism.

Noncoding part: The function of DNA

material between genes is largely

unknown. Certain intergenic regions

of DNA are known to play a major

role in cell regulation (controls the

production of proteins and their

possible interactions with DNA).Source: www.mtsinai.on.ca/pdmg/Genetics/basic.htm

Page 4: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

4

Central Dogma of Molecular BiologyFrom Sequence to Function

DNA

Transcription

and Splicing

RNA

Translation

Protein

{A,C,T,G}*

{A,C,U,G}*

{20 symbols}*

Code is Structure is Function

Page 5: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

5

Molecule of the Month

www.pdb.org

Animated gifs from: proteinexplorer.org

March 2008:

Cadherin

• Adhesive Proteins

• Selective Stickiness:

The red tyrosine

amino acid will

bind to Cadherins

on neighbouring

cells

Molecule of the Month

www.pdb.org

November 2015:

Glutamate-gated Chloride

Receptors

• Receptors of Chloride Ion

channels in nerve systems

of parasites like worms

• Targets for antibiotics,

because our own cells don't

use them

Page 6: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

6

The Cell

Evolutie: The Tree of Life

Phylogenetic trees:

• Traditioneel:

morfologie

• Nieuwe inzichten door

bestudering van

biologische

sequenties, genomen,

etc.

From: wikipedia

Page 7: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

7

From: Mark Ragan, Phylogenetics without multiple sequence alignment,

IPAM Workshop on Multiple Sequence Alignment UCLA, 13 January 2015

Tikfouten:

Dit is een gen.

Dit is geen gen.

Dit is een gen.

Dit is een zen.

Dit is een den.q w e r t

a s d f g

z x c v b

Page 8: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

8

Biologische (Sequentie) Databases

Efficiente algoritmen voor het zoeken in steeds

groter wordende biodatabases:

Zoek patroon P in een text T.

(Knuth-Morris-Pratt, 1974/1977).

BLAST, FASTA

SPOT FINDING IN MICROARRAYS

DROSOPHILA: DYNAMIC GENE EXPRESSION PATTERNS

FLUORESCENT TRANSGENIC MOUSE

BUILDING 3D ATLASZEBRA FISHTAGGING INDIVIDUAL CHROMOSOMES

FISH

TAGGING TELOMERSPROTEIN GELS

GO ANNOTATION

KEGG PATHWAYS

ENSEMBLE CONTIG VIEW

CL

US

TE

R V

IE

W

~1012 bp

Overview

String Alignment

t h i s _ i s _ a _ t e s t _ s t r i n g

x x x x x x x x x x x x x x x

t h o s _ i s a _ t e x t _ s t r o n g

t h i s _ i s _ a _ t e s t _ s t r i n g

t d c c

t h o s _ i s a _ t e x t _ s t r o n g

t = typo; c = change leading to different semantics; d = deletion;

Page 9: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

9

Overview

String Alignment

A G T C A A G T C A A G T - C A

s d i

A G A C A A G - C A A G T A C A

K S Q E T K S Q E T K S Q E - T

s d i

K V Q E T K - Q E T K S Q E V T

s = substitution; d = deletion; i = insertion;

DNA sequences.

Protein sequences: Peptide or amino acid sequences.

Overview

Alignment and Database Search

(symmetric)

Page 10: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

10

Overview

Alignment and Database Search

BLAST Queries:

Overview

Multiple Alignment, Profiles, Gene Finding

Page 11: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

11

Overview Phylogeny

Overview Phylogeny

Page 12: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

12

Overview Phylogeny

Overview

Physical Mapping (PQ Trees), Sequencing

Page 13: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

13

Overview

Physical Mapping (PQ Trees), Sequencing

Overview Hidden Markov Models

MODEL

- OBSERVER -

Page 14: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

14

Overview

Applications of HMM

Speech Recognition

Music Recognition

Weather Prediction

Molecular Biology

CpG Island Detection

Protein Profile Alignment

Gene Finding

Copy Number Detection

Overview

RNA Structure

RNA Folding Simulations

Page 15: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

15

Overview

Structure Prediction

DNA

sequence 3D structure protein functions

DNA (gene) →→→ pre-RNA →→→ RNA →→→ Protein

RNA-polymerase Spliceosome Ribosome

ACCGACCAAGCGGCGTTCACC

ATGAGGCTGCTGACCCTCCTG

GGCCTTCTG…

TDQAAFDTNIVTLTRFVMEQG

RKARGTGEMTQLLNSLCTAVK

AISTAVRKAGIAHLYGIAGST

NVTGDQVKKLDVLSNDLVINV

LKSSFATCVLVTEEDKNAIIV

EPEKRGKYVVCFDPLDGSSNI

DCLVSIGTIFGIYRKNSTDEP

SEKDALQPGRNLVAAGYALYG

SATML

Overview

Structure Prediction

Primary structure, i.e., the sequence of

amino acids is given/determined

Secondary structure prediction

algorithms consider the residues in a

polypeptide chain to be in:

Three (helix, strand, coil) states

Four (helix, strand, coil, turn) states or even

Eight states and try to predict the location of

these states.

Page 16: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

16

Overview

Structure Prediction

Tertiary structure Prediction three general strategies

1. Comparative (homology) modeling. If there is a clear sequence homology between the target and one or more known structures, an algorithm tries to obtain the most accurate structural model for the target, consistent with the known set.

2. Fold recognition: approaches that try to recognize a known fold in a domain within the target protein. Alternative sequence structure alignments are scored using some kind of conformational energy calculation, based on statistics of known structures.

3. Ab initio methods. Modeling of structures using potential energy calculations.

Sequencing

1953 Watson and Crick: the structure of the DNA molecule DNA carrier of the genetic information, the challenge of

reading the DNA sequence became central to biological research.

methods for DNA sequencing were extremely inefficient, laborious and costly.

1965 Holley: reliably sequencing the yeast gene for tRNAAla required the equivalent of a full year's work per person per base pair (bp) sequenced (1bp/person year).

1970 Two classical methods for sequencing DNA fragments by Sanger and Gilbert.

Page 17: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

17

Sanger sequencing (sketch):

a. The polymerase extends the labeled primer, randomly incorporating either:

a normal dCTP base, or

a modified ddCTP base

At every position where a ddCTP is inserted, polymerization terminates ! => a population of fragments, where the length of each fragment is a function of the relative distance from the modified base to the primer.

b. electrophoretic separation of the products of each of the four reaction tubes (ddG, ddA, ddT, and ddC), run in individual lanes. The bands on the gel represent the respective fragments (Labelled Strands) shown to the right => the complement of the original template (bottom to top)

Next Generation Sequencing Technologies

(1) A modified nucleotide is added to the complementary DNA strand by a DNA polymerase enzyme.

(2) A laser is used to obtain a read of the nucleotide just added.

(3) The full sequence of a fragment thus determined through successive iterations of the process.

(4) A visualization of the matrix where fragment clusters are attached to flow cells.

Page 18: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

18

Next Generation Sequencing

Picture from: A. Gritsenko, Scaffolding of next-generation sequencing assemblies using diverse information sources, MSc Thesis, Leiden, 2011

OverviewExact Pattern Matching, Aho-Corasick, Suffix Trees

Exact Matching

Search a pattern P in text T,

where P and T are strings

Knuth Morris Pratt

Preprocess pattern P

Aho Corasick

Pattern P is set of strings {P1,…,Pr}

Suffix Trees

Preprocess text T, where T can be a set of texts,

i.e., a database of texts

Page 19: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

19

Molecular Biology Databases

UniProt / SWISS-PROT

Genbank (NCBI, America)

EMBL (Europe)

DDBJ (Japan)

ENTREZ (NCBI)

Others

Sequence Databases

First molecular biology database:

Atlas of Protein Sequences (1965)

As a result of novel efficient DNA sequencing techniques:

Early 80’s three major nucleotide sequence databases

EMBL (European Molecular Biology Laboratory)

Genbank (America, National Institute of Health)

DDBJ (DNA DataBank of Japan)

See www.oxfordjournals.org/nar/database/c/ for an overview of a collection of 1552 Molecular Biology Databases

Page 20: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

20

Sequence Databases

The three major nucleotide sequence databases

EMBL (European)

Genbank (America)

DDBJ (Japan)

In 1988 a common (flat) file format was defined

Currently all sequences are exchanged on a daily basis between these three databases

Submission of a sequence entry at one of the databases

The database that ‘owns’ the submission will update the sequence in the remaining two databases

More than 300 000 organisms

Genbank Flat-file Format

an elementary unit of a sequence database

represents a single sequence entry

used to exchange data between the

databases

GenBank, EMBL, and DDBJ use flat-files

that have only small syntactic differences:

EMBL flat files have line-type prefixes at

every line

Etc. …

Page 21: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

21

Genbank Flat-file Format

The flat-file has the following global structure:

Header

Description of the record

Features

Annotations on the record

Sequence

The nucleotide sequence itself

Genbank Flat-file Format

Header

The description of the sequence such as names of organism, gene name, sequence length, keywords, literature reference data

Identifiers for searching Accession number ( www.ncbi.nlm.nih.gov/Sequin/acc.html )

• Combination of 1 letter and 5 or 2 letters and 6 digits (U12345, AF123456), indicating the ‘owner’ of nucleotide sequence.

• Does not change, even if the data in the record changes for example due to a sequence correction, etc.

• Version number is used in those cases U12345.1; U12345.2

GI number (GenInfo number)• Identifier unique for every sequence

• If a sequence changes a new GI-number is assigned

ORF numbers• For every open reading frame (ORF) encoded by a nucleotide

sequence a separate number is used.

Page 22: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

22

Genbank Flat-file Format

Features

Contain the relevant biological annotations of the sequence

Encoded proteins

RNA sequences

Signals such as binding sites

protein binding sites

mRNA

Exons

introns, etc.

Genbank Flat-file Format

Data fields linkage

Many data fields provide links to other databases such as

Pubmed literature database

Protein databases

Etc.

These are hard links between entries from different sequence databases that share the same data fields of the ENTREZ retrieval system

Page 23: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

23

ENTREZ Retrieval System

Maintained by the American National Center for Biotechnology Information (NCBI)

ENTREZ provides unified access to various databases and resources

Access to the primary database of nucleotide sequences

Access to curated databases RefSeq non-redundant reference sequences

ENTREZ Gene, gene-centered information

Etc.

NCBI ENTREZ (GQuery)http://www.ncbi.nlm.nih.gov/gquery/

Page 24: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

24

NCBIhttp://www.ncbi.nlm.nih.gov/

EMBLhttp://www.embl.org/

Page 25: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

25

EMBL-EBIhttp://www.ebi.ac.uk/

EMBL-EBIhttp://www.ebi.ac.uk/services

Programmatic access for processing pipelines.

Page 26: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

26

GenBank

http://www.ncbi.nlm.nih.gov/genbank/

Genbank Growthhttp://www.ncbi.nlm.nih.gov/genbank/statistics

Page 27: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

27

Dec. 2015:

203 939 111 071

DDBJ http://www.ddbj.nig.ac.jp/index-e.html

Page 28: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

28

DDBJ

SWISS-PROT (UniProtKB)

http://www.uniprot.org/

Page 29: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

29

SWISS-PROT

By University of Geneva and EMBL

Created in 1987, one of the oldest and most used protein sequence database

Access via:

Entrez (NCBI)

SRS (European Bioinformatics Institute (EBI))

It is a secondary database

Manually curated

Adds value to data available in primary nucleotide sequence database

Only very small portion of data is directly submitted to SWISS-PROT

SWISS-PROT

Data

Core Data

Sequence

Bibliography

Taxonomic information

Annotation

Protein function

Domains

Specific sites

Etc.

Page 30: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

30

SWISS-PROT

ENTREZ

ENTREZ Protein entries from

SWISS-PROT

Translations from coding regions in

GenBank, RefSeq

Hard links to relevant entries in the

other databases

Links to similar, neighboring protein

sequences

NCBI Example: Lactococcus lactis

Page 31: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

31

http://www.ncbi.nlm.nih.gov/gquery

Page 32: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

32

NCBI

Microbial Genomes

NCBI

Microbial Genomes

Page 33: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

33

NCBI

Microbial Genomes

Lactococcus lactis subsp. lactis KF147

Lactococcus lactis. This microbe is a member of the lactic acid bacteria and produces lactic acid from sugars. It is found in many environments including plant and animal habitats. Lactococcus lactis is used as a starter culture for the production of cheese products(such as cheddar) and in milk fermentations and, as such, is one of the most important microbes in the food industry. Many of the important functions for fermentation are encoded on the many conjugative plasmids these bacteria contain. The degradation of casein, acidification by lactic acid, and production of flavor compounds, processes that are caused by the bacteria, contribute to the final product.

Lactococcus lactis subsp. lactis KF147. This strain will be sequence for comparataive genome analysis.

Project ID: 42831

NCBI

Microbial Genomes

Page 34: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

34

NCBI

Microbial Genomes

NCBI

Microbial Genomes

Page 35: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

35

Ensembl http://www.ensembl.org/index.html

Ensembl

Page 36: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

36

Ensembl

Gorilla

http://www.uni-giessen.de/ecoli/IECA/index.php

Page 37: Computational Molecular Biology Lecture 1liacs.leidenuniv.nl/~bakkerem2/cmb2016/CMB2016_Lecture01.pdf · (2) A laser is used to obtain a read of the nucleotide just added. (3) The

2/3/2016

37

HOME WORK

Visit each of the following sites and get

an impression of the available tools and

data sources:

UniProt / SWISS-PROT

Genbank (NCBI, America)

EMBL (Europe)

DDBJ (Japan)

ENTREZ (NCBI)