Erwin M. Bakkerliacs.leidenuniv.nl/~bakkerem2/cmb2015/CMB2015... · 2/11/2015 25 Biology Fundamentals: Cell Biology • A cell is made up of molecular components that can be viewed

2/12/2015

1

Introduction to Molecular Biology

Erwin M. Bakker

Compilation of slides, and pictures taken from [7-11]

Overview

• Introduction

– DNA

– Genes

– Proteins

• Proteins

• Sequencing

• 3D Structure of Proteins

• Micro-Arrays

• Gene finding

2/12/2015

2

Organism -> Cell -> Proteins

Top-down view:

- Organism

- Organ

- Tissue

- Cells

Overview of a Cell

2/12/2015

3

Example of a Protein at Work

2/11/2015 6

The Structure of DNARosalind Franklin, James D. Watson, Francis Crick (1953)

Nucleotides (bases)

• Adenine (A)

• Cytosine (C)

• Guanine (G)

• Thymine (T)

Complementary

Binding:

• T – A

• A – T

• C – G

• G - C

2/12/2015

4

2/11/2015 7

Biology Fundamentals: Genes

• Gene: Contiguous subparts of single

strand DNA that are templates for

producing proteins. Genes can appear in

either of the DNA strand.

– Chromosomes: compact chains of

coiled DNA

• Genome: The set of all genes in a given

organism.

• Noncoding part: The function of DNA

material between genes is largely

unknown. Certain intergenic regions of

DNA are known to play a major role in

cell regulation (controls the production of

proteins and their possible interactions

with DNA).Source: www.mtsinai.on.ca/pdmg/Genetics/basic.htm

2/11/2015 8

Biology Fundamentals: Transcription

• Proteins: Produced from DNA using 3 operations or transformations:

transcription, splicing and translation

– In eukaryotes (cells with nucleus): genes are only a minute part of the total

DNA

– In prokaryotes (cells without nucleus): the phase of splicing does not

occur (no pre-RNA generated)

• DNA is capable of replicating itself (DNA-polymerase)

• Center dogma: The capability of DNA for replication and undergoing the

three (or two) transformations

– Genes are transcribed into pre-RNA by a complex ensemble of molecules

(RNA-polymerase). During transcription T is substituted by the letter U

(for uracil).

– Pre-RNA can be represented by alternations of sequence segments called

exons and introns. The exons represents the parts of pre-RNA that will be

expressed, i.e., translated into proteins.

2/12/2015

5

2/11/2015 9

Biological Information:

From Genes to Proteins

GeneDNA

RNA

Transcription

Translation

Protein Protein folding

genomics

molecular

biology

structural

biology

biophysics

PROTEINS

2/12/2015

6

2/11/2015 11

Biology Fundamentals: Proteins

• Splicing (by spliceosome—an ensemble of proteins): concatenates the exons

and excises introns to form mRNA (or simply RNA)

• Translation (by ribosomes—an ensemble of RNA and proteins)

– Repeatedly considers a triplet of consecutive nucleotides (called codon)

in RNA and produces one corresponding amino acid

– In RNA, there is one special codon called start codon and a few others

called stop codons

• An Open Reading Frame (ORF): a sequence of codons starting with a start

codon and ending with an end codon. The ORF is thus a sequence of

nucleotides that is used by the ribosome to produce the sequence of amino

acid that makes up a protein.

• There are basically 20 amino acids (A, L, V, S, ...) but in certain rare

situations, others can be added to that list.

Creating the proteins in the cell

• Starting from the DNA

-ACTGCATGTACGTAGCTGAC-

-TGACGTACATGCATCGACTG-l l l l l l l l l l l l l l l l l l l l

2/12/2015

7


• Transcriptions: Making a copy of the DNA


• mRNA is a 4 letter alphabet (A, U, T, G)

• Amino acids 20 letter alphabet (All the

amino acids)

• Codons

2/12/2015

8


Structure of a Protein

• Consists of building blocks: amino acids

• Zwitter ion

2/12/2015

9

Side chains of Amino Acids

• Types of side chains:

– Hydrophobic

– Hydrophilic

• Charged

• Non-charged

Amino acid reaction

2/12/2015

10

Amino acid reaction

2/11/2015 20

Biology Fundamentals: 3D Structure

• Since there are 64 different codons and 20 amino acids, the “table look-up”

for translating each codon into an amino acid is redundant: multiple codons

can produce the same amino acid

• The table used by nature to perform translation is called the genetic code

• Due to the redundancy of the genetic code, certain nucleotide changes in

DNA may not alter the resulting protein

• Once a protein is produced, it folds into a unique structure in 3D space, with

3 types of components:α-helices, β-sheets and coils.

• The secondary structure of a protein is its sequence of amino acids,

annotated to distinguish the boundary of each component

• The tertiary structure is its 3D representation

2/12/2015

11

Secondary structure &

Tertiary Structure

DNA

sequence 3D structure protein functions

DNA (gene) →→→ pre-RNA →→→ RNA →→→ Protein

RNA-polymerase Spliceosome Ribosome2/11/2015 22

ACCGACCAAGCGGCGTTCACC

ATGAGGCTGCTGACCCTCCTG

GGCCTTCTG…

TDQAAFDTNIVTLTRFVMEQG

RKARGTGEMTQLLNSLCTAVK

AISTAVRKAGIAHLYGIAGST

NVTGDQVKKLDVLSNDLVINV

LKSSFATCVLVTEEDKNAIIV

EPEKRGKYVVCFDPLDGSSNI

DCLVSIGTIFGIYRKNSTDEP

SEKDALQPGRNLVAAGYALYG

SATML

From Amino Acids to Proteins Functions

2/12/2015

12

Source: fajerpc.magnet.fsu.edu/Education/2010/Lectures/26_DNA_Transcription.htm2/11/2015 23

Biology Fundamentals:

Functional Genomics

• The function of a protein is the

way it participates with other

proteins and molecules in

keeping the cell alive and

interacting with its environment

• Function is closely related to

tertiary structure

• Functional genomics: studies the

function of all the proteins of a

genome

2/11/2015 24

Protein is Function

Produced

mRNA

Reading the

Corresponding

DNA code

Protein at work:

T7 RNA Polymerase

2/12/2015

13

2/11/2015 25

Biology Fundamentals: Cell Biology

• A cell is made up of molecular components that can be viewed as 3D-structures of various shapes

• In a living cell, the molecules interact with each other (w. shape and location). An important type of interaction involves catalysis (enzyme) that facilitate interaction.

• A metabolic pathway is a chain of molecular interactions involving enzymes

• Signaling pathways are molecular interactions that enable communication through the cell’s membrane

Source: www.mtsinai.on.ca/pdmg/images/pairscolour.jpg

Human Genome—23 pairs of chromosomes

2/11/2015 26

The Cell as Computing Device

2/12/2015

14

Alzheimer related Protein Interactions

SEQUENCING

2/12/2015

15

2/11/2015 29

Lab Tools for Determining Bio. Data (I)

• Sequencer: machines capable of reading off a sequence of nucleotides in a

strand of DNA in biological samples

– It can produce 300k base pairs per day at relatively low cost

– A user can order from biotech companies vials containing short

sequences of nucleotides specified by the user

• Since sequences gathered in a wet lab consist of short random segments, one

has to use the shotgun method (a program) to reassemble them

– Difficulty: redundancy of seq. and ambiguity of assembly.

• Mass spectroscopy: identifies proteins by cutting them into short sequences

of amino acids (peptides) whose molecular weights can be determined by a

mass spectrograph, and then computationally infer the constituents of

peptides

Sequencing

1953 Watson and Crick: the structure of the DNA molecule DNA carrier of the genetic information, the challenge of reading the

DNA sequence became central to biological research.

methods for DNA sequencing were extremely inefficient, laborious and costly.

1965 Holley: reliably sequencing the yeast gene for tRNAAla required the equivalent of a full year's work per person per base pair (bp) sequenced (1bp/person year).

1970 Two classical methods for sequencing DNA fragments by Sanger and Gilbert.

2/12/2015

16

Sequencing: DNA Replication

• DNA Polymerase is a family of enzymes carrying out DNA replication.

• A Primer is a short fragment of DNA or RNA that pairs with a template DNA.

• Polymerase Chain Reaction– Mixture of templates and primers is heated such that templates

separate in strands

– Uses a pair of primers spanning a target region in template DNA

– Polymerizes partner strands in each direction from the primers using thermo stable DNA polymerase

– repeat

Sequencing: DNA Replication Enzyme

2/12/2015

17

Sequencing: Primers

2/12/2015

18

Sanger sequencing (sketch): a. The polymerase extends the

labeled primer, randomly incorporating either:

– a normal dCTP base, or

– a modified ddCTP base

At every position where a ddCTP is inserted, polymerization terminates ! => a population of fragments, where the length of each fragment is a function of the relative distance from the modified base to the primer.

b. electrophoretic separation of the products of each of the four reaction tubes (ddG, ddA, ddT, and ddC), run in individual lanes. The bands on the gel represent the respective fragments (Labelled Strands) shown to the right => the complement of the original template (bottom to top)

Sequencing

1980 In the 1980s methods were augmented by • partial automation

• the cloning method, which allowed fast and exponential replication of a DNA fragment.

1990 Start of the human genome project: sequencing efficiency reached 200,000 bp/person-year.

2002 End of the human genome project: 50,000,000 bp/person-year. (Total cost summed to $3 billion.)

Note: Moore’s Law doubling transistor count every 2 years

Here: doubling of #base pairs/person-year every 1.5 years

2/12/2015

19

Sequencing

Recently (2007 – 2011/2012):

• new sequencing technologies next generation sequencing (NGS) or deep sequencing– reliable sequencing of 100x109 bp/person-year.

– NGS now allows compiling the full DNA sequence of a person for ~ $10,000

– within 3-5 years cost is expected to drop to ~$1000.

• Currently full humane genome scan with a coverage of >3 for around 2000 euro.

Next Generation Sequencing Technologies

Next Generation Sequencing technology of Illumina, one of the leading companies in the field:

– DNA is replicated => millions of copies.

– All DNA copies are randomly shredded into various fragments using restriction enzymes or mechanical means.

– The fragments form the input for the sequencing phase.

– Hundreds of copies of each fragment are generated in one spot (cluster) on the surface of a huge matrix.

– Initially it is not known which fragment sits where.

Note: There are quite a few other available technologies available and currently under development.

2/12/2015

20

Next Generation Sequencing

Technologies

(1) A modified nucleotide is added to the complementary DNA strand by a DNA polymerase enzyme.

(2) A laser is used to obtain a read of the nucleotide just added.

(3) The full sequence of a fragment thus determined through successive iterations of the process.

(4) A visualization of the matrix where fragment clusters are attached to flow cells.

The Mapping Problem

The Short Read Mapping Problem

INPUT: m reads S1,…, Sm of length l and an approximate

reference genome R.

QUESTION: What are the positions x1,…, xm along R where each read S1,…, Sm matches, respectively?

For example: after sequencing the genome of a person we want to map it to an existing sequence of the human genome.

• The new sample will not be 100% identical to the reference genome:– natural variation in the population

– some reads may align to their position in the reference with mismatches or gaps

– sequencing errors

– repetitive regions in the genome make it difficult to decide where to map the read

• Humans are diploid organisms– different alleles on the maternal and paternal chromosomes

– two slightly different reads mapping to the same location (some perhaps with mismatches)

2/12/2015

21

41

Motivation: Coding Regions

Human Genome Project

• 3.21 x109 bases (GRCh38, 24-12 2013)

• ~21.000 genes (upper limit ~40k)

• ~2% coding sequences

• The Gold Standard completed in April 2003

• Many contributors

See: http://www.sanger.ac.uk/about/history/hgp/

Both Images by Morag Lewis

3D STRUCTURE OF PROTEIN

2/12/2015

22

2/11/2015 43

3D Structure of Proteins

• The 3D-structure of proteins is mainly determined (costly) by

– X-ray crystallography: X-ray passing through a crystallized sample of that protein, and

– nuclear magnetic resonance (NMR): obtain a number of matrices that express that fact that two atoms are within a certain distance and then deduce a 3D shape

2/11/2015 Data Mining: Principles and

Algorithms44

Molecule of the Month

www.pdb.org

Animated gifs from: proteinexplorer.org

March 2008:

Cadherin

• Adhesive Proteins

• Selective Stickiness:

The red tyrosine

amino acid will

bind to Cadherins

on neighbouring

cells

2/12/2015

23


Algorithms45


www.pdb.org


April 2009:

Oct and Sox

Transcription

Factors

• Determine which genes

will be turned on or off.

• Human contains about

30,000 genes

• There are only about

3000 transcription factors,

1 for every 10 genes


Algorithms46


www.pdb.org


November 2010:

Inteins

In most cases, each gene encodes a single protein,

but cells have found ways around this limitation.

Viruses, with their tiny genomes, often contain genes

that encode long polyproteins, which are then

chopped into a bunch of functional pieces by

enzymes. Inteins are another way that cells make

several proteins from one gene.

2/12/2015

24

Protein 3D Structure: Problem

• ~100.000.000.000 (=1011) DNA sequences

• ~100.000 (=105) 3D structures (in PDB)

• Cost of high resolution X-ray/NMR

– Purification of protein

– Actual observation of protein

– Interpretation of the data

– Only for 1 protein

• DNA Sequencers will produce (part of) the

entire DNA of an organism

2/11/2015 48

Protein Structure Design

Computer Aided Design of a Ligand

Specific to 14-3-3 Gamma Isomorf,

Danio Rerio (Zebra Fish)

by H.S. Faddiev

2/12/2015

25

MICROARRAYS

2/11/2015 50

Microarrays

• Microarrays: determine simultaneously the amount of mRNA production (gene

expression) of thousands of genes. It has 3 phases:

– Place thousands of different one-strand chunks of RNA in minuscule wells

on the surface of a small glass chip

– Spread genetic material obtained by a cell experiment one wishes to

perform

– Use a laser scanner and computer to measure the amount of combined

material and determine the degree (a real number) of gene expression for

each gene on the chip

2/12/2015

26


Algorithms

51

Gene Expression and Microarray

GENE FINDING

2/12/2015

27

DNA

http://en.wikipedia.org/wiki/DNA

53

54

Prokaryotes DNA Structure

2/12/2015

28

55

Eukaryotes DNA Structure

56

Genes

2/12/2015

29

57

Genes: Coding Regions

central dogma:

DNA transcription RNA translation protein

Eukaryotes only ~2% coding … find these regions

Prokaryotes Eukaryotes

no nucleus nucleus

most of genome is coding

H.influenza: 70 % coding

part of genome is coding

Human: ~2% coding

continuous genes introns & exons

Fast replication Slow transcription and

splicing

58

Genese: Transcription

Gene Expression from DNA to Protein

1. Transcription

2. Translation

(1) Transcription• DNA -> mRNA (messenger RNA).

• Takes place from the 5’ end to the 3’ , i.e., downstream (other direction is upstream)

• RNA polymerase enzyme starts a few bases upstream of the coding region, and terminates a few bases after the end of the coding region.

• The upstream and downstream regions that are not used for the code of the protein are called untranslated regions (UTRs)

• RNA polymerase molecules start transcription by recognizing and binding to promoter regions upstream of the transcription start sites.

• Each promoter region has a signal which can encourage or suspend transcription.

2/12/2015

30

59

Genes: Translation

Translation mRNA -> Protein is executed by ribosomes:

• Each triplet of bases in the mRNA is a command for the ribosomes, called codon.

• 64 different possible codons and only 20 amino acids => multiple codons represent the same amino acid.

Ribosome starts scanning the mRNA molecule from 5’ end to 3’ end;

If Ribosome detects start codon (also code for Methionine)

Then

While (Ribosome did not detect any of the three possible stop codons)

do

Generating an amino acid sequence coded by the mRNA;

Scan next codon;

od

Detach the chain of aminoacids from the Ribosome;

Endif

60

Codons

start

AUG

stop

UAA,

UAG,

UGA

UUU

UUC

2/12/2015

31

61

Codons

start

AUG

stop

UAA,

UAG,

UGA

UUU

UUC

62

Reading Frames

AGGCATGCGATCCAAGTTCCACCATGATGACATGATGACTA

5’-end 3’-enddownstream

upstream

TCCGTACGCTAGGTTCAAGGTGGTACTACTGTACTACTGAT

5’-end3’-end downstream

Complementary DNA strand

DNA strand

C G

A T

2/12/2015

32

63

Reading Frames

AGG CAT GCG ATC CAA GTT CCA CCA TGA TGA CAT GAT GAC TAAGGC ATG CGA TCC AAG TTC CAC CAT GAT GAC ATG ATG ACT AA.GCA TGC GAT CCA AGT TCC ACC ATG ATG ACA TGA TGA CTA A..

5’-end 3’-enddownstream

upstream

..T CCG TAC GCT AGG TTC AAG GTG GTA CTA CTG TAC TAC TGA.TC CGT ACG CTA GGT TCA AGG TGG TAC TAC TGT ACT ACT GATTCC GTA CGC TAG GTT CAA GGT GGT ACT ACT GTA CTA CTG ATT

5’-end3’-end downstream

Complementary DNA strand

DNA strand

64

Genes are not Random

motto

Leu Leucine 6 codons 6.9

Ala Alanine 4 6.5

Trp Tryptophan 1 1

codon frequencies (see table next slide)

‘random’ coding

ratios ratios

Also:

A or T in third position of a codon

sometimes in 90% of the cases.

=> Genes are not random!

=>

2/12/2015

33

65

Nucleotide Based Measure

• For CpG islands √

• For Non coding ORFs (NORFs)NORFs Genes

66

Codon Based Measure

NORFs Genes

2/12/2015

34

67

Promotor Regions

Construct statistics fb,i frequency of base b

in position i of known promotor region suffixes =>

… n TTGAC n18 TATAAT n6 N n …

position weight matrix

pos 1 2 3 4 5 6A 2 95 26 59 51 1C 9 2 14 13 20 3G 10 1 16 15 13 0T 79 3 44 13 17 96

Note: there is a 80% correlation between

the weight matrix score of the region and the binding energy.

start

codingTATA box

cf. ‘profile’

68

Promotor Region Detection

‘consensus’ sequence around RNA

transcription start point i.e. not exact

… n TTGAC n18 TATAAT n6 N n …

start

codingTATA box

• Promotor region is an ‘anchor’ point for polymerase, i.e.,

regulatory region that controls transcription rate.

• TATAAT is called TATA- or Pribnow-box.

• Use frequency of the occurrence of these sequences

• Variability of binding sites => no exact method for TATA-

box identification ….

2/12/2015

35

69

Typical Distributions: Vertebrates

Some typical data:

• Average gene length: 30Kb with coding region ~1-2Kb long.

• The average coding region has 6 exons, each ~150bp long.

• The promoter is about 6bp long and appears about 30bp upstream of the transcription start site (TSS).

• Transcription rate of less than 50b/sec

• Splicing process takes several minutes

But huge deviations exist:

• dystrophin is 2.4MB long.

• Blood coagulation-factor VIII has 26 exons whit sizes from 69bp to 3106bp.

70

Typical Distributions: Vertebrates

2/12/2015

36

71

Gene Expression

Polymerase

2/12/2015

37

Histones

2/12/2015

38

75

introns: splicing

• Splicing sites should be precise as a miss would lead to non-sense interpretation of the rest of the sequence.

• So called anchor points in the intron called branch point appears frequently.

• Also pyrimidine (bases C,T) rich areas appear between the branchpoint and the acceptor site.

algorithms based on position specific weight matrices.

does not exploit all the information (reading frames, intron/exon states, etc) and is not suitable for short genes.

Picture from: http://en.wikipedia.org/wiki/RNA_splicing

76

introns: splicing

AGGUAAGU … … CTGAC … … NCAGG …

donor

siteacceptor

sitebranchpoint

< 15 bp >

pyrimidine

rich CT

62 77 100 100 60 74 84 50 63-91 78 100 100 55

intron

consensus sequences / position specific weight matrices

Freq%

2/12/2015

39

77

HMM for Gene Finding

Geometric distribution of the exon state length k:

P(exon of length k) = pk(1-p)

HMM memory-less => modeled length distribution is geometric.

But exon length does not have a geometric distribution! =>

78

Gene Finding

Difficulties

– Low information contents of sequence signals

– Real signals difficult to detect

– Handling of sequencing errors

Prokaryotes:

– Short genes can easily be missed

– Overlapping genes

Eukaryotes:

– Complex gene structure

– Splicing

– Pseudo genes

2/12/2015

40

79

mGene.Web

Gabriele Schweikert, et al., mGene.web: a

web service for accurate computational

gene finding, Nucleic Acids Research, Vol.

37, 2009.

Bibliography

[1] Ben Langmead, Cole Trapnell, Mihai Pop and Steven L. Salzberg. Ultrafast and memory-effcient alignment of short DNA sequences to the human genome. Genome Biology, 2009.

[2] Michael Burrows and David Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.

[3] Brent Ewing and Phil Green. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research, 1998.

[4] Paulo Ferragina and Giovani Manzini. Opportunistic data structures with applications. FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science, 2000.

[5] Heng Li, Jue Ruan and Richard Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 2008.

[6] Daniel R. Zerbino and Ewan Birney. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 2008.

[7] Ron Shamir, Computational Genomics Fall Semester, 2010 Lecture 12: Algorithms for Next Generation Sequencing Data, January 6, 2011 Scribe: Anat Gluzman and Eran Mick

2/12/2015

41

Bibliography

[8] J. Pagano, Multi-Scale Analysis of Enhancer Gene Relations in Nine Inbred Rat Strains, Internal Report CS Bioinformatics Track 14-04, August 2014.

[9] D. Zafeiropoulou, Genotypes-Phenotype Predictions in Patients diagnosed with Early Onset Alzheimer, Internal Report CS Bioinformatics Track 14-05, August 2014.

[10] J. Neuteboom, Protein Structure Prediction by Iterative Fragment Assembly (PITA), CS Master Thesis, December 2014.

[11] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Presentation on Chapter 8, 2006. ( www.cs.uiuc.edu/~hanj )

http://www.cs.uiuc.edu/~hanj

Documents

Erwin M. Bakkerliacs.leidenuniv.nl/~bakkerem2/cmb2015/CMB2015... · 2/11/2015 25 Biology Fundamentals: Cell Biology • A cell is made up of molecular components that can be viewed