Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
2/12/2015
1
Introduction to Molecular Biology
Erwin M. Bakker
Compilation of slides, and pictures taken from [7-11]
Overview
• Introduction
– DNA
– Genes
– Proteins
• Proteins
• Sequencing
• 3D Structure of Proteins
• Micro-Arrays
• Gene finding
2/12/2015
2
Organism -> Cell -> Proteins
Top-down view:
- Organism
- Organ
- Tissue
- Cells
Overview of a Cell
2/12/2015
3
Example of a Protein at Work
2/11/2015 6
The Structure of DNARosalind Franklin, James D. Watson, Francis Crick (1953)
Nucleotides (bases)
• Adenine (A)
• Cytosine (C)
• Guanine (G)
• Thymine (T)
Complementary
Binding:
• T – A
• A – T
• C – G
• G - C
2/12/2015
4
2/11/2015 7
Biology Fundamentals: Genes
• Gene: Contiguous subparts of single
strand DNA that are templates for
producing proteins. Genes can appear in
either of the DNA strand.
– Chromosomes: compact chains of
coiled DNA
• Genome: The set of all genes in a given
organism.
• Noncoding part: The function of DNA
material between genes is largely
unknown. Certain intergenic regions of
DNA are known to play a major role in
cell regulation (controls the production of
proteins and their possible interactions
with DNA).Source: www.mtsinai.on.ca/pdmg/Genetics/basic.htm
2/11/2015 8
Biology Fundamentals: Transcription
• Proteins: Produced from DNA using 3 operations or transformations:
transcription, splicing and translation
– In eukaryotes (cells with nucleus): genes are only a minute part of the total
DNA
– In prokaryotes (cells without nucleus): the phase of splicing does not
occur (no pre-RNA generated)
• DNA is capable of replicating itself (DNA-polymerase)
• Center dogma: The capability of DNA for replication and undergoing the
three (or two) transformations
– Genes are transcribed into pre-RNA by a complex ensemble of molecules
(RNA-polymerase). During transcription T is substituted by the letter U
(for uracil).
– Pre-RNA can be represented by alternations of sequence segments called
exons and introns. The exons represents the parts of pre-RNA that will be
expressed, i.e., translated into proteins.
2/12/2015
5
2/11/2015 9
Biological Information:
From Genes to Proteins
GeneDNA
RNA
Transcription
Translation
Protein Protein folding
genomics
molecular
biology
structural
biology
biophysics
PROTEINS
2/12/2015
6
2/11/2015 11
Biology Fundamentals: Proteins
• Splicing (by spliceosome—an ensemble of proteins): concatenates the exons
and excises introns to form mRNA (or simply RNA)
• Translation (by ribosomes—an ensemble of RNA and proteins)
– Repeatedly considers a triplet of consecutive nucleotides (called codon)
in RNA and produces one corresponding amino acid
– In RNA, there is one special codon called start codon and a few others
called stop codons
• An Open Reading Frame (ORF): a sequence of codons starting with a start
codon and ending with an end codon. The ORF is thus a sequence of
nucleotides that is used by the ribosome to produce the sequence of amino
acid that makes up a protein.
• There are basically 20 amino acids (A, L, V, S, ...) but in certain rare
situations, others can be added to that list.
Creating the proteins in the cell
• Starting from the DNA
-ACTGCATGTACGTAGCTGAC-
-TGACGTACATGCATCGACTG-l l l l l l l l l l l l l l l l l l l l
2/12/2015
7
Creating the proteins in the cell
• Transcriptions: Making a copy of the DNA
Creating the proteins in the cell
• mRNA is a 4 letter alphabet (A, U, T, G)
• Amino acids 20 letter alphabet (All the
amino acids)
• Codons
2/12/2015
8
Creating the proteins in the cell
Structure of a Protein
• Consists of building blocks: amino acids
• Zwitter ion
2/12/2015
9
Side chains of Amino Acids
• Types of side chains:
– Hydrophobic
– Hydrophilic
• Charged
• Non-charged
Amino acid reaction
2/12/2015
10
Amino acid reaction
2/11/2015 20
Biology Fundamentals: 3D Structure
• Since there are 64 different codons and 20 amino acids, the “table look-up”
for translating each codon into an amino acid is redundant: multiple codons
can produce the same amino acid
• The table used by nature to perform translation is called the genetic code
• Due to the redundancy of the genetic code, certain nucleotide changes in
DNA may not alter the resulting protein
• Once a protein is produced, it folds into a unique structure in 3D space, with
3 types of components:α-helices, β-sheets and coils.
• The secondary structure of a protein is its sequence of amino acids,
annotated to distinguish the boundary of each component
• The tertiary structure is its 3D representation
2/12/2015
11
Secondary structure &
Tertiary Structure
DNA
sequence 3D structure protein functions
DNA (gene) →→→ pre-RNA →→→ RNA →→→ Protein
RNA-polymerase Spliceosome Ribosome2/11/2015 22
ACCGACCAAGCGGCGTTCACC
ATGAGGCTGCTGACCCTCCTG
GGCCTTCTG…
TDQAAFDTNIVTLTRFVMEQG
RKARGTGEMTQLLNSLCTAVK
AISTAVRKAGIAHLYGIAGST
NVTGDQVKKLDVLSNDLVINV
LKSSFATCVLVTEEDKNAIIV
EPEKRGKYVVCFDPLDGSSNI
DCLVSIGTIFGIYRKNSTDEP
SEKDALQPGRNLVAAGYALYG
SATML
From Amino Acids to Proteins Functions
2/12/2015
12
Source: fajerpc.magnet.fsu.edu/Education/2010/Lectures/26_DNA_Transcription.htm2/11/2015 23
Biology Fundamentals:
Functional Genomics
• The function of a protein is the
way it participates with other
proteins and molecules in
keeping the cell alive and
interacting with its environment
• Function is closely related to
tertiary structure
• Functional genomics: studies the
function of all the proteins of a
genome
2/11/2015 24
Protein is Function
Produced
mRNA
Reading the
Corresponding
DNA code
Protein at work:
T7 RNA Polymerase
2/12/2015
13
2/11/2015 25
Biology Fundamentals: Cell Biology
• A cell is made up of molecular components that can be viewed as 3D-structures of various shapes
• In a living cell, the molecules interact with each other (w. shape and location). An important type of interaction involves catalysis (enzyme) that facilitate interaction.
• A metabolic pathway is a chain of molecular interactions involving enzymes
• Signaling pathways are molecular interactions that enable communication through the cell’s membrane
Source: www.mtsinai.on.ca/pdmg/images/pairscolour.jpg
Human Genome—23 pairs of chromosomes
2/11/2015 26
The Cell as Computing Device
2/12/2015
14
Alzheimer related Protein Interactions
SEQUENCING
2/12/2015
15
2/11/2015 29
Lab Tools for Determining Bio. Data (I)
• Sequencer: machines capable of reading off a sequence of nucleotides in a
strand of DNA in biological samples
– It can produce 300k base pairs per day at relatively low cost
– A user can order from biotech companies vials containing short
sequences of nucleotides specified by the user
• Since sequences gathered in a wet lab consist of short random segments, one
has to use the shotgun method (a program) to reassemble them
– Difficulty: redundancy of seq. and ambiguity of assembly.
• Mass spectroscopy: identifies proteins by cutting them into short sequences
of amino acids (peptides) whose molecular weights can be determined by a
mass spectrograph, and then computationally infer the constituents of
peptides
Sequencing
1953 Watson and Crick: the structure of the DNA molecule DNA carrier of the genetic information, the challenge of reading the
DNA sequence became central to biological research.
methods for DNA sequencing were extremely inefficient, laborious and costly.
1965 Holley: reliably sequencing the yeast gene for tRNAAla required the equivalent of a full year's work per person per base pair (bp) sequenced (1bp/person year).
1970 Two classical methods for sequencing DNA fragments by Sanger and Gilbert.
2/12/2015
16
Sequencing: DNA Replication
• DNA Polymerase is a family of enzymes carrying out DNA replication.
• A Primer is a short fragment of DNA or RNA that pairs with a template DNA.
• Polymerase Chain Reaction– Mixture of templates and primers is heated such that templates
separate in strands
– Uses a pair of primers spanning a target region in template DNA
– Polymerizes partner strands in each direction from the primers using thermo stable DNA polymerase
– repeat
Sequencing: DNA Replication Enzyme
2/12/2015
17
Sequencing: Primers
2/12/2015
18
Sanger sequencing (sketch): a. The polymerase extends the
labeled primer, randomly incorporating either:
– a normal dCTP base, or
– a modified ddCTP base
At every position where a ddCTP is inserted, polymerization terminates ! => a population of fragments, where the length of each fragment is a function of the relative distance from the modified base to the primer.
b. electrophoretic separation of the products of each of the four reaction tubes (ddG, ddA, ddT, and ddC), run in individual lanes. The bands on the gel represent the respective fragments (Labelled Strands) shown to the right => the complement of the original template (bottom to top)
Sequencing
1980 In the 1980s methods were augmented by • partial automation
• the cloning method, which allowed fast and exponential replication of a DNA fragment.
1990 Start of the human genome project: sequencing efficiency reached 200,000 bp/person-year.
2002 End of the human genome project: 50,000,000 bp/person-year. (Total cost summed to $3 billion.)
Note: Moore’s Law doubling transistor count every 2 years
Here: doubling of #base pairs/person-year every 1.5 years
2/12/2015
19
Sequencing
Recently (2007 – 2011/2012):
• new sequencing technologies next generation sequencing (NGS) or deep sequencing– reliable sequencing of 100x109 bp/person-year.
– NGS now allows compiling the full DNA sequence of a person for ~ $10,000
– within 3-5 years cost is expected to drop to ~$1000.
• Currently full humane genome scan with a coverage of >3 for around 2000 euro.
Next Generation Sequencing Technologies
Next Generation Sequencing technology of Illumina, one of the leading companies in the field:
– DNA is replicated => millions of copies.
– All DNA copies are randomly shredded into various fragments using restriction enzymes or mechanical means.
– The fragments form the input for the sequencing phase.
– Hundreds of copies of each fragment are generated in one spot (cluster) on the surface of a huge matrix.
– Initially it is not known which fragment sits where.
Note: There are quite a few other available technologies available and currently under development.
2/12/2015
20
Next Generation Sequencing
Technologies
(1) A modified nucleotide is added to the complementary DNA strand by a DNA polymerase enzyme.
(2) A laser is used to obtain a read of the nucleotide just added.
(3) The full sequence of a fragment thus determined through successive iterations of the process.
(4) A visualization of the matrix where fragment clusters are attached to flow cells.
The Mapping Problem
The Short Read Mapping Problem
INPUT: m reads S1,…, Sm of length l and an approximate
reference genome R.
QUESTION: What are the positions x1,…, xm along R where each read S1,…, Sm matches, respectively?
For example: after sequencing the genome of a person we want to map it to an existing sequence of the human genome.
• The new sample will not be 100% identical to the reference genome:– natural variation in the population
– some reads may align to their position in the reference with mismatches or gaps
– sequencing errors
– repetitive regions in the genome make it difficult to decide where to map the read
• Humans are diploid organisms– different alleles on the maternal and paternal chromosomes
– two slightly different reads mapping to the same location (some perhaps with mismatches)
2/12/2015
21
41
Motivation: Coding Regions
Human Genome Project
• 3.21 x109 bases (GRCh38, 24-12 2013)
• ~21.000 genes (upper limit ~40k)
• ~2% coding sequences
• The Gold Standard completed in April 2003
• Many contributors
See: http://www.sanger.ac.uk/about/history/hgp/
Both Images by Morag Lewis
3D STRUCTURE OF PROTEIN
2/12/2015
22
2/11/2015 43
3D Structure of Proteins
• The 3D-structure of proteins is mainly determined (costly) by
– X-ray crystallography: X-ray passing through a crystallized sample of that protein, and
– nuclear magnetic resonance (NMR): obtain a number of matrices that express that fact that two atoms are within a certain distance and then deduce a 3D shape
2/11/2015 Data Mining: Principles and
Algorithms44
Molecule of the Month
www.pdb.org
Animated gifs from: proteinexplorer.org
March 2008:
Cadherin
• Adhesive Proteins
• Selective Stickiness:
The red tyrosine
amino acid will
bind to Cadherins
on neighbouring
cells
2/12/2015
23
2/11/2015 Data Mining: Principles and
Algorithms45
Molecule of the Month
www.pdb.org
Animated gifs from: proteinexplorer.org
April 2009:
Oct and Sox
Transcription
Factors
• Determine which genes
will be turned on or off.
• Human contains about
30,000 genes
• There are only about
3000 transcription factors,
1 for every 10 genes
2/11/2015 Data Mining: Principles and
Algorithms46
Molecule of the Month
www.pdb.org
Animated gifs from: proteinexplorer.org
November 2010:
Inteins
In most cases, each gene encodes a single protein,
but cells have found ways around this limitation.
Viruses, with their tiny genomes, often contain genes
that encode long polyproteins, which are then
chopped into a bunch of functional pieces by
enzymes. Inteins are another way that cells make
several proteins from one gene.
2/12/2015
24
Protein 3D Structure: Problem
• ~100.000.000.000 (=1011) DNA sequences
• ~100.000 (=105) 3D structures (in PDB)
• Cost of high resolution X-ray/NMR
– Purification of protein
– Actual observation of protein
– Interpretation of the data
– Only for 1 protein
• DNA Sequencers will produce (part of) the
entire DNA of an organism
2/11/2015 48
Protein Structure Design
Computer Aided Design of a Ligand
Specific to 14-3-3 Gamma Isomorf,
Danio Rerio (Zebra Fish)
by H.S. Faddiev
2/12/2015
25
MICROARRAYS
2/11/2015 50
Microarrays
• Microarrays: determine simultaneously the amount of mRNA production (gene
expression) of thousands of genes. It has 3 phases:
– Place thousands of different one-strand chunks of RNA in minuscule wells
on the surface of a small glass chip
– Spread genetic material obtained by a cell experiment one wishes to
perform
– Use a laser scanner and computer to measure the amount of combined
material and determine the degree (a real number) of gene expression for
each gene on the chip
2/12/2015
26
2/11/2015 Data Mining: Principles and
Algorithms
51
Gene Expression and Microarray
GENE FINDING
2/12/2015
27
DNA
http://en.wikipedia.org/wiki/DNA
53
54
Prokaryotes DNA Structure
2/12/2015
28
55
Eukaryotes DNA Structure
56
Genes
2/12/2015
29
57
Genes: Coding Regions
central dogma:
DNA transcription RNA translation protein
Eukaryotes only ~2% coding … find these regions
Prokaryotes Eukaryotes
no nucleus nucleus
most of genome is coding
H.influenza: 70 % coding
part of genome is coding
Human: ~2% coding
continuous genes introns & exons
Fast replication Slow transcription and
splicing
58
Genese: Transcription
Gene Expression from DNA to Protein
1. Transcription
2. Translation
(1) Transcription• DNA -> mRNA (messenger RNA).
• Takes place from the 5’ end to the 3’ , i.e., downstream (other direction is upstream)
• RNA polymerase enzyme starts a few bases upstream of the coding region, and terminates a few bases after the end of the coding region.
• The upstream and downstream regions that are not used for the code of the protein are called untranslated regions (UTRs)
• RNA polymerase molecules start transcription by recognizing and binding to promoter regions upstream of the transcription start sites.
• Each promoter region has a signal which can encourage or suspend transcription.
2/12/2015
30
59
Genes: Translation
Translation mRNA -> Protein is executed by ribosomes:
• Each triplet of bases in the mRNA is a command for the ribosomes, called codon.
• 64 different possible codons and only 20 amino acids => multiple codons represent the same amino acid.
Ribosome starts scanning the mRNA molecule from 5’ end to 3’ end;
If Ribosome detects start codon (also code for Methionine)
Then
While (Ribosome did not detect any of the three possible stop codons)
do
Generating an amino acid sequence coded by the mRNA;
Scan next codon;
od
Detach the chain of aminoacids from the Ribosome;
Endif
60
Codons
start
AUG
stop
UAA,
UAG,
UGA
UUU
UUC
2/12/2015
31
61
Codons
start
AUG
stop
UAA,
UAG,
UGA
UUU
UUC
62
Reading Frames
AGGCATGCGATCCAAGTTCCACCATGATGACATGATGACTA
5’-end 3’-enddownstream
upstream
TCCGTACGCTAGGTTCAAGGTGGTACTACTGTACTACTGAT
5’-end3’-end downstream
Complementary DNA strand
DNA strand
C G
A T
2/12/2015
32
63
Reading Frames
AGG CAT GCG ATC CAA GTT CCA CCA TGA TGA CAT GAT GAC TAAGGC ATG CGA TCC AAG TTC CAC CAT GAT GAC ATG ATG ACT AA.GCA TGC GAT CCA AGT TCC ACC ATG ATG ACA TGA TGA CTA A..
5’-end 3’-enddownstream
upstream
..T CCG TAC GCT AGG TTC AAG GTG GTA CTA CTG TAC TAC TGA.TC CGT ACG CTA GGT TCA AGG TGG TAC TAC TGT ACT ACT GATTCC GTA CGC TAG GTT CAA GGT GGT ACT ACT GTA CTA CTG ATT
5’-end3’-end downstream
Complementary DNA strand
DNA strand
64
Genes are not Random
motto
Leu Leucine 6 codons 6.9
Ala Alanine 4 6.5
Trp Tryptophan 1 1
codon frequencies (see table next slide)
‘random’ coding
ratios ratios
Also:
A or T in third position of a codon
sometimes in 90% of the cases.
=> Genes are not random!
=>
2/12/2015
33
65
Nucleotide Based Measure
• For CpG islands √
• For Non coding ORFs (NORFs)NORFs Genes
66
Codon Based Measure
NORFs Genes
2/12/2015
34
67
Promotor Regions
Construct statistics fb,i frequency of base b
in position i of known promotor region suffixes =>
… n TTGAC n18 TATAAT n6 N n …
position weight matrix
pos 1 2 3 4 5 6A 2 95 26 59 51 1C 9 2 14 13 20 3G 10 1 16 15 13 0T 79 3 44 13 17 96
Note: there is a 80% correlation between
the weight matrix score of the region and the binding energy.
start
codingTATA box
cf. ‘profile’
68
Promotor Region Detection
‘consensus’ sequence around RNA
transcription start point i.e. not exact
… n TTGAC n18 TATAAT n6 N n …
start
codingTATA box
• Promotor region is an ‘anchor’ point for polymerase, i.e.,
regulatory region that controls transcription rate.
• TATAAT is called TATA- or Pribnow-box.
• Use frequency of the occurrence of these sequences
• Variability of binding sites => no exact method for TATA-
box identification ….
2/12/2015
35
69
Typical Distributions: Vertebrates
Some typical data:
• Average gene length: 30Kb with coding region ~1-2Kb long.
• The average coding region has 6 exons, each ~150bp long.
• The promoter is about 6bp long and appears about 30bp upstream of the transcription start site (TSS).
• Transcription rate of less than 50b/sec
• Splicing process takes several minutes
But huge deviations exist:
• dystrophin is 2.4MB long.
• Blood coagulation-factor VIII has 26 exons whit sizes from 69bp to 3106bp.
70
Typical Distributions: Vertebrates
2/12/2015
36
71
Gene Expression
Polymerase
2/12/2015
37
Histones
2/12/2015
38
75
introns: splicing
• Splicing sites should be precise as a miss would lead to non-sense interpretation of the rest of the sequence.
• So called anchor points in the intron called branch point appears frequently.
• Also pyrimidine (bases C,T) rich areas appear between the branchpoint and the acceptor site.
algorithms based on position specific weight matrices.
does not exploit all the information (reading frames, intron/exon states, etc) and is not suitable for short genes.
Picture from: http://en.wikipedia.org/wiki/RNA_splicing
76
introns: splicing
AGGUAAGU … … CTGAC … … NCAGG …
donor
siteacceptor
sitebranchpoint
< 15 bp >
pyrimidine
rich CT
62 77 100 100 60 74 84 50 63-91 78 100 100 55
intron
consensus sequences / position specific weight matrices
Freq%
2/12/2015
39
77
HMM for Gene Finding
Geometric distribution of the exon state length k:
P(exon of length k) = pk(1-p)
HMM memory-less => modeled length distribution is geometric.
But exon length does not have a geometric distribution! =>
78
Gene Finding
Difficulties
– Low information contents of sequence signals
– Real signals difficult to detect
– Handling of sequencing errors
Prokaryotes:
– Short genes can easily be missed
– Overlapping genes
Eukaryotes:
– Complex gene structure
– Splicing
– Pseudo genes
2/12/2015
40
79
mGene.Web
Gabriele Schweikert, et al., mGene.web: a
web service for accurate computational
gene finding, Nucleic Acids Research, Vol.
37, 2009.
Bibliography
[1] Ben Langmead, Cole Trapnell, Mihai Pop and Steven L. Salzberg. Ultrafast and memory-effcient alignment of short DNA sequences to the human genome. Genome Biology, 2009.
[2] Michael Burrows and David Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
[3] Brent Ewing and Phil Green. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research, 1998.
[4] Paulo Ferragina and Giovani Manzini. Opportunistic data structures with applications. FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science, 2000.
[5] Heng Li, Jue Ruan and Richard Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 2008.
[6] Daniel R. Zerbino and Ewan Birney. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 2008.
[7] Ron Shamir, Computational Genomics Fall Semester, 2010 Lecture 12: Algorithms for Next Generation Sequencing Data, January 6, 2011 Scribe: Anat Gluzman and Eran Mick
2/12/2015
41
Bibliography
[8] J. Pagano, Multi-Scale Analysis of Enhancer Gene Relations in Nine Inbred Rat Strains, Internal Report CS Bioinformatics Track 14-04, August 2014.
[9] D. Zafeiropoulou, Genotypes-Phenotype Predictions in Patients diagnosed with Early Onset Alzheimer, Internal Report CS Bioinformatics Track 14-05, August 2014.
[10] J. Neuteboom, Protein Structure Prediction by Iterative Fragment Assembly (PITA), CS Master Thesis, December 2014.
[11] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Presentation on Chapter 8, 2006. ( www.cs.uiuc.edu/~hanj )