Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Molecular Biology of the Genome
Christine Queitsch
Department of Genome Sciences
1
• Information Flow in Genomics
• Gene Structure
• Genetic Linkage
• Chromatin Structure
• Genome Sequencing
Outline
2
DNA and the Flow of Information
The genetic material: DNA - Four kinds of subunits (bases A, C, G, T)
Ile
Gly
Ala
Arg
Lys
Val
Leu
Ile
ProSer
Thr
Cys
Tyr
Asn
Glu
Gln
ArgPhe
Val
Asn
Gln
His
Leu
Cys
Gly
Ser
HisLeu Val
Glu
Ala
Leu
Leu
Tyr
Val
Cys
GlyPhe
Phe
Tyr
Arg
Arg
Ala
Pro
Gln
Glu
Ala
Ala
Gly
Glu
Gly
Gly
Gly
Gly
Gly
Leu
Leu
Gln
Ala
LeuAla
Leu
Pro
Gly
Glu
Pro
Gln
Lys
Val
Gly
Cys
Gln
Glu
Thr
Cys
Ser
LeuGln
Leu
Glu
Asn
Asn
Tyr
Cys
H3N+
COO-
Activities within the cell performed by proteins - Twenty kinds of subunits (amino acids)
A coding problem
A C G
T
3
The “Central Dogma” of Molecular Biology
Information into protein flows one way A universal code: 3 nucleotides = 1 amino acid
DNA RNA Protein
phenotype
transcription translation replication
heredity
4
DNA Structure
• Information content is in the sequence of bases along a DNA molecule
rules of base pairing each strand of the double helix has all the info needed to recreate the other strand
• Genetic variation — differences in the base sequence between different individuals
• Redundancy in the code
multiple ways that DNA can specify a single amino acid
why individuals vary in their phenotypes
5
Central Dogma: DNA Replication
DNA structure: polarity and base pairing
5’ 3’ 3’ 5’
Watson
Crick A pairs with T G pairs with C
DNA replication: what’s the point?
duplicate the entire genome prior to cell division
new subunits can only be added to the 3’OH of the growing chain
6 3’
3’
5’
5’ 5’
3’
leading strand
lagging strand
3’
Central Dogma: Transcription
Genes — specific segments along the chromosomal DNA that code for some function
promoter
mRNA
mRNA
promoter
terminator
Transcription: “copy” gene into RNA (to make a specific protein)
gene gene
gene
terminator
7
Transcription
Transcription: “copy” gene into RNA to make a specific protein
5’ 3’ 3’ 5’
w
c
gene coding or sense strand
template strand
Where’s the 5’ end of the gene? of the mRNA?
Which way is RNA polymerase moving?
mRNA RNA polymerase
ribonucleic acid… uses uracil (U) in place of thymine (T)
8
Transcription in vivo
gene
nascent RNA transcripts DNA
RNA polymerases 9
Practice Question
1. Which way (to the right or left) are RNA polymerases moving?
2. Which strand (W or C) is the template strand?
5’ 3’ 3’ 5’
w
c
gene
10
Processing of pre-mRNA
Eukaryotic genes are interrupted by introns (non-coding information). They must be removed from the RNA before translation in a process called “splicing.”
mature mRNA introns discarded exons spliced together
exons introns
ORF
gene
UTR’s (untranslated regions)
pre-mRNA
11
Review of the Central Dogma: Translation
Translating the nucleic acid code to a peptide code…
Possible coding systems:
1 base per amino acid
Could only code for 4 amino acids!
2 bases per amino acid
Could only code for 16 amino acids
3 bases per amino acid
64 possible combinations… that’s plenty!
12
M e t P h e T h r V a l S e r T h r
A U G A C U U U U U A A A A
A A C C C C G
NH3+ COO-
5’ 3’ mRNA
protein
The triplet code
3 bases = 1 amino acid More than 1 triplet can code for the same amino acid
Translation: reads the information in RNA to order the amino acids in a protein
codon
13
Punctuation:
M e t P h e T h r V a l S e r T h r
A U G A C U U U U U A A A A
A A C C C C G
NH3+ COO-
5’ 3’ mRNA
protein STOP
Start: AUG = methionine, the first amino acid in (almost) all proteins
Stop: UAA, UAG, and UGA.
NOT an amino acid! 14
The Genetic Code: Who is the interpreter? Where’s the dictionary? What are the rules of grammar?
aminoacyl tRNA synthetase
amino acid
tRNA
charged tRNA
UAC UAC
Met Met
tRNA = transfer RNA
3’
anticodon
| | | AUG 3’ 5’
recognizes codon in mRNA
5’ 3’
15
5’ 3’
The ribosome: mediates translation
…AUAUGACUUCAGUAACCAUCUAACA…
After the 1st two tRNAs have bound…
ribosome
UAC
Met
... UGA
Thr
...
Locates the 1st AUG, sets the reading frame for codon-anticodon base-pairing
16
5’ 3’ …AUAUGACUUCAGUAACCAUCUAACA…
UAC
Met
the ribosome breaks the Met-tRNA bond; Met is instead joined to the second amino acid
ribosome
UGA
Thr
...
17
P-site A-site
5’ 3’ …AUAUGACUUCAGUAACCAUCUAACA…
UAC
Met
the ribosome breaks the Met-tRNA bond; Met is instead joined to the second amino acid …and the Met-tRNA is released
ribosome
UGA
Thr
...
…then ribosome moves over by 1 codon in the 3’ direction
18
5’ 3’ …AUAUGACUUCAGUAACCAUCUAACA…
Met
UGA
Thr
AGU ...
Ser
19
5’ 3’ …AUAUGACUUCAGUAACCAUCUAACA… UAG ...
Met Thr Ser Val Thr Phe
STOP
When the ribosome reaches the Stop codon… termination
20
5’ 3’ …AUAUGACUUCAGUAACCAUCUAACA…
Met Thr Ser Val Thr Phe NH3
+ COO-
The finished peptide!
21
C-terminus
N-terminus
Practice Question
Which strand on the DNA sequence is the coding (sense) strand? How can you tell?
22
Finding Sense in Nonsense
cbdryloiaucahjdhtheflybitthedogbutnotthecatjhhajctipheq
GGGTATAGAAAATGAATATAAACTCATAGACAAGATCGGTGAGGGAACATTTTCGTCAGTGTATAAAGCCAAAGATATCACTGGGAAAATAACAAAAAAATTTGCATCACATTTTTGGAATTATGGTTCGAACTATGTTGCTTTGAAGAAAATATACGTTACCTCGTCACCGCAAAGAATTTATAATGAGCTCAACCTGCTGTACATAATGACGGGATCTTCGAGAGTAGCCCCTCTATGTGATGCAAAAAGGGTGCGAGATCAAGTCATTGCTGTTTTACCGTACTATCCCCACGAGGAGTTCCGAACTTTCTACAGGGATCTACCAATCAAGGGAATCAAGAAGTACATTTGGGAGCTACTAAGAGCATTGAAGTTTGTTCATTCGAAGGGAATTATTCATAGAGACATCAAACCGACAAATTTTTTATTTAATTTGGAATTGGGGCGTGGAGTGCTTGTTGATTTTGGTCTAGCCGAGGCTCAAATGGATTATAAAAGCATGATATCTAGTCAAAACGATTACGACAATTATGCAAATACAAACCATGATGGTGGATATTCAATGAGGAATCACGAACAATTTTGTCCATGCATTATGCGTAATCAATATTCTCCTAACTCACATAACCAAACACCTCCTATGGTCACCATACAAAATGGCAAGGTCGTCCACTTAAACAATGTAAATGGGGTGGATCTGACAAAGGGTTATCCTAAAAATGAAACGCGTAGAATTAAAAGGGCTAATAGAGCAGGGACTCGTGGATTTCGGGCACCAGAAGTGTTAATGAAGTGTGGGGCTCAAAGCACAAAGATTGATATATGGTCCGTAGGTGTTATTCTTTTAAGTCTTTTGGGCAGAAGATTTCCAATGTTCCAAAGTTTAGATGATGCGGATTCTTTGCTAGAGTTATGTACTATTTTTGGTTGGAAAGAATTAAGAAAATGCGCAGCGTTGCATGGATTGGGTTTCGAAGCTAGTGGGCTCATTTGGGATAAACCAAACGGATATTCTAATGGATTGAAGGAATTTGTTTATGATTTGCTTAATAAAGAATGTACCATAGGTACGTTCCCTGAGTACAGTGTTGCTTTTGAAACATTCGGATTTCTACAACAAGAATTACATGACAGGATGTCCATTGAACCTCAATTACCTGACCCCAAGACAAATATGGATGCTGTTGATGCCTATGAGTTGAAAAAGTATCAAGAAGAAATTTGGTCCGATCATTATTGGTGCTTCCAGGTTTTGGAACAATGCTTCGAAATGGATCCTCAAAAGCGTAGTTCAGCAGAAGATTTACTGAAAACCCCGTTTTTCAATGAATTGAATGAAAACACATATTTACTGGATGGCGAGAGTACTGACGAAGATGACGTTGTCAGCTCAAGCGAGGCAGATTTGCTCGATAAGGATGTTCT
How do you find out if sequence contains a gene? How do you identify the gene?
23
Reading Frame: the ribosome establishes the grouping of nucleotides that correspond to codons by the first AUG encountered.
ORF: open reading frame, from the first AUG to the first in-frame stop. The ORF encodes the information for the protein.
5’ 3’ …AUAUGACUUCAGUAACCAUCUAACA…
Starts counting triplets from this base
More generally: a reading frame with a stretch of codons not interrupted by stop – non-coding RNAs!
24
- read the sequence 5’ 3’, looking for stop
- try each reading frame
- since we know the genetic code—can do a virtual translation if necessary
Looking for ORFs
25
How to identify genes experimentally?
• Information Flow in Genomics
• Gene Structure
• Genetic Linkage
• Chromatin Structure
• Genome Sequencing
Outline
26
Gene Structure: The Parts List
= CRM (cis-regulatory motif) • Can be upstream or downstream of promoter, proximal or distal
Exon Exon
Promoter – proximal regulatory element
5’ UTR 3’ UTR
Intron Intron
Enhancer – distal regulatory element
Genomic DNA for a protein-coding eukaryotic gene is comprised of regulatory and coding sequences
27
Promoters
•Promoters are specific sites on DNA that RNA polymerase first binds to initiate the transcription of a gene
• Composed of a variety of different cis-sequence elements which recruit trans-acting factors through DNA-protein interactions
28
Core Promoter Elements
Exon Exon
Promoter
5’ UTR 3’ UTR
Intron Intron Enhancer
TATA inr
T A TATA A
T A
~-30
PyPyAN T A PyPy
+1
G C A
G G C
CGCC
BRE
- not all elements required
- many promoters lack a TATA box, using instead the
functionally analogous initiator (inr) element
~-50
29
Combinatorial Gene Regulation
• Most eukaryotic genes have multiple cis regulatory motifs
located outside of the core promoter region
• Can be located in promoter proximal regions, 3’ downstream regions, and many kb away from target gene
• Allows for combinatorial control of gene expression
30
Distal regulatory elements: Enhancers
Enhancer :
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=mcb.figgrp.2601
“Enhancesome”
- Can function in either orientation
- Can occur far (>50 kb) from the gene
- Can be up or downstream
- Range in size between ~50-200 bp
- Contain multiple TF binding sites
31
Exon Exon 5’ UTR 3’ UTR
• Most eukaryotic mRNAs contain untranslated regions in their 5’ and 3’ ends
• The 5’ UTR is the region between the start of transcription and the start of translation
• The 3’ UTR is the region between the stop codon and poly-A tail
• Both the 5’ and 3’ UTRs can contain cis regulatory sequences that bind TFs, influence transport to the cytoplasm, mediate transcript stability, and translational control
Untranslated Regions (UTRs)
32
Alternative Splicing
• mRNA from some genes can be spliced into two or more distinct transcripts
• Creates protein diversity (isoforms)
5’ splice site 3’ splice site
33
• Information Flow in Genomics
• Gene Structure
• Genetic Linkage
• Chromatin Structure
• Genome Sequencing
Outline
34
Transmission of Genetic Information
Chromosomes condensed
Chromosomes decondensed
Diploid 2N 2N
1N
1N
2N
Elements of cell division
Cell growth
Chromosome duplication
Chromosome segregation 35
Meiosis
Interphase: Chromosomes replicate
Meiosis I: Reductive division, homologous chromosomes separate
Meiosis II: Sister chromatids separate
36
Recombination
37
How Does Distance Between Loci Affect Transmission?
Independent Assortment: loci are unlinked or far enough apart that they are transmitted independently from one another
Genetic linkage: loci are close enough together on a chromosome to be transmitted together
38
Genetic Mapping
The frequency of recombination between loci is based on the distance between them
39
Recombination Is A Measure of Distance
• Recombination fraction, = the probability that a recombinant gamete is transmitted
• If two loci are on different chromosomes, they will segregate independently
=> recombination fraction = 0.5
• If two loci are right next to each other, they will segregate together during meiosis
=> recombination fraction = 0
• Jargon:
< 0.5 the loci are close (they are linked)
= 0.5 the loci are far apart (they are not linked) 40
Recombination Is A Measure of Distance
Map Distance = Number Recombinant Gametes
Total Number of Gametesx 100
Centimorgan (cM): a unit of chromosome length, equals the length of chromosome over which crossing-over occurs with 1% frequency
41
Practice Question
• In maize, consider three recessive phenotypes: lazy growth (ll), glossy leaves (gg), and sugary endosperm (ss).
• The following cross was made: Ll Gg Ss x ll gg ss and the observed progeny distribution was (neither gene nor linkage phase is known)
Phenotype Number
wildtype 286
lazy 33
glossy 59
sugary 4
lazy, glossy 2
lazy, sugary 44
glossy, sugary 40
lazy, glossy, sugary 272
Total 740
• Determine order and distances among the three genes
42
Where to begin?
Parental types will constitute ≥ 50% of all progeny, so…
L G S / l g s x l g s / l g s
Recomb. Wild-type for all lazy, gloss, sugary
Rule 1: Two most-frequent gametes types are the parental types
Progeny Phenotype
Progeny Genotypes
Number
wildtype L G S // l g s 286
lazy l G S // l g s 33
glossy L g S // l g s 59
sugary L G s // l g s 4
lazy,glossy l g S // l g s 2
lazy,sugary l G s // l g s 44
glossy,sugary L g s // l g s 40
lazy,glossy,sugary l g s // l g s 272
Total 740
L G S // l g s x l g s // l g s
Linkage phase in heterozygous parent?
• L G S or L g S or l g S or L g s
• l g s l G s L G s
l G S
Rule 2 • The double-recombinant gametes will be the two
least frequent types.
A B C
a b c
Progeny Phenotype Progeny Genotypes
Number
wildtype L G S / l g s 286 lazy l G S / l g s 33 glossy L g S / l g s 59 sugary L G s / l g s 4 lazy,glossy l g S / l g s 2 lazy,sugary l G s / l g s 44 glossy,sugary L g s / l g s 40 lazy,glossy,sugary l g s / l g s 272
Total 740
Rule 3
• Effect of double crossovers is to interchange the members of the middle pair of alleles between the chromosomes
A B C
a b c
A b C
a B c
Double-crossover types:
• L G s and l g S
Which gene is in the middle?
L s G
l S g
Parental types:
L G S and l g s
L S G
l s g
Now you know linkage phase of heterozygous parent
and gene order…how far apart are these genes?
Count the cross-overs between adjacent genes
• In parents, L allele on same homolog as S and l on same homolog as s. So if these get broken up ---> cross-over between L and S loci
• In parents, S on same homolog as G and s on same homolog as g. If these get broken up --> recombination between S and G loci
L S G
l s g
Rule 4: Reciprocal
products expected to occur in approximately equal numbers
• LGS ≈ lgs (286 ≈ 272)
• LgS ≈ lGs (59 ≈ 44)
• Lgs ≈ lGS (40 ≈ 33)
• LGs ≈ lgS (4 ≈ 2)
Progeny Phenotype
Progeny Genotype #
wildtype L G S / l g s 286 lazy l G S / l g s 33 glossy L g S / l g s 59 sugary L G s / l g s 4 lazy,glossy l g S / l g s 2 lazy,sugary l G s / l g s 44 glossy,sugary L g s / l g s 40 lazy,glossy,sugary l g s / l g s 272
Total 740
• l G S 33 • L g s 40 • L G s 4 • l g S 2 79
Rec Freq L-S Rec Freq S-G
L g S 59 l G s 44 L G s 4 l g S 2 109
Progeny Phenotype
Progeny Genotype #
Crossover or Non-Crossover?
wildtype L G S / l g s 286 Parental (NCO) lazy l G S / l g s 33 single CO between L and S glossy L g S / l g s 59 single CO between S and G sugary L G s / l g s 4 double CO lazy,glossy l g S / l g s 2 double CO lazy,sugary l G s / l g s 44 single CO between S and G glossy,sugary L g s / l g s 40 single CO between L and S lazy,glossy,sugary l g s / l g s 272 Parental (NCO)
Total 740
79/740 or 10.7% of gametes recombinant between L & S. distance between L & S = 10.7 map units 109/740 or 14.8 % of gametes recombinant between S & G. distance between S & G= 14.8 map units
l G S 33 L g s 40 L G s 4 l g S 2 79
Rec Freq L-S
Rec Freq S-G
L g S 59 l G s 44 L G s 4 l g S 2 109
10.7 mu 14.8 mu
_____________________________
L S G
• Information Flow in Genomics
• Gene Structure
• Genetic Linkage
• Chromatin Structure
• Genome Sequencing
Outline
53
Chromosome Structure: Coils of Coils of Coils…
nucleosome
Local unpacking of chromatin allows gene expression and replication
at mitosis
54
Nucleosomes
• ~146 bp of DNA wrapped around nucleosome • ~ 80 bp linker • histone octamer
55
Histone Modification and Chromatin Activity
56
• modifications change interaction with DNA and trans-factors
• can activate or repress transcription
• reinforce regulatory patterns set up by TFs
What Do These Modifications Do? A Histone Code?
Carey et al. Cell (2007) 128:707
“Distinct histone modifications, on one or more tails, act sequentially
or in combination that is read by other proteins to bring about distinct
downstream events” (Strahl and Allis, 2000, Nature, 403:41)
57
• Information Flow in Genomics
• Gene Structure
• Genetic Linkage
• Chromatin Structure
• Genome Sequencing
Outline
58
• Next-Generation
• Sanger sequencing
DNA Sequencing Technology
• 3rd and 4th Generation
59
Genome Sequencing: Hierarchical Shotgun Sequencing
• Shear genomic DNA into smaller pieces and subclone into library (such as BACs, Cosmids, etc.)
• Create physical map
• Shotgun sequence each BAC from minimal tiling path (shearing of ~150kb BAC clone into ~ 2kb fragments)
• Data from linkage and physical maps used to assemble sequence maps of chromosomes
60
• Whole genome randomly sheared three times – Plasmid library constructed
with ~ 2kb inserts – Plasmid library with ~10 kb
inserts – BAC library with ~200 kb
inserts
• Computer program assembles sequences into chromosomes
• No physical map construction
• Only one BAC library
• Overcomes problems of repeat sequences…only not really
Genome Sequencing: Whole Genome Shotgun Sequencing
61
62
Next-Generation Sequencing Technology
• Illumina HiSeq:
– 4 billion reads per flow cell X 100 bases, paired = 400 Gbp
– 8 samples per flow cell = 50 Gbp each (one human genome = 3 Gbp)
– Reagent cost ~$8K per run
Updated: HiSeq 3000/4000 SBS Kits enable up to 1500 Gb (1.5 Tb) of output per dual flow cell run
• ABI SOLID: similar yield
• Roche 454: 1 million reads X 500 bases = 0.5 Gbp
63
Illumina sequencing
64
Mardis, ER, 2008, ARGHG
1. 2.
3. 4.
Illumina sequencing: clusters
65
Illumina sequencing: sequence reaction
66
Illumina sequencing: sequence reaction
Sequence clusters are imaged after each cycle of
synthesis
67
What is missed?
68
Plenty: repetitive DNA and structural variation
C
C
C
C
C C A
A
A
A A
A G G
G
G G G
Example: short tandem repeats
3rd Generation Sequencing Technology
• Single Molecule Real Time (SMRT) sequencing technology (PacBio RS)
• based on ‘circular’ DNA molecules read by polymerase
• and long reads - up to 10kb
• error-prone
69
4th Generation Sequencing Technology
• Protein nanopore sequencing
(Oxford Nanopore)
• ultra-long reads - up to 1MB, limited by integrity of the DNA
• high error rate, low throughput
70
Next-Gen Sequencing - What’s All the Fuss About?
71
The Era of Personal Genomics?
James D. Watson
(5/31/2007)
J. Craig Venter (8/4/2007)
http://www.ffrf.org/day/img/0406_watson.gif, http://www-news.uchicago.edu/releases/07/images/070601.watson.jpg
http://www.usnews.com/usnews/images/news/photos/venter051022.jpg
It is here. The challenge is interpretation.
“Censoring” of Watson’s ApoE gene
3.6 kb
Important ethical issues confront personal
genomics.
73
Interpreting Genome Sequences
• Pilot Project Description – ENCODE Project Consortium et al. The
ENCODE (ENCyclopedia Of DNA Elements) Project. Science (2004) vol. 306 (5696)
• Pilot Project Results – ENCODE Project Consortium et al.
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature (2007) vol. 447 (7146)
The ENCODE Project: comprehensive parts list of the functional elements in the human genome
74
Let’s Play “Gene” or “No Gene”
A gene is often a segment of DNA that encodes a protein.
a micro RNA that binds to an mRNA to inhibit translation?
How about DNA that encodes:
an RNA spliced out of an intron and used for another function?
an antisense transcript?
a long non-coding RNA of unknown function?
a pseudogene? 75