66
Molecular Biology of the Genome Christine Queitsch Department of Genome Sciences [email protected] 1

Molecular Biology of the Genome - biostat.washington.edu · Activities within the cell performed by proteins ... A coding problem A C G T 3. The “Central Dogma”of Molecular Biology

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Molecular Biology of the Genome

Christine Queitsch

Department of Genome Sciences

[email protected]

1

• Information Flow in Genomics

• Gene Structure

• Genetic Linkage

• Mutations

• Chromatin Structure

Outline

2

DNA and the Flow of Information

The genetic material: DNA- Four kinds of subunits (bases A, C, G, T)

Ile

Gly

Ala

Arg

Lys

Val

Leu

Ile

ProSer

Thr

Cys

Tyr

Asn

Glu

Gln

ArgPhe

Val

Asn

Gln

His

Leu

Cys

Gly

Ser

HisLeu Val

Glu

Ala

Leu

Leu

Tyr

Val

Cys

GlyPhe

Phe

Tyr

Arg

Arg

Ala

Pro

Gln

Glu

Ala

Ala

Gly

Glu

Gly

Gly

Gly

Gly

Gly

Leu

Leu

Gln

Ala

LeuAla

Leu

Pro

Gly

Glu

Pro

Gln

Lys

Val

Gly

Cys

Gln

Glu

Thr

Cys

Ser

LeuGln

Leu

Glu

Asn

Asn

Tyr

Cys

H3N+

COO-

Activities within the cell performed by proteins- Twenty kinds of subunits (amino acids)

A coding problem

AC G

T

3

The “Central Dogma” of Molecular Biology

Information into protein flows one wayA universal code: 3 nucleotides = 1 amino acid

DNA RNA Protein

phenotype

transcription translationreplication

heredity

4

DNA Structure

• Information content is in the sequence of bases along a DNA molecule

rules of base pairing each strand of the double helix has all the info needed to recreate the other strand

• Genetic variation — differences in the base sequence between different individuals

• Redundancy in the code

multiple ways that DNA can specify a single amino acid

why individuals vary in their phenotypes

5

Central Dogma: DNA Replication

DNA structure: polarity and base pairing

5’ 3’3’ 5’

Watson

CrickA pairs with T G pairs with C

DNA replication: what’s the point?

duplicate the entire genome prior to cell division

new subunits can only be added to the 3’OH of the growing chain

63’

3’

5’

5’5’

3’

leading strand

lagging strand

3’

Central Dogma: Transcription

Genes — specific segments along the chromosomal DNA that code for some function

promoter

mRNA

mRNA

promoter

terminator

Transcription: “copy” gene into RNA (to make a specific protein)

genegene

gene

terminator

7

Transcription

Transcription: “copy” gene into RNA to make a specific protein

5’ 3’3’ 5’

w

c

gene coding or sense strand

template strand

Where’s the 5’ end of the gene? of the mRNA?

Which way is RNA polymerase moving?

mRNARNA polymerase

ribonucleic acid… uses uracil (U) in place of thymine (T)

8

Transcription in vivo

gene

nascent RNA transcriptsDNA

RNA polymerases 9

Practice Question

1. Which way (to the right or left) are RNA polymerases moving?

2. Which strand (W or C) is the template strand?

5’ 3’3’ 5’

w

c

gene

10

Processing of pre-mRNA

Eukaryotic genes are interrupted by introns (non-coding information). They must be removed from the RNA before translation in a process called “splicing.”

mature mRNAintrons discardedexons spliced

together

exons introns

ORF

gene

UTR’s(untranslated regions)

pre-mRNA

11

Review of the Central Dogma: Translation

Translating the nucleic acid code to a peptide code…

Possible coding systems:

1 base per amino acid

Could only code for 4 amino acids!

2 bases per amino acid

Could only code for 16 amino acids

3 bases per amino acid

64 possible combinations… that’s plenty!

12

Met PheThrV alSerThr

AUGACUU U U UA AAA

AAC CC CG

NH3+ COO-

5’ 3’mRNA

protein

The triplet code

3 bases = 1 amino acidMore than 1 triplet can code for the same amino acid

Translation: reads the information in RNA to order the amino acids in a protein

codon

13

Punctuation:

Met PheThrV alSerThr

AUGACUU U U UA AAA

AAC CC CG

NH3+ COO-

5’ 3’mRNA

proteinSTOP

Start: AUG = methionine, the first amino acid in (almost) all proteins

Stop: UAA, UAG, and UGA.

NOT an amino acid!14

The Genetic Code: Who is the interpreter? Where’s the dictionary? What are the rules of grammar?

aminoacyl tRNA synthetase

amino acid

tRNA

charged tRNA

UAC UAC

MetMet

tRNA = transfer RNA

3’

anticodon

| | |AUG 3’5’

recognizes codon in mRNA

5’3’

15

5’ 3’

The ribosome: mediates translation

…AUAUGACUUCAGUAACCAUCUAACA…

After the 1st two tRNAs have bound…

ribosome

UAC

Met

... UGA

Thr

...

Locates the 1st AUG, sets the reading frame for codon-anticodon base-pairing

16

5’ 3’…AUAUGACUUCAGUAACCAUCUAACA…

UAC

Met

the ribosome breaks the Met-tRNA bond; Met is instead joined to the second amino acid

ribosome

UGA

Thr

...

17

P-site A-site

5’ 3’…AUAUGACUUCAGUAACCAUCUAACA…

UAC

Met

the ribosome breaks the Met-tRNA bond; Met is instead joined to the second amino acid …and the Met-tRNA is released

ribosome

UGA

Thr

...

…then ribosome moves over by 1 codon in the 3’direction

18

5’ 3’…AUAUGACUUCAGUAACCAUCUAACA…

Met

UGA

Thr

AGU...

Ser

19

5’ 3’…AUAUGACUUCAGUAACCAUCUAACA…UAG...

MetThr Ser Val Thr Phe

STOP

When the ribosome reaches the Stop codon… termination

20

5’ 3’…AUAUGACUUCAGUAACCAUCUAACA…

MetThr Ser Val Thr PheNH3

+

COO-

The finished peptide!

21

C-terminus

N-terminus

Practice Question

Which strand on the DNA sequence is the coding (sense) strand? How can you tell?

22

Finding Sense in Nonsense

cbdryloiaucahjdhtheflybitthedogbutnotthecatjhhajctipheq

GGGTATAGAAAATGAATATAAACTCATAGACAAGATCGGTGAGGGAACATTTTCGTCAGTGTATAAAGCCAAAGATATCACTGGGAAAATAACAAAAAAATTTGCATCACATTTTTGGAATTATGGTTCGAACTATGTTGCTTTGAAGAAAATATACGTTACCTCGTCACCGCAAAGAATTTATAATGAGCTCAACCTGCTGTACATAATGACGGGATCTTCGAGAGTAGCCCCTCTATGTGATGCAAAAAGGGTGCGAGATCAAGTCATTGCTGTTTTACCGTACTATCCCCACGAGGAGTTCCGAACTTTCTACAGGGATCTACCAATCAAGGGAATCAAGAAGTACATTTGGGAGCTACTAAGAGCATTGAAGTTTGTTCATTCGAAGGGAATTATTCATAGAGACATCAAACCGACAAATTTTTTATTTAATTTGGAATTGGGGCGTGGAGTGCTTGTTGATTTTGGTCTAGCCGAGGCTCAAATGGATTATAAAAGCATGATATCTAGTCAAAACGATTACGACAATTATGCAAATACAAACCATGATGGTGGATATTCAATGAGGAATCACGAACAATTTTGTCCATGCATTATGCGTAATCAATATTCTCCTAACTCACATAACCAAACACCTCCTATGGTCACCATACAAAATGGCAAGGTCGTCCACTTAAACAATGTAAATGGGGTGGATCTGACAAAGGGTTATCCTAAAAATGAAACGCGTAGAATTAAAAGGGCTAATAGAGCAGGGACTCGTGGATTTCGGGCACCAGAAGTGTTAATGAAGTGTGGGGCTCAAAGCACAAAGATTGATATATGGTCCGTAGGTGTTATTCTTTTAAGTCTTTTGGGCAGAAGATTTCCAATGTTCCAAAGTTTAGATGATGCGGATTCTTTGCTAGAGTTATGTACTATTTTTGGTTGGAAAGAATTAAGAAAATGCGCAGCGTTGCATGGATTGGGTTTCGAAGCTAGTGGGCTCATTTGGGATAAACCAAACGGATATTCTAATGGATTGAAGGAATTTGTTTATGATTTGCTTAATAAAGAATGTACCATAGGTACGTTCCCTGAGTACAGTGTTGCTTTTGAAACATTCGGATTTCTACAACAAGAATTACATGACAGGATGTCCATTGAACCTCAATTACCTGACCCCAAGACAAATATGGATGCTGTTGATGCCTATGAGTTGAAAAAGTATCAAGAAGAAATTTGGTCCGATCATTATTGGTGCTTCCAGGTTTTGGAACAATGCTTCGAAATGGATCCTCAAAAGCGTAGTTCAGCAGAAGATTTACTGAAAACCCCGTTTTTCAATGAATTGAATGAAAACACATATTTACTGGATGGCGAGAGTACTGACGAAGATGACGTTGTCAGCTCAAGCGAGGCAGATTTGCTCGATAAGGATGTTCT

How do you find out if sequence contains a gene? How do you identify the gene?

23

Reading Frame: the ribosome establishes the grouping of nucleotides that correspond to codons by the first AUG encountered.

ORF: open reading frame, from the first AUG to the first in-frame stop. The ORF encodes the information for the protein.

5’ 3’…AUAUGACUUCAGUAACCAUCUAACA…

Starts counting triplets from this base

More generally: a reading frame with a stretch of codons not interrupted by stop – non-coding RNAs!

24

- read the sequence 5’ 3’, looking for stop

- try each reading frame

- since we know the genetic code—can do a virtual translation if necessary

Looking for ORFs

25

How to identify genes experimentally?

• Information Flow in Genomics

• Gene Structure

• Genetic Linkage

• Mutations

• Chromatin Structure

Outline

26

Gene Structure: The Parts List

= CRM (cis-regulatory motif)• Can be upstream or downstream of promoter, proximal or distal

Exon Exon

Promoter – proximal regulatory element

5’ UTR 3’ UTR

Intron Intron

Enhancer – distal regulatory element

Genomic DNA for a protein-coding eukaryotic gene is comprised of regulatory and coding sequences

27

Promoters

•Promoters are specific sites on DNA that RNA polymerase first binds to initiate the transcription of a gene

• Composed of a variety of different cis-sequence elements which recruit trans-acting factors through DNA-protein interactions

28

Core Promoter Elements

Exon Exon

Promoter

5’ UTR 3’ UTR

Intron IntronEnhancer

TATA inr

TATATA A

TA

~-30

PyPyANTAPyPy

+1

GC A

GGC

CGCC

BRE

- not all elements required

- many promoters lack a TATA box, using instead the

functionally analogous initiator (inr) element

~-50

29

Combinatorial Gene Regulation

• Most eukaryotic genes have multiple cis regulatory motifs

located outside of the core promoter region

• Can be located in promoter proximal regions, 3’downstream regions, and many kb away from target gene

• Allows for combinatorial control of gene expression

30

Distal regulatory elements: Enhancers

Enhancer :

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=mcb.figgrp.2601

“Enhancesome”

- Can function in either orientation

- Can occur far (>50 kb) from the gene

- Can be up or downstream

- Range in size between ~50-200 bp

- Contain multiple TF binding sites

31

Exon Exon5’ UTR 3’ UTR

• Most eukaryotic mRNAs contain untranslated regions in their 5’and 3’ ends

• The 5’ UTR is the region between the start of transcription and the start of translation

• The 3’ UTR is the region between the stop codon and poly-A tail

• Both the 5’ and 3’ UTRs can contain cis regulatory sequences that bind TFs, influence transport to the cytoplasm, mediate transcript stability, and translational control

Untranslated Regions (UTRs)

32

Alternative Splicing

• mRNA from some genes can be spliced into two or more distinct transcripts

• Creates protein diversity (isoforms)

5’ splice site 3’ splice site

33

Let’s Play “Gene” or “No Gene”

A gene is often a segment of DNA that encodes a protein.

a micro RNA that binds to an mRNA to inhibit translation?

How about DNA that encodes:

an RNA spliced out of an intron and used for another function?

an antisense transcript?

a long non-coding RNA of unknown function?

a pseudogene?34

• Information Flow in Genomics

• Gene Structure

• Genetic Linkage

• Mutations

• Chromatin Structure

Outline

35

Transmission of Genetic Information

Chromosomes condensed

Chromosomes decondensed

Diploid2N2N

1N

1N

2N

Elements of cell division

Cell growth

Chromosome duplication

Chromosome segregation36

Meiosis

Interphase:Chromosomes replicate

Meiosis I:Reductive division, homologouschromosomes separate

Meiosis II:Sister chromatids separate

37

Recombination

38

How Does Distance Between Loci Affect Transmission?

Independent Assortment: loci are unlinked or far enough apart that they are transmitted independently from one another

Genetic linkage: loci are close enough together on a chromosome to be transmitted together

39

Genetic Mapping

The frequency of recombination between loci is based on the distance between them

40

Recombination Is A Measure of Distance

• Recombination fraction, = the probability that a recombinant gamete is transmitted

• If two loci are on different chromosomes, they will segregate independently

=> recombination fraction = 0.5

• If two loci are right next to each other, they will segregate together during meiosis

=> recombination fraction = 0

• Jargon:

< 0.5 the loci are close (they are linked)

= 0.5 the loci are far apart (they are not linked) 41

Recombination Is A Measure of Distance

 

Map Distance = Number Recombinant Gametes

Total Number of Gametesx 100

Centimorgan (cM): a unit of chromosome length, equals the length of chromosome over which crossing-over occurs with 1% frequency

42

Practice Question

• In maize, consider three recessive phenotypes: lazy growth (ll), glossy leaves (gg), and sugary endosperm (ss).

• The following cross was made: Ll Gg Ss x ll gg ss and the observed progeny distribution was (neither gene nor linkage phase is known)

Phenotype Number

wildtype 286

lazy 33

glossy 59

sugary 4

lazy, glossy 2

lazy, sugary 44

glossy, sugary 40

lazy, glossy, sugary 272

Total 740

• Determine order and distances among the three genes

43

Where to begin?

Parental types will constitute ≥ 50% of all progeny, so…

L G S / l g s x l g s / l g s

Recomb.Wild-type for all lazy, gloss, sugary

Rule 1: Two most-frequent gametes types are the parental types

Progeny Phenotype

Progeny Genotypes

Number

wildtype L G S // l g s 286

lazy l G S // l g s 33

glossy L g S // l g s 59

sugary L G s // l g s 4

lazy,glossy l g S // l g s 2

lazy,sugary l G s // l g s 44

glossy,sugary L g s // l g s 40

lazy,glossy,sugary l g s // l g s 272

Total 740

L G S // l g s x l g s // l g s

Linkage phase in heterozygous parent?

L G S

l g s

L g S

l G s

L g s

l G S

L G s

l g S

…which variants of L, G, and S are physical linked and in which order?

Rule 2• The double-recombinant gametes will be the two

least frequent types. A B C

a b c

Progeny Phenotype Progeny Genotypes

Number

wildtype L G S / l g s 286 lazy l G S / l g s 33 glossy L g S / l g s 59 sugary L G s / l g s 4 lazy,glossy l g S / l g s 2 lazy,sugary l G s / l g s 44 glossy,sugary L g s / l g s 40 lazy,glossy,sugary l g s / l g s 272

Total 740

Rule 3

• Effect of double crossovers is to interchange the members of the middle pair of allelesbetween the chromosomes

A B C

a b c

A b C

a B c

Double-crossover types:

• L G s and l g S

Which gene is in the middle?

L s G

l S g

Parental types:

L G S and l g s

L S G

l s g

Now you know linkage phase of heterozygous parent

and gene order…how far apart are these genes?

Count the cross-overs between adjacent genes

• In parents, L allele on same homolog as S and l on same homolog as s. So if these get broken up ---> cross-over between L and S loci

• In parents, S on same homolog as G and s on same homolog as g. If these get broken up --> recombination between S and G loci

L S G

l s g

Rule 4: Reciprocal

products expected to occur in approximately equal numbers

• LGS ≈ lgs (286 ≈ 272)

• LSg ≈ lsG (59 ≈ 44)

• Lsg ≈ lSG (40 ≈ 33)

• LsG ≈ lSg (4 ≈ 2)

Progeny Phenotype

Progeny Genotype #

wildtype L G S / l g s 286 lazy l G S / l g s 33 glossy L g S / l g s 59 sugary L G s / l g s 4 lazy,glossy l g S / l g s 2 lazy,sugary l G s / l g s 44 glossy,sugary L g s / l g s 40 lazy,glossy,sugary l g s / l g s 272

Total 740

• l G S 33• L g s 40• L G s 4• l g S 2

79

Rec Freq L-S Rec Freq S-G

L g S 59l G s 44L G s 4l g S 2

109

Progeny Phenotype

Progeny Genotype #

Crossover or Non-Crossover?

wildtype L G S / l g s 286 Parental (NCO) lazy l G S / l g s 33 single CO between L and S glossy L g S / l g s 59 single CO between S and G sugary L G s / l g s 4 double CO lazy,glossy l g S / l g s 2 double CO lazy,sugary l G s / l g s 44 single CO between S and G glossy,sugary L g s / l g s 40 single CO between L and S lazy,glossy,sugary l g s / l g s 272 Parental (NCO)

Total 740

79/740 or 10.7% of gametes recombinant between L & S.

distance between L & S = 10.7 map units

109/740 or 14.8 % of gametes recombinant between S & G.

distance between S & G=14.8 map units

l G S 33L g s 40L G s 4l g S 2

79

Rec Freq L-S

Rec Freq S-G

L g S 59l G s 44L G s 4l g S 2

109

10.7 mu 14.8 mu

_____________________________

L S G

• Information Flow in Genomics

• Gene Structure

• Genetic Linkage

• Mutations

• Chromatin Structure

Outline

54

Causes and types of mutations

55

• Spontaneous mutations – DNA decay (deamination, change

in hydrogen bond, etc.)

• Replication errors – failure to repair DNA damage in

template strand

• DNA repair errors – double strand break repairs are

error-prone (NHEJ repair)

Small-scale mutations

• Induced mutations – oxidative damage, mutagens (e.g. EMS)

-> substitutions, small insertions and deletions

Most commonly considered human genetics!

Causes and types of mutations

56

• Loss of heterozygosity – common in cancers

• Large duplications and deletions – errors in recombination

through micro-homologies or repetitive (transposable) elements

• Chromosomal rearrangements – inversions or translocations

of chromosomal segments

Large-scale mutations

• Induced mutations – ionizing radiation causes dsDNA breaks

-> mutations affect often many genes, generate gene fusions,

place genes in altered regulatory context, alter gene dosage

Commonly ignored in human genetics!

• Aneuploidy – whole chromosome loss or duplication

Mutation types and consequences

57

Recessive mutations – phenotypic consequences are

buffered by wild-type allele,

dosage from one allele is sufficient

Dominant mutations – phenotypic consequences arise from

one mutated allele

Gain of function – prevalent in cancer, gene fusions

Haploinsufficiency – one wild-type allele not enough for function

copy number variant-associated disorders,

autism, William’s syndrome, polydactyly,

Marfan’s syndrome

Dominant negative – poisons the function of the wild-type

protein, p53, Marfan’s

• Information Flow in Genomics

• Gene Structure

• Genetic Linkage

• Mutations

• Chromatin Structure

Outline

58

Chromosome Structure: Coils of Coils of Coils…

nucleosome

Local unpacking of chromatin allows gene expression and replication

at mitosis

59

Nucleosomes

• ~146 bp of DNA wrapped around nucleosome• ~ 80 bp linker• histone octamer

60

Histone Modification and Chromatin Activity

61

• modifications change interaction with DNA and trans-factors

• can activate or repress transcription

• reinforce regulatory patterns set up by TFs

What Do These Modifications Do? A Histone Code?

Carey et al. Cell (2007) 128:707

“Distinct histone modifications, on one or more tails, act sequentially

or in combination that is read by other proteins to bring about distinct

downstream events” (Strahl and Allis, 2000, Nature, 403:41)

62

DNA modification also contributes to chromatin state

DNA methylation can change the activity of a DNA segment

without changing the sequence. In gene promoters, DNA

methylation typically acts to repress gene transcription.

63

methylated Adenine

DNA methylation is essential for normal development

and is associated with a number of key processes including

genomic imprinting, X-chromosome inactivation, repression

of transposable elements, aging, and carcinogenesis.

DNA methylation patterns differ among organisms

• no DNA methylation in common model organisms such as

C. elegans and D. melanogaster

• in plants and other organisms, DNA methylation occurs as

CpG, CHG or CHH (where H correspond to A, T or C)

• In mammals, almost exclusively as CpG, with exception of

embryonic stem cells and developing neuronal cells that show

CHH

• CpGs are depleted in mammalian genomes with exception of CpG

islands in gene promoters (in ~70% of genes, un-methylated)

64

Deamination of 5-methyl cytosine is mutagenic

65

• spontaneous deamination -> C/G into T/A

• most common base substitution in human genome

• explains CpG depletion in the genome

• unmethylated CpGs in promoters protected by DNA repair

Cytosine Uracil

Genome sequencing

ENCODE, modENCODE,

“plantENCODE”

Christine Queitsch

Department of Genome Sciences

[email protected]

66