48
Hillary Term 04: “The Human Genome” 20.1 The Human Genome – evolutionary issues (Hein) 27.1 Non-Genic Selection in the Human Genome (Lunter) 3.2 Mammalian Genes I: Conservation and slow evolution (Ponting) 10.2 Mammalian Genes II: Functional innovation and rapid change (Ponting/Goodstadt) 17.2 RNAs in Human Genome (Sam Griffiths-Jones) 24.2 Population Genetics of the Human Genome (Gil McVean ) 2.3 Association Mapping and the Human Genome (Lon Cardon)

In PowerPoint format

  • Upload
    pammy98

  • View
    1.283

  • Download
    2

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: In PowerPoint format

Hillary Term 04: “The Human Genome”20.1 The Human Genome – evolutionary issues (Hein)

27.1 Non-Genic Selection in the Human Genome (Lunter)

3.2 Mammalian Genes I: Conservation and slow evolution (Ponting)

10.2 Mammalian Genes II: Functional innovation and rapid change (Ponting/Goodstadt)

17.2 RNAs in Human Genome (Sam Griffiths-Jones)

24.2 Population Genetics of the Human Genome (Gil McVean )

2.3 Association Mapping and the Human Genome (Lon Cardon)

9.3 The Human Genome and Human Evolution (Chris Tyler-Smith)

Page 2: In PowerPoint format

The Human Genome – key issuesThe Human Genome Project

Few basic facts of the human genome

Grammar of Genes

Basic events happening to a genome per mitosis/generation Genealogical Structures: Phylogenies, Pedigrees and the ARG

Long term Dynamics of the Human Genome: The comparative aspect

(Genotype Phenotype) & (Population Genetics/History) => Gene Mapping

History

Our interests.

Page 3: In PowerPoint format

History of the Human Genome Project

Strachan and Read, HMG3 p213

1956 Physical map. 24 types and total set of 46 chromosomes

1977 Sanger publishes dideoxy sequencing method

1980 Botstein proposes human genetic map using RFLPs

1987 US DOE publishes report discussing HGP

1988 HUGO is established

1990 Official start of HGP with 3 billion $ and a 15 year horizon.

1991 Genome Database GB is established

1992 Genethon publishes map based on microsatelites.

1995 Lander et al. detailed map based on sequence tagged sites.

1998 Comprehensive map based on gene markers.

1999 Sanger Centre publishes chromosome 22

2001 Draft Genome published: Celera & Public

2003 Completion (almost) of Human Genome

Page 4: In PowerPoint format

Public effort- strategy: Celera - strategy: From Myers 99

Sequencing Strategies

Celera’s view of International Consortium International Consortium’s view of Celera

Unfair competition: IC delivering the same goods but with state funding.

Unfair competition: Celera delivering the same goods but can use IC data, while IC cannot use Celera data.

Page 5: In PowerPoint format

Other Genome Projects1976/79 First viral genome – MS2/fX174

1980 Mitochondrion

1982 First shotgun sequenced genome – Bacteriophage lambda

1995 First prokaryotic genome – H. influenzae

1996 First unicellular eukaryotic genome – Yeast

1998 The first multicellular eukaryotic genome – C.elegans

2000 Drosophila melanogaster

2000 Arabidopsis thaliana

2001 Human Genome

2002 Mouse Genome

The Genome OnLine Database knows of 958 genome sequencing projects, of which 169 are completed

Page 6: In PowerPoint format

Favourite and Model OrganismsMulticellular AnimalsMammals Human 3.5 Gb Mouse 3.2 Gb Cow 3.0 Gb Dog 2.8 Gb Rat 3.1 Gb Chimp 3.5 Gb Pig 3.0 Gb Fish Puffer Fish 0.4 Gb Zebra Fish 1.9 Gb

Insects Drosophila 165 Mb Honey Bee 270 Mb Yellow Fever Mosquito 780 Mb Malaria Mosquito 278 MbStrachan and Read (2004) Chapter 8

Birds Chicken 1.2 Gb

Frog Xenopus Laevis 1.7 Gb

Nematodes Caenorhabdites elegans 100 Mb Caenorhabdites briggsae 80 Mb

Sea Urchin Strongylocentrotus purpuratus 800 Mb

Multicellular Plants

Arabidopsis thaliana 125 Mb Rice 430 Mb

Page 7: In PowerPoint format

globin

Exon 2Exon 1 Exon 3

5’ flanking 3’ flanking

(chromosome 11)

The Human Genome I http://www.sanger.ac.uk/HGP/ & R.Harding & HMG (2004) p 245

*5.000

*20

6*104 bp

3.2*109 bp

*103

3*103 bp

ATTGCCATGTCGATAATTGGACTATTTGGA 30 bp

Myoglobin globin

aa aa aa aa aa aa aa aa aa aa

DNA:

Protein:

12 3

4 56 7

8 9X

Y1514131210 112120191817

1622

279251

221197 198

176 163 148 140 143 148 142118 107 100

104 88 8672 66 45 48

163

51

mitochondria

.016

Page 8: In PowerPoint format

The Human Genome II http://www.sanger.ac.uk/HGP/

Strachan and Read (2004) Chapter 9

Nuclear Genome MitochondriaHighly conserved - coding 1.5% 93%Highly conserved - other 3.5% 5%Transposon based repeats 45 % -Heterochromatin 6.6% -Other non-conserved 44 % 2% Mendelian inheritance Maternal inheritance 1 (typically) Possibly thousands Recombination No recombinationGene Density: 1/130 kb 2 kb

Pseudogenes: 20000

Processed Pseudogenes

Page 9: In PowerPoint format

The Human Genome III http://www.sanger.ac.uk/HGP/

Strachan and Read (2004) Chapter 9 + Lander et al.(2001)

Gene families

Clustered

-globins (7), growth hormone (5), Class I HLA heavy chain (20),….

Dispersed

Pyruvate dehydrogenase (2), Aldolase (5), PAX (>12),..

Clustered and Dispersed

HOX (38 – 4), Histones (61 – 2), Olfactory receptors (>900 – 25),…

Transposons

Page 10: In PowerPoint format

Genes and Gene Structures I•Presently estimated Gene Number: 24.000 (reference: )

•Average Gene Size: 27 kb

•The largest gene: Dystrophin 2.4 Mb - 0.6% coding – 16 hours to transcribe.

•The shortest gene: tRNATYR 100% coding

•Largest exon: ApoB exon 26 is 7.6 kb Smallest: <10bp

•Average exon number: 9

•Largest exon number: Titin 363 Smallest: 1

•Largest intron: WWOX intron 8 is 800 kb Smallest: 10s of bp

•Largest polypeptide: Titin 38.138 smallest: tens – small hormones.

•Intronless Genes: mitochondrial genes, many RNA genes, Interferons, Histones,..

Jobling, Hurles & Tyler-Smith (2004) HEG p 29 + HMG chapt. 9

Page 11: In PowerPoint format

Genes and Gene Structures IIGenes within Genes:

Intron 26 of neurofibromatosis type I (NF1) contains 3 internal (2 exons) genes in the opposite direction.

Overlapping Genes:

Class III region of HLA

Stra

chan

and

Rea

d (2

004)

Cha

pter

9 p

258

Simple Eukaryotic

Page 12: In PowerPoint format

Alternative Splicing

Cartegni,L. et al.(2002) “Listening to Silence and understanding nonsense: Exonic mutations that affect splicing” Nature Reviews Genetics 3.4.285-

HMG p291-294

1. A challenge to automated annotation.

2. How widespread is it?

3. Is it always functional?

4. How does it evolve?

Page 13: In PowerPoint format

RNAs in the Genome

Strachan and Read (2004) p.247 F9.4

~200 snoRNA small nucleolar, over 100 types - RNA modification and processing

~100 snRNA small nuclear - involved in splicing

~200 miRNA very small ~22bp , regulation

~175 28S,5.8S,5S large cytosolic subunit

~175 18S small mitochondrial subunit

~250 5S large mitochondrial subunit

>500 tRNA transfer RNA

>1500 Antisense RNA > 1500 types

Page 14: In PowerPoint format

Genome Annotation

Ensembl

http://www.ensembl.org

Santa Cruz Genome Browser

http://genome.ucsc.edu/

Genomes

Proteins

ESTs

Page 15: In PowerPoint format

Gene Finding and Protein (HMM) DescriptorsBurge & Karlin jmb 96

A. Make gene characteristics to each nucleotide. Extract legal prediction by dynamical programming.

B. Use HMM to describe biological knowledge of gene structure.

Page 16: In PowerPoint format

Mutations and Mutation Rates

1 mitosis or generation

Average Number of Mitoses

Male generation (15:35 .. 20:150

Female generation: ~24

Crow,JF (2000) “The Origins, Patterns and Implications of Human Spontaneous Mutation” Nature Review Genetics 1.1.40-47 + Strachan and Read (2004) chapter 11 +Jobling, Hurles and Tyler-Smith (2004) chapter 2

• Single nucleotide substitutions: ~10-7

• Microsatellites (~100.000): ~10-2

• Small insertion deletions: ~10-8

Page 17: In PowerPoint format

Recombination

1 meiosis

Lander et al.(2001) “Initial sequencing and analysis of the human genome” Nature 409.860-912. + Kong,E. et al.(2002) “A high resolution recombination map of the human genome” Nature Genetics

Recombination:Gene Conversion:

•Total Haploid length males: 25.9 M - females: 44.6 M.

•Gene conversions 1-2 orders higher. Length 300-2000 pb.

Page 18: In PowerPoint format

Selection: Positive & Negative

A

A

A

A

A

A

One sequence scenario Population scenario

AAACC

AAACC

AAACC

ThrSerACGTCA

ThrProPro ACGCCA ThrSer

ACGCCG

ArgSer AGGCCG ThrSer

ACTCTG

AlaSer GCTCTG

AlaSer GCACTG

-

-

One sequence scenario again

Certain events have functional consequences and will be selected out. The strength and localization of this selection is of great interest.

The selection criteria could in principle be anything, but the selection against amino acid changes is without comparison the most important.

Page 19: In PowerPoint format

The Genetic Code

Substitutions Number Percent

Total in all codons 549 100

Synonymous 134 25

Nonsynonymous 415 75

Missense 392 71

Nonsense 23 4

Page 20: In PowerPoint format

Examples of rates remade from Li,1997

RNA Virus

Influenza A Hemagglutinin 13.1 10-3 3.6 10-3

Hepatitis C E 6.9 10-3 0.3 10-3

HIV 1 gag 2.8 10-3 1.7 10-3

DNA virus

Hepatitis B P 4.6 10-5 1.5 10-5

Herpes Simplex Genome 3.5 10-8

Nuclear Genes

Mammals c-mos 5.2 10-9 0.9 10-9

Mammals a-globin 3.9 10-9 0.6 10-9

Mammals histone 3 6.2 10-9 0.0

Organism Gene Syno/year Non-Syno/Year

Page 21: In PowerPoint format

Genealogical StructuresHomology:The existence of a common ancestor (for instance for 2 sequences)

Phylogeny Pedigree:

Ancestral Recombination Graph – the ARG

ccagtcg

ccggtcgcagtct

Only finding common ancestors. Only one ancestor.

i. Finding common ancestors.

ii. A sequence encounters Recombinations

iii. A “point” ARG is a phylogeny

Page 22: In PowerPoint format

Populations

Now

Parents

Grand parents

Page 23: In PowerPoint format

Genealogical approach to Population Variation Analysis

Africa Non-Africa

Inter.SNP Consortium (2001): A map of human genome sequence variation containing 1.42 million SNPs. Nature 409.928-33

Page 24: In PowerPoint format

Pedigrees

Icelandichttp://www.decode.com + Helgason, A. et al. (2003 June) “A population-wide coalescent analysis of Icelandic matrilineal and patrilineal genealogies: Evidence for a faster evolutionary rate of mtDNA lineages than Y-chromosomes” American Journal Human Genetics.

Chinesehttp://demography.anu.edu.au/People/Staff/zhongwei.html

Burke’s British Peeragehttp://www.burkes-peerage.net/sites/wars/sitepages/home.asp

Mormonshttp://genealogy-mormons.com/

Quebec FrenchHeyer and Tremblay, 1998 PNAS

Total Pedigree

1972

2002

1848

1892

Year

2

2

3

1

1 1 11

1

1

2

2

2

Ancestor cohort

Contemporary cohort

77.9%

22.1%

N = 31,817 N = 31,659

N = 66,910N = 64,150

8.3%

91.7% 86.2%

13.8%

73.9%

26.1%

Ancestral cohort born 1848-1892

Descendant cohortborn after 1972

Matrilines Patrilines

Helga

son

Page 25: In PowerPoint format

Genealogical Questions

Pedigrees

Time back to first individual common ancestor to everyone

ARG questions:

The height of ARGs - correlation between local phylogenies

Gene Phylogeny Questions

Total Branch Length - Height

Page 26: In PowerPoint format

Long Term Evolutionary History: Myr/Gyr

Origin of Life

Last Universal Common Ancestor – LUCA

First Eukaryotes

First Chordates

First Vertebrates

First Mammals

First Primates

First Hominoids

Chimp-Human Split

Hedges, SB (2002) “The Origin and Evolution of Model Organisms” Nature Review Genetics 3.11.838-848.

Brown (2003) “Horizontal Genetic Transfers “ Nature Genetics

Page 27: In PowerPoint format

observable observable

Parameters:time

rates, selection

Unobservable

Evolutionary Path

observable

MRCA-Most Recent Common Ancestor

?

3 Problems:

i. Test all possible relationships.

ii. Examine unknown internal states.

iii. Explore unknown paths between states at nodes.

ATTGCGTATATAT….CAG ATTGCGTATATAT….CAG ATTGCGTATATAT….CAG

Time D

irection

The Comparative Aspect.

Page 28: In PowerPoint format

Observable

Observable Unobservable

Unobservable

U

C G

A

C

AU

AC

)()(

)()(

SequencePSequenceStructureP

StructurePStructureSequenceP

Goldman, Thorne & Jones, 96

RNA Structure

Gene Structure

One Principle of Comparative Genomics

Protein Structure

Page 29: In PowerPoint format

Molecular Evolution and Gene Finding: Two HMMs

Simple Prokaryotic Simple Eukaryotic

AGTGGTACCATTTAATGCG..... Pcoding{ATG-->GTG} orAGTGGTACTATTTAGTGCG..... Pnon-coding{ATG-->GTG}

Page 30: In PowerPoint format

The Rise of Comparative Genomics

Lander et al(2001) Figure 25A

Page 31: In PowerPoint format

RNA (Secondary) StructureSequences

ACTGT

ACTCCT

Protein Structure

87654321

4

Cabbage

Turnip

75 31 86 2

Gene Order/Orientation.

Gene StructureInteraction Networks

Any Graph.

General Theme.

Formal Model of Structure

Stochastic Model of Structure Evolution.

Renin

HIV proteinase

The Domain of Comparative Genomics

Page 32: In PowerPoint format

Linkage Mapping

rM

D

From McVean

Page 33: In PowerPoint format

A set of characters.

Binary decision (0,1).

Quantitative Character.

Dominant/Recessive.

Penetrance

Spurious Occurrence

Heterogeneity

genotype Genotype Phenotype phenotype2N

e generations

Association/Fine scale mapping

Page 34: In PowerPoint format

Single marker association

Bayesian analysis

1000 cases and 1000 controls typed at 8 microsatellite markersBRCA2 example

Rafnar et al.(2004) – Morris et al(2001) + Causative SNPs.

Page 35: In PowerPoint format

Short Term Evolutionary History: Kyr/MyrOldest Polymorphisms

Neutral Human Autosomal Polymorphisms

First Out-of-Africa

Anatomically Modern Man

Peopling of the Globe – genetic and fossil evidence.

The globe & migrations:

Cavalli-Sforza,2001 + HEG (2004)

Supposedly well behaved populations

Iceland

Finland

Sardinia

Page 36: In PowerPoint format

HapMap “The International H

apMap Project “N

ature 426, 789 - 796 (18 Dec 2003) http://w

ww

.hapmap.org/

Started October 27-29, 2002

Page 37: In PowerPoint format

HapMap

Page 38: In PowerPoint format

Ontologies

http://www.geneontology.org

Gene Ontology Consortium (2001) “Creating the Gene Ontology Resource: Design and Implementation.” Genome Research 11.1425-33

Gene Ontology Consortium (2004) “The Gene Ontology (GO) database and informatics resource” Nucleic Acid Research 32.D258-61.

A Structured Vocabulary – Consistent across species.

Purpose:

Facility communication among researchers

Facility communication among computer systems

2001: Three Ontologies:

Molecular Function

Biological Process

Cellular Component

Source NA

R(2004) 32.D

258-

Page 39: In PowerPoint format

Structural Genomics: Systematic Structure Determination

http://www.strgen.org/ http://www.nysgrc.org/ http://www.oppf.ox.ac.uk/ http://pdb.ccdc.cam.ac.uk/pdb/strucgen.html

John Westbrook, Zukang Feng, Li Chen, Huanwang Yang and Helen M. Berman “The Protein Data Bank and structural genomics” Nucleic Acids Research, 2003, Vol. 31, No. 1 489-491

PDB Holdings List: 10-Feb-2004 Molecule Type

Proteins, Peptides, and Viruses

Protein/Nucleic Acid Complexes Nucleic Acids Carbohydrates total

Exp. Tech.

X-ray Diffraction and other 19014 898 719 14 20645

NMR 2934 96 569 4 3603

Total 21948 994 1288 18 24248

Examples:•Center for Eukaryotic Structural Genomics

•Structural Genomics of Pathogenic Protozoa Consortium

•Berkeley Structural Genomics Center : Mycoplasma genitalium and Mycoplasma pneumoniae

Page 40: In PowerPoint format

Structural Genomics: Mycoplasma pneumoniae proteins

http://www.strgen.org/status/mpoverview.html

Page 41: In PowerPoint format

Proteomics

http://www.hupo.org Hanash,S.(2003) “Disease Proteomics” Nature 422.226- Aebersold,R. and M.Mann (2003) “Mass spectrometry-based proteomics” Nature 422.198- Gavin et al. (2002) “Functional Organisation of the Yeast Proteome by systematic analysis of protein complexes” Nature 415.141-

2D PAGE gels (polyacryl gel electrophoresis )

MALDI

Protein Micro-arrays

Source: Hanash (2003)

Source Gavin et al.(2002)

Page 42: In PowerPoint format

The Genome

Genomes: Variation and long term evolution.

Genealogical Structures: Phylogenies, Pedigrees and the ARG

Long term Dynamics of the Human Genome: The comparative aspect

(Genotype Phenotype) & (Population Genetics/History) => Gene Mapping

Summary

Page 43: In PowerPoint format

Our Genomically Motivated Projects

1. Comparative gene annotation (Meyer, Skou Pedersen)

2. Superimposed selective constraints (Forsberg, Meyer, Skou Pedersen) *

3. Haplotype Blocks (Song) *

4. Genome transformations (Miklos)

5. Ancestral Blocks*

6. Statistical Sequence Comparison (Drummond, Lunter, Miklos)

7. Substitutions and insertion-deletions at the Genome Level (Lunter) Next week

Page 44: In PowerPoint format

a: (3,4)b: (3,4)c: (15,16)d: (16,17)e: (35,36)f: (35,36)g: (36,37)

Minimal ARGs and Haplotype Blocks (Song)

Page 45: In PowerPoint format

Combining Levels of Selection.Forsberg, Meyer, Pedersen

Protein-Protein

Hein & Støvlbæk, 1995 Codon Nucleotide Independence Heuristic

Jensen & Pedersen, 2001

Contagious Dependence

Assume multiplicativity: fA,B = fA*fB

Protein-RNA

DoubletsSinglet

Contagious Dependence

Page 46: In PowerPoint format

A randomly picked ancestor: (ancestral material comes in batteries!)

0

0 52.000

260 Mb

06890 8360

7.5 Mb

*35

0 30kb

*250

Parameters used 4Ne 20.000 Chromos. 1: 263 Mb. 263 cM

Chromosome 1: Segments 52.000 Ancestors 6.800

All chromosomes Ancestors 86.000Physical Population. 1.3-5.0 Mill.

Applications to Human Genome (Wiuf and Hein,97)

Page 47: In PowerPoint format

References: Books & www-pages.Books:

Strachan and Read (2004) “Human Molecular Genetics” (3rd Ed.) Bioscience

Jobling, Hurles and Tyler-Smith (2004) “Human Evolutionary Genetics” Bioscience

Sulston, J.(2002) “Our Common Thread” Corgi Books

Ridley, Matt (2001) “Genome”

“Encyclopedia of the Human Genome” (2003) Nature Publishing Group

Cavalli-Sforza,L. (2001) “Genes, People and Language” Penguin

Key articles:

Lander et al.(2001) “Initial Sequencing and Analysis of the Human Genome” Nature

Venter et al.(2001)”The Sequence of the Human Genome” Science 291.1304-1351

Page 48: In PowerPoint format

References: www-pages.Major sequencing centers: Baylor College of Medicine Genome Sequencing Center hgsc.bcm.tcm.edu/Celera www.celera.com DoE Joint Genome Institute www.jgi.doe.govGenoscope www.genoscope.cns.frTIGR www.tigr.org Washington University Genome Sequencing Center www.genome.wustl.edu Wellcome Trust Sanger Institute www.sanger.ac.ukWhitehead Institute/MIT Center for Genome Research www.-genome.wi.mit.edu

Ensembl genome annotator - www.ensembl.orgEuropean Bionformatics Institute - www.ebi.ac.ukNCBI - www.ncbi.nlm.nih.gov

Nature Genome Gateway http://www.nature.com/genomics/human/

Integrated Genomics http://wit.integratedgenomics.com/GOLD/

Ebi genome databases http://www2.ebi.ac.uk/genomes/

Primate Sequencing Projects http://sayer.lab.nig.jp/~silver/index.html

European Bioinformatics Institute Proteomics http://www.ebi.ac.uk/proteome/

National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/

HapMap Project Homepage http://www.hapmap.org/

Online Inheritance in Man http://www.ncbi.nlm.nih.gov/omim/