Introduction to Genomics using yeasts as model organisms · just working by similarity, no...

Preview:

Citation preview

Introduction to Genomics using yeasts

as model organisms

Strasbourg team

Bleykasten Claudine

Despons Laurence

de Montigny Jacky

Friedrich Anne

Ivanov Samy (Sofia)

Jung Paul

Kugler Valérie

Leh Véronique

Potier Serge

Schacherer Joseph

Souciet Jean-Luc

Straub Marie-laure

Spehner Catherine

Uzunov Zlatyo (Sofia)

Génolevures teams

J.-L. Souciet, Université de Strasbourg,CNRS (Génolevures Project Coordinator))

28 rue Goethe, F-67000 Strasbourg, France

B. Dujon, Institut Pasteur, CNRS, Université Pierre et Marie Curie

UFR927, 25 rue du Docteur Roux, F-75724 Paris, France

C. Gaillardin AgroParisTech, INRA, CNRS F-78850 Thiverval-Grignon, France

D.Sherman, LaBRI, CNRS 351 cours de la Libération, 33405 Talence Cedex, France

associated with the sequencing center

J. Weissenbach, Génoscope (CEA) 2 rue Gaston Crémieux, BP 191, F-91057 Evry Cedex, France

http://www.genolevures.org/

Preliminary introduction

Darwin

2009 bicentenary celebration

it is over

2010 is the year of Biodiversity

however we have to remember the

Darwin’s contribution in the scope

of the studies on genome evolution

Darwinian evolution in the

light of genomics

Koonin E.V. Nucleic Acids Research 2009, 37, 1011-1034

In brief from Darwin :

1/ Undirected, random variation is the main driving force for evolution

2/ Evolution proceeds by fixation of the rare beneficial variations and

elimination of deleterous variations = natural selection

3/ Beneficials changes that are fixed by natural selectionn are

infinitesimally small = evolution is accumulation of these tiny modifications

4/ the evolutionary process remains processes rmained the same throught

history life

5/ evolution could be represented by a single tree TOL (Tree Of Life)

(later, later after Darwin = LUCA (Last Universal Common Ancestor))

Conclusion from Genome Analysis (Genomics sensu lato) Studies April 2009.

1/ Infinitesimal changes = point mutations

2/ All kinds of duplications (WGD, segmental, single gene) or deletions

(single gene, chromsomal part, gene erasion or relics) gene lost

3/ Horizontal Gene Transfer (HGT) fot single or blocks

4/ various types of genomes rearrangements

5/ Diverse types of selfish genetic elements

6/ No unidirectionnal fates for genes (ex : gain and lost several times)

7/ The relative contributions of different evolutionarily forces greatly vary

from lineage to lineage

8/ Each genome is a PALIMPSET = a diverse collection of genes with

different evolutionary fates

A palimpsest is a manuscript page from a scroll or book that has been

scraped off and used again. The word "palimpsest" comes through

Latin from Greek παλιν + ψαω = ("again" + "I scrape"), and meant

"scraped (clean and used) again." Romans wrote on wax-coated

tablets that could be smoothed and reused, and a passing use of the

rather bookish term "palimpsest" by Cicero seems to refer to this

practice.

The term has come to be used in similar context in a variety of

disciplines, notably architectural archaeology….

and Genomics!

Definition extracted from Wikipedia

Genomics

Basic questions:

How genes arise ?

How genes disappear ?

How genes arise ?

First, duplication.

How, mechanisms ?

- segmental duplications

- tandem duplications

- duplications by retroposition

- duplication by aneuploidy

- duplication by polyploidy

- etc….

How genes disappear ?

- deletions

- accumulation of numerous point mutations

- translocations

The Génolevures strategy

-phylogeneticaly related species

- species or clades choice (Kurtzman)

- criteria ?

- complete sequencing ? telomere to telomere

(link to new emerging sequencing technologies)

- just working by similarity, no functional analysis

- annotation by experts?

(proteome of high quality and most of the detectable ncARN are identified)

- database devoted to comparative genomic analysis

(http://www.genolevures.org/)

Fungal biodiversity

(Euascomycetes)

(Hemiascomycetes)

(Archiascomycetes)

A least 3 major groups within Saccharomycotina ,

(previously hemiascomycetes) :

- Saccharomyces complex or Saccharomycetaceae

14 clades (Kurtzman 2003), point centromeres

- reassigment of the genetic code : the CTG codons ar translated

in serine rather than leucine ex. Candida albicans or

Debaryomyces hansenii

- GC rich genome ex. Yarrowia lipolytica

WGD

Kurtzman 2003

14 clades for the

Saccharomyces complex or Saccharomycetaceae (compact centromere)

- Wolfe 1997 - Kellis 2004

(P.Philippsen)

P

r

o

t

o

p

l

o

i

d

s

- goals, strains, genome sizes: eukaryotes/prokaryotes

- chromosomal mapping, genomic DNA banks,

and genome coverage, redundancy, overlapping

fragments

- DNA sequencing (and new methods Pyrosequencing

evolving very fast= new approaches)

- chromosomal assembly

- identifications of the genetic elements ( ORFs/CDSs, TR,

tRNA, repetitive sequences,

- definitions: orthologs, paralogs, duplication,

common and specific genes, species specific genes,

synteny

- ncRNA,…

- comparisons using BLAST

- mirror effects

- small ORFs /CDSs

- intron detection (a possible strategy)

- mechanisms of genome evolution

- comparative genomics and phylogenetics

- impact of new sequencing technologies , 454, Solexa

- other developments: transcriptomics

TOL three domains of life (cf LUCA)

Eubacteria

Eukaryotes

Archea Tree of life

Saccharomyces

cerevisiae

Caenorabditis elegans

Drosophila

melanogaster

Mus musculus

Arabidopsis thaliana

Model organisms

Goals of genomics

Systematic sequencing of each chromosome

(eucaryotes from telomere to telomere)

(procaryotes often circular not always (!) and number of

DNA molecule/strain)

Both DNA strands with perfect complementarity

Towards 0% error (difficult)

If the previous steps are finished: it will be

possible (theoretically) to detect all the genetic objects

within the sequenced species

General strategy:

- step 1: from chromosome to the

DNA sequence

- step 2: identification within the

DNA sequence of all the genetic

objects for production of a very

precise chromosomal map

The G+C content:

Highly variable among procaryotes

- 28,6 % Borrelia burgdorferi

- 69,4 % Thermus thermophilus

Less variable among eucaryotes

- 41 % Mus musculus

- 38 % S. cerevisiae

- 49 % Yarrowia lipolytica

- 35 % Arabidopsis thaliana

one example

link = genetic code, codon usage

G+ C value is important , example:

Nature, March 2010, A.J. Chapman et al.

Hydra 29% G+C Hydra magnipapillata

Associated with Hydra is a bacteria

a novel Curvibacter (the reference name will be published later)

with G+C content up to 60%

If contigs (see later) are differents with their respective

G+C content, this is an indication that there are probably

different genomes

Classes of interspersed repeat in the human genome

6-8 kb 850,000 21%

Length Copy Fraction of

number genome

100-300 bp 1,500,000 13%

6-11 kb

1.5-3 kb

450,000 8%

2-3 kb

80-3,000 bp

300,000 3% }

}

AAA

AAA A B

gag pol (env)

(gag)

transposase

LINEs Autonomous

SINEs Non-autonomous

Retrovirus-like

elements Autonomous

Non-autonomous

DNA Autonomous

transposon

fossils Non-autonomous ] [

Retrovirus

LTR LTR gag pol env

7 kb

Ty1/copia retrotransposon

LINE (Long Interspersed Nuclear Elements)

gag ? pol poly(A)

6 kb

SINE (Short Interspersed Nuclear Elements)

0,3 kb

JA Chapman et al. Nature 000, 1-5 (2010) doi:10.1038/nature08830

Dynamics of transposable element expansion in

Hydra reveals several periods of transposon activity.

Sequencing strategy :

(if using only Sanger technology, up to year 2007)

Genomic DNA bank 1: 3 to 5 kb long DNA fragments; high copy number E. coli vector

Genomic DNA bank 2: 5 to 7/10 kb long DNA fragments;low copy number E. coli vector

Genomic DNA bank:

- cosmides (35-40 kb)

- BACs (70-150 kb)

Both ends sequencing: required for contigs assembly

Checking the super contig assembly :

- by fingerprint

- by marqker genes assignation on chromosomes

( hybidization or chromosome painting; depending of the species)

(However depending of the genome size and of the % of repetitive elements)

Genomics using yeasts as model organisms:

- goals, strains,genome size eukaryotes/prokaryotes

- chromosomal mapping, genomic DNA banks,

and genome coverage, redundancy, overlapping

fragments

- DNA sequencing (and new methods Pyrosequencing)

- chromosomal assembly

- identifications of the genetic elements ( ORFs/CDSs, TR,

tRNA, repetitive sequences,

- definitions: orthologs, paralogs, duplication,

common and specific genes, speciation genes?, synteny

- comparison with BLAST

- mirror effects

- small ORFs /CDSs

- intron detection

- mechanisms of genome evolution

- comparative genomics and phylogenetics

- impact of new sequencing technologies , 454

- other developments: transcriptomics

Amino acid numbers to encode different enzymes of the glycolytic pathway in S. cerevisiae Glycolyse GLK1 aldohexose specific glucokinase 500 aa HXK 2 hexokinase II 486 aa HXK 1 hexokinase I 485 aa PGI1 glucose-6-phosphate isomerase 554 aa FBA1 fructose-bisphosphate aldolase 359 aa GPD 1 glyceraldehyde-3-phosphate dehydrogenase 1 332 aa GPD 2 glyceraldehyde-3-phosphate dehydrogenase 2 332 aa GPD 3 glyceraldehyde-3-phosphate dehydrogenase 3 332 aa PGK1 phosphoglycerate kinase 416 aa GPM 1 phosphoglycerate mutase 247 aa GPM 3 phosphoglycerate mutase 303 aa GPM 2 phosphoglycerate mutase 311 aa ENO 2 enolase II (2-phosphoglycerate dehydratase) 437 aa ENO 1 enolase I (2-phosphoglycerate dehydratase) 437 aa PYK 2 pyruvate kinase 506 aa PYK 1 pyruvate kinase 500 aa

Alcoolic fermentation PDC 1 pyruvate decarboxylase, isozyme 1 563 aa PDC 5 pyruvate decarboxylase, isozyme 2 563 aa PDC 6 pyruvate decarboxylase, isozyme 3 563 aa

CDS or gene 1 CDS or gene 2 CDS or gene 3

Intergenic area Intergenic area

Part of a chromosome area

The cover shows the devastating results of potato blight

(Phytophthora infestans) infestation - the pathogen that triggered

the Irish potato famine in the nineteenth century. The genome

of this still dangerous pathogen has now been sequenced,

revealing fast evolving effector genes that may contribute to

the rapid adaptability to host plants that has made potato blight

so difficult to control. Nature 461, pp315-438, September 2008

75 % repeated

Sequences,

Mainly RT elements

- goals, strains,genome

sizes eukaryotes/prokaryotes

Standardisation at the hapoid level

C value represents the haploid genome

size of one organism

(C for « constant » or « characteristic »)

(1000 bp= 1kb, 1000 kb= 1Mb, 1000 Mb= 1Gb)

Génomes complètement séquencés de quelques Archées

Génomes complètement séquencés de quelques Eubactéries

Génomes complètement séquencés de quelques Eucaryotes

Conclusion:

In procaryotes genome size is relatively

small and in correlation with the number

of genes encoding proteins

In eucaryotes no correlation between

the genome size and the number of

genes encoding proteins

- species

- strains and polymorphism (HETEROZYGOTE!)

- genetic analysis = deletions,

duplications, translocations,… in addition to basic

point mutations

- consequence = the gene number is different

from strain to strain (very often gene number is

referring to the genes encoding proteins, but this

is wrong, tDNA, rDNA genes, sn RNA….)

- the gene number of Saccharomyces

cerevisiae is

5813 is a wrong expression

- the gene number of Saccharomyces

cerevisiae

strain S288C is 5813: is a good expression

+ the date, because annotation is a never

ending process……….

The same for Homo sapiens sapiens and all

other living organisms

Genomic DNA banks (10 X coverage)

- redundancy (including boundaries overlaps)

- representativity

- rich (only clones with expected DNA inserts (99 %))

The BAC cloning scheme

Average Fragment Sizes of Mammalian DNA Produced by

Cleavage with Rare Cutting Restriction Enzymes

Library size

(genome coverage) P (%)

0.5 39.3

1 63.2

2 86.4

3 95.0

4 98.2

5 99.3

6 99.75

7 99.91

8 99.97

9 99.99

10 99.995

Cx

P (x) = ------- e-c

x!

cov erage f rom 0.5 to 10.

Probability of Having One or More Clones/

Locus within a Library as a Function of Library Size

The probability of f inding x clones f rom a library of c cov erage is calculated as

The probabilities shown are those of f inding one or more clones (1- P (0)) f or libraries with genome

Haploid Average

genome size insert size

of organism of clones no. of no. of 384-well

(Mb) (kb) clones plates

20 50 3000 8

1000 100 7500 20

3000 50 450 000 1172

3000 150 150 000 391

Number of Clones Required for 7.5x Genomic Coverage

Library size

The importance of the physical mapping of large DNA

fragments:

duplicated DNA sequences and possible assembly

mistakes

- contigs

- ordering large DNA fragments or

chromosome walking

- correspondance contigs map and physical map

X

A

B

chromosome V

YEP06

FCY21 FCY22

Ordering clones

by chromosome walking

Contigs map check by fingerprint:

an example with Not 1 digestion

also Hind III often used ( target size and resolution)

Note: with large DNA fragments ( more than 10 kb)

the size determination is achieve by PFGE

Digestion of BACs with NotI

HindIII fingerprints of human BACs and PACs

5’ATGTCTGCTGCTGCTGATAGATAACTTAACTTCCGGCCAC 5’AACTTAACTTCCGGCCACTTGAATGCTGGT

5’CTGCTGATAGATAACTTAACTTCCGGCCACTTGAAT

5’CCGGCCACTTGAATGCTGGTAGA

5’CTGATAGATAACTTAACTTCCGGCCACTTGAATGCTGG 5’AACTTAACTTCCGGCCACTT

5’CTGATAGATAACTTAACTTCCGGCCACTT

5’GATAACTTAACTTCCGGCCACTTGAATGCT

5’ATGTCTGCTGCTGCTGATAGATAACTT

5’GCCACTTGAATGCTGGTAGA

5’ACTTAACTTCCGGCCACTTGAATGCTGG

5’TGCTGCTGATAGATAACTTAACTTCCGG

Raw sequences :

After assembly…

5’ATGTCTGCTGCTGCTGATAGATAACTTAACTTCCGGCCAC

5’AACTTAACTTCCGGCCACTTGAATGCTGGT

5’CTGCTGATAGATAACTTAACTTCCGGCCACTTGAAT

5’CCGGCCACTTGAATGCTGGTAGA

5’CTGATAGATAACTTAACTTCCGGCCACTTGAATGCTGG

5’AACTTAACTTCCGGCCACTT

consensus sequence

5’ATGTCTGCTGCTGCTGATAGATAACTTAACTTCCGGCCACTTGAATGCTGGTAGA

5’CTGATAGATAACTTAACTTCCGGCCACTT

5’GATAACTTAACTTCCGGCCACTTGAATGCT

ATGTCTGCTGCTGCTGATAGATAACTT

GCCACTTGAATGCTGGTAGA

ACTTAACTTCCGGCCACTTGAATGCTGG

TGCTGCTGATAGATAACTTAACTTCCGG

C-J Rubin et al. Nature 000, 1-7 (2010) doi:10.1038/nature08832

Chicken lines resequenced.

Resequencing

Recommended