30
Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October 2008

Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

  • View
    222

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Genomics of bacteria and archaea: the emerging dynamic view of the

prokaryotic world

E. V. Koonin and Y. I. WolfNucleic Acids Research 36:6688-6719

October 2008

Page 2: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Extent of Prokaryotic Diversity

• Only about 0.1% of bacteria can be cultured in the laboratory!

• Currently about 1200 sequenced prokaryotic genomes• Large scale metagenomic surveys have not revealed

abundant bacteria outside of already known phyla– Metagenomics = sequencing DNA found in the environment

without growing or purifying the organisms. – Biggest survey: Craig Venter seawater survey.– Only about 10% of metagenomic sequences have no discernable

homologs.– Possibly many new species exist in some unusual habitats?

Page 3: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Genome Size• Current smallest: Carsonella rudii = 180 kbp• Current largest: Sorangium cellulosum = 13 Mbp• Genomes less than 1 Mbp are all parasites or intracellular

symbionts, which don’t need to make all compounds from scratch.– 1 Mbp seems to be about the minimum size for a fee-living bacterium

• The largest viruses (mimiviruses) are 1 Mbp or so; such viruses are common in marine habitats.

• The smallest eukaryotic genomes (the obligate intracellular parasite Encephalitozoon intestinalis) is 2.3 Mbp

Page 4: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Gene Density• Roughly 1 gene per 1000 bp in

both bacteria and archaea• Intergenic spaces are either almost

0 bp (within operons) or average about 100 bp.

• Longer intergenic spaces probably contain RNA-only genes or pseudogenes

• Nearly all prokaryotic genes are a single open reading frame, with very few introns or split genes.

• Gene overlaps are no more than a few base pairs: no documented cases of long overlaps.

Page 5: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Clusters of Orthologous Groups of Genes

• “orthologs” are genes that descend from the same gene in an ancestral species.– Need to be a bit looser in prokaryotes, where horizontal gene

transfer is common– Often defined by “bidirectional best hits” (BBH): two genes (in

different genomes) are each other’s best blast hit in those genomes.

– Problem of gene duplication: paralogues. Paralogues are also derived from the common ancestor but have evolved different functions.

• COGs are based on identifying orthologous genes, even if there is more than one in a given genome.

• New derivative of COGs: EggNOGs (yuck). The database includes genes from 312 bacterial species and 26 archaea.

Page 6: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

COG results• How widespread are different

orthologous gene families?• In most sequenced genomes, about 80%

of genes can be assigned to a COG.– The rest of the genes have no

detectable homology with any other protein; they are often called “ORFans”

• There are very few COGs found in most or all organisms (“core” genes: about 70 gene clusters)

• A larger, but still small number of COGs is moderately conserved, found in many genomes (“shell” genes: 5700 gene clusters)

• The large majority of COGs are found in only a few genomes (“cloud” genes: 24,000 clusters)

Out of the 338 genomes in EggNoGhow many are missing in each COG?Same data in both plots, but the bottom one is semi-log.

Page 7: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Percentage of Genes in EgGnOG COGs

Page 8: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

ORFans vs. ELFs• A gene with no detectable homology to any other protein in another

species is an ORFan• What are ORFans?

– Some are ELFs = Evil Little Fellows: falsely predicted genes; hypothetical genes are aren’t real. (BTW--I don’t think ELF is going to make it into standard genomics jargon, but ORFan might).

– Some are real genes derived from bacteriophages. • Metagenomic studies suggest that the world bacteriophage genomes is vast and

very under-explored. This is a very important concept that we will explore later. • In genome annotation, it is common to find prophages, which look like regions of

the genome with many hypothetical genes mixed with a few genes labeled “phage protein” or “integrase/recombinase”

– Some are just the tail end of the distribution of the “cloud” genes that are found in only a few genomes (in this case, just 1 genome).

• How big is “gene space”, the totality of all genes? Could be several orders of magnitude larger than we know now, and almost all of it will be genes found in only 1 or a few genomes, or perhaps only in bacteriophage.

Page 9: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

COGs in Phylogenetic Groups

• The presence or absence of members of each of the 30,000+ COG groups in all the 338 EggNog genomes can be used for cluster analysis. (Self-organizing map, here). – On the SOM, genomes close to each other share more COGs than

genomes far apart on the map.– There is quite a good correlation between COG presence/absence and

known phylogenetic groups (based on 16S rRNA): different members of the same phylum group together.

– with a few exceptions: gamma-proteobacteria are split, possibly due to a diversity of life styles.

Page 10: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October
Page 11: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

COGs vs. Gene Function• Genome annotation is based on the principle that if someone experimentally

determines a gene’s function, then all other genes with similar protein sequences perform the same function.

– Annotation also uses information about the gene’s chromosomal neighborhood: genes that are part of the same subsystem are often found grouped together.

– We are not likely to be able to predict a protein’s function directly from its amino acid sequence anytime soon.

• Non-orthologous gene displacement is common. When two organisms are compared, the same gene function is performed by two entirely different, non-homologous proteins.

– This happens even in very fundamental processes like DNA replication: the primary enzymes for replication are entirely different between the bacteria and the archaea.

• Because of non-homologous gene displacement, the “gene sequence homology space” in the previous slide is not identical to a SOM map of “gene function space”

– SOM set up the same way: vector of presence/absence of different gene functions (functional roles) in different species

– Here, the phyla are less well grouped. Perhaps because even closely related bacteria often have very different lifestyles.

Page 12: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October
Page 13: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Genome Architecture• Most prokaryotes have a single DNA origin of replication (ori), which is used to define base 1

in a genomic sequence, as well as the orientation of the sequence.• DNA polymerase starts replication at ori and goes in both directions, which defines a “leading

strand” (the right half of the genome) and a “lagging strand” (left half). – These can also be called the right and left replichores.

• The two halves often have noticeably different base compositions (GC content, etc.).• Most genes, especially highly transcribed genes, are oriented in the same direction as

replication.

Page 14: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Dotplots to Compare Genome Structure

• Compare positions of orthologous genes between 2 genomes, then plot positions.– A. Closely related genomes are mostly collinear, or syntenic (this is two

Geobacillus species)• syntenic means that neighboring genes in one species are also neighbors in

another species– B. Moderately related bacteria show an X-shaped pattern due to

multiple inversions across the origin (which preserves the direction of transcription). Shewanella

– C. X-pattern in 2 Archaea: Pyrococcus– D. Distantly related species show a random distribution of orthologs:

genome is well-scrambled.

• In general, only closely related species show any common genome architecture. The overall arrangement of genes on the chromosome is not well preserved.

Page 15: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October
Page 16: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Bacillus Dotplot: B. megaterium vs. B. cereus--organization is conserved in the vicinity of the replication origin, butnot in other regions.

Page 17: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Operons• The classic operon is the E. coli lac operon.

– Jacob and Monod, 1962– Three genes involved in lactose utilization are transcribed onto a single

messenger RNA– Transcription is under the control of a single transcription factor, the

lac repressor.– When the lac repressor detects lactose, it allows the operon to be

transcribed.• Most prokaryotes have numerous operons of many types

Page 18: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Operons Across Species• Operon structure is conserved much better than overall chromosomal synteny• especially for genes whose proteins physically interact, such as the ribosomal

proteins.– Interpretable as selection for having a balanced number of all subunits.

• The 50+ ribosomal proteins are found grouped in different patterns across all prokaryotes. The ribosomal “superoperon”

– other groups of partially conserved operons also exist, giving the general concept of the conserved gene neighborhood: even when they are not part of the same operon, genes involved in the same subsystem tend to stay near each other.

• However, most operons are not part of superoperons, but rather just 2-4 genes that are oriented in the same direction and are co-transcribed and co-regulated.

• Conservation is moderate: operon membership tends to change over phylogenetic distance.

– However, most groups of adjacent genes in the same orientation are actually co-regulated as operons.

• The percentage of genes in operons varies: very high in Thermotoga, very low in Cyanobacteria.

Page 19: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Gene Regulation and Signal Transduction

• Lac operon model: a single protein senses something in the environment (lactose) and directly alters transcription.– Some variants: genes transcribed from a common regulatory region in

opposite directions (a divergon), and genes in multiple locations affected by the same regulatory protein (regulon)

• The transcription factors (DNA binding proteins that affect transcription) are well conserved, but which genes are affected varies widely.– Transcription factors generally consist of a ligand-binding domain (e.g the part

that binds to lactose) and a DNA binding domain.

• Two component histidine kinase systems: – one protein in a membrane-bound histidine kinase that senses something in

the extracellular environment. – The histidine kinase phosphorylates another protein, the response regulator,

which is soluble and binds to the DNA to affect transcription.

• Many other systems, often originally found in eukaryotes: cyclic AMP, cyclic di-GMP, programmed cell death systems, and more.

Page 20: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Genome Size• Minimal number of genes:

– for growth on rich medium, where there are very few biosynthetic requirements: maybe 250 genes.

• Carsonella rudii, an obligate intracellular parasite, has only 170 genes. It even lacks some aminoacyl tRNA synthetases, and probably uses host enzymes for this function. Perhaps it is being converted into an organelle? (that’s just speculation, however)

– for a free living heterotroph, maybe 1000 genes are needed• Pelegibacter ubique has about 1100 genes

– given the presence of non-homologous gene replacement and different lifestyles, there are undoubtedly many more-or-less minimal genomes that survive.

Page 21: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Gene Class vs. Genome Size• Some genes are found in about the same

numbers in all genomes: translation machinery, cell division machinery.

• Other genes are proportional to genome size: metabolic genes, transporters, DNA replication and repair

• Other genes are proportional to the square of genome size: regulatory proteins.

– Small genomes have very few regulatory proteins, while large genomes have lots. The fraction of regulatory genes increases as the total number of genes increases.

• Note the exponent on the equations in the figure.

• Leads to a proposed maximum genome size of about 20,000 genes: where each non-regulatory gene has its own regulatory gene

Page 22: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Horizontal Gene Transfer• Defined as DNA transfer across species lines

– As opposed to vertical gene transfer: genes transmitted from parent to offspring through chromosome replication and cell division.

– Once considered unusual or controversial, it is now obvious that HGT is a frequent event with major effects on all prokaryotic genomes.

– HGT has made the definition of “species” difficult in prokaryotes.

• Pathogenicity islands: regions of up to 100 kbp, often near tRNA genes and often containing multiple prophage insertions. They contain genes needed for pathogenic behavior, such as toxins and type II secretion systems.

• The classical three sexual processes in prokaryotes:– Conjugation: direct transfer of DNA between two cells. Certain plasmids have

genes that cause conjugation.– Transduction: transfer of DNA through a bacteriophage intermediate

• Gene transfer agents (GTAs) are defective bacteriophages that package and transfer random pieces of the bacterial genome, without killing the host cells.

– Transformation: uptake of naked DNA from the environment.

Page 23: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

More HGT• In the absence of a direct genome

comparison, horizontal gene transfer can be detected by differences in DNA composition: GC content, codon usage, oligonucleotide frequency, etc.

– However, acquired genes undergo a process of “amelioration”, where selectively neutral mutations shift the DNA composition to match the host’s DNA.

• Organisms that share a common environment often transfer genes, even across the bacteria-archaea divide.

– Hyperthermophilic bacteria have up to 20% of their genes with better matches in the archaea than in other bacteria.

– Similarly, mesophilic archaea sahre more genes with mesophilic bacteria

Top= bacteria, bottom = archaea. In both cases, a mesophile is on the leftAnd a hyperthermophile is on the right.

Page 24: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

HGT and Gene Loss• Genes are gained by horizontal gene transfer as well as by internal

processes like duplication, and genes are also lost. The relative rates of these two events must be balanced to keep the genome reasonably constant in size.

• Probably all COG groups have had at least one horizontal transfer.– But still, most genes are transferred vertically most of the time. – Several studies have shown that most genes within a group of organisms have

a common phylogeny that matches the expectation of vertical descent.

• Are genes involved in replication, transcription, and translation less prone to HGT? Based on the idea that these genes interact so intimately that they can’t be easily replaced. However, many cases of HGT in these genes have been seen, and there probably isn’t a big difference in rates.

– The problem is, much HGT in informational genes is not easily detected because the COG families for these genes are “core”: nearly all species use slight variations on the same gene. Genes involved in the same metabolic function tend to fall into several different, non-homologous COGs.

Page 25: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Selfish Operons• An operon can be thought of as a group of genes that act

together to perform a single metabolic function. – Operons often even come with their own regulatory protein.– An operon thus provides a phenotype that natural selection can act

on.

• You can think of operons as selfish: travelling between genomes, conferring a useful trait that increases the number of copies of that operon in the world.

• An example: membrane-bound ATP synthases, the primary way most species generate energy. There is an archaeal version and a bacterial version. Both are encoded by a single operon, which has been transferred many times across the domain boundary.

Page 26: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

The Prokaryotic Mobilome• “-ome” means the set of all things with this function. I will admit to feeling that

this suffix is over-used these days.• The mobilome is the set of bacteriophages, plasmids, transposable elements, and

associated genes that frequently travel between the genomes of cellular life.• All sequenced genomes show signs of multiple integrated phages and plasmids• Bacteriophages are everywhere: it has been estimated that there at 10 times as

many phage particles as cells in sea water.• Plasmids are replicons independent of the chromosome.

– Usually circular, but sometimes linear.– Usually not necessary for life, but some are very integrated into the life of the cell. – Some integrate into the chromosome.

• Plasmid addiction: seen in restriction/modification systems and toxin/antitoxin systems

– Put the toxin gene on the chromosome and the antitoxin on the plasmid. If the plasmid is lost, the cell dies because it makes the toxin but not its antidote.

– This is a selfish process: the plasmid benefits but the cell is forced to keep replicating a useless plasmid.

Page 27: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

The Principal Processes of Prokaryotic Evolution• Three basic processes:

– 1. vertical transfer from parent to child– 2. horizontal transfer between species– 3. mobilome genes that are occasionally recruited to perform useful functions

for the cells.

• Another important process: gene loss– Sometimes under strong selection, as is the development of parasitism, where

many genes are no longer needed.– More generally, there is a weak selection pressure to remove unneeded

genes.

• Also important: recombination within a single genome can generate duplications and other rearrangements that cause genes to evolve.

Page 28: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Evolutionary Theory• The gene-centric perspective, as opposed to the classical genome-centric

viewpoint. Individual genes can be considered distinct evolutionary units that are subject to selection across species and compete with other genes.– In the gene-centric view, a genome is a community of genes that have some

degree of selfishness. – The genome-centric view is that selection occurs at the level of the entire

organism, the classical view– Both views have validity.

• Population genetics theory has two main mechanisms by which major genetic innovations can spread: – if the mutation confers a major advantage, such as antibiotic resistance– if the population size is very small, so random genetic drift can cause neutral

mutations to take over. – It’s not obvious how this works in prokaryotes, but it is worth thinking about.

Page 29: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Genetic Signatures of Different Lifestyles

• You would think that organisms living at very high temperatures or in high radiation environments would need some special genes to adapt.

• Only 1 gene is routinely found in hyperthermophiles but not mesophiles: reverse gyrase, an enzyme that introduces positive supercoiling into DNA (as opposed to the usual negative supercoiling)

– The secrets of how to survive at high temperatures are not obvious from looking at the genes. Pretty much the same genes in hyperthermophiles as are in mesophiles.

• Similarly, no specific genes in Deinococcus radiodurans, which is highly radiation (and dessication) tolerant. Radiation resistance is thought to be a side effect of dessication resistance.

– Some genes are up-regulated under irradiation, but knocking them pout doesn’t affect radiation tolerance. However, other, non-upregulated genes do affect radiation sensitivity.

• So, at this point, all of our genomics work has failed to pinpoint how complex phenotypes are generated.

Page 30: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October

Some General Principles• The split between bacteria and archaea is obvious when 16S rRNA, universal

genes, or almost any other comparison is made.• All prokaryotes have the same general genome pattern: a single circular

chromosome with (usually) one origin of replication, and with genes packed closely together with some co-regulated as operons.

• Scaling: informational genes tend to be constant in number, metabolic and transport genes are found in numbers proportional to the overall genome size, and regulatory genes are proportional to the square of genome size.

• Horizontal gene transfer is widespread, both between genomes and between cellular and “mobilome” life.

– All prokaryotes seem to form a single common gene pool, although not all transfers are equally likely.

– Not all genes within an organism have the same evolutionary history.

• Orthologous genes have patchy distributions across genomes, because non-homologous gene displacement is common.

– Gene function and gene homology are not strictly related