Genomics -sequencing of microbial sequences · Web viewGenomics-sequencing of microbial genomes This lecture illustrates the strategies used in microbial genome sequencing projects,

Genomics-sequencing of microbial genomes This lecture illustrates the strategies used in microbial genome sequencing projects, compares genome content and organisation amongst microbes, and shows how to derive information on gene function across genome.

Objectives for students:

Expected to describe strategies involved in microbial genome sequencing and functional genomics

Provide examples of information that can be derived from genomics

Microbial Genome Sequencing Genome Sequencing Projects

o strategy & methodso annotation

Comparative genomicso organisationo gene content

Functional genomicso transcriptomeo proteomeo genome-wide mutation

Concentrate on strategy & ideas

Genome Sequencing Projects

Genome sequencing progress (2009)

Complete:o Archaeal: 70 (2007 = 49) (2008= 55)o Bacterial: 945 (2007 = 554) (2008= 728)o (Eukaryotc : 121) (2007 = 76) (2008= 97)

Ongoing: o Archaeal: 111o Bacterial: 3498 o (Eukaryotic: 1223)o Metagenome projects: 200

University of Leicester –Genomes–Microbial Genomics -October 2010 Page 1

www.genomesonline.org

Bacterial genome projects

Many completed:

Haemophilus influenzae Escherichia coli Bacillus subtilis Mycoplasma genitalium Helicobacter pylori (x2) Campylobacter jejuni Treponema pallidum Neisseria menigitidis Neisseria gonnorhoea Vibrio cholerae E. coli O157

Links:

http://www.tigr.org/ http://www.ncbi.nlm.nih.gov/ http://www.sanger.ac.uk/ http://www.genomesonline.org/

Completed microbial eukaryote projects


http://www.genomesonline.org/

http://www.sanger.ac.uk/

http://www.ncbi.nlm.nih.gov/

http://www.tigr.org/

http://www.genomesonline.org/

http://www.genomesonline.org/images/gold_s1.gif

Yeast -Saccharomyces cerevisiae Plasmodium falciparum Aspergillus nidulans, A.niger, A.oryzae &A.fumigatus Trypanosoma cruzi & brucei Leishmania Entamoeba histolytica Giardia lamblia Candida albicans & glabrata Paramecium

Genome sequencing strategy

In the pre-genome era there were a number of considerations regarding the benefits of sequencing. The piecemeal collection of sequenced genes was slow and costly. Issues also arose over ownership, strain choice, approach and data release. The genome project, however, provided a rational approach to sequencing which was efficient and rapid, and was able to address novel questions. The post genomic era has allowed the application of comparative and functional genomics.

Genome sequencing strategy: Strategy choice

o large collaborative cosmid/BAC-based projects now better suited for larger genomes slow

o small insert shotgun approach centralised rapid and efficient choice for bacteria

Strain choiceo fresh isolate vs lab straino clinical vs environmentalo subsequent genetic analysis

E.g. Yeast genome sequence strategy

Yeast chromosomes (16) individually sequenced several approaches used

o Make genome library in cosmids order cosmid library

need to know which cosmid overlaps with which

link cosmid to genome map


produced tiled set of cosmids

only sequence minimum number

o Use chromosome specific probe to identify chromosome-specific cosmidso sequence cosmid inserts by subcloningo Solve problems by direct PCR sequencing, walking and other libraries (lambda)o Telomeres


Tiled set

Whole genome/chromosome shot-gun strategy (WGS)

Rapid Generation of small insert genomic library Library is not initially ordered DNA sequence ends of inserts Depends on powerful computing to assemble sequence reads


c1A B

c2C D

c3E F

c4G H

c5I J

c1 c2 c3 c4 c5A

B

C

D

E

F

G

H

I

J

OrderingClones

PH011200100

80

100

120

140 160 180

70512

70449

70893 705

15 70124

70266

7202702

65 70871704

63

Main steps in generating a complete genome sequence


IsolationConstr

uctionShotgun

sequencing

FinishingAnnot

ation

Minimum time

period (week

s)

24-62-41212

bacterial chromosome

vectorplasmid

random shearing

size selection

libraryof

clones

sequenceend of

each clone

individual clones

Automated sequencers:

Manually chain termination sequencing requires four reaction tubes each containing a different type of terminator base as well as a radioactive nucleotide for labelling the newly synthesised DNA fragments. Each of the four reactions is electrophoresed in a separate lane of a gel. Demand for the ability to read more sequence in a shorter amount of time, led to the automation of the DNA sequencing process.

The attachment the of different fluorescent dyes to each of the four terminator bases ensured four separate sequencing reactions were no longer required; the entire sequencing reaction could be accomplished in a single tube. The development of these automated sequencing machines using multiple capillaries, thin, hollow glass tubes filled with a gel polymer, removed the need for a technician to add each sequencing reaction into an individual lane of the gel prior to the run

ABI 3700

The ABI 3700s (made by Applied Biosystems) are the most widely used automated sequencers. They have 96 capillaries, with a robot loading from 384-well plates.

MegaBACE

The MegaBACE is made by Amersham. It also has 96 capillaries and robotic loading from 384–well plate. Each run takes two to four hours, and can read up to 800 bases.

These advances have lead to the industrialization of sequencing. Most genome sequencing projects divide tasks (such as genome libraries, production sequencing and finishing) among different teams.


AssemblySequencing individual clones

genome sequence with gaps

Sequencing machines run are run 24 hours a day, 7 days a weeks and many tasks can be perfomed by robots.

454 sequencing- the future?

454 sequencing was developed Roche, and relies on a technique known as pyrosequencing (sequencing by synthesis). It differs from Sanger sequencing, relying on the detection of pyrophosphate release on nucleotide incorporation, rather than chain termination with dideoxynucleotides.

Nucleotides are flowed sequentially in a fixed order across the PicoTiterPlate device during a sequencing run.

During the nucleotide flow, hundreds of thousands of beads each carrying millions of copies of a unique single-stranded DNA molecule are sequenced in parallel.

If a nucleotide complementary to the template strand is flowed into a well, the polymerase extends the existing DNA strand by adding nucelotide(s).

Addition of one (or more) nucleotide(s) results in a reaction that generates a light signal that is recorded by the CCD camera in the instrument.

The signal strength is proportional to the number of nucleotides incorporated in a single nucelotide flow.

The GS FLX System software tracks the location of DNA carrying beads on a XY axis. Each bead

corresponds to a XY-coordinate on a series of images. The signal intensity per nucleotide flow is

recorded for each bead over time and is plotted to generate a flowgram. Each 10 hour sequencing

run on the GS FLX Titanium series will typically produce over one million flowgrams, one flowgram

per read.


Source :454 Sequencing © Roche Diagnostics

The development and impact of 454 sequencing. http://www.ncbi.nlm.nih.gov/pubmed/18846085

Rothberg et al.Biotechnology. Volume 26, 1117-1124 9/10/2008

Work involved in whole genome sequencing: individual sequencing reads accumulate

o each read about 500bpo computing used to assemble readso contiguous sequences called contigs

Aim for 8-10 read coverage of genome for accuracy example:

o H.influenzae 19,687 templates 24,304 reads assembled 11,631,485 bp

Gaps in genome sequence need to be filled in:


http://www.ncbi.nlm.nih.gov/pubmed/18846085

Bridging GapsA contig is a set of gel readings that are related to one another by overlap of their sequences. The gel readings in a contig can be summed to form a contiguous consensus sequence, the length of this sequence forms the length of the contig.

rise in contig number as amount of reads increases steady fall as accumulating sequence bridges gaps between contigs levels off as new reads more likely in known contig than gap start finishing

Finishing Why are gaps present? Gap bridging

o sequence gaps sequence gaps –choose appropriate clone and walk

o physical gaps alternative libraries (which?) PCR across gap

Mistakes/poor sequenceo areas where sequence reads are less than 8-10o repeated sequences -rRNA

closure and completion

Genome annotation


Physical Gap

Sequence Gap

Genome

Library cloneSequence read

contig

Find ORFso look for ATG-Stop (+alternatives) o over certain sizeo overlapso computer based (“Glimmer” & “Orpheus”) and trained eye

ORF functiono Search databases with predicted translated sequences –BLASTXo Consider level of similarity and contexto Domain comparisons

Pfam/Prosite Other features


http://www.yeastgenome.org/MAP/GENOMICVIEW/GenomicView.shtml

http://mips.gsf.de/genre/proj/yeast/index.jsp


http://mips.gsf.de/genre/proj/yeast/index.jsp

http://www.yeastgenome.org/MAP/GENOMICVIEW/GenomicView.shtml

http://www.yeastgenome.org/index.shtml

Artemis: sequence viewer and annotation tool from the Sanger Centre (http://www.sanger.ac.uk/Software/Artemis/)

http://xbase.bham.ac.uk/

xBASE is a database for comparative genome analysis of all bacterial genome sequences

Chaudhuri RR, Pallen MJ. xBASE, a collection of online databases for bacterial comparative genomics. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D335-7.

http://nar.oxfordjournals.org/cgi/content/full/34/suppl_1/D335


http://nar.oxfordjournals.org/cgi/content/full/34/suppl_1/D335


http://www.sanger.ac.uk/Software/Artemis/


Post Genome Sequence Approaches Comparative genomics

o comparing genome organisation and contento genome sizeo genome repeats/Tn/phageso gene contento minimal gene content

Functional genomics –ascribing gene function across a genomeo gene function –knownso phenotype predictiono gene function –unknownso investigating function

Bacteria-Yeast

Bacteria: Does genome size matter? Link genome size to adaptive capability

o biosynthetic capability synthesis of nutrients

o Stress resistance resist environmental insults

o structural complexity


Coordinator D

NA

Shotgun sequence

s

Finishinginstruction

s

Shotgun

templates

Annotation tasks

Finishing sequence

sBioinformatics Lab

Annotation

s

SSS

S

SS

SSS

SS

SS

S S S S

S SS

SS

SS

S

SS S

SSS

S S

Working

draft sequence

Finished sequence Finished

annotated sequence

A conceptual diagram of the flux and information in a network-based genome-

sequencing project

surface structures, sporogenesis Regulation –sensing signals and transcriptional responses

o detect change or requirement and respond appropriatelyo transcriptional regulation

Although size of the bacterial genome is important how the genome is expressed and regulated within its environment is also important:

Small genomes:

Mycoplasma genitaliumo 580,070 bpo smallest genome for self-replicating organismo free living but infects host cells o few biosynthesis and regulatory systemso has replication & transcription & translation, metabolism etc functions

Borrelia burgdorferio 910,725 bpo Lyme diseaseo few cellular biosynthetic systems

Mycoplasma pneumoniae(0.8 Mbp); Chlamydia trachomatis(1.0 Mbp);

Larger genomes: Haemophilus influenzae

o 1.830 Mbpo colonises human respiratory tracto limited environment

Helicobacter pylorio 1.667 Mbpo colonises human stomacho limited environment

Campylobacter jejunio 1.641 Mbpo colonises intestineo limited environment

Very large: Escherichia coli(K-12)

o 4.639 Mbp Bacillus subtilis

o 4.214 Mbpo soil/plant organism


o secondary metabolites Pseudomonas aeruginosa

o incomplete (5.9 Mbp) Yersinia pestis(4.4 Mbp) Clostridiumspp (4-5 Mbp) Mycobacterium tuberculosis

o 4.411 Mbpo slow growing (double in 24h)o large proportion of genome on lipid metabolism

Streptomyces coelicolor(~8 Mbp)o secondary metabolites –antibiotics!

Organisation of bacterial genomes Linear chromosomes

o Borrelia burgdorferio Streptomyces coelicolor

Multiple chromosomeso Vibrio cholerae

Plasmidso Borrelia burgdorferio 17 linear & circular plasmidso 50% genome sizeo plasmid replication, “decaying genes”, Antigenic variation

Transposons, IS elements, phageso found in most genomeso Although Campylobacter has none

Repeats

Replication Origin (oriC) and termination (terC) of replication

o OriC often near dnaAgene (replication initiation protein)o In Borrelia burgdorferi (linear) oriC (& dnaA) in centre

strand biaso which strand is each gene on?o transcription in same direction as replication –more efficiento variation in level of strand bias

Mt 55% vs Bs 75%

Genes can be annotated according to sequence similarity e.g. gene families, and regulators, transport, biosynthesis or domain matches such as trans-membrane domains, or DNA binding domains. Paralogues and orthologues can also be noted. Paralogues are members of same family (homologous) in same genome, but are likely to evolved to have a different exact function,


orthologues on the other hand are homologous genes(same family) in different genomes, that may have identical function.

This allows the deduction of metabolic pathways in newly synthesised organisms:

e.g. Vibrio cholorae

Reprinted by permission from Macmillan Publishers Ltd: [NATURE] (Heidelberg et al, 406 ,477-483), copyright (2000)

A significant proportion of genome contains ORFs of unknown function. Some may be orthologues of unknowns in other organisms, whilst others may be unique to the organism and important for its biology of organism. For example H.influenzae has 42% of genes with no known function whilst H.pylori has 33%, E.coli has 38% and M.tuberculosis between 60% to 16%. The number of these genes of unknown function is decreasing, however.

Comparison between genomes indicates the differing genomic arrangements within species, for example:

Comparison of Salmonella enterica serovar Typhi CT18 and Salmonella enterica serovar Typhi Ty2 shows an inversion that spans the terminus.


http://www.nature.com/

Variation in genomes may occur by gain or loss. Regions shared by closely related species are referred to as Core regions. There is also an additional “flexible” gene pool containing variable regions acquired from mobile genetic elements. These were first described as pathogenicity islands, although they are also found in non-pathogens, and having wider roles than pathogenicty, are now referred to as genomic islands. These islands contain genes are found in pathogens, commensals, symbionts and environmental bacteria. The gain of a genome island can be associated with gene loss e.g. gene reduction in obligate intracellular pathogens. Genome organisation as well as genome content correlates with microbial lifestyle.

Inserted Genome islands are frequently located adjacent to tRNA genes, known as tRNA associated elements, e.g tRNAProL and tRNAArgU in S.tyhpi and E.coli.


http://www.sanger.ac.uk/resources/software/act/

The supragenomeThe distributed-genome hypothesis (DGH) states that bacteria possess a number of virulence traits that are expressed only at the population level and are not operational at the single-cell level, i.e. that bacteria a have a (supra) genome much larger than the genome of any single bacterium.


Black arrows=Sal+Ec; white arrows=Sal or Ec; grey=strain/serovar specificGC is for S. Typhi

Infection and Immunity, May 2002, p. 2351-2360, Vol. 70, No. 5

Infection and Immunity, May 2002, p. 2351-2360, Vol. 70, No. 5

The supragenome consists of core and non-core gene sets, e.g.: Hiller et al. (Journal of Bacteriology, November 2007, p. 8186-8195, Vol. 189, No. 22 http://jb.asm.org/cgi/content/abstract/189/22/8186)

sequenced 8 strains of Streptococcus pneumoniae and analysed a further 9 previously available . They found core set of genes in all strains, but 20-30% genes were non-core (not present in all strains) due to the genetic recombination generating diversity across strains. This was also observed in Haemophilus influenzae(Hogg et al. Genome Biology 2007, 8:R103 http://genomebiology.com/2007/8/6/R103) who found–~1400 genes in the core set and ~1300 non-core genes in subset of strains.

Yeast 16 chromosomes totalling 12.068Mbp 5885 orfs –6275 but translation is thought to be unlikely in 390 Few introns ~4% Average gene size 2kb (worm ~6kb and human >30kb) GC vary along chromosome length

o low GC at telomere & centromereo GC rich correlate with higher recombination

Tn and remnants in genomeo evidence of hotspots

50% orfs of known function o For some the exact role is unclear

http://genome-www.stanford.edu/Saccharomyces

http://mips.gsf.de/projects/fungi

Functional genomics•Functional genomics involves ascribing gene function across a genome.

Micro and Chip Arrays: Microarrays

o Glass slides with <10000 individual samples applied in known positiono Use of roboticso Samples can be PCR products or oligoso example: oligos complementary to each unique Tago example: oligo/PCR product complementary to each ORF

Chip arrays o silicon basedo >10,000 sequenceso http://www.affymetrix.com/index.html


http://www.affymetrix.com/index.html

http://mips.gsf.de/projects/fungi

http://genome-www.stanford.edu/Saccharomyces

http://genomebiology.com/2007/8/6/R103

http://jb.asm.org/cgi/content/abstract/189/22/8186

TranscriptomeThe transcriptome is the total set of RNAs (including mRNA, rRNA, tRNA, and non-coding RNA) produced by a single cell or population of cells, and provides a genome-wide expression level of each ORF. The expression of a gene relates to its role, so the transcriptome also allows the assessment of mutants, by comparing the expression of each ORF in different conditions. Both genome wide expression maps and global patterns of expression can be produced.

http://www.bio.davidson.edu/courses/genomics/chip/chip.html

e.g. Expression profiling C. jejuni in low iron



One cell= one specific sequence

AC

GT

AT

AC

GT

AT

AC

GT

AT

TG

CA

TA

TG

CA

TA

TG

CA

TALaser

Chip

Arrays

Individual sequences &bound sample



ProteomeThe proteome is the entire set of proteins expressed by a genome, cell, tissue or organism, specifically, at a given time under defined conditions. This genome-wide determination of protein expression provides information on how protein expression is linked to function. It allows assessment of mutants, in particular regulatory mutants which affect several proteins. Bacteria are grown under defined conditions, and their protein extracted and electrophoresed on 2D gel. Proteins can then be identified by spot identification, mass spectrometry and peptide size predictions from genome data.

E.g. growth of C. jejunini in iron


Which protein?Which conditions?

Which other proteins are co-expressed?

http://depts.washington.edu/yeastrc/pages/ms.html

MutantomeMass Mutagenesis can be used to create a mutantome, where every ORF in the genome has been mutated via organism specific technology. This allows high throughput analysis of the phenotype,


digest with

protease

pI

Mol

mas

s

Mass Spec



allowing analysis of many 1000s of mutants under many conditions. Signature-tagged technology enables analysis of mutant pools, but requires array technology for genome-wide projects.

Signature Tagging involves the addition of short unique DNA sequence tags. Each tag is linked to a mutation, with each individual mutant having a unique tag.

By inserting a “molecular barcode” within a gene for a number of mutants and then subjecting this pool of mutants to a treatment, copies of the barcode present post-treatment can be determined. The process allows identification of missing bar coded mutants, and also those genes which can be assumed to have a role in adapting to the treatment environment.

Nature Reviews Genetics 7, 929-939 (December 2006: http://www.nature.com/nrg/journal/v7/n12/full/nrg1984.html)


ORF X

Chromosomal Mutants Mutant Pools

compare

condition ‘normal’functional role ?

http://www.nature.com/nrg/journal/v7/n12/full/nrg1984.html

Interactome: Yeast 2 hybrid allows the identification of protein-protein interactions and protein-DNA interactions by testing for physical interactions (such as binding) between two proteins or a single protein and a DNA molecule, respectively.

The premise behind the test is the activation of downstream reporter gene(s) by the binding of a transcription factor onto an upstream activating sequence (UAS). For two-hybrid screening, the transcription factor is split into two separate fragments, called the binding domain (BD) and activating domain (AD). The BD is the domain responsible for binding to the UAS and the AD is the domain responsible for the activation of transcription.


mutant 2mutant 3

mutant 4and so on…

to mutant 1654.

mutant 1 mutant-

specific

DNA sequence

“normal, un-mutatedCampylobacter

Which bar coded mutants are missing?Gene involved in process

mutant pool

post-treatmentmutant pool

copies of barcodes present

1 2 3 4……… 9 10

1121

91

100

++++ +++++++++++++- - --

- +-

-

Bar code Array

++ +

www.freedigitalphotos.net/

The expression library of binding-domain: protein 1 (bait) and the expression library of activation-domain: protein 2 (prey), allows the testing of combinations of all open reading frames within a genome.

http://en.wikipedia.org/wiki/File:Two_hybrid_assay.svg

Genomic indexing

Microarray techniques can be used to assess gene inventories. Genomic indexing evaluates the distribution of genes of sequenced bacterial strains among un-sequenced strains of the same or related species, and can be used to determine the repertoire of virulence genes found in bacterial pathogens.

For example: an array of all known genes in a microbe is created, indicating that genes 1, 2, 3 & 14 form the minimal gene set as they hybridise the array with labelled chromosomal DNA. However gene expression patterns from different isolates can be identified and compared.


http://en.wikipedia.org/wiki/File:Two_hybrid_assay.svg

http://upload.wikimedia.org/wikipedia/commons/6/61/Two_hybrid_assay.svg

Reprinted by permission from Macmillan Publishers Ltd: [NATURE REVIEWS GENETICS] (Mazurkiewicz et al. 7 929-939), copyright (2006)


11 12 13 146 7 8 91 2 3 4

15105

11 12 13 146 7 8 91 2 3 4

15105

11 12 13 146 7 8 91 2 3 4

15105

11 12 13 146 7 8 91 2 3 4

15105

Array of all known genes in microbeGenes 1, 2, 3 &14 forms minimal gene setHybridise array with labelled chromosomal DNA

1

2

3

1465

9

8

114

5

15

Isolate 1 Isolate 2 Isolate 3

http://www.nature.com/nrg/

This marks the end of the lecture notes for Genomes on Microbial Genomes.


Documents

Genomics -sequencing of microbial sequences · Web viewGenomics-sequencing of microbial genomes This lecture illustrates the strategies used in microbial genome sequencing projects,