14
Microbial Genomics & Bioinformatics - Fall 2012 Lecture 1 - September 4, 2012 Wackett and Khodursky Microbes are small but prolific On the head of a pin Microbes are everywhere In high temp In high salinity In high radiation fields When we refer to microbes we mean: Prokaryotes Single celled eukaryotes Bacteria Archaea Yeast Protozoa These microscopic, single-celled organisms are in all three domains of life

Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

Microbial Genomics & Bioinformatics - Fall 2012 Lecture 1 - September 4, 2012

Wackett and Khodursky Microbes are small but prolific On the head of a pin Microbes are everywhere In high temp In high salinity In high radiation fields

When we refer to microbes we mean:

Prokaryotes Single celled eukaryotes Bacteria Archaea Yeast Protozoa These microscopic, single-celled organisms are in all three domains of life

Page 2: Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

2

Why genomics? A. The complete DNA sequence provides “operating instructions” for an organism B. Comparing DNA sequences supplies information about taxonomy and evolution C. Determining the clustering of genes reveals related function and regulation D. Can use genome sequence to explore global gene expression E. Learn global genome structure, eg. # of chromosomes and plasmids, gene organization F. On cusp of new era: can synthesize genes, gene clusters, entire genome G. Genome knowledge and construction becoming central to industrial biotechnology

Why microbial genomics?

A. Learn about pathogens – how they cause disease and how they can be stopped B. Learn about evolution, prokaryotes are the most diverse life forms C. Learn about organismal interactions, eg. Genes present or absent in endosymbionts D. Learn about biogeochemical cycling in the environment E. Microbial genomes are more accessible due to smaller size and denser coding F. Because of above, greater chance for comparative genomics G. Industrial microbes can be understood and manipulated using genomic information

Background on genomes: It starts with DNA 1. DNA composition A. Complete depiction

Visualization by EM –

Atomic structure

DNA depicted as string of letters- …..TCAAGGATCCCGGAATTACG…. Physical and chemical description of DNA

As mass, in pg (picograms) As atomic mass (Daltons) – molar mass of one independent copy Nucleotide base pairs (Mb) = Mb = one million base pair

One picogram DNA = 978 Mb, Doležel, J., J. Bartoš, H. Voglmayr, and J. Greilhuber. 2003. Cytometry 51A: 127-128

Table 1: Relative Molecular Weights of Nucleotides† Nucleotide Chemical

formula Relative molecular

weight 2′-deoxyadenosine 5′-monophosphate C10H14N5O6P 331.2213 2′-deoxythymidine 5′-monophosphate C10H15N2O8P 322.2079 2′-deoxyguanosine 5′-monophosphate C10H14N5O7P 347.2207 2′-deoxycytidine 5′-monophosphate C9H14N3O7P 307.1966 †Source of table: Doležel, J., J. Bartoš, H. Voglmayr, and J. Greilhuber. 2003. Cytometry 51A: 127-128 Relative weights of nucleotide pairs can be calculated as follows: AT = 615.383 and GC = 616.3711, bearing in mind that formation of one phosphodiester linkage involves a loss of one H2O molecule. Further, phosphates of nucleotides in the DNA chain are acidic so at physiologic pH the H+ ion is dissociated. Provided the ratio of AT to GC pairs is 1:1, the mean relative weight of one nucleotide pair is 615.8771.

Page 3: Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

3

Representative sizes of prokaryote genomes Organism Genome size (bp) Prokaryotes Carsonella ruddii 0.16 x 106

Buchnera apidicola 0.4 x 106 Mycoplasma genitalium 0.6 x 106 Borrelia burgdorferi 1.0 x 106

Synechocystis sp. PCC6803 3.5 x 106 Escherichia coli K12 4.6 x 106 Pseudomonas fluorescens 6.6 x 106

Streptomyces coelicolor 8.7 x 106

Myxococcus xanthus 9.5 x 106

Ktedonobacter racemifer 13.6 x 106

Genome sizes – all living things

Note: Correction on graph above, human genome is 3,000,000,000 bp.

Page 4: Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

4

Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote 0.16 – 13.6 x 106 bp 150 – 11,000 Genome size of eukaryotes Number of genes in a single eukaryote 9.2 x 106 - 8.0 x 1011 bp 4700 – 50,000(?) Genomics has been enormously impacted by sequencing methods development The first DNA sequencing was manual and slow. Only a small virus could be done initially The interest has been driven by human health (the human genome and disease causing bacteria) However, DNA sequencing has become so easy, rapid and cheap, that it has transformed all biology Several things to note about the figure above:

1. The Y-axis is logarithmic; sequencing has increased more than a single exponential

2. The human genome sequence was announced in 2000; we have increased 107 since then

3. The increase has opened up many new fields of genomics and revolutionized biology

4. The sequencing revolution has sparked a corresponding need for computation (informatics)

5. Bioinformatics is still catching up to the huge amounts of data being generated

6. There is a need for greater integration of genomics data with other fields of biology

7. Now, the first thing to do with a new microbe or new environment is to sequence DNA

Page 5: Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

5

Uing genomic data is lagging behind generating it! Data storage and manipulation is also computationally limited. Bioinformatics must catch up!

Page 6: Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

6

What’s easy and what’s hard in bioinformatics?

Finding gene boundaries is a little harder but still fairly easy for current software tools

Page 7: Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

7

Comparing two sequences, one new sequence against many known sequences, is relatively easy

Page 8: Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

8

In biology, proteins are a certain size, so that means genes are a certain size

A small bacterial genome of 1,000,000 base pairs will encode about 1000 proteins A small bacterial genome of 2,000,000 base pairs will encode about 2000 proteins and so on……….

Page 9: Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

9

We can also look at aggregate properties of genomes, such as: 1. Overall size 2. GC content 3. Structural arrangement of DNA elements

Page 10: Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

10

AT and GT content can effect overall properties of the genome GC content and genome size appear to show some correlation

Page 11: Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

11

Overall conclusions:

1. Even with 3.6 billion years of evolution, prokaryote genome size has remained relatively small 2. Range of prokaryote genome has remained narrow, generally one order of magnitude 3. In prokaryotes, unlike eukaryotes, most DNA encodes for gene products (RNA/protein) 4. Points above argue for selective pressure maintaining compact genome in prokaryotes 5. Prokaryote genome organization is variable; may be: (1) single chromosome (E. coli), (2) multiple chromosomes,

(3) Chromosome(s) plus plasmid(s), (4) circular chromosomes (5) linear chromosomes

Page 12: Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

12

Some useful genome, DNA sequence, protein databases, and tools freely available 1. NCBI Entrez – links to GenBank, etc. http://www.ncbi.nih.gov/Entrez 2. European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database http://www.ebi.ac.uk/embl/index.html 3. DNA DataBank of Japan (DDBJ) http://www.ddbj.nig.ac.jp/ 4. J. Craig Venter Instiute http://www.jcvi.org 5. Integrated Microbial Genomes

http://img.jgi.doe.gov/cgi-bin/w/main.cgi 6. Kyoto Encyclopedia of Genes and Genomes (KEGG)

http://www.genome.jp/kegg/kegg.html 7. Universal Protein Resource

http://www.uniprot.org 8. ExPASy Proteomics server

http://expasy.org 10. Protein DataBank – Protein structures http://www.pdb.org/pdb/home/home.do 11. KEGG Pathway Database http://www.genome.jp/kegg/pathway.html 12. Metacyc Metabolism Database http://metacyc.org/ 13. University of Minnesota Biocatalysis/Biodegradation Database (UM-BBD) http://umbbd.msi.umn.edu 14. UM-BBD Pathway Prediction System http://umbbd.msi.umn.edu/predict/ 15. BLAST – Basic Local Alignment Search Tool http://blast.ncbi.nlm.nih.gov/Blast.cgi 16. Structure-Function Linkage Database http://sfld.rbvi.ucsf.edu 17. Microbial Genomics at the U.S. Department of Energy http://microbialgenomics.energy.gov/ 18. Comprehensive Microbial Resource http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi 19. CAMERA: Community Cyberinfrastructure for Advanced Microbial Ecology Research http://camera.calit2.net/index.shtm 20. Sanger Centre Microbial Genomes http://www.sanger.ac.uk/resources/downloads/bacteria/

Definition of terms Alignment – the procedure of for comparing two or more sequences by looking for a series of individual characters or

character patterns that are in the same approximate order in the sequences Annotation – the prediction of gene name and function in a genome Archae - domain or main kingdom of life comprised of prokaryotes which are distinct from Bacteria Bacteria (or Eubacteria) - domain or main kingdom of life comprised of prokaryotes distinct from Archae Bioinformatics – an interdisciplinary field involving biology, computer science, mathematics, and statistics to analyze biological sequence daya, genome content or arrangement and to predict the structure or function of macromolecules

Page 13: Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

13

Codon Usage – Analysis of the preferred use of triplet codons in any given organism Comparative genomics - studies whereby one genome is analyzed with respect to other genomes Contig – Stretches of sequenced DNA that can be assembled in a linear order to rebuild the order found in vivo Database – A computerized repository of data with a standardized format for locating, adding, removing and changing data Expect value – In a database similarity search, the probability that an alignment score as good as the one found between a query sequence and a database sequence would be found in as many comparisons between random sequences as was done to find the matching sequence False negative – a negative data point collected in a data set that was incorrectly reported due to a failure of the test in avoiding negative results False positive – a positive data point collected in a data set that was incorrectly reported due to a failure of the test. If the test had correctly measured the data point, the data would have recorded it as negative Functional genomics - study of the biological function of whole set of genes, usually by comparing genes or organisms Gap – mismatch in the alignment of two sequences caused by either an insertion in one or a deletion in another Genome – a single haploid set of the complete hereditary nucleic acid component of an organism Genomics – the study of genomes and genome-controlled processes on a broad scale Global alignment – attempts to match as many characters as possible, from end to end, in two or more sequences Homologs – A pair of genes (or proteins) derived from a common ancestor; thus, they are evolutionarily related Horizontal gene transfer – the transfer of genetic material between two distinct species. Plasmids are important in transferring genes between very different organisms and hence facilitate horizontal gene transfer Indel – an insertion or deletion in a sequence alignment K-tuple – identical short stretches of sequence, also called “words” Local alignment – attempts to align regions of sequences with the highest density of matches Maximum likelihood – the most likely, given a probabilistic model of evolutionary change in sequences Maximum parsimony – the minimum number of evolutionary steps required to generate the observed variation in a set of sequences, as found by comparison of the number of steps in all possible phylogenetic trees Microbial genomics - genomics of single-celled microscopic organisms (mostly prokaryotes; a few eukaryotes) Molecular clock hypothesis – the hypothesis that sequences change at the same rate in the branches of an evolutionary tree Mutation matrix – a scoring matrix compiled from the observation of point mutations between aligned sequences Object-oriented database – Object-oriented databases attempt to model the structure of a given dataset. Data is typically stored as “objects” which might be gene or enzyme names which are then assembled into a model when data is requested Optimal alignment – the highest scoring alignment found by an algorithm capable of producing multiple solutions Orthologs – A pair of homologous genes (or proteins) that perform the same function in different species Paralogs – A pair of homologous genes (or proteins) that perform different, but perhaps related, functions in the same organism PAM scoring matrices – Percent Accepted Mutation (PAM) describe the probability that one base or amino acid has changed during the course of evolution

Page 14: Microbial Genomics & Bioinformatics - Fall 2012 Wackett ... · Summary of genome sizes (ranges) in scientific notation Genome size of prokaryotes Number of genes in a single prokaryote

14

Pairwise alignment – an alignment performed between two sequences Percent identity – the percentage of the columns in an alignment of two sequences that has identical amino acids Percent similarity – the percentage of the columns in an alignment of two sequences that include either identical amino

acids or amino acids that are frequently found substituted for each other because of similar properties Plasmid - autonomously replicating DNA, typically containing genes non-essential for the viability of the organism under

ideal conditions Proteomics - the study of proteins in complex mixtures such as a cell or cell extract, often using genomic information as a

guide Relational database – Organizes information into tables where each column represents the fields of information that can be stored in a single record Selectivity (in database similarity searches) – the ability of a search method to locate members of a protein family without making a false-positive identification of members of another family Sensitivity (in database similarity searches) – the ability of a search method to locate as many members of a gene or protein family as possible Significance – a significant result is one that has not simply occurred by chance, and therefore is probably true. In sequence analysis, the significance of an alignment score may be calculated as the probability that that such a score would be found between random or unrelated sequences. Specificity (in database similarity searches) – the ability of a search method to locate members of one protein family, including distantly related members Synteny – the presence of a set of homologous genes in the same order on two genomes Threading – in protein structure prediction, the alignment of the sequence of unknown structure with a known three-dimensional structure to determine whether the amino acid sequence is spatially compatible with that structure