34
www.sciencemag.org/cgi/content/full/337/6101/1661/DC1 Supplementary Materials for Fermentation, Hydrogen, and Sulfur Metabolism in Multiple Uncultivated Bacterial Phyla Kelly C. Wrighton, Brian C. Thomas, Itai Sharon, Christopher S. Miller, Cindy Castelle, Nathan C. VerBerkmoes, Michael J. Wilkins, Robert L. Hettich, Mary S. Lipton, Kenneth H. Williams , Philip E. Long, and Jillian F. Banfield* *To whom correspondence should be addressed. E-mail: [email protected] Published 21 September 2012, Science 337, 1661 (2012) DOI: 10.1126/science.1224041 This PDF file includes Materials and Methods Supplementary Text Figs. S1 and S4 to S9 Tables S1, S2, and S4 to S6 Full References Other Supplementary Material for this manuscript includes the following: (available at www.sciencemag.org/cgi/content/full/337/6101/1661/DC1) Fig. S2. Maximum likelihood phylogenetic tree of 16S rRNA gene. Fig. S3. Phylogenetic trees constructed using conserved gene sequences from at least 31 ACD candidate division genomes. Table S3. Summary of proteomic data for the A, C, and D samples. Database S1 (as zipped archive): Individual FASTA files for gene and protein sequences referenced in this manuscript.

Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

www.sciencemag.org/cgi/content/full/337/6101/1661/DC1

Supplementary Materials for

Fermentation, Hydrogen, and Sulfur Metabolism in Multiple

Uncultivated Bacterial Phyla

Kelly C. Wrighton, Brian C. Thomas, Itai Sharon, Christopher S. Miller, Cindy Castelle, Nathan C. VerBerkmoes, Michael J. Wilkins, Robert L. Hettich, Mary S. Lipton,

Kenneth H. Williams, Philip E. Long, and Jillian F. Banfield*

*To whom correspondence should be addressed. E-mail: [email protected]

Published 21 September 2012, Science 337, 1661 (2012) DOI: 10.1126/science.1224041

This PDF file includes

Materials and Methods Supplementary Text Figs. S1 and S4 to S9 Tables S1, S2, and S4 to S6 Full References

Other Supplementary Material for this manuscript includes the following: (available at www.sciencemag.org/cgi/content/full/337/6101/1661/DC1)

Fig. S2. Maximum likelihood phylogenetic tree of 16S rRNA gene. Fig. S3. Phylogenetic trees constructed using conserved gene sequences from at least 31 ACD candidate division genomes. Table S3. Summary of proteomic data for the A, C, and D samples. Database S1 (as zipped archive): Individual FASTA files for gene and protein sequences referenced in this manuscript.

Page 2: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

2

Materials and Methods Field experiment

The field experiment was carried out between July 20 and November 7, 2008 at the Rifle Integrated Field Research Challenge (IFRC) site adjacent to the Colorado River (Western Colorado, USA). For biostimulation experiments at the site, acetate was chosen as the amendment because it is a common electron donor and carbon source in anoxic systems. 50 mM acetate (plus 20 mM bromide as a tracer) was added to the groundwater 3, 4, and 5 m below ground surface through 10 injection wells. Acetate was transported 2.5 m down gradient to 12 monitoring wells via groundwater flow. The experiment was designed to achieve concentrations of ~5 mM acetate (bromide ~ 2 mM) in the monitoring wells from which samples were recovered for microbial community analysis.

Samples for measurement of aqueous geochemistry were taken 17 feet below the surface. Prior to the sample collection, one well volume of fluid (12 L) was discarded to ensure that the groundwater representative of the target zone was analyzed. Ferrous iron and sulfide concentrations (Figure S1) were analyzed immediately after sampling using the HACH phenanthroline assay and a sulfide reagent kit, respectively (HACH, CO). Acetate concentrations were determined using a Dionex ICS1000 ion chromatograph equipped with a CD25 conductivity detector and a Dionex IonPac AS22 column (Dionex, CA). A more detailed discussion of experimental details and geochemical sampling are included (30). Sample collection

The A, C, and D samples were recovered from groundwater 5, 7, and 10 days, respectively, after the start of acetate amendment. For each sample, 200 L of groundwater pumped at approximately 2 L min-1 from the well was filtered through a prefilter (1.2 μm-pore-size, 293-mm diameter Supor disc filter; Pall Corporation, NY). Cells were recovered using a Pall tangential flow filtration system consisting of two 0.2 μm stacked Centramate filter cartridges (Pall Corporation, NY) to concentrate biomass. As a consequence of the filtering procedure, which is standard operating procedure for microbial sampling in many systems, our analyses focus on cells between 0.2-1.2 μm in diameter. Groundwater was passed through a series of chilling baths containing an ice-rock salt mixture as soon as it was pumped to the surface to minimize changes to the biomass. Chilling baths ensured that the groundwater temperature was approximately 1°C as it passed through the filtration system, which was located in an air-conditioned trailer. Biomass was concentrated to ~ 200 ml in the retentate vessel and centrifuged at 4,000 rpm for 40 min at 4°C using sterile disposable centrifuge bottles. The resulting pellet was re-suspended in ~ 5 ml of groundwater in a 15 ml falcon tube, immediately frozen in an ethanol-dry ice mix, and shipped overnight on dry ice to Oak Ridge National Laboratory (ORNL) and Pacific Northwest National Laboratory (PNNL) for proteomic analysis. A subsample of the PNNL sample not used for proteomics was subsequently sent to UC Berkeley for DNA extraction. Following each sampling event, the Pall TFF apparatus was cleaned by passing ~ 5 L of bleach solution through the filters, followed by ~ 5 L of DI H2O. DNA extraction and sequencing

For genomic DNA extraction, approximately 3 ml of each of the A, C, and D frozen TFF concentrate samples were thawed and homogenized. Biomass was pelleted by centrifugation at 7,000 × g for 5 minutes at 4 °C. The supernatant was removed, and 1 ml

Page 3: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

3

of phosphate buffered saline, pH 7.0, was added to the cell pellet. 500 µl of cell suspension was added to tubes with pre-warmed (65 °C) lysis solution from the PowerSoil DNA Isolation Kit (MoBio Laboratories, Carlsbad, CA, USA). This mixture was incubated with shaking at 120 rpm for 30 minutes at 65 °C, and was inverted every 10 minutes. The manufacturer's protocol was followed for DNA extraction, with the exception that the product of two DNA extractions was combined onto one spin filter as a concentration step. DNA was eluted in 50 µl TRIS buffer prior to being submitted for library preparation.

Illumina library preparation and sequencing following standard protocols was carried out at the UC Davis DNA technologies Core Facility (http://dnatech.genomecenter.ucdavis.edu/). Illumina libraries were prepared using NEBNext DNA Sample Prep Reagent kit following manufacturer's instructions. Briefly, genomic DNA was sheared by sonication, and sheared fragments were end-repaired and phosphorylated. Blunt-end fragments were A-tailed, and sequencing adapters were ligated to the fragments. Fragments with an insert size of around 200 bp were extracted using E-Gel SizeSelect gels (Invitrogen). Library fragments were enriched with 12 cycles of PCR before library quantification and validation. Libraries were pooled and sequenced on the Illumina GAIIx platform and paired-end reads of 85 cycles were collected. Fastq files were generated using the Illumina pipeline (CASSAVA 1.7).

For each sample a single flow cell lane was used to obtain paired-end 85-bp reads with 200 bp inserts. The raw reads have been deposited in the NCBI Sequence Read Archive under accession number SRA050978.1. Proteomic analyses

Because of the novel nature of the subsurface experiments, biomass samples were sent to two independent proteomics laboratories. This approach allowed evaluation of protein identification (and abundance metrics) resulting from slightly different sample handling and analytical approaches. In contrast to some previous proteogenomic studies, protein identification rates for most organisms are low, partly because of the evenness of the species abundances.

Protein Extraction. PNNL: Harvested cell pellets were washed and suspended in 100 mM NH4HCO3 (pH 8.4) and then lysed via pressure cycling technology (PCT) using a barocylcer (Pressure BioSciences Inc., South Easton, MA). The barocycler was operated for 20 s at 35 kpsi, followed by 10 s at ambient pressure. These conditions were repeated for 10 cycles. Following cell lysis, global protein fractions were extracted from the cell lysates using established protocols (31). ORNL: For all samples equal quantities (500 µL) of concentrated groundwater were diluted into 2 mls of 6M Guanidine/10 mM DTT and heated at 60 °C for 1 hr to lyse cells and denature proteins. Samples were then diluted 6-fold with Tris buffer and proteomes were digested into peptides with sequencing grade trypsin (Promega Corp, Madison, WI). Complex peptide solutions were then de-salted via solid phase extraction, concentrated, filtered, aliquoted and frozen.

PNNL 2D-LC-MS/MS analysis. The 2D-LC system was custom built using two Agilent 1200 nanoflow pumps and one 1200 capillary pump (Agilent Technologies, Santa Clara, CA), various Valco valves (Valco Instruments Co., Houston, TX), and a PAL autosampler (Leap Technologies, Carrboro, NC). Full automation was made possible by custom software that allows for parallel event coordination providing near

Page 4: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

4

100% MS duty cycle through use of two trapping and analytical columns. All columns were manufactured in-house by slurry packing media into fused silica (Polymicro Technologies Inc., Phoenix, AZ) using a 1 cm sol-gel frit for media retention. First dimension SCX column; 5-µm PolySULFOETHYL A (PolyLC Inc., Columbia, MD), 15-cm x 360 µm o.d. x 150 µm i.d. Trapping columns; 5-µm Jupiter C18 (Phenomenex, Torrence, CA), 4-cm x 360 µm o.d. x 150 µm i.d. Second dimension reversed-phase columns; 3-µm Jupiter C18 (Phenomenex, Torrence, CA), 35-cm x 360 µm o.d. x 75 μm i.d. Mobile phases consisted of 0.1 mM NaH2PO4 (A) and 0.3 M NaH2PO4 (B) for the first dimension and 0.1% formic acid in water (A) and 0.1% formic acid in acetonitrile (B) for the second dimension. MS analysis was performed using a LTQ Orbitrap Velos ETD mass spectrometer (Thermo Scientific, San Jose, CA) outfitted with a custom electrospray ionization (ESI) interface. Electrospray emitters were custom made using 150 μm o.d. x 20 μm i.d. chemically etched fused silica (32). The heated capillary temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection and 15 min into gradient. Orbitrap spectra (AGC 1x106) were collected from 400-2000 m/z at a resolution of 60k followed by data dependent ion trap CID MS/MS (collision energy 35%, AGC 3x104) of the ten most abundant ions. A dynamic exclusion time of 60 sec was used to discriminate against previously analyzed ions.

ORNL 2D-LC-MS/MS analysis. Peptide aliquots were loaded onto a 5 cm x 150 µM Strong Cation Exchange (SCX) back column. The SCX column was attached to a Dionex HPLC with a home-built split system to deliver a flow rate of ~300 µL/min. The SCX column was briefly washed with a water-organic gradient and then connected to a 15 cm x 100 µM reverse phase front column (both packing materials (SCX and RP) were obtained as bulk packing materials from Phenomenex, CA). The front column had an integrated nanospray tip and was connected to a dual linear ion trap Orbitrap mass spectrometer (Thermo Fisher Scientific LTQ Velos Orbitrap). A 22 hr two-dimensional gradient of 11 salt pulses and two hour reverse phase gradients were provided by the HPLC. During the entire chromatographic run the MS system oscillated between high resolution full scans in the Orbitrap (30K resolution) and ten low resolution data dependent MS/MS scans in the dual ion trap. For all runs 2 microscans were averaged for full scans and MS/MS scans, dynamic exclusion was set at 1 with an exclusion time of 60 seconds to minimize re-analyses of abundant peptides.

Proteomic Data Analysis. The MS/MS data from individual 22 hr runs from both the ORNL and PNNL laboratories were searched against a predicted protein database via SEQUEST (33). The database was generated from the complete metagenomic dataset (87 genome bins, plus unassigned genome fragments) plus sequence from a closely related Geobacter species known to be present, but poorly reconstructed (Geobacter bemidjiensis Bem, DSM 16622). Common contaminants and protein standards were also included. ORNL SEQUEST search outputs were filtered and sorted with DTASelect with conservative filters (delCN of at least 0.08 and cross-correlation scores (Xcorrs) of at least 1.8 (+1), 2.5 (+2) and 3.5 (+3) and then re-filtered via high mass accuracy (+/-10 ppm parent mass for peptides). At least two peptides were required per predicted protein sequence for the protein to be considered confidently identified though a list of one peptide is included in Table S3. PNNL peptide identifications were filtered using an MS-GF cutoff value of 1 e-10 (34). Spectral count data (Table S3) for each identified protein

Page 5: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

5

was normalized using NSAF calculations (35) that account for both protein length and variations in protein abundance between samples. 16S rRNA phylogenetic analyses

Near-full-length ribosomal 16S rRNA sequences were reconstructed from Illumina reads using EMIRGE (8), supported via conventional 16S rRNA clone library analysis. For EMIRGE analysis, reads from A, C, and D were trimmed from the 3′ end until a base with a quality score of ≥3 was encountered. Paired-end reads where both reads were at least 60 nucleotides in length after trimming were used as inputs. We ran EMIRGE for 50 iterations to reconstruct SSU sequences. Next, 16S rRNA clone sequences (>1200 bp) were amended to a trimmed (length between 1200 and 1800 bp) version of the Silva small subunit rRNA database (SSURef version 102) that had been clustered at 97% identity to remove similar sequences. Ambiguous characters in the database were replaced with a random nucleotide. The combined clone plus Silva dataset was used as the input database for the final EMIRGE analysis. Data from each sample were processed with standard parameters separately for 120 iterations, and SSU sequences (>1200 bp) with relative abundances of >1% were included.

16S rRNA clone libraries were constructed from samples A and D. Bacterial 16S rRNA was amplified using universal primers 27F (5’ AGAGTTTGATCCTGGCTCAG) and 1492R (5’ ACGGGCGGTGTGTAC) and ligated into pCR4-TOPO vectors (Invitrogen, CA). Ligated plasmids were transformed into E. coli TOP10 electro competent cells according to the manufacturer’s recommended protocol (Invitrogen, CA). Clones were randomly selected and inserts were sequenced bi-directionally using M13 vector specific primers. Sequences were primer and vector screened using cross_match, quality scored using Phred and assembled into contigs using Phrap. Sequences were trimmed to retain only bases Phred ≥q20 and high quality contigs were tested for chimeras using UClust v. 3.0 (36).

For the 16S rRNA phylogenetic analysis, SSU sequences were first aligned with MUSCLE (37) using default parameters. Columns in the full alignment with gaps were removed from the alignment for tree construction and pairwise percent identity calculations. The alignment was used to generate a 16S rRNA based maximum likelihood tree with RAxML (38) using the GTRCAT model of nucleotide substitution and 1000 bootstrapped replicates. Trees were viewed using iTOL (39). Additionally, we uploaded EMIRGE and clone library 16S rRNA sequences to Greengenes and aligned to create a second 7682 character NAST multiple sequence alignment (MSA) with common gaps removed by the lane-masking program (40). This second tree demonstrated no differences in tree topology (data not shown). Assembly

Illumina sequences from each sample (A, C, and D) were assembled individually using Abyss (41), then co-assembled using Velvet (42). Using Velvet, the A, C, and D datasets were assembled individually and in combination, and the results compared. Based on this comparison, we find no indication that co-assembly was a significant source of chimeras (genome assembly depends on overlapping near identical sequences, which are rarely anticipated in genomes from different organisms, and become increasingly improbable as evolutionary distance increases).

To further ensure chimeric assemblies were not a problem in the ACD dataset we employed a number of complementary approaches to avoid and detect errors. We have

Page 6: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

6

in-house scripts to check for mate-pair consistency, ensuring all bases from reported contigs are flanked by multiple, correctly oriented paired-end reads (chimeric contigs will fail this check). We performed comparative analyses of multiple assemblies using different assembly algorithms (which have different strengths and weaknesses). We tested for sampling depth consistency across the three datasets (chimeric contigs would often show inconsistent coverage profiles between the falsely joined segments). We manually performed extensive phylogenetic analyses using many different marker genes. Typically, we find that assembly errors, when they occur, involve misassemblies involving fragments from the same genome (e.g due to intragenomic repeats), not between different genomes.

For the co-assembly, we used an iterative assembly approach (43) to address the problem of multiple coverage levels in metagenomic data (due to variable abundance levels of different organisms). Sequencing used in the assembly:

Sample Total reads Total sequencing (bps) A 85,465,255 7,104,464,305 C 79,421,294 6,534,671,196 D 67,992,430 5,545,128,096

Summary iteration steps for the combined (A+C+D) assembly:

We focus our analyses on the Velvet co-assembly, which provided better results.

However, for two reasons, the Abyss assembly provides support for the reliability of our approach. First, we compared genomes for the same organism (ACD2, ACD49, ACD78 and ACD80 genomes) from both assemblies to confirm overall that essentially the same genome was reconstructed using the two methods (in a few cases we noted discrepancy in fragment order; genomes from different organisms appeared not to co-assemble). Second, we compared the genome of a reference isolate co-recovered in our analyses (ACD53) with its known genome sequence. Geobacter sulfurreducens isolate cells were introduced during sample collection (~2 % of the samples). We manually curated the Abyss assembly of genome fragments from this organism and recovered essentially the complete genome.

For the A, C, and D datasets, 50.0%, 59.4%, and 58.5% of reads mapped back to the assembled scaffolds, respectively, resulting in an average of 56.0% reads mapping to the ACD co-assembly.

k‐mer size expectedcoverage

coveragecutoff

scaffolds sequence total(bps)

1 67 80.3 56 779 3,077,0912 65 59.5 33.8 1,257 6,602,2573 63 46.1 28.2 2,025 3,330,5044 55 30.6 16.6 5,263 8,310,3865 47 27.1 17.2 10,069 10,531,2166 41 19.6 11.1 16,270 25,790,2917 39 13.7 16.3 64,614 83,449,0378 35 6.3 4 519,586 242,129,100

Results of iterationParameters for iterationIteration

Page 7: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

7

Functional annotation Assembled contigs were annotated by predicting open reading frames using Prodigal

(44) and RNAs using Infernal (45) and tRNAscan (46). Proteins were then compared using BLAST (47) to KEGG (48) and UniRef90 (49) and matches greater than 60 bits reported. Reverse BLASTs were also conducted to identify reciprocal best BLAST hits. We also analyzed proteins using InterproScan (50) to identify significant motif signatures. Finally, the collection of annotations for a protein are ranked: Reciprocal best BLAST hits (RBH) are given the highest rank, followed by blast hits to KEGG/UniRef90 with a bit score greater than 60. The next rank represents proteins that only had InterproScan matches. Finally, our lowest rank comprises the hypothetical proteins, with only a prediction from Prodigal. Up to 47% of genes lack any functional clue. Notably, is an underestimation, as proteins that contained a “domain of unknown function” were even excluded from this count. Phylogenetic analyses

A phylogenetic pipeline was established for the construction of single gene protein trees. Protein sequences were aligned using MUSCLE with default settings (37). Problematic regions of the alignment were removed using very liberal curation standards according to (51) using the program GBlocks (52). Best models of amino acid substitution for each protein alignment were estimated using ProtTest (53). Phylogenetic trees were generated using RAxML (38) with the PROTCAT setting for the rate model and the best model of amino acid substitution specified by ProtTest. Nodal support was estimated based on 1000 bootstrap replications using the rapid bootstrapping option implemented in RAxML.

Single gene phylogenetic analyses were performed using full-length RecA, DNA gyrase B, and RpoB amino acid sequences reconstructed from 32, 31, and 31 of our genomes, respectively (Figure S3). Reference sequences were the same across all 3 phylogenetic trees, with representative taxa identified by blasting our sequences to nr. We also included single genes for taxa represented in the 16S rRNA analysis (Figure S2). The homologous proteins in these 58 bacterial taxa were extracted using CoGe BLAST (54). These reference sequences were then used to construct single gene trees for full and nearly full-length proteins from the ACD dataset (Figure S8). We note the altered topology associated with the Thermus aquaticus and Deinococcus spp. RecA sequences (Figure S3A) is due to truncated sequence length (~62% of the average RecA length). The presence or absence of these reference sequences did not impact alignment or the phylogenetic placement of the ACD genomes. Figure 3A includes these RecA sequences to illustrate the relative placement of the ACD genomes using identical reference sequences across the protein markers.

In all of the protein trees constructed (only 3 trees are shown for simplicity), four monophyletic groups were identified amongst the CD divisions: BD1-5, PER, OD1, and OP11. The PER lineage, composed of 3 genomes, while always monophyletic, was in phylogenetically discordant positions across the tree. It is more closely related to the BD1-5 in the RpoB tree (93% support) and ribosomal marker gene trees (data not shown) and OD1 in RecA and DNA gyrase B trees (supported by 31 and 85 % of trees respectively). Reference organisms and gene accession numbers are included in Table S2. Binning

Page 8: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

8

Assignment of genome fragments > 2 kb to organism “bins” relied first on tetranucleotide frequency composition, then upon information about genome abundance in each of the three samples. Other information, including paired read linkage to binned contigs, phylogenetic analysis of genes, and BLAST match analysis for well-conserved genes, was used to assign unassigned fragments to bins or, in a few cases, to refine binning.

An emergent self-organizing map (ESOM) was used to analyze tetranucleotide frequencies (9). The website link to the ESOM is provided under Data sharing. The primary map structure was established using 5 kb fragments (all fragments > 10 kb were subdivided into 5 kb segments; see website link below for the ESOM that serves as the entry point into the online dataset). In a second phase of binning, fragments between 2 and 5 kb in length were projected onto the ESOM using their tetranucleotide frequency information. Fragments falling within clusters were assigned to that cluster.

To generate time series abundance information, we computed the fraction of reads contributing to each contig from the A, C, and D datasets for all genome fragments > 5 kb (read abundances were normalized to account for differences in the sizes of the A, C and D genomic datasets). The analysis indicated that some genomes were abundant only early in biostimulation and others showed different distribution patterns. Given that all fragments from a single genome should show the same time series abundance pattern, we coded each point in the ESOM with a value indicating the ratio of read abundance in the first vs. last sample (A/D). Peaks in the histogram of A/D values were used to assign color ranges. This assisted with the separation of adjacent map regions with similar tetranucleotide sequence composition (see 9). Across the 49 genome bins described here, the average length of the largest contig is 71.7 kb (see Table S2 for details). Assignment of organism names to ACD bins

Phylogenetic analyses on single copy marker genes enabled us to cluster the organisms into phylum-level groups and sub-lineages (e.g., core genes from the 10 OD1-i genomes indicated a consistent clade; see Figure S3). Then, we used 16S rRNA gene sequence information, as follows. Given that assemblies of community metagenomic Illumina data fail in the 16S rRNA regions (due to high sequence conservation in this region), the binned contigs contain no more than one end of this gene. In some cases, we could obtain useful phylogenetic signal from these fragments and sometimes we could link the fragments to the 16S rRNA genes reconstructed via EMIRGE and recovered via traditional clone library analysis (see above) using paired read information. So long as the genome fragment containing the partial rRNA sequence was large enough to be confidently binned, direct cluster identification was possible. Data sharing Downloadable supplementary FASTA file includes:

-The 16S rRNA EMIRGE reconstructed sequences -The consensus clone 16S rRNA sequences (dereplicated, chimera checked, and

clustered to 97%). These sequences, in conjunction with SILVA 108, served as the database for the EMIRGE final reconstruction.

- PER-related 16S rRNA sequences -All reconstructed ACD and reference protein sequences used in phylogenetic

analyses presented in this manuscript.

Page 9: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

9

All genomic and proteomic information is included in an open access format. Additional information on binning and the remaining genomic datasets not described here can be accessed via the website:

http://geomicrobiology.berkeley.edu/rifle/acd_ggkbase.html

Supplementary Text Proteomic verification of alternative coding

Six genomes (ACD2, ACD3, ACD4, ACD49, ACD78, and ACD80) use genetic code 4 (Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code). For more information see: http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG4.

We inferred the use of alternative coding because predictions using the standard (code 1) or bacterial (code 11) codes resulted in highly fragmented genes with low coding density (Figure 2). In contrast, gene prediction using code 4 resulted in normal sized open reading frames, and many proteins similar to known proteins. We used proteomics to confirm the use of genetic code 4. To accomplish this, we evaluated coding of tryptophan in W-containing peptides that were confidently identified by mass spectrometry for the proteome predicted using code 4. Many peptides in which W was coded by TGG and TGA (normally a stop codon) were identified (see one example of TGA coding W in Figure 2, main text). Systematic analysis of codons for proteomically identified peptides confirmed the use of genetic code 4. Hydrogenase analysis

All members of one OD1 sub-lineage, OD1-i, have at least one cytoplasmic 3b hydrogenase (Figure S4, Table S5) with the conserved and essential residues for the NiFe active center, metal coordination, and proton transfer (Figures S4, S5). These hydrogenases are generally more closely related to each other than to other known sequences and form a divergent Bacterial clade within the type 3b Archaeal sulfhydrogenases. ACD9 contains two type 3b hydrogenases, the second of which clades with the Bacterial, not Archaeal, type 3b hydrogenases.

Investigation of the type 3b sequence alignments revealed that some proteins have slight modifications at the hydrogenase active sites. ACD57 has threonine in place of glutamate in position 42 of the metal active site. In addition, we identified a highly conserved His at position 552 in some sequences. In Desulfovibrio vulgaris, hydrogenase MF, this His at position 552 coordinates an extra metal site (55).

Type 4 membrane bound hydrogenases are encoded by six OD1 genomes, four of which belong to members of the OD1-i. The sequences are 67% similar to each other and form a monophyletic group with putative membrane bound hydrogenases identified from genome annotations (not confirmed physiologically) from Methylococcus capsulatus strain Bath, Candidatus Nitrospira defluvii, and NC10 Dutch sediment bacterium. Our membrane hydrogenases also show sequence similarity to the well characterized Hyc and Hyf from E.coli (17).

The type 4 hydrogenases are targeted to the membrane but they, and the others in their monophyletic group, lack key functional residues typical of classic NiFe hydrogenases. The type 4 hydrogenases have modifications from the classical NiFe active site (the N-terminal and C-terminal CxxC motifs, RxCGxCxxxH and DPCxxCxxH/R

Page 10: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

10

respectively; Figure S4). Previously, active site variants have been noted in the type 4 Mbx hydrogenases from the Thermococcales (22). Biochemical and sequence analysis suggested that the Mbx variant hydrogenases were capable of membrane proton translocation and sulfur, rather than proton, reduction. In addition to the sequence similarity to know proton-translocating hydrogenases, the gene neighbors of our group 4 hydrogenase large subunit encode membrane-spanning proteins with homology to E.coli group 3 and 4 hydrogenase and NADH dehydrogenase complex 1 transmembrane subunits responsible for proton translocation. Thus, despite the aberrant residues in NiFe catalytic sites, the OD1 group 4 hydrogenases may be capable of augmenting ATP synthesis by proton translocation.

The proteins required for NiFe hydrogenase maturation are the products of the hydrogenase pleiotropic, hyp, genes (hypA, hypB, hypC, hypD, hypE, and hypF). In the case of E. coli, a single copy of each maturation protein is present in the genome, responsible for the maturation of three active isoenzymes (56). Most commonly, the six proteins are responsible for nickel incorporation/maturation (HypA and HypD), nickel insertion (HypB), chaperone maturation (HypC), purine derivative binding (HypE), and CN/CO delivery (HypF). In E.coli HypB, HypD, HypE, and HypF are involved in the maturation of hydrogenases 1, 2, and 3, where HypA and HypC only act in the hydrogenase 3 system.

We could identify five of the six genes in the hyp operon in many of the most complete OD1-i genomes (Table S5). Notably all 49 genomes analyzed in this study lacked the hybB gene, an unexpected finding given that we were able to identify 7 HypB from equally complete genomes belonging to members of the Bacteriodetes and Proteobacteria in the same dataset (87 genomes were recovered, only data for 49 are reported here). We searched for HypB homologs using gene identity and genome synteny (the consistent position of HypB near to HypA in most characterized genomes), but we failed to identify candidate hypB genes. While generally regarded as requirement for GTP hydrolysis essential for nickel insertion, hybB gene site-directed mutations can be partially restored by the addition of high concentrations of nickel ions in the culture medium (57).

The collective abundance and distribution of 3b and 4 hydrogenases from OP11 and OD1 lineages corroborates our hypothesis that these organisms are fermentative, and as such they must have enzymatic methods for disposing of reducing equivalents generated by the oxidation of carbon. Additionally our finding and characterization of 17 full-length hydrogenases, many with all necessary residues for functionality, demonstrates the bounty of biotechnologically significant gene sequence information residing in the uncultivated majority of bacteria. RuBisCO analysis

As further support for the phylogeny of hybrid II/III, we did not identify small RuBisCO subunits, as expected because only type I RuBisCO are composed of small and large subunits (23). We compared our new RuBisCO sequences to sequences of other type II, III, and putative type II/III to evaluate sequence conservation within and between groups (Table S1C). Pairwise amino-acid sequence comparisons indicate that the largest divergence within the new RuBisCO type II/III group is between the two PER and ACD80 genomes (55 and 56%). The ACD80 RuBisCO is most similar to MBR from M. burtonii (74%) and the other methanogenic Archaea. In contrast, the divergence between

Page 11: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

11

the type II/III RuBisCOs and type II is ~40% (to R. capsulatus type II CbbM) and between the type II/III and type III is ~33% (M. jannashcii type III RbcL). In comparison, sequence conservation for a given type ranges from 49% (type IV) to 85% (type 1B) (23). The low sequence conservation between the eight type II/III RuBisCOs and recognized form II and III RuBisCOs supports the phylogenetic inference that the II/III lineage constitutes an independent RuBisCO type divergent from types II and III.

A variant GAW sequence, as opposed to GGG, occurs in ACD80, with a G-to-A variant at position 423 and G-to-W variant at position 424 (Figure S6). Notably, the MBR sequence also has a variant serine rather than alanine on the second glycine but is invariant with respect to the first and third glycines. Previously it was noted in mutagenesis of R. rubrum RuBisCO, that replacement of the second position glycine with an alanine results in production of 1 pyruvate and 1 PGA, rather than the 2 PGA products (58). This misfiring of RuBisCO may not be as detrimental in ACD80, as compared to RuBisCO which function in the CBB pathway, which could still recover energy by fermentation (and substrate level phosphorylation) of pyruvate. Moreover, modeling of ACD80 to RuBisCO 3A12 (type III, Thermococcus sp. KOD1) suggests that a tryptophan at the third position could significantly distort the P1 binding site, in addition to the already addressed perturbations caused by the alanine side chain in the second position, could impact the functionality of the ACD80 RuBisCO. However, given that modeling suggests that the type III template accommodates the substitutions in ACD80 better than the type II template, it is possible that this RuBisCO may still function, albeit likely as low turnover, misfiring RuBisCO that produces pyruvate and 1 PGA rather than 2 PGA.

The Bacterial RuBisCO identified here lack support for CBB pathway functionality and we suggest function similar to the Archaea (type III) RuBisCO. In the Archaeal RuBisCO pathway, adenine is released from AMP and the phosphoribose moiety enters central-carbon metabolism. Thus type III, and likely II/III from MBR, RuBisCO enable anaerobic bacteria to use AMP (when energy levels are low and/or intracellular AMP in excess) for the production of 3-Phosphoglycerate, which can be used for ATP generation as follows (23, 24):

AMP + Pi ---> A + Ribose 1,5-bisphosphate (DeoA) Ribose 1,5-bisphosphate ---> Ribulose 1,5-bisphosphate (E2b2) Ribulose 1,5-bisphosphate + CO2 ---> 3-Phosphoglycerate (RuBisCO)

Genome completion To analyze genome completeness, we surveyed genes expected to be in single

copy, based on those identified in (11). This analysis and subsequent phylogenetic analyses of these marker genes, also aided in confirming the genomic binning (from the ESOM). Only genome bins that contained ≤1 genome/bin were included in the phyla-level genome completion summary (Table 1).

An interactive website with downloadable FASTA files for the data in Figure S3 and S8 can be accessed here: http://genegrabber.berkeley.edu/rifle_acd/genome_summaries/15-Wrightonetal_FigS8_GenomeCompletion PER-related analyses

Given the novelty of the PER genomes (ACD28, ACD51, and ACD65), we have taken great care to control for chimeric sequences that may have resulted from assembly

Page 12: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

12

or binning procedures. We confirmed these genomes were not a synthetic composite of different phylogenetic lineages by controlling for assembly chimeras (addressed in “Assembly” section above). In addition we regenerated a separate assembly for ACD28, a near-complete PER genome, using reads that map to the assembled scaffolds. In addition to bioinformatic verification mentioned above, we visually confirmed the absence of chimeric assembly with manual curation (using the Consed program).

The binning of the PER genomes is extremely well defined in the ESOM analysis. The ESOM is presented on the website link provided under the “Data sharing” section above. The clearly defined bin (e.g. ACD28) indicates highly-distinct tetranucleotide and abundance patterns.

The PER genome bins (ACD28 and ACD51) contain only one copy of each of 30 conserved single copy genes (based on conserved markers in Raes et al and PhyloSift), strongly supporting the conclusion that these genomes are near-complete and free of contaminating contigs from outside genomes. ACD51 contains all the marker genes in single copy, while the ACD28 genome only has one copy of each gene but is missing 2 genes (L11 and S9) (Figure S8). While the binning and single copy gene analysis for the partially complete ACD65 is less coherent than the other PER genomes, this genome bin was used only to support statements pertaining to the physiology and phylogeny of the other PER genomes, and was not used as a basis for any independent analyses.

Phylogenetic analyses conducted with these marker genes (Figures S3 and S8), revealed the three PER genomes remain monophyletic for all of a very large number of phylogenetically informative genes including multiple 30S and 50S ribosomal proteins, DNA gyrase A, DNA gyrase B, rpoB, recA. A subset of the most diverse topologies is shown in Figure S3 and addressed above in the section entitled “Phylogenetic analyses”. In addition to the well-supported and congruent phylogenetic signal, all PER genomes lack the alternate coding associated with the BD1-5 lineage genomes and share very similar metabolic strategies that are unique only to the PER genomes (Figure 3B).

As part of our reassembly efforts, we recovered a partial 16S rRNA sequence (~327 bp) from the ACD28 genome. When we aligned this sequence to a database containing ACD EMIRGE and clone 16SrRNA sequences generated as part of this analysis, a single clone, ACD clone77 (1453 bp), shared 87 % identity across the entire ACD28 fragment length. This clone accounted for 0.04% of the clone library 16S rRNA diversity. The near-full length 16S ACD clone77 was aligned to Greengenes, RDP, and SILVA/ARB databases to recruit other related 16S rRNA sequences.

In all three databases ACD clone77 was identified as affiliating with phylogenetically “Unclassified” Bacterial sequences. The nearest isolate sequence in GenBank, a member of the Epsilonproteobacteria, Helicobacter cetorum MIT 00-718, (78% identity, 88% of 16S rRNA length, Accession CP0034791), had very low overall sequence similarity to the ACD PER sequences. The most closely related near full-length (>1200 bp) 16S rRNA clones were all from uncultivated clones, with the most similar sequences in the SILVA database from accession numbers FJ12516 (85%, 1095 bp), EU24609 (87%, 1090 bp), HQ672998 (85%, 1086 bp, identified as a member of BD1-5 in ARB-SILVA), with additional uncultivated 16S rRNA sequences included from Greengenes and GenBank. These PER-related 16S rRNA sequences were recovered from a diversity of environments including submarine volcanic sediments, microbial mats, and oxygen minimum zone in the Pacific ocean.

Page 13: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

13

In all phylogenetic trees based on maximum likelihood (RAxML, GTRCAT), inferred approximate-maximum-likelihood (FastTree, -GTR –NT), maximum parsimony, and evolutionary distance the ACD PER sequences (ACD clone77 and the ACD28 fragment) and identified PER-related reference sequences formed a clade together and independent from all other phyla including nearest neighboring phyla of BD1-5, OD1, and OP11 sequences (Figure S9). The 16S rRNA PER clade was strongly supported by high bootstrap values (from 89% in neighbor-joining to 100% in RAxML) and was clearly separated from the nearest phylum (BD1-5). Within the PER clade, RAxML, neighbor joining, and maximum parsimony trees demonstrated congruent topology (Figure S9). Members of the PER clade (>1000 bp) shared an average within-phylum sequence identity of 87% for sequences (>700 bp). The PER sequences only shared a 77-81% 16S rRNA gene sequence similarity to members of the BD1-5 phylum, and less than 80% sequence similarity to members of the more phylogenetically distant OD1, OP11, and Proteobacteria phyla.

The 16S rRNA and protein phylogenetic results show that PER lineage is monophyletic in all analyses, with good bootstrap support. Hugenholtz et al (27) recommended that 85% 16S rRNA sequence was a cutoff for distinguishing new phyla. Incorporating this 16S rRNA criterion, and the additional support gleaned from numerous protein-coding phylogenetic analyses, the PER clade are inferred to represent a new phylum-level lineage in the domain Bacteria. Previous detection of BD1-5

Prior 16S rRNA clone library analysis of biofilms from cold, near-neutral pH sulfide-rich springs by our laboratory at Alum Rock, CA yielded BD1-5 sequences. A 1448 bp sequence (GenBank: GQ355005) is the closest representative (96%) in Silva to the most abundant 16S rRNA sequence in our dataset.

Page 14: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

14

Figure S1. Diagram indicating the change in solution chemistry (Fe(II), acetate, and sulfide concentrations) over the experiment.

Page 15: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

15

Figure S4. Amino acid alignment of selected NiFe hydrogenase large subunits. Residues binding the Ni-Fe site (green shading), conserved residues surrounding the Ni-Fe site (blue background), amino acids coordinating Mg or Fe2+ (yellow background). A. Type 3b sulfhydrogenase references sequences NP578623 Pyrococcus furiosus DSM3638, NP126548 Pyrococcus abyssi GE5, YP184482 Thermococcus kodakarensis KOD1 are included for comparison. B. The N-terminal (black box) and C-terminal (red box) CXXC motifs involved in (NiFe) coordination in type 4b hydrogenases. Included for reference are classical type four HyfG and HycE hydrogenases from Escherichia coli strain K12, as well as a type 4 Mbh and 3b hydrogenase from P. furiosus DSM 3638 and variant Mbx hydrogenases from P. furiosus DSM 3638, P. abyssi GE5, and Thermotoga maritima.

A. ACD63_24994_43975_9G0014_OD1 ---------------------------MRIKIDHIARTEGHIGFVSDIVKGDIKKAQLQTLEGARLFEGILSGRRYFEVGEIAQRICGVCPVVHCLDAIK 73 ACD9_12676_12654_9G0013_OD1_i MQIQIKIHTKYQIPAYWTGRQHTKYELMNIKINHIAKIEGHTGFMASVLQGDVKSAKFEVKEGVRLIEGILIGRHYKDMPVIAQRICGICPVVHNLTSIK 100 ACD58_25121_63638_11G0012_OD1 --------------------------MSLVEQHYITKIEGHGTLNINFRQ---CHAKLEIDEGERFFEALLVGRPYTDGPFITSRICGVCPVAHTLASIK 71 ACD9_288_4022_47G0001_OD1_i -----------------------MTNNIHVDVHHITRVEGHGNIKVDIENGVIKECNLAIVEAPRFFEAMVRGRHFDEATVVVSRICGICAVGHQLASLA 77 Grp3b_NP126548_Pyrococcus_abys ----------------------MRNLYIPITVDHIARVEGKGGVEIIVGDEGVKEVKLNIIEGPRFFEAITIGKKLEEALAIYPRICSFCSAAHKLTALE 78 Grp3b_NP578623_sulfhydrogenase ----------------------MKNLYLPITIDHIARVEGKGGVEIIIGDDGVKEVKLNIIEGPRFFEAITIGKKLEEALAIYPRICSFCSAAHKLTALE 78 Grp3b_YP184482_Thermococcus_ko ----------------------MKNVYLPITVDHIARVEGKGGVEIVVGDDGVKEVKLNIIEGPRFFEAITLGKKLDEALAIYPRICSFCSAAHKLTAVE 78 ACD63_24994_43975_9G0014_OD1 AVENAMGVDVSDETVLLRKMIMAGQMIQSHTLHLYFLSLPDFLDYRDDLKMVEDYPNRSKDIIKLRDFGNMIISLIGGRSVHPVSPKVGGFTKYPSKDAV 173 ACD9_12676_12654_9G0013_OD1_i AIENAMKIEVSDETKKLRKVMEHAQFIHSHALHLFFLSLADFLDIENDLQLVKEYPEHTKNAVKIREYGMDLIRTIGGRVVHPLTNEVGGFKKVPTSEEI 200 ACD58_25121_63638_11G0012_OD1 ALESALDIMIDDNTIRLRKILLASQIIQSHALHLFFLTLPDYIGLSSTLELHQTKPELFKIALSLKNLGDKITTIIGGRNVHPINLVVGGFSKIPSKNEL 171 ACD9_288_4022_47G0001_OD1_i ATEDAFGIVLSRQERLLRRLMNCGEFFESHVLHIYFLAVPDFVGAKSVLPLVKTHKDVVVQALKLKRVGHSIGSIIGGRLIHPTSLFPKAMTMIPTREQL 177 Grp3b_NP126548_Pyrococcus_abys AAEKAIGFTPREEIQALREVLYIGDMIESHALHLYLLVLPDYLGYSSPLKMVNEYKKELEIALKLKNLGSWMMDVLGSRAIHQENAILGGFGKLPSKETL 178 Grp3b_NP578623_sulfhydrogenase AAEKAVGFVPREEIQALREVLYIGDMIESHALHLYLLVLPDYRGYSSPLKMVNEYKREIEIALKLKNLGTWMMDILGSRAIHQENAVLGGFGKLPEKSVL 178 Grp3b_YP184482_Thermococcus_ko AAEKAIGFTPREEIQALREVLYIGDMIESHALHLYLLVLPDYLGYSGPLHMIDEYKKEMSIALDLKNLGSWMMDELGSRAIHQENAVLGGFGKLPDKSVL 178 ACD63_24994_43975_9G0014_OD1 LKLAREGNKLMSIVLDLSDMFNKLA-FPKFKRETEYIALKHKG-EYAVYSGSIASTSGTAVPVKKFVNSISEIQKPYDLVKSATRKGKP-YLVGALARIN 270 ACD9_12676_12654_9G0013_OD1_i TELITKGESVLPVVFELAAFFKTIK-TPNFSRPTEYISLKTSG-EYAIYDGNVASNKGLNIDTQQFENNFHEFQRPKEIVKRVQHDGKTSYMVGAIARIN 298 ACD58_25121_63638_11G0012_OD1 QEIKLACQNNLTFTQETVKLFASFN-YPKISNPVEYMSLVNGE-SYETYDGKVETSKNYIFDPINYKKEIIEKVKSYSSAKFSTHDGQP-MMVGGLARIS 268 ACD9_288_4022_47G0001_OD1_i EEIIIKLKDAQNELKQGVKTLKTLT-LPQFERETEYVSLRKKG-EYAINEGNIYSSDTGETSYKNYRDFTNEYLVDHSTAKRCKHNRSS-FMVGALARFN 274 Grp3b_NP126548_Pyrococcus_abys EEMKAKLRESLSLAEYTFELFAKLEQYREVEGEITHLAVKPRGDVYGIYGDYIKASDGEEFPSEDYKEHINEFVVEHSFAKHSHYKGKP-FMVGAISRVV 277 Grp3b_NP578623_sulfhydrogenase EKMKAELREALPLAEYTFELFAKLEQYSEVEGPITHLAVKPRGDAYGIYGDYIKASDGEEFPSEKYRDYIKEFVVEHSFAKHSHYKGRP-FMVGAISRVI 277 Grp3b_YP184482_Thermococcus_ko ENMKRRLKEALPKAEYTFELFTKLEQYEEVEGPITHIAVKPRNGVYGIYGDYLKASDGNEFPSEEYREHIKEFVVEHSFAKHSHYHGKP-FMVGAISRLV 277 ACD63_24994_43975_9G0014_OD1 VNSKELNDTARKFLDKSVIDLPDYNPFHNVVAQAIEIIHFIEEFQSLAREFTNA-KSNVPFVEVDIKPGKGVGAMEAPRGTLYHYYEIGSDGKVADCNII 369 ACD9_12676_12654_9G0013_OD1_i NNHEKLSINARGYLKSLDFNMPDYNPFHNVLYQMVEVMHCVEDSVKLLKELSHANLENAITKEYQIREGSAAAAIEAPRGTLYYWVDIDAKGYIKNVNII 398 ACD58_25121_63638_11G0012_OD1 LHHQYLNTKAKKTFNDLKIEMPSYNTFHNNLAQAIEMVHFLEEIIKLCEELINDQSYKSQVISYKLKSGHGIGAIEAPRGVLYHYYELDKKGIIKNCDII 368 ACD9_288_4022_47G0001_OD1_i NNHEYLSDGAKEIASDLGLKAPCYNSYMNTIAQFVELYHIADDTILAAQELLNMKVEVED-RKFPVKAGTGVGAVEVPRGLLYHEYTYDDDGYMKKANCV 373 Grp3b_NP126548_Pyrococcus_abys NNKDLLYGRAKDLYESHKELLKGTNPFANNLAQALELVYFIERAIDLIDEVLIKWPVKER-DKVEVRDGFGVSTTEAPRGILVYALKVE-NGRVAYADII 375 Grp3b_NP578623_sulfhydrogenase NNADLLYGKAKELYEANKDLLKGTNPFANNLAQALEIVYFIERAIDLLDEALAKWPIKPR-DEVEIKDGFGVSTTEAPRGILVYALKVE-NGRVSYADII 375 Grp3b_YP184482_Thermococcus_ko NNADTLYGRAKELYESYKDLLRSTNPFANNLAQALELVYFTERAIDLIDEALAKWPIRPR-DEVALKDGFGVSTTEAPRGVLVYALKVE-NGRVSYADII 375 ACD63_24994_43975_9G0014_OD1 TPTAQFLFHMEQDLNAYLPMLKDKKLS--LQKKEIRKLIRAYDPCISCATH------- 418 ACD9_12676_12654_9G0013_OD1_i TPTAQFLTHLEDDIAAFIPGILDKDEK--VMEKKMRAFIRAYDPCISCAVH------- 447 ACD58_25121_63638_11G0012_OD1 TPTAQNLTNIEEDATALLKEFKLDKLEGKVCIRELEMLIRAYDPCITCSVH------- 419 ACD9_288_4022_47G0001_OD1_i IPTGQNLNNVENDFRALLPTILDKSQE--EITLLLEMMVRAYDPCISCSAHILTVEFE 429 Grp3b_NP126548_Pyrococcus_abys TPTAFNLAMMEEHVRMMAEKHYNDDPE--RLKLLAEMVVRAYDPCISCSVHVVKL--- 428 Grp3b_NP578623_sulfhydrogenase TPTAFNLAMMEEHVRMMAEKHYNDDPE--RLKILAEMVVRAYDPCISCSVHVVRL--- 428 Grp3b_YP184482_Thermococcus_ko TPTAFNLAMMEQHVRMMAEKHYNDDPE--KLKLLAEMVVRAYDPCISCSVHVARL--- 428

B. ACD9_36209_17157_9G0004 FIERLSGDMAASHALAYVQAVESIAKCDVPKRALMLRTLVCELERITMHI 269 ACD7_123.14940.77G0011 FIERLSGDASASHALAYAQAIEKISSCKVSQRAQIIRMIIAECERITMHV 286 ACD72_17889_5564_7G0005 LSERISGDTSFSHSLAFCQAVEKLTATDVSVRAKYLRIIFAELERIANHI 266 HyfG_gi|85675422|dbj|BAA16375. LSDRVCGICGFAHSVAYTNSVENALGIEVPQRAHTIRSILLEVERLHSHL 287 HycE_gi|85675542|dbj|BAE76798. LSDRVCGICGFAHSTAYTTSVENAMGIQVPERAQMIRAILLEVERLHSHL 285 MbxL_Pfur LLLRICVPEPDVPEAIYSMAVDEIIGWEVPERAQWIRTLVLEMARVTAYL 129 MbxL-like_Paby LLLRICVPESDVPEAIYSLAVDEIIGWEVPERAQWIRTTVLEMARVSAYL 132 MbxL-like_Tmar LIPRICVPEPDINEICYAMAIEKIAKVEVPERAQWIRMIVLELARIANHI 108 MbhL_Pfur LAERMCGICSFSHNHTYVRAVEEMAGIEVPERAEYIRVIVGELERIHSHL 112 Grp3b_NP578623_sulfhydrogenase IYPRICSFCSAAHKLTALEAAEKAVGFVPREEIQALREVLYIGDMIESHA 109 ACD9_36209_17157_9G0004 -LINKSLNLSYSGTDL---------------------------------- 519 ACD7_123.14940.77G0011 -LINKSLNLSYSGNDM---------------------------------- 536 ACD72_17889_5564_7G0005 -LINKSFSLSYTGNDL---------------------------------- 519 HyfG_gi|85675422|dbj|BAA16375. -LIIGSLDPCYSCTDRVTLVDVRKRQSKTVPYKEIERYGIDRNRSPLK-- 555 HycE_gi|85675542|dbj|BAE76798. -LIIGSLDPCYSCTDRMTVVDVRKKKSKVVPYKELERYSIERKNSPLK-- 569 MbxL_Pfur -AILMSLDNCPPDIDR---------------------------------- 391 MbxL-like_Paby -VILMSLDNCPPDIDR---------------------------------- 394 MbxL-like_Tmar -IWLATMDVCAPEIDR---------------------------------- 368 MbhL_Pfur -VAIASIDPCLSCTDRVAIVKEGKKVVLTEKDLLKLSIEKTKEINPNVKG 414 Grp3b_NP578623_sulfhydrogenase EMVVRAYDPCISCSVHVVRL------------------------------ 428

Page 16: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

16

Figure S5. Phylogenetic tree of the NiFe hydrogenases. Sequences belong to the 3b and 4 subgroups and are color-coded: OD1-i (light purple) remaining OD1 (dark purple) OP11 (green). Bootstraps greater than 75 are shown as text. NiFe hydrogenase reference sequences not previously characterized but identified only by database searches are denoted as putative. A complete list of organisms and accession numbers can be found in Table S1.

0.1

Dethiobacter alkaliphilus AHT1, putative (ZP03728990)Verrucomicrobiae bacterium DG1235, putative (ZP5056260)

ACD58 OD1ACD63 OD1

ACD15 OD1-iACD67 OD1-iACD9 OD1-i

ACD8 OD1-iACD56 OD1-i

ACD14 OD1ACD1 OD1-i

ACD5 OD1-i

ACD11 OD1-i

Pyrococcus furiosus DSM 3638, II (NP579061)

Pyrococcus furiosus DSM 3638, I (NP578623)Pyrococcus abyssi GE5 (NP12654)

Thermococcus kodakarensis KOD1 (YP184482)Thermococcus barophilus MP (YP004070736)

ACD57 OP11

ACD9 OD1-i, vIIThermodesulfovibrio yellowstonii DSM 11347 (YP002249644)

Allochromatium vinosum DSM180, putative (YP3442786)

Methanosarcina barkeri (CAA76121)

Thermoanaerobacter tengcongensis MB4 (AAM23431)Dehalococcoides sp BAV1 (ABQ17368)

Desulfovibrio gigas (AAP51029)

Methanospirillum hungatei JF1 (YP503186)Clostridium thermocellum ATCC 27405 (YP0010394091)

Desulfosporosinus youngiae DSM 17734, putative (ZP09652926)Candidatus Kuenenia stuttgartiensis, putative (CAJ72523)

ACD72 OD1 ACD14 OD1Methylococcus capsula (YP113608)

ACD7 OD1-iACD9 OD1-i, vIII

Candidatus Nitrospira defluvii, putative (YP003799849)NC10 bacterium Dutch sediment, putative (CBE69905)

Thermococcus sp AM4 (EEB73425)Thermococcus onnurineus NA1 (ACJ15761)

Thermococcus sp AM4 (EEB72991)Thermococcus onnurineus NA1 (ACJ16511)

Rhodopseudomonas palustris BisB18 (ABD90092)Salmonella enterica ATCC 9150 (AAV785641)Escherichia coli (CAA355501)

Pyrococcus furiosus DSM 3638 (NP579163)Thermococcus kodakarensis KOD1 (YP184504)Thermococcus onnurineus NA1 (YP002307980)

Pyrococcus abyssi GE5 (NP126404)

Carboxydothermus hydrogenoformans Z2901, CO-oxidizing (YP360647)Rhodospirillum rubrum, CO-oxidizing (AAC451211)

80

97

100

94100

76

99100

9586

81

100

100

81

100 99

86

10078

99

100

100

92

97

100100

90

100

Group 3b Hydrogenase

Group 3a Bacterial HydrogenaseGroup 3a Archaeal Hydrogenase

Group 3c Archaeal Hydrogenase

To G

roup

1 a

nd 2

Hyd

roge

nase

s

Sulf-hydrogenase (type 3b)

Energy-generating (type 4)

Page 17: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

17

Figure S6. Structural alignment of RuBisCO proteins to the large subunit of four forms of RuBisCO. RuBisCO and RLP reference sequences from type II/III M. burtonii (MBR), type I spinach (8RUC), type II R. rubrum (5RUB), type III T. kodakaraensis (3A12), and type 4 RuBisCO-like C. tepidum (1YKW) are included. The alignment was readjusted manually according to the three dimensional structure. Substrate binding and catalytic activity residues are shaded in light blue and yellow respectively, while conserved non-active site residues are shaded in red. 10 of the replaced active site residues in the form 4 RuBisCO-like protein (1YKW) are denoted by pink shading. Residues of form I RuBisCO that interact with S subunits and residues of form III RuBisCO involved in L2-L2 interactions are colored in blue and green, respectively. Secondary structure elements (β strands and α helices) of the form III RuBisCO are shown at the top of the sequences, with orange band indicating the extra loop present in ACD80 and MBR. The one non-conservative substrate-binding residue in ACD80 is shaded in pink, and the variant non-glycine residues (position 423) in ACD80 and MBR are shaded in black with white text.

ACD80_3200.32411.19G0017 ---MDIKELRATLNSHQAAYVNLDLPN---PQNGEYMLCAFHLVPGEGLNFLQAACEVSAESSTGTNFLVRTET--PFSREMNSLVYKIDLERN------ 86 YP_566926_M. burtonii -MSLIYEDLVKSLDSKQQAYVDLKLPD---PTNGEFLLAVFHMIPGGDLNVLQAAAEIAAESSTGTNIKVSTET--AFSRTMNARVYQLDLERE------ 88 ACD51_48308.71846.10G0003 ---------------MQKEYLKLGFDP---IAAGNYMLVVFHLVPGEGRDLLDAASEVAAESSTGSNLTIGTAT--EFSKSMDALVYKIDEAKN------ 74 ACD65_203648.6442.5G0005 ---------------MQKEYINLKLNP---LKGGKYMLAVFHLVPKPGEDFLSCASEVASESSTGSNLRVGTAT--KFSDNLNAIVYKIDKKKN------ 74 5RUB_A_formII -------------MDQSSRYVNLALKEEDLIAGGEHVLCAYIMKPKAGYGYVATAAHFAAESSTGTNVEVCTTD--DFTRGVDALVYEVDEARE------ 79 8RUC_A_formI MSPQTETKASVEFKAGVKDYKLTYYTPEYETLDT-DILAAFRVSPQPGVPPEEAGAAVAAESSTGTWTTVWTDG-LTNLDRYKGRCYHIEPVAGEENQ-- 96 3A12_A_formIII ---------MVEKFDTIYDY---YVDKGYEPSKKRDIIAVFRVTPAEGYTIEQAAGAVAAESSTGTWTTLYPWYEQERWADLSAKAYDFHDMG--DG--- 84 1YKW_A_formIV ---------------MNAEDVKGFFASRESLDMEQYLVLDYYLESVG--DIETALAHFCSEQSTAQWKRVGVDEDFRLVHAAKVIDYEVIEELEQLSYPV 83 ACD80_3200.32411.19G0017 -----------LVWIAYPRRLFDR-----GGNVQNILTYIVGN-VLWMKQINALKLLDVRFPSSMLEQYDGPSYTLDDMRKYLDVYE---RPILGTIIKP 166 YP_566926_M. burtonii -----------LVWIAYPWRLFDR-----GGNVQNILTYIIGN-ILGMKEIQALKLMDIWFPPSMLEQYDGPSYTVDDMRKYLDVYD---RPILGTIVKP 168 ACD51_48308.71846.10G0003 -----------LVWIAYPVDIFDR-----GGNVQNILTYIVGN-VFGMADVKAIKALDCWFPPEMLKNYDGPYTTIGDMKKYLGIDGDA-RPVLGTIIKP 156 ACD65_203648.6442.5G0005 -----------LVWIAFPWKIFDR-----GGNVQNILTYVVGN-VFGMGDLSALKALDCWFPKEMLEHYDGPATTIHDLKKYLGVKG---RPVLGTIVKP 154 5RUB_A_formII -----------LTKIAYPVALFDRNITDGKAMIASFLTLTMGN-NQGMGDVEYAKMHDFYVPEAYRALFDGPSVNISALWKVLGRPEVDGGLVVGTIIKP 167 8RUC_A_formI ----------YICYVAYPLDLFEE------GSVTNMFTSIVGN-VFGFKALRALRLEDLRIPVAYVKTFQGPPHGIQVERDKLNKYG---RPLLGCTIKP 176 3A12_A_formIII ---------SWIVRIAYPFHAFEE------ANLPGLLASIAGN-IFGMKRVKGLRLEDLYFPEKLIREFDGPAFGIEGVRKMLEIKD---RPIYGVVPKP 164 1YKW_A_formIV KHSETGKIHACRVTIAHPHCNFGP-------KIPNLLTAVCGEGTYFTPGVPVVKLMDIHFPDTYLADFEGPKFGIEGLRDILNAHG---RPIFFGVVKP 173 ACD80_3200.32411.19G0017 KMGLTSAEYAEVAYDFRVGGGDFVKNDEPQANQDFCPYDKMVKHIAEAMAKAVKETGHKKVHSFNVSAADYDTMISRCEMIKNSWMEAGSY-AFLIDGTT 265 YP_566926_M. burtonii KMGLTSAEYAEVCYDFWVGGGDFVKNDEPQANQDFCPYEKMVAHVKEAMDKAVKETGQKKVHSFNVSAADFDTMIERCEMITNAGFEPGSY-AFLIDGIT 267 ACD51_48308.71846.10G0003 KIGLKTDEFADVCYRFWKGGGDFVKFDEPQADQVFCPFEDAVKAIAKKMEQVRKETGKNKVMSFNISAADFMTMQKRAEIVMKY-MEKGSY-AFLVDGLT 254 ACD65_203648.6442.5G0005 KIGLKPKQFADVCYKFWKGGGDFVKFDEPQADQEFCPFKEAIDEIVKAMAKVEKETGKKKVMSINISAADFMTMQKRAEYVIKK-MKKGSY-AFLVDGLT 252 5RUB_A_formII KLGLRPKPFAEACHAFWLGG-DFIKNDEPQGNQPFAPLRDTIALVADAMRRAQDETGEAKLFSANITADDPFEIIARGEYVLETFGENASHVALLVDGYV 266 8RUC_A_formI KLGLSAKNYGRAVYECLRGGLDFTKDDENVNSQPFMRWRDRFLFCAEALYKAQAETGEIKGHYLNATAGTCEDMMKRAVFARELGVPIVMH-----DYLT 271 3A12_A_formIII KVGYSPEEFEKLAYDLLSNGADYMKDDENLTSPWYNRFEERAEIMAKIIDKVENETGEKKTWFANITADLLEMEQRLEVLA-DLGLKHAMV-----DVVI 258 1YKW_A_formIV NIGLSPGEFAEIAYQSWLGGLDIAKDDEMLADVTWSSIEERAAHLGKARRKAEAETGEPKIYLANITDEVDSLMEKHDVAVRNGANALLIN------ALP 267 ACD80_3200.32411.19G0017 AGWMAVQTLRRKYPD--VFIHFHRAGHGAFTRPENPIGFTVLVLSKFARLAGASGIHTGTAGVGKMAGSPEEDITAAHNILNLVAEGH-----IFHQSRG 358 YP_566926_M. burtonii AGWMAVQTLRRRYPD--VFLHFHRAAHGAFTRQENPIGFSVLVLSKFARLAGASGIHTGTAGIGKMKGTPAEDVVAAHSIQYLKSPGH-----FFEQTWS 360 ACD51_48308.71846.10G0003 AGWTAVQTARRMWPD--VFLHFHRAGHGAMTREENPIGYTVEVLTKFGRLAGASGMHTGTAGIGKMAGDGDTDVRAAHLALDKVASGP-----FFEQDWG 347 ACD65_203648.6442.5G0005 AGWTAVQTARRMWPG--VFLHFHRAGHGAMTRPENPIGYTVPFMTKMGRLAGASGMHTGTAGIGKMEGSAKEDVMAAHHALFAKSEGD-----FFDQDWY 345 5RUB_A_formII AGAAAITTARRRFPD--NFLHYHRAGHGAVTSPQSKRGYTAFVHCKMARLQGASGIHTGTMGFGKMEGES-SDRAIAYMLTQDEAQGP-----FYRQSWG 358 8RUC_A_formI GGFTANTTLSHYCRDNGLLLHIHRAMHAVIDRQKN-HGMHFRVLAKALRLSGGDHIHSGTV-VGKLEGERDITLGFVDLLRDDYTEKDRSRGIYFTQSWV 369 3A12_A_formIII TGWGALRYIRDLAADYGLAIHGHRAMHAAFTRNPY-HGISMFVLAKLYRLIGIDQLHVGTAGAGKLEGGKWDVIQNARILRESHYKPDENDVFHLEQKFY 357 1YKW_A_formIV VGLSAVRMLSNYTQVP---LIGHFPFIASFSRMEK-YGIHSKVMTKLQRLAGLDAVIMPGFGDRVMTPEEEVLENVIECTKP---MGR------------ 348 ACD80_3200.32411.19G0017 TIPETDDDFIRDIQEDIAHHTVLQDDSRRAVKKCCPIISGGLNPTLLKPFIDVMGNVDFITTMGAWCHAHPGGTQKGATALVQSCEAYKAW--------- 449 YP_566926_M. burtonii KIMDTDKDVINLVNEDLAHHVILEDDSWRAMKKCCPIVSGGLNPVKLKPFIDVMENVDFITTMGSGVHSHPGGTQSGAKALVQACDAYLQG--------- 451 ACD51_48308.71846.10G0003 -----------------------------DMKPMCPIASGGLNPVLLKPFADVIGTVDFITTMGGGVHSHPSGTEKGAMALVQACEAWKQK--------- 409 ACD65_203648.6442.5G0005 -----------------------------GMKPMCPIASGGLNPILLKPFADVVGTTDFITTMGGGVHSHPGGTEKGAMALVQACDAWKKG--------- 407 5RUB_A_formII -----------------------------GMKACTPIISGGMNALRMPGFFENLGNANVILTAGGGAFGHIDGPVAGARSLRQAWQAWRDG--------- 420 8RUC_A_formI -----------------------------STPGVLPVASGGIHVWHMPALTEIFG-DDSVLQFGGGTLGHPWGNAPGAVANRVALEACVQARNEGRDLAR 439 3A12_A_formIII -----------------------------SIKAAFPTSSGGLHPGNIQPVIEALG-TDIVLQLGGGTLGHPDGPAAGARAVRQAIDAIMQG--------- 418 1YKW_A_formIV ------------------------------IKPCLPVPGGSDSALTLQTVYEKVGNVDFGFVPGRGVFGHPMGPKAGAKSIRQAWEAIEQG--------- 409 ACD80_3200.32411.19G0017 ---IDIHEYAKTHKELAQAIEFFEKNLNKDIKHAERNEIS--------- 486 YP_566926_M. burtonii ---MDIEEYAKDHKELAEAIEFYLNR----------------------- 474 ACD51_48308.71846.10G0003 ---IDMNEYAKTHAELGQAVEFYKEHVEYTKKYAGK------------- 442 ACD65_203648.6442.5G0005 ---ISIKEYAKNHKELAQAIGFYKEKVGYSKKYL--------------- 438 5RUB_A_formII ---VPVLDYAREHKELARAFESFPGDADQIYPGWRKALGVEDTRSALPA 466 8RUC_A_formI EGNTIIREATKWSPELAAACEVWKEIKFEFPAMDTV------------- 475 3A12_A_formIII ---IPLDEYAKTHKELARALEKWGHVTPV-------------------- 444 1YKW_A_formIV ---ISIETWAETHPELQAMVDQSLLKKQD-------------------- 435

βB αB α0 βC

βD αC βE α D α E β1

α1 β2 α2 β3 α3 β4

α4 β5 αF βF α5 β6 α6 βG βH

Extra loop β7 α7 β8 α8

αG αH

Page 18: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

18

Figure S7. Model structure of ACD80_3200.32411.19G0017 RuBisCO, including the additional loop structure colored in black and surrounded by a brown box. The predicted active site is surrounded by a blue box.

Page 19: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

19

Figure S8.

Genome BinACD2BD1-5

ACD3BD1-5

ACD49BD1-5

ACD4BD1-5

ACD78BD1-5

ACD80Distant BD1-5

ACD18OD1

ACD41OD1

ACD43OD1

ACD58OD1

ACD63OD1

ACD66OD1

ACD68OD1

ACD76OD1

ACD81OD1

ACD83OD1

ACD11OD1-i

ACD14OD1-i

ACD15OD1-i

ACD1OD1-i

ACD56OD1-i

ACD5OD1-i

ACD67OD1-i

ACD7OD1-i

ACD8OD1-i

ACD9OD1-i

ACD12OP11

ACD13OP11

ACD19OP11

ACD22OP11

ACD24OP11

ACD25OP11

ACD26OP11

ACD27OP11

ACD30OP11

ACD31OP11

ACD32OP11

ACD36OP11

ACD37OP11

ACD38OP11

ACD40OP11

ACD48OP11

ACD50OP11

ACD52OP11

ACD57OP11

ACD61OP11

ACD28PER

ACD51PER

ACD65PER

RecA

11

11

111

11

121

1

11

111

11

21

111

12

11

11

1

11

11

1

1

1

11

DNA gyrase

A

11

11

211

2

11

1

1

1

1

11

11

11

12

111

11

1

1

1

1

2

11

11

DNA gyrase

B

11

11

111

11

1

11

1

1

1

11

21

11

23

211

11

11

1

1

111

21

1

RpoB

11

11

111

11

121

1

1

1

1

12

11

13

111

2

11

1

21

11

12

11

1

L1 riboso

mal pro

t

11

11

211

11

21

1

1

1

11

3

1

12

11

1

1

1

11

1

221

1

11

L2 riboso

mal pro

t

11

11

11

11

11

11

1

1

1

11

1

11

12

11

11

11

12

11

11

11

L3 riboso

mal pro

t

1

11

111

1

111

1

1

1

1

11

2

1

1

3

1

11

11

12

1

11

11

L4 riboso

mal pro

t

11

11

11

11

111

1

1

1

1

11

2

1

11

211

1

11

21

11

11

11

11

L5 riboso

mal pro

t

11

11

112

12

121

1

1

1

1

1

1

1

1

22

211

11

11

1

1

11

11

L6 riboso

mal pro

t

11

11

1

2

11

121

1

1

1

1

11

1

111

32

211

11

11

11

1

4

11

1

L10 rib

osomal

prot

11

11

111

2

11

1

1

1

1

11

21

1

2

22

1

1

2

11

211

11

1

L11 rib

osomal

prot

11

11

1

1

11

11

1

1

1

11

3

1

11

121

1

11

1

13

1

11

L13 rib

osomal

prot

1

11

211

21

21

11

1

1

1

1

11

11

22

111

1

11

1

11

12

112

11

12

L14 rib

osomal

prot

11

11

111

12

121

1

1

1

1

11

1

111

23

3

1

11

1

21

1

1

1

11

11

L15 rib

osomal

prot

1

11

111

11

111

1

1

1

11

2

111

21

11

11

1

1

2

11

11

L16 rib

osomal

prot

11

11

111

12

111

1

1

1

1

11

1

11

23

1

1

11

21

1

1

11

L18 rib

osomal

prot

11

11

2

1

11

12

1

1

1

1

11

1

111

32

111

11

1

1

1

3

11

11

S2 riboso

mal pro

t

11

11

211

21

21

1

1

2

1

11

12

211

22

11

2

1

21

11

111

11

11

S3 riboso

mal pro

t

1

11

111

11

11

11

1

1

1

11

1

11

13

3

1

11

2

11

1

1

11

S4 riboso

mal pro

t

1

11

211

21

111

11

2

1

2

11

11

11

22

112

11

1

13

112

11

12

S7 riboso

mal pro

t

1

11

2

1

22

1

1

1

1

1

21

111

13

111

2

1

11

12

11

11

S8 riboso

mal pro

t

11

11

112

11

121

1

1

1

1

11

1

111

22

211

11

11

1

2

2

11

11

S9 riboso

mal pro

t

1

11

211

21

21

11

11

1

1

1

21

11

22

111

1

11

1

11

12

113

1

12

S10 rib

osomal

prot

11

1

11

1

111

1

1

1

1

11

2

1

11

11

1

1

1

11

11

1

1

11

S11 rib

osomal

prot

1

11

211

21

111

11

2

1

1

11

11

11

12

211

1

1

12

111

1

12

S12 rib

osomal

prot

1

11

1

1

21

11

1

1

1

1

11

111

12

111

1

1

21

1

13

11

11

S13 rib

osomal

prot

1

11

211

21

111

2

2

1

1

11

1

21

12

21

1

1

13

1

1

1

12

S15 rib

osomal

prot

11

11

111

1

121

1

1

1

1

1

11

111

22

31

1

11

31

21

122

1

11

S17 rib

osomal

prot

11

111

12

121

1

1

1

1

11

1

11

22

2

1

11

21

1

1

1

11

S19 rib

osomal

prot

11

11

111

12

111

11

1

1

1

11

2

11

23

2

1

1

11

11

11

1

1

1

11

Single-copy Gene Phylogenetic Markers

Page 20: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

20

Quantification of the single copy phylogenetic marker genes for estimating genome completion and amount of genomes per bin of the 49 BD1-5, ACD80, OD1, OP11, and the PER associated genomic bins.

Page 21: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

21

Figure S9. Maximum likelihood 16S rRNA tree showing the phylogenetic position of the two ACD recovered PER 16S rRNA sequences (blue text). These sequences cluster with other reference sequences, previously of “unknown” phylogenetic assignment according to Greengenes and RDP, to form a well-defined PER lineage. Bootstrap proportions for 500 samplings are reported (text), with the separate PER lineage (blue bracket) strongly supported (>89%) in trees constructed using RAxML, FastTree, maximum parsimony, and neighbor joining methods. The circles at each node in the PER lineage indicate recovered nodes in all four treeing methods (closed circle) and recovered nodes in three treeing methods (open circle). All non-ACD PER sequences are from uncultivated Bacterial clones, with SILVA sequences with highest identity to the ACD PER denoted in bold. Beyond the PER-labeled phylum, the reference sequences from Figure S2 are included (data not shown).

0.1

AM167972 OP11 unclassified Fluids deep igneous rock aquifers spring clone BB35 DQ298014 OP11 unclassified Hydrocarbon contaminated clone UB12

DQ103602 OP11 3 Volcano sediments clone AN07BC1EF019888 OP11 3 Aspen rhizosphere clone

AF050606 OP11 WCBH1 Contaminated aquifer clone WCHB1 64AY667254 OP11 WCBH1 TCE dechlorinating groundwater clone TANB22

CU927577 Candidate division OP11 unculturedA EMIRGE 382 ACD Clone98D EMIRGE 203 ACD Clone98

AB186803 Candidate division OP11 uncultured

D EMIRGE 44 AB186803A EMIRGE 68 AB186803

C EMIRGE 149 AB186803

AB252963 OP11 4 Japan Lagoon iron oxidation clone 08EU134950 OP11 4 Prarie soil clone FFCH11767

DQ676346 OP11 4 Suboxic freshwater pond clone MVP 115AM167966 OP11 4 fluids deep igneous rock aquifers spring clone BB17

DQ676374 1 OP11 1 Suboxic freshwater pond clone MVP 112DQ404649 1 OP11 1 Uranium clone

D EMIRGE 105 GQ487896DQ329876 OP11 1 Guerrero Negro hypersaline microbial mat

AF050600 OP11 2 Contaminated aquifer clone WCHB107 EU522666 OP11 2 Hydrocarbon degrading Canada oil sands clone

AY953160 OP11 2 Anaerobic swine lagoon clone B 1AT

GQ354923 OD1 Sulfidic spring Alum Rock Park clone MS4 16AM490688 OD1 Sulfidic cave spring clone SS LKC22 UA37

GQ339139 OD1 freshwater ironII rich clone 32FJ482175 OD1 clone Pav OD14

DQ521564 OD1 ANTLV9 G05EU52266 Candidate division OD1 uncultured

C EMIRGE 69 ACD Clone90A EMIRGE 20 ACD Clone90

C EMIRGE 176 DQ018805D EMIRGE 1 DQ018805

FJ482176 OD1 Pav OD7C EMIRGE 618 FJ482176

D EMIRGE 171 DQ925880

D EMIRGE 28 ACD Clone6A EMIRGE 116 ACD Clone6

C EMIRGE 58 ACD Clone6

D EMIRGE 96 EF029846

FJ482180 OD1 clone Pav OD4

C EMIRGE 9 ACD Clone50

C EMIRGE 53 ACD Clone0A EMIRGE 735 ACD Clone0D EMIRGE 62 ACD Clone0

GQ355005 BD1 5 unculturedD EMIRGE 25 ACD Clone13

A EMIRGE 22 ACD Clone100FJ959979 BD1 5 uncultured

D EMIRGE 147 ACD Clone134

EU246089 MAT-CR-P3-B04EF459871 137b1 865bp

FJ712516 KZNMV-10-B32 1425bpHQ672998 F9P265-0S-M09 1425bpJN510510 SBZC 5010JN513936 SBZF 6153

EU487961 PerRef CK-1C2-58

JN507275 SBZC 3307 JF086691 ncd979h02c1 JF086552 ncd978d06c1

ACD28 326bp fragmentACDClone 77 1453 bp

74

75

81

95

83

100

33

98

100

69

100100

1009264

37

21

75100

32

100100

81100

100100

100

10092

95

100100

100100

89

100

100

7999

3747

10097

63

100100

10098

999986

100

100

57

5096

47

96

58 100

66100

49

PER

BD1-5

OD1

GQ487896 Candidate Division OP11

OP11

To B

acte

ria (s

ee F

igur

e S2

)

Page 22: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

22

Table S1. Accession numbers for A. RecA, DNA gyrase B, and RpoB (Figure S3); B. NiFe large subunit hydrogenase (Figure S5); and C. RuBisCO (Figure 4, main text) used in the phylogenetic analyses.

Page 23: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

23

Organism Name DNA gyrase B recA rpoBAcetohalobium arabaticum Z-7288 DSM 5501 YP003826625 YP003828054 YP003826771Acidobacterium capsulatum ATCC 51196 YP002755762 ACO33353 YP002755938Acidovorax ebreus TPSY YP002551492 YP002554778 YP002554675Acinetobacter lwoffii SH145 ZP06070882 ZP06069410 ZP06068382Ammonifex degensii KC4 YP003238033 YP003238626 YP003239477Anaeromyxobacter dehalogenans 2CP ZP02324675 ZP02325928 ZP02325295Bacillus cereus strain Rock1 ZP04242437 ZP04240806 ZP04237424Bacillus subtilis subsp subtilis strain 168 NP387887 NP389576 NP387988Bacteroides vulgatus ATCC 8482 YP001298699 YP001299658 YP001298139Bacteroides finegoldii strain DSM 17565 ZP05416642 EEX44381 ZP05416316candidate division TM7 GTL1 ZP01811183 ZP01811340 N/Acandidate division TM7 single cell TM7a ZP02869697 ZP02869274 N/AChlamydophila pneumoniae strain TW 183 NP876558 NP877062 NP876357Chloroflexus aggregans DSM 9485 YP002464957 YP002463084 YP002464011Chloroflexus aurantiacus J-10-fl YP001636597 YP001634976 YP001634496Clostridium spiroforme DSM 1552 ZP02868565 ZP02867562 ZP02868682Coxiella burnetii strain CbuG Q212 YP002302609 YP002303465 YP002304193Dechloromonas aromatica strain RCB YP283232 YP287348 YP283541Dehalococcoides ethenogenes 195 YP180759 YP182301 YP181345Dehalococcoides sp strain BAV1 YP001213474 YP001214807 YP001214041Dehalococcoides sp strain GT YP003461832 YP003463211 YP003462363Deinococcus proteolyticus strain MRP YP180759 YP004255825 YP004255565Deinococcus radiodurans strain R1 YP004256010 NP296060 NP294636Delta proteobacterium strain MLMS ZP01288026 ZP01290064 ZP01289862Desulfococcus oleovorans Hxd3 YP001527982 YP001529150 YP001528589Desulfotalea psychrophila LSv54 YP064385 YP066694 YP064853Gallionella capsiferriformans ES-2 YP003845812 YP003846316 YP003846358Gemmatimonas aurantiaca T-27T YP002759996 YP002761429 YP002760367Geobacter bemidjiensis Bem DSM 16622 YP002136833 YP002140418 YP002137743Lactobacillus acidophilus strain ATCC 4796 ZP04020395 ZP04021352 ZP04020985Megasphaera micronuciformis F0359 ZP07756792 ZP07757834 ZP07757468Methylococcus capsulatus Bath YP115417 YP112918 YP113541Moorella thermoacetica ATCC 39073 YP428891 YP429173 YP431293Mycobacterium abscessus strain 47J26 MAB47J26 EHC00686 EHB99868Mycoplasma pneumoniae strain FH ADK86799 ADK86658 ADK86844Paludibacter propionicigenes WB4 DSM 17365 YP004042808 YP004041343 YP004042849Paracoccus denitrificans PD1222 YP914226 YP914405 YP914555Pedosphaera parvula Ellin514 ZP03628627 ZP03628143 ZP03628347Prevotella denticola strain F0289 YP004328499 YP004329430 YP004328612Rhodobacter capsulatus strain SB YP003576176 YP003577903 YP003576463Rhodoferax ferrireducens strain ATCC BAA YP521296 YP525178 YP524827Rhodopseudomonas palustris BisA53 YP778946 YP780523 YP782507Shewanella oneidensis strain MR 1 NP715653 NP718983 NP715864Spirochaeta thermophila DSM 6578 AEJ62551 AEJ61402 AEJ60792Staphylococcus aureus 04-02981 ADC36220 ADC37454 ADC36730Streptococcus pneumoniae 670-6B YP003879223 YP003880178 YP003880192Sulfuricurvum kujiense YK-1 DSM 16994 YP004059093 YP004058915 YP004061130Synechococcus elongatus PCC 6301 YP172325 YP171875 YP173217Thermincola potens JR YP003638806 YP003640040 YP003639075Thermoanaerobacter ethanolicus CCSD1 ZP05492324 ZP05491696 ZP05493624Thermobaculum terrenum YNP1 ATCC BAA-798 YP003322805 YP003321818 ZP03857201Thermus aquaticus Y51MC23 ZP03497068 ZP03495554 ZP03495559Treponema denticola ATTC 35405 NP970619 NP971482 NP973020Ureaplasma parvum U26 sv 14 ATCC 33697 ZP02689822 ZP02689824 ZP02689961Variovorax paradoxus strain EPS YP004152337 YP004158184 YP004152975Verrucomicrobium spinosum DSM 4136 ZP02929173 ZP02926079 ZP02930291Vibrio cholerae 2129 ZP04419588 ZP04419744 ZP04416407Waddlia chondrophila WSU 86-1044 YP003709377 YP003710317 YP003708966

A) Single Gene Phylogenetic Analysis

Page 24: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

24

Organism name Accession Number

Hydrogenase Type

Desulfotalea psychrophila LSv54 YP064311 1Desulfovibrio fructosovorans JJ ZP07335471 1Geobacter uraniireducens Rf4 YP001229652 1Anabaena variabilis ATCC29413 YP325087 2aBradyrhizobium japonicum USDA110 NP773583 2bNostoc punctiforme PCC73102 AAC16277 2aRhodobacter capsulatus AAC32033 2bRhodobacter sphaeroides KD131 YP002526186 2bThiocapsa roseopersicina AAX40740 2bMethanococcus voltae Q00404 3aMethanopyrus kandleri AV19 NP613553 3aPyrococcus abyssi GE5 NP126548 3bThermococcus kodakarensis KOD1 YP184482 3bThermodesulfovibrio yellowstonii DSM11347 YP002249644 3bPyrococcus furiosus DSM3638 NP578623 3b, IPyrococcus furiosus DSM3638 NP579061 3b, IIThermococcus barophilus MP YP004070736 3b, IIDethiobacter alkaliphilus AHT1 ZP03728990 Putative 3bAllochromatium vinosum DSM YP3442786 Putative 3bVerrucomicrobiae bacterium DG1235 ZP5056260 Putative 3bDehalococcoides ethenogenes 195 YP181357 3cDesulfotalea psychrophila LSv54 YP06474 3cMethanosphaera stadtmanae DSM YP447369 3cMethanothermobacter thermautotrophicus DeltaH NP276262 3cSyntrophobacter fumaroxidans MPOB YP848058 3cDechloromonas aromatica RCB YP284208 3dDesulfotalea psychrophila LSv54 YP65948 3dMagnetospirillum magneticum AMB1 YP422759 3dRhodobacter capsulatus AAD38065 3dSynechococcus sp PCC7002 YP001733469 3dClostridium thermocellum ATCC27405 YP0010394091 4Desulfovibrio gigas AAP51029 4Methanosarcina barkeri CAA76121 4Methanospirillum hungatei JF1 YP503186 4Thermococcus kodakarensis KOD1 YP184504 4Pyrococcus furiosus DSM NP579163 4Pyrococcus abyssi GE5 NP126404 4Thermoanaerobacter tengcongensis MB4 AAM23431 4Rhodopseudomonas palustris BisB18 ABD90092 4Thermococcus onnurineus NA1 ACJ15761 4Thermococcus onnurineus NA1 ACJ16511 4Thermococcus onnurineus NA1 YP2307980 4Thermococcus sp AM4 EEB72991 4Thermococcus sp AM4 EEB73425 4Escherichia coli CAA355501 4Rhodospirillum rubrum AAC451211 4Salmonella enterica ATCC9150 AAV785641 4Dehalococcoides sp BAV1 ABQ17368 4Carboxydothermus hydrogenoformans Z2901 YP360647 4Desulfosporosinus youngiae DSM ZP9652926 Putative 4Methylococcus capsulatus str YP113608 Putative 43799849 Candidatus Nitrospira YP3799849 Putative 4Candidatus Kuenenia stuttgartiensis CAJ72523 Putative 4NC10 bacterium Dutch sediment CBE69905 Putative 4

B) NiFe Hydrogenase, Large subunit

Page 25: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

25

Organism name Accession Number

RuBisCO Form

Hydrogenophilus thermoluteolus Q51856 IANitrobacter vulgaris Q59613 IAActivated spinach gi1827835 IAAnabaena variabilis ATCC 29413 ABA23512 IBNicotiana tabacum P00876 IBXanthobacter autotrophicus Py2 YP001416820 1CMethylibium petroleiphilum PM1 YP001020675 1CAurantimonas sp. SI85-9A1 AAB41464 IDNitrosococcus oceani ATCC 19707 ABA56859 IDOdontella sinensis NP043654 IDPleurochrysis carterae Q08051 IDNitrosospira multiformis ATCC 251966 YP411385 IDGonyaulax polyedr AAA98748 IISymbiodinium sp. AAG37859 IIThiomicrospira crunogena XCL-2 ABB41020 IIHydrogenovibrio marinus Q59462 IIMagnetospirillum magnetotacticum AMB-1 YP422059 IIRhodoferax ferrireducens T118 YP522655 IIRhodopseudomonas palustris AAN52766 IIRhodobacter capsulatus ATCC11166 P50922 IIRiftia pachyptila endosymbiont AAC38280 IIMethanocaldococcus jannaschii AAB99239 IIIMethanosarcina acetivorans C2A AAM07894 IIIPyrococcus horikoshii OT3 BAA30036 IIIThermococcus kodakaraensis KOD1 BAD86479 IIINatronomonas pharaonis DSM 2160 CAI49476 IIIArchaeglobus fulgidus DSM 4304 NP070466 IIIHyperthermus butylicus DSM 5456 YP001012710 IIIThermofilum pendens Hrk 5 YP920628 IIIStaphylothermus hellenicus DSM 12710 YP003669400 IIIDesulfurococcus fermentans DSM 16532 ZP09027197 IIIRhodospirillum rubrum ABC22798 IV-DeepYkrHeliobacillus mobilis ABH04879 IV-DeepYkrRhodopseudomonas palustris CGA009 CAE27610 IV-DeepYkrHalorhodospira halophila SL1 YP001002057 IV-DeepYkrRhodopseudomonas palustris BisB18 YP532057 IV-DeepYkrRhodopseudomonas palustris BisB5 YP569369 IV-DeepYkrAlkalilimnicola ehrlichei MLHE-1 YP742007 IV-DeepYkrRhodopseudomonas palustris BisA53 YP782588 IV-DeepYkrMesorhizobium loti BAB53192 IV-Non-photoSinorhizobium meliloti 1021 CAC48779 IV-Non-photoBordetella bronchiseptica RB50 CAE31534 IV-Non-photoJannaschia sp. CCS1 YP511005 IV-Non-photoRoseobacter sp. MED193 ZP01056409 IV-Non-photoFulvimarina pelagi HTCC2506 ZP01438569 IV-Non-photoChlorobium tepidum TLS1 AAM72993 IV-PhotoChlorobium chlorochromatii CaD3 ABB28892 IV-PhotoAllochromatium vinosum BAB44150 IV-PhotoRhodopseudomonas palustris BisB18 YP530146 IV-PhotoBacillus cereus E33L AAU16474 IV-YkrWBacillus licheniformis ATCC 14580 AAU23062 IV-YkrWBacillus clausii KSM-K16 BAD64310 IV-YkrWBacillus subtilis subsp. subtilis str. 168 CAB13232 IV-YkrWMethanosaeta concilii YP004385218 Putative II/IIIMethanosalsum zhilinae DSM 4017 YP004615354 Putative II/IIIMethanohalophilus mahii DSM 5219 YP003542093 Putative II/IIIMethanococcoides burtonii DSM 6242 YP566926 II/III

C) RuBisCO

Page 26: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

26

Table S2. An overview of the 49 genome datasets. Abundance was based on genome coverage calculated directly from the kmer coverage, using the standard relationship (Ck = C*(L-k+1)/L from the Velvet manual). Note that these values will over estimate the true abundance by a factor of two, because 55.3% of the reads did not go into the ACD genome assembly.

Organism Phylum Abundance Bin Size GC % Contig # Gene # Longest contigACD49 BD1‐5 0.011 1.23 Mbp 27.68% 31 1173 259398ACD80 BD1‐5 0.010 1.30 Mbp 35.20% 115 1351 202142ACD4 BD1‐5 0.006 1.35 Mbp 27.47% 186 1583 37706ACD3 BD1‐5 0.078 1.88 Mbp 33.17% 101 1784 79526ACD78 BD1‐5 0.009 1.20 Mbp 42.90% 237 1652 27201ACD2 BD1‐5 0.070 1.68 Mbp 37.21% 58 1585 155588ACD18 OD1 0.011 1.28 Mbp 32.24% 188 1474 43392ACD1 OD1 0.112 1.30 Mbp 38.53% 69 1345 107361ACD15 OD1 0.008 1.31 Mbp 38.23% 103 1473 62809ACD41 OD1 0.008 1.45 Mbp 46.87% 205 1756 32449ACD7 OD1 0.090 1.84 Mbp 34.32% 275 2237 66800ACD14 OD1 0.004 215.82 Kbp 37.02% 43 278 9563ACD68 OD1 0.003 261.22 Kbp 41.33% 69 353 11505ACD83 OD1 0.003 447.39 Kbp 37.41% 132 674 10118ACD67 OD1 0.003 579.50 Kbp 40.92% 111 760 20053ACD66 OD1 0.003 621.26 Kbp 40.59% 143 800 13553ACD43 OD1 0.003 869.49 Kbp 42.14% 150 1088 20320ACD63 OD1 0.004 897.55 Kbp 40.65% 137 1037 44013ACD76 OD1 0.009 900.45 Kbp 43.26% 84 974 65085ACD56 OD1 0.007 912.61 Kbp 40.01% 73 1005 52176ACD81 OD1 0.059 959.32 Kbp 43.24% 98 1091 98615ACD8 OD1 0.010 991.31 Kbp 38.34% 53 1028 89478ACD5 OD1 0.005 1.05 Mbp 36.46% 183 1352 55567ACD9 OD1 0.004 1.06 Mbp 36.80% 207 1371 21092ACD11 OD1 0.017 1.09 Mbp 36.75% 71 1174 357851ACD72 OD1 and OP11 0.003 1.20 Mbp 37.01% 287 1670 16832ACD58 OD1? 0.005 1.22 Mbp 33.67% 158 1395 63676ACD22 OP11 0.008 1.09 Mbp 37.60% 128 1359 36255ACD24 OP11 0.009 1.11 Mbp 36.61% 300 1688 19831ACD50 OP11 0.004 1.18 Mbp 38.58% 211 1652 31886ACD48 OP11 0.003 1.21 Mbp 40.05% 338 1885 13609ACD13 OP11 0.010 1.37 Mbp 40.92% 125 1651 143676ACD37 OP11 0.004 1.51 Mbp 36.98% 320 2159 25148ACD12 OP11 0.006 1.70 Mbp 33.30% 380 2480 43707ACD36 OP11 0.003 343.41 Kbp 43.05% 84 515 13590ACD31 OP11 0.007 378.63 Kbp 33.05% 49 464 243834ACD26 OP11 0.007 379.77 Kbp 29.69% 86 555 112469ACD32 OP11 0.005 448.08 Kbp 36.28% 66 534 118836ACD25 OP11 0.003 603.41 Kbp 43.53% 132 825 14723ACD57 OP11 0.004 916.27 Kbp 39.83% 216 1398 19492ACD52 OP11 0.004 921.35 Kbp 42.62% 156 1260 30997ACD30 OP11 0.005 926.19 Kbp 36.99% 50 1107 112704ACD38 OP11 0.005 955.57 Kbp 39.56% 86 1167 55406ACD40 OP11 0.007 986.25 Kbp 46.08% 127 1366 65026ACD61 OP11 0.005 990.14 Kbp 44.99% 152 1314 43627ACD19 OP11 min OD1 0.007 1.42 Mbp 33.72% 244 2016 147681ACD51 PER 0.005 1.46 Mbp 41.43% 156 1632 90706ACD28 PER 0.006 1.68 Mbp 44.80% 207 1916 84311ACD65 PER 0.003 751.66 Kbp 42.89% 172 1057 19890

Page 27: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

27

Table S4. List of genes and proteins featured in Figure 3 (main text). Comprehensive proteome identification is included in Table S3. Asterisks indicate proteins identified by mass spectrometry with lower confidence.

Page 28: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

28

Gene No.

EC number

or KEGGGene name ACD

genomeACD

proteomeACD

genomeACD

proteome

1 2.7.1.1 hexokinase yes yes* no no2 5.4.2.2 phosphoglucomutase yes no no no3 5.3.1.9 glucose-6-phosphate isomerase yes yes* yes yes4 2.7.1.1 6-phosphofructokinase no no yes no5 4.1.2.13 fructose-bisphosphate aldolase yes no yes yes6 1.2.1.12 glyceraldehyde 3-phosphate dehydrogenase yes yes yes yes7 2.7.2.3 phosphoglycerate kinase yes no yes no8 5.4.2.1 2,3-bisphosphoglycerate-dependent phosphoglycerate mutase yes no yes no9 4.2.1.11 enolase yes yes yes yes

10 2.7.1.40 pyruvate kinase yes yes yes yes11 2.3.1.54 pflB; formate acetyltransferase no no yes only acd6512 1.97.1.4 pyruvate formate-lyase activating enzyme yes no yes no13 1.2.7.1 pyruvate ferredoxin oxidoreductase, alpha subunit yes yes* no no14 1.2.7.1 pyruvate ferredoxin oxidoreductase, beta subunit yes yes* no no15 1.2.7.1 pyruvate ferredoxin oxidoreductase, delta subunit yes no no no16 1.2.7.1 pyruvate ferredoxin oxidoreductase, gamma subunit yes no no no17 6.2.1.13 acetyl-CoA synthetase (ADP-forming) yes no no no18 3.6.1.7 acylphosphatase no no yes no19 2.7.2.1 acetate kinase no no yes yes*20 2.3.1.8 phosphate acetyltransferase, 2 copies no no yes no21 1.1.1.28 D-lactate dehydrogenase yes no no no22 4.1.1.1 pyruvate decarboxylase (acetaldehyde forming) no no no no23 1.1.1.1 alchohol dehydrogenase yes yes no no24 4.1.1.31 phosphoenolpyruvate carboxylase only acd8 no yes no25 1.1.1.38 malate dehydrogenase (oxaloacetate-decarboxylating) yes no yes no26 2.2.1.1 transketolase yes no yes no27 2.7.1.15 ribokinase yes no yes no28 5.4.2.2 phosphoglucomutase yes no no no29 2.7.6.1 ribose-phosphate pyrokinase yes no no no30 3.6.3.14 F-type H+-transporting ATPase subunit alpha yes yes yes yes31 3.6.3.14 F-type H+-transporting ATPase subunit beta yes yes yes yes32 3.6.3.14 F-type H+-transporting ATPase subunit gamma yes yes* yes no33 3.6.3.14 F-type H+-transporting ATPase subunit delta yes no no no34 3.6.3.14 F-type H+-transporting ATPase subunit epsilon yes no yes no35 3.6.3.14 F-type H+-transporting ATPase subunit c yes yes* yes no36 3.6.3.14 F-type H+-transporting ATPase subunit a yes yes yes no37 3.6.3.14 F-type H+-transporting ATPase subunit b yes yes yes yes38 5.4.2.8 phosphomannomutase yes no yes no39 5.3.1.8 mannose-6-phosphate isomerase yes no no no40 2.4.1.25 glycogen debranching enzyme yes no no no41 2.4.1.21 starch phosphorylase yes no no no42 2.4.1.11 glycogen(starch) synthase, 2 copies yes no no no43 NA 4Fe-4S ferredoxin or polyferredoxin yes yes yes no44 1.18.1.2 Ferredoxin reductase-like C-terminal NADP-linked yes no no no45 1.18.99.1 Cytoplasmic type 3b NiFe hydrogenase- large and small subunit yes yes no no46 1.18.99.1 Membrane type 4 NiFe hydrogenase- large and small subunit yes no no no47 K04651 hydrogenase accessory protein HypA yes no no no48 K04652 hydrogenase accessory protein HypB yes no no no49 K04653 hydrogenase expression/formation protein HypC yes no no no50 K04654 hydrogenase expression/formation protein HypD yes yes* no no51 K04655 hydrogenase expression/formation protein HypE yes yes no no52 K04656 hydrogenase expression regulatory protein HypF yes yes no no53 NA Pili subunits (pilB, pilM, pilC, pilT) yes pil B, C, M yes pilT54 NA AMP phosphorylase, homolog Tk-DeoA no no yes no55 NA Ribose-1,5 biphosphate isomerase, homolog Tk-E2B2 no no yes no56 4.1.1.39 Rubulose-1,5 biphosphate carboxylase, RuBisCO no no yes no57 2.2.1.6 acetolactate synthase I/II/III large subunit no no yes no58 4.2.1.9 dihydroxy-acid dehydratase no no yes no59 2.6.1.42 branched-chain amino acid aminotransferase no no yes no60 2.5.1.54 3-deoxy-7-phosphoheptulonate synthase no no yes no61 4.2.3.4 3-dehydroquinate synthase no no yes no62 4.2.1.10 3-dehydroquinate dehydratase I no no yes no63 1.1.1.25 shikimate dehydrogenase no no yes no64 2.7.1.71 shikimate kinase no no yes no65 2.5.1.19 3-phosphoshikimate 1-carboxyvinyltransferase no no no no66 4.2.3.5 chorismate synthase no no yes no67 5.4.99.5 chorismate mutase no no yes yes68 4.2.1.51 prephenate dehydratase no no yes no69 2.6.1.1 aspartate aminotransferase, cytoplasmic no no yes no70 2.6.1.9 histidinol-phosphate aminotransferase no no yes no71 K07636 OmpR family, phosphate regulon sensor histidine kinase PhoR no no yes yes72 K07657 OmpR family, phosphate regulon response regulator PhoB no no yes no73 K03324 Na/Phosphate symporter no no yes yes

OD1-i (Fig 3A) PER (Fig 3B)

Page 29: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

29

Table S5. Distribution of NiFe hydrogenases, FeFe hydrogenases, and hydrogenase maturation factors detected in the uncultivated phyla (OD1-i, OD1, and OP11). Grey shading denotes presence of a single feature, white shading absence, a number value is indicated where features are greater than 1.

FeFe1 2a 2b 3a 3b 3c 3d 4 HypA HypB HypC HypD HypE HypF

All ACD Genomes 3 0 0 0 18 3 1 8 7 17 7 16 10 14 18Candidate Divisions 0 0 0 0 17 0 0 5 3 8 0 10 7 7 8

OD1-iACD1 ACD5 1 1 1 1 1 1ACD7 1 1 1ACD8 1 1 1 1 1 1ACD9 2 1 1 1 1 1 1ACD11 1 1 1 1 1 1ACD15 1 1 1 1 1 1 1ACD56 1 1 1 1ACD67 1 1 1 1

other OD1ACD14 1 1ACD58 1 1ACD72 1 1ACD63 1 1 1 1 1OP11

ACD48 1 1 1ACD57 1 1ACD22 1ACD25 1 1ACD12 1

NiFe Hydrogenase maturation proteins

Page 30: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

30

Captions for additional separate files: Figure S2, Figure S3, Table S3

Figure S2. Maximum likelihood phylogenetic tree of 16S rRNA gene showing the placement of EMIRGE ACD sequences. The best blast hit, either clone library (cluster number of sequences clustered to 97% identity) or accession number for previously deposited sequence, is noted in parenthesis next to the EMIRGE sequence. The histogram on the right indicates the relative abundance of each sequence recovered using EMIRGE across the 3 time points (maximum is 19%).

Figure S3. Phylogenetic trees constructed using conserved gene sequences from at least 31 ACD Candidate division genomes. The same 58 reference sequences were collected from public databases from sequenced genomes. A. RecA, B. DNA gyrase B, C. RpoB. The BD1-5, PER, OD1, and OP11 taxa are denoted by colored brackets with text. 1000 bootstraps were performed with nodal support greater than 80 indicated. A complete list of organisms and accession numbers can be found in Table S1A.

Table S3. Summary of proteomic data for the A, C, and D samples (spectral count and normalized spectral abundance factor (NSAF) values are reported for the two labs that made measurements on the same samples. Results for organisms from the same lineage are grouped, and within groups they are arranged approximately by protein abundance. Gold highlight flags proteins identified by two or more peptides.

Supplementary Download (FASTA_file.zip) Individual FASTA files for gene and protein sequences referenced in this manuscript.

Page 31: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

31

References and Notes 1. M. Hess et al., Metagenomic discovery of biomass-degrading genes and genomes from

cow rumen. Science 331, 463 (2011).

2. D. Wu et al., Stalking the fourth domain in metagenomic data: Searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees. PLoS ONE 6, e18011 (2011).

3. G. W. Tyson et al., Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37 (2004).

4. V. Iverson et al., Untangling genomes from metagenomes: Revealing an uncultured class of marine Euryarchaeota. Science 335, 587 (2012).

5. P. G. Falkowski, T. Fenchel, E. F. Delong, The microbial engines that drive Earth’s biogeochemical cycles. Science 320, 1034 (2008).

6. M. S. Rappé, S. J. Giovannoni, The uncultured microbial majority. Annu. Rev. Microbiol. 57, 369 (2003).

7. Materials and methods are available as supplementary materials on Science Online.

8. C. S. Miller, B. J. Baker, B. C. Thomas, S. W. Singer, J. F. Banfield, EMIRGE: Reconstruction of full-length ribosomal genes from microbial community short read sequencing data. Genome Biol. 12, R44 (2011).

9. G. J. Dick et al., Community-wide analysis of microbial genome sequence signatures. Genome Biol. 10, R85 (2009).

10. J. Raes, J. O. Korbel, M. J. Lercher, C. von Mering, P. Bork, Prediction of effective genome size in metagenomic samples. Genome Biol. 8, R10 (2007).

11. M. S. Elshahed et al., Metagenomic analysis of the microbial community at Zodletone Spring (Oklahoma): Insights into the genome of a member of the novel candidate division OD1. Appl. Environ. Microbiol. 71, 7598 (2005).

12. N. H. Youssef, P. C. Blainey, S. R. Quake, M. S. Elshahed, Partial genome assembly for a candidate division OP11 single cell from an anoxic spring (Zodletone Spring, Oklahoma). Appl. Environ. Microbiol. 77, 7804 (2011).

13. J. P. McCutcheon, B. R. McDonald, N. A. Moran, Origin of an alternative genetic code in the extremely small and GC-rich genome of a bacterial symbiont. PLoS Genet. 5, e1000565 (2009).

14. X. Mai, M. W. Adams, Purification and characterization of two reversible and ADP-dependent acetyl coenzyme A synthetases from the hyperthermophilic archaeon Pyrococcus furiosus. J. Bacteriol. 178, 5897 (1996).

15. R. T. Anderson et al., Stimulating the in situ activity of Geobacter species to remove uranium from the groundwater of a uranium-contaminated aquifer. Appl. Environ. Microbiol. 69, 5884 (2003).

Page 32: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

32

16. L. Schöcke, B. Schink, Membrane-bound proton-translocating pyrophosphatase of Syntrophus gentianae, a syntrophically benzoate-degrading fermenting bacterium. Eur. J. Biochem. 256, 589 (1998).

17. P. M. Vignais, B. Billoud, Occurrence, classification, and biological function of hydrogenases: An overview. Chem. Rev. 107, 4206 (2007).

18. K. Ma, R. N. Schicho, R. M. Kelly, M. W. Adams, Hydrogenase of the hyperthermophile Pyrococcus furiosus is an elemental sulfur reductase or sulfhydrogenase: Evidence for a sulfur-reducing hydrogenase ancestor. Proc. Natl. Acad. Sci. U.S.A. 90, 5341 (1993).

19. D. J. van Haaster, P. J. Silva, P. L. Hagedoorn, J. A. Jongejan, W. R. Hagen, Reinvestigation of the steady-state kinetics and physiological function of the soluble NiFe-hydrogenase I of Pyrococcus furiosus. J. Bacteriol. 190, 1584 (2008).

20. R. Sapra, K. Bagramyan, M. W. W. Adams, A simple energy-conserving system: Proton reduction coupled to proton translocation. Proc. Natl. Acad. Sci. U.S.A. 100, 7545 (2003).

21. P. J. Silva et al., Enzymes of hydrogen metabolism in Pyrococcus furiosus. Eur. J. Biochem. 267, 6541 (2000).

22. T. Kanai et al., Distinct physiological roles of the three [NiFe]-hydrogenase orthologs in the hyperthermophilic archaeon Thermococcus kodakarensis. J. Bacteriol. 193, 3109 (2011).

23. F. R. Tabita, T. E. Hanson, S. Satagopan, B. H. Witte, N. E. Kreel, Phylogenetic and evolutionary relationships of RubisCO and the RubisCO-like proteins and the functional lessons provided by diverse molecular forms. Phil. Trans. R. Soc. B 363, 2629 (2008).

24. T. Sato, H. Atomi, T. Imanaka, Archaeal type III RuBisCOs function in a pathway for AMP metabolism. Science 315, 1003 (2007).

25. C. Briée, D. Moreira, P. López-García, Archaeal and bacterial community composition of sediment and plankton from a suboxic freshwater pond. Res. Microbiol. 158, 213 (2007).

26. S. Peura et al., Distinct and diverse anaerobic bacterial communities in boreal lakes dominated by candidate division OD1. ISME J. 6, 1640 (2012).

27. P. Hugenholtz, C. Pitulle, K. L. Hershberger, N. R. Pace, Novel division level bacterial diversity in a Yellowstone hot spring. J. Bacteriol. 180, 366 (1998).

28. J. K. Harris, S. T. Kelley, N. R. Pace, New perspective on uncultured bacterial phylogenetic division OP11. Appl. Environ. Microbiol. 70, 845 (2004).

29. I. Pagani et al., The Genomes OnLine Database (GOLD) v.4: Status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 40 (Database issue), D571 (2012).

Page 33: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

33

30. K. H. Williams et al., Acetate availability and its influence on sustainable bioremediation of uranium-contaminated groundwater. Geomicrobiol. J. 28, 519 (2011).

31. M. S. Lipton et al., Global analysis of the Deinococcus radiodurans proteome by using accurate mass tags. Proc. Natl. Acad. Sci. U.S.A. 99, 11049 (2002).

32. R. T. Kelly et al., Chemically etched open tubular and monolithic emitters for nanoelectrospray ionization mass spectrometry. Anal. Chem. 78, 7796 (2006).

33. J. K. Eng, A. L. McCormack, J. R. Yates, An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976 (1994).

34. S. Kim, N. Gupta, P. A. Pevzner, Spectral probabilities and generating functions of tandem mass spectra: A strike against decoy databases. J. Proteome Res. 7, 3354 (2008).

35. B. Zybailov et al., Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae. J. Proteome Res. 5, 2339 (2006).

36. R. C. Edgar, Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460 (2010).

37. R. C. Edgar, MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792 (2004).

38. A. Stamatakis, RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688 (2006).

39. I. Letunic, P. Bork, Interactive Tree Of Life (iTOL): An online tool for phylogenetic tree display and annotation. Bioinformatics 23, 127 (2007).

40. T. Z. DeSantis et al., Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069 (2006).

41. J. T. Simpson et al., ABySS: A parallel assembler for short read sequence data. Genome Res. 19, 1117 (2009).

42. D. R. Zerbino, E. Birney, Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821 (2008).

43. I. Sharon et al., Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. (accepted) 10.1101/gr.142315.112 (2012).

44. D. Hyatt et al., Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).

45. E. P. Nawrocki, D. L. Kolbe, S. R. Eddy, Infernal 1.0: Inference of RNA alignments. Bioinformatics 25, 1335 (2009).

46. T. M. Lowe, S. R. Eddy, tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955 (1997).

Page 34: Supplementary Materials for · 2012-09-26 · temperature and spray voltage were 275ºC and 2.2 kV, respectively. Data was acquired for 100 min, beginning 65 min after sample injection

34

47. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, Basic local alignment search tool. J. Mol. Biol. 215, 403 (1990).

48. H. Ogata et al., KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 27, 29 (1999).

49. B. E. Suzek, H. Huang, P. McGarvey, R. Mazumder, C. H. Wu, UniRef: Comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282 (2007).

50. E. Quevillon et al., InterProScan: Protein domains identifier. Nucleic Acids Res. 33, (Web Server issue), W116 (2005).

51. D. Sassera et al., Phylogenomic evidence for the presence of a flagellum and cbb(3) oxidase in the free-living mitochondrial ancestor. Mol. Biol. Evol. 28, 3285 (2011).

52. G. Talavera, J. Castresana, Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst. Biol. 56, 564 (2007).

53. F. Abascal, R. Zardoya, D. Posada, ProtTest: Selection of best-fit models of protein evolution. Bioinformatics 21, 2104 (2005).

54. E. Lyons, M. Freeling, How to usefully compare homologous plant genes and chromosomes as DNA sequences. Plant J. 53, 661 (2008).

55. H. Ogata, P. Kellers, W. Lubitz, The crystal structure of the [NiFe] hydrogenase from the photosynthetic bacterium Allochromatium vinosum: Characterization of the oxidized enzyme (Ni-A state). J. Mol. Biol. 402, 428 (2010).

56. L. Casalot, M. Rousset, Maturation of the [NiFe] hydrogenases. Trends Microbiol. 9, 228 (2001).

57. N. Mehta, S. Benoit, R. J. Maier, Roles of conserved nucleotide-binding domains in accessory proteins, HypB and UreG, in the maturation of nickel-enzymes required for efficient Helicobacter pylori colonization. Microb. Pathog. 35, 229 (2003).

58. F. W. Larimer, M. R. Harpel, F. C. Hartman, Beta-elimination of phosphate from reaction intermediates by site-directed mutants of ribulose-bisphosphate carboxylase/oxygenase. J. Biol. Chem. 269, 11114 (1994).