9
Supporting Information Husnik and McCutcheon 10.1073/pnas.1603910113 SI Materials and Methods Symbiont Genome Assembly, Annotation, and Analyses. Endosymbi- ont genomes were closed into circular mapping molecules by the combination of PCR and Sanger sequencing. General Tremblaya primers for closing of problematic regions, such as the duplicated rRNA operon, were designed to be applicable to most Tremblaya princeps species (Table S4). Given unclear GC skew in some of the species, the origin of replication was set to the same region as in already published Tremblaya and Moranella genomes to stan- dardize comparative genomic analyses. Pilon v1.12 (86) and REAPR v1.0.17 (87) were used to diagnose and improve potential misassemblies, collapsed repeats, and polymorphisms. Genome annotations and reannotations [abbreviations combine Tremblaya princeps (TP) with species abbreviations such as PCIT; i.e., for TPPCIT, MEPCIT, and Tremblaya phenacola from Phenacoccus avenae (TPPAVE)] were carried out by the Prokka v1.10 pipeline (88) with disabled default discarding of ORFs overlapping tRNAs. Our comparative data allowed us to reannotate many genes and pseudogenes previously annotated as hypothetical proteins and uncover pseudogene remnants (Dataset S2A). Tremblaya panproteome was curated manually with an extensive use of MetaPathways v2.0 (89), PathwayTools v17.0 (90), and Inter- Proscan v5.10 (91) and then, used in Prokka as trusted proteins for annotation. This approach was used to obtain identical gene names for all seven Tremblaya genomes (TPPAVE, TPPCIT, TPMHIR, TPFVIR, TPPLON, TPPMAR, and TPTPER). tRNA and tmRNA regions were reannotated using tFind.pl wrapper (bioinformatics. sandia.gov/software). Tremblaya pseudogenes were reannotated in the Artemis browser (92) based on genome alignment of all Tremblaya genomes. Genomes of γ-proteobacterial symbionts were annotated as described for Tremblaya genomes, except that several approaches were used to assist in pseudogene annotation. Proteins split into two or more ORFs were joined into a single pseudogene feature. All proteins were then searched against the National Center of Biotechnology Information (NCBI) nonredundant protein data- base (NR) database, and their length was compared. If the en- dosymbiont protein was shorter than 60% of its 10 top hits, it was called a pseudogene unless it is known to be a bifunctional protein and at least one of its domains was intact. All intergenic regions were then screened by BlastX (e value 1e 4 ] against NR to reveal pseudogene remnants. Multigene matrices of conserved orthologous genes for β-pro- teobacteria (49 genes) and Enterobacteriaceae (80 genes) were generated by the PhyloPhlAN package (93). Sequences of genes for 16S and 23S rRNA were downloaded from the NCBI nucle- otide database and used for Tremblaya- and Sodalis-allied, species- rich phylogenies. All matrices were aligned by the MAFFT v6 L-INS-i algorithm (94). Ambiguously aligned positions were ex- cluded by trimAL v1.2 (95) with the automated 1 flag set for likelihood-based phylogenetic methods. Maximum likelihood (ML) and Bayesian inference (BI) phylogenetic methods were applied to the single-gene and concatenated amino acid alignments. ML trees were inferred using RAxML 8.2.4 (96) under the LG + G model with subtree pruning and regrafting tree search algorithm and 1,000 bootstrap pseudoreplicates. BI analyses were conducted in MrBayes 3.2.2 (97) under the LG + I + G model with 5 million generations [prset aamodel = fixed(lg), lset rates = invgamma ngammacat = 4, mcmcp checkpoint = yes ngen = 5,000,000]. Concatenated 16S23S rRNA gene phylogenies for mealybug endosymbionts were inferred as above, except that the GTR + I + G model was used. For BI analyses, a proportion of invariable sites (I) was estimated from the data, and heterogeneity of evolutionary rates was modeled by four substitution rate categories of the γ- (G) distribution with the γ-shape parameter (α) estimated from the data. Exploration of Markov chain Monte Carlo convergence and burn-in determination were performed in AWTY (ceb.csit.fsu.edu/ awty) and Tracer v1.5 (evolve.zoo.ox.ac.uk). Additionally, concate- nated protein and Dayhoff6 recoded datasets were analyzed under the CAT + GTR + G model in PhyloBayes MPI 1.5a (98). Pos- terior distributions obtained under four independent PhyloBayes runs were compared using tracecomp and bpcomp programs, and runs were considered converged at maximum discrepancy value <0.1 and minimum effective size >100. Tremblaya genomes were aligned using progressiveMauve v2.3.1 (99). Clusters of orthologous genes were generated using OrhoMCL v1.4 (100). Orthologs missed because of low homology (BLAST e value 1e 5 ) were curated with the help of identical gene order and annotations. All genomes were visualized as linear with links connecting positions of orthologous genes in Processing3 (https://processing.org/). Additional figures were drawn or curated in Inkscape (https://inkscape.org/en/). Contamination Screening and Filtering of Draft Mealybug Genomes. The presence of additional species, such as facultative symbionts, environmental bacteria, and contamination in the genome data were visualized by the Taxon-Annotated GC Coverage (TAGC; drl.github. io/blobtools/ ) plots (101, 102), and the tool was also used to extract contigs of two γ-proteobacterial symbionts from the Pseudococcus longispinus mealybug and Wolbachia sp. from the Maconellicoccus hirsutus mealybug. We confirmed that there were no other organisms present in our data at high coverage, except the expected endo- symbionts. Although there are now reliable methodologies to re- move the majority of contamination from data sequenced using several independent libraries (102, 103), recognizing low-coverage contamination (in our case, mostly of bacterial, human, and plant origin) from single-library sequencing data can be problematic. Using the TAGC Tool, we were able to recognize low-coverage Propioni- bacterium spp. and human contamination in several of the samples (megablast e value 1e 25 ) and plant contamination in the P. longispinus sample. These short sequences were filtered out, and also, all (nonsymbiont) contigs or scaffolds shorter than 200 bp and/or having coverage lower than 3× were excluded from the total assemblies. Draft Insect Genomes and HGTs. Endosymbiont contigs and PhiX contigs (from the spike in of Illumina libraries) were excluded from assemblies, and insect genome assemblies were evaluated by the Quast v.2.3 Tool (104) for basic assembly statistics and by the CEGMA v2.5 (105) and BUSCO v1.1 (106) with Arthropoda dataset for gene completeness (Table S2). Lacking RNA Sequencing data to properly annotate the draft genomes, only preliminary gene predictions were carried out by unsupervised GeneMark-ES (107) runs to get exon structures for scaffolds with HGTs. Horizontally transferred genes previously identified in the Pla- nococcus citri genome were used as queries for BlastN, tBlastN, and tBlastX searches against custom databases made of scaffolds from individual species. Additionally, two approaches were used to minimize false negative results possibly caused by highly di- verged and/or fragmented HGTs undetected by BLAST searches. First, nucleotide alignments of individual HGTs (see above) were used as Hidden Markov Model profiles in nhmmer (108) searches against scaffolds of individual assemblies. Second, BLAST data- bases were made out of all raw fastq reads and searched by tBlastN using protein HGTs from P. citri as queries. Husnik and McCutcheon www.pnas.org/cgi/content/short/1603910113 1 of 9

Supporting Information - PNAS · 8/25/2016  · the CAT + GTR + G model in PhyloBayes MPI 1.5a (98). Pos-terior distributions obtained under four independent PhyloBayes runs were

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Supporting InformationHusnik and McCutcheon 10.1073/pnas.1603910113SI Materials and MethodsSymbiont Genome Assembly, Annotation, and Analyses. Endosymbi-ont genomes were closed into circular mapping molecules by thecombination of PCR and Sanger sequencing. General Tremblayaprimers for closing of problematic regions, such as the duplicatedrRNA operon, were designed to be applicable to most Tremblayaprinceps species (Table S4). Given unclear GC skew in some of thespecies, the origin of replication was set to the same region as inalready published Tremblaya and Moranella genomes to stan-dardize comparative genomic analyses. Pilon v1.12 (86) andREAPR v1.0.17 (87) were used to diagnose and improve potentialmisassemblies, collapsed repeats, and polymorphisms. Genomeannotations and reannotations [abbreviations combine Tremblayaprinceps (TP) with species abbreviations such as PCIT; i.e., forTPPCIT, MEPCIT, and Tremblaya phenacola from Phenacoccusavenae (TPPAVE)] were carried out by the Prokka v1.10 pipeline(88) with disabled default discarding of ORFs overlapping tRNAs.Our comparative data allowed us to reannotate many genesand pseudogenes previously annotated as hypothetical proteinsand uncover pseudogene remnants (Dataset S2A). Tremblayapanproteome was curated manually with an extensive use ofMetaPathways v2.0 (89), PathwayTools v17.0 (90), and Inter-Proscan v5.10 (91) and then, used in Prokka as trusted proteins forannotation. This approach was used to obtain identical gene namesfor all seven Tremblaya genomes (TPPAVE, TPPCIT, TPMHIR,TPFVIR, TPPLON, TPPMAR, and TPTPER). tRNA and tmRNAregions were reannotated using tFind.pl wrapper (bioinformatics.sandia.gov/software). Tremblaya pseudogenes were reannotatedin the Artemis browser (92) based on genome alignment of allTremblaya genomes.Genomes of γ-proteobacterial symbionts were annotated as

    described for Tremblaya genomes, except that several approacheswere used to assist in pseudogene annotation. Proteins split intotwo or more ORFs were joined into a single pseudogene feature.All proteins were then searched against the National Center ofBiotechnology Information (NCBI) nonredundant protein data-base (NR) database, and their length was compared. If the en-dosymbiont protein was shorter than 60% of its 10 top hits, it wascalled a pseudogene unless it is known to be a bifunctional proteinand at least one of its domains was intact. All intergenic regionswere then screened by BlastX (e value 1e−4] against NR to revealpseudogene remnants.Multigene matrices of conserved orthologous genes for β-pro-

    teobacteria (49 genes) and Enterobacteriaceae (80 genes) weregenerated by the PhyloPhlAN package (93). Sequences of genesfor 16S and 23S rRNA were downloaded from the NCBI nucle-otide database and used for Tremblaya- and Sodalis-allied, species-rich phylogenies. All matrices were aligned by the MAFFT v6L-INS-i algorithm (94). Ambiguously aligned positions were ex-cluded by trimAL v1.2 (95) with the −automated 1 flag set forlikelihood-based phylogenetic methods. Maximum likelihood (ML)and Bayesian inference (BI) phylogenetic methods were applied tothe single-gene and concatenated amino acid alignments. MLtrees were inferred using RAxML 8.2.4 (96) under the LG + Gmodel with subtree pruning and regrafting tree search algorithmand 1,000 bootstrap pseudoreplicates. BI analyses were conductedin MrBayes 3.2.2 (97) under the LG + I + G model with 5 milliongenerations [prset aamodel = fixed(lg), lset rates = invgammangammacat = 4, mcmcp checkpoint = yes ngen = 5,000,000].Concatenated 16S–23S rRNA gene phylogenies for mealybugendosymbionts were inferred as above, except that the GTR + I +Gmodel was used. For BI analyses, a proportion of invariable sites

    (I) was estimated from the data, and heterogeneity of evolutionaryrates was modeled by four substitution rate categories of the γ- (G)distribution with the γ-shape parameter (α) estimated from thedata. Exploration of Markov chain Monte Carlo convergence andburn-in determination were performed in AWTY (ceb.csit.fsu.edu/awty) and Tracer v1.5 (evolve.zoo.ox.ac.uk). Additionally, concate-nated protein and Dayhoff6 recoded datasets were analyzed underthe CAT + GTR + G model in PhyloBayes MPI 1.5a (98). Pos-terior distributions obtained under four independent PhyloBayesruns were compared using tracecomp and bpcomp programs, andruns were considered converged at maximum discrepancy value100.Tremblaya genomes were aligned using progressiveMauve v2.3.1

    (99). Clusters of orthologous genes were generated using OrhoMCLv1.4 (100). Orthologs missed because of low homology (BLASTe value 1e−5) were curated with the help of identical gene orderand annotations. All genomes were visualized as linear withlinks connecting positions of orthologous genes in Processing3(https://processing.org/). Additional figures were drawn or curatedin Inkscape (https://inkscape.org/en/).

    Contamination Screening and Filtering of Draft Mealybug Genomes.The presence of additional species, such as facultative symbionts,environmental bacteria, and contamination in the genome data werevisualized by the Taxon-Annotated GCCoverage (TAGC; drl.github.io/blobtools/) plots (101, 102), and the tool was also used to extractcontigs of two γ-proteobacterial symbionts from the Pseudococcuslongispinus mealybug and Wolbachia sp. from the Maconellicoccushirsutusmealybug. We confirmed that there were no other organismspresent in our data at high coverage, except the expected endo-symbionts. Although there are now reliable methodologies to re-move the majority of contamination from data sequenced usingseveral independent libraries (102, 103), recognizing low-coveragecontamination (in our case, mostly of bacterial, human, and plantorigin) from single-library sequencing data can be problematic. Usingthe TAGC Tool, we were able to recognize low-coverage Propioni-bacterium spp. and human contamination in several of the samples(megablast e value 1e−25) and plant contamination in the P. longispinussample. These short sequences were filtered out, and also, all(nonsymbiont) contigs or scaffolds shorter than 200 bp and/or havingcoverage lower than 3× were excluded from the total assemblies.

    Draft Insect Genomes and HGTs. Endosymbiont contigs and PhiXcontigs (from the spike in of Illumina libraries) were excluded fromassemblies, and insect genome assemblies were evaluated by theQuast v.2.3 Tool (104) for basic assembly statistics and by theCEGMAv2.5 (105) andBUSCOv1.1 (106) withArthropoda datasetfor gene completeness (Table S2). Lacking RNA Sequencing datato properly annotate the draft genomes, only preliminary genepredictions were carried out by unsupervised GeneMark-ES (107)runs to get exon structures for scaffolds with HGTs.Horizontally transferred genes previously identified in the Pla-

    nococcus citri genome were used as queries for BlastN, tBlastN,and tBlastX searches against custom databases made of scaffoldsfrom individual species. Additionally, two approaches were usedto minimize false negative results possibly caused by highly di-verged and/or fragmented HGTs undetected by BLAST searches.First, nucleotide alignments of individual HGTs (see above) wereused as Hidden Markov Model profiles in nhmmer (108) searchesagainst scaffolds of individual assemblies. Second, BLAST data-bases were made out of all raw fastq reads and searched bytBlastN using protein HGTs from P. citri as queries.

    Husnik and McCutcheon www.pnas.org/cgi/content/short/1603910113 1 of 9

    http://bioinformatics.sandia.gov/softwarehttp://bioinformatics.sandia.gov/softwarehttp://ceb.csit.fsu.edu/awtyhttp://ceb.csit.fsu.edu/awtyhttp://evolve.zoo.ox.ac.uk/https://processing.org/https://inkscape.org/en/http://drl.github.io/blobtools/http://drl.github.io/blobtools/www.pnas.org/cgi/content/short/1603910113

  • Lineage-specific candidates of HGT were detected as reportedpreviously (9) using the NR database (downloaded March 17, 2015).We used stringent screening criteria: only genes present on longscaffolds containing insect genes or present in several mealybuggenomes were considered as strongly supported HGT candidateshere (Table S3). Moreover, all scaffolds of HGT candidates pre-sented here were confirmed by mapping raw read data and manuallyexamined for low-coverage regions and potential misassembliescreated by the joining of low-coverage contigs of bacterial contam-inants with bona fide insect contigs.A multigene mealybug phylogeny was inferred as above using

    419 concatenated protein sequences of the core eukaryotic proteinsidentified from six mealybug genomes by the CEGMA package.Phylogenetic trees for individual HGTs were inferred as reportedpreviously (9), except that the workflow was implemented using theETE3 Python Toolkit (109).

    Microscopy.Whole-mealybug individuals stored in absolute ethanolwere postfixed with 4% (vol/vol) paraformaldehyde in PBS for 1 h;

    dehydrated by 1-h incubations in 80%, 90%, and 100% (vol/vol)ethanol; cleared in xylene two times for 1 h each, and paraffinembedded overnight. Paraffin blocks were sectioned to 5–7 μMsections, deparaffinized in xylene two times for 5 min each, andthen, hydrated through a 100%, 85%, and 70% (vol/vol) ethanolseries. Hybridization was done according to the work by vanLeuven et al. (110). No probe and RNase A controls were used toassess insect tissue autofluorescence. The following fluorochrome-labeled oligonucleotide probes targeting 16S rRNA were used forendosymbiont in situ hybridization of M. hirsutus [TPMHIR: 5′-Cy3-ATGCCACCCTTCCTCCCGAA-3′;Doolittlea endobiaMHIR(DEMHIR): 5′-Cy5-CTTTCATTTTCTTCCCCGTT-3′] and Par-racoccus marginatus [TPPMAR: ACGCCCYCCTTCATCCC-GAA; Mikella endobia PMAR (MEPMAR): 5′-Cy5-TAATAAC-TTTCTTCCTTGCT-3′]. An Olympus FV 1000 IX Inverted LaserScanning Confocal Microscope was used for imaging with 60× and100× oil immersion lenses. Image postprocessing was done in Fijiv1.51a (111).

    Husnik and McCutcheon www.pnas.org/cgi/content/short/1603910113 2 of 9

    www.pnas.org/cgi/content/short/1603910113

  • Fig. S1. Supplementary phylogenetic trees. Values at nodes represent support from ML bootstrap pseudoreplicates. (A) Multigene ML phylogeny of Trem-blaya within β-proteobacteria inferred from 49 concatenated protein sequences. (B) Zoomed-in Tremblaya ML phylogeny inferred from the 16S–23S rRNAalignment. (C) Multigene mealybug ML phylogeny inferred from 419 concatenated CEGMA protein sequences. (D) ML phylogeny of γ-proteobacterial sym-bionts inferred from the 16S–23S rRNA alignment. Clade labels A–G were adopted from the work by Thao et al. (43).

    Husnik and McCutcheon www.pnas.org/cgi/content/short/1603910113 3 of 9

    www.pnas.org/cgi/content/short/1603910113

  • Fig.S2

    .Schem

    atic

    diagramsofinsect

    scaffoldsco

    ntainingHGTs

    invo

    lved

    inam

    inoacid

    andBvitamin

    metab

    olism.Insect

    exons(predictedbyGen

    eMarkES

    )areco

    lor-co

    ded

    asgreen

    rectan

    glesan

    dwhen

    inclose

    proximityto

    HGTs,an

    notatedbytheirputative

    functions.Gen

    esofbacterial

    origin

    arehighlig

    htedin

    yello

    w.(A

    )Gen

    ome

    localiz

    ationofbioABD,ribAD,lysA

    ,dap

    F,an

    dtm

    sHGTs

    confirm

    ingthat

    they

    arepresentoninsect

    scaffolds.

    Only

    thelongestscaffold

    forea

    chHGTisshown,becau

    sethescaffoldsfrom

    differentmea

    lybugspeciessharegen

    eorder.(B)Alig

    nmen

    tsofM.hirsutus,P.

    marginatus,

    andF.

    virgatascaffoldsshowingcysK

    acquisitionafter

    divergen

    ceoftheMaconellicoccusclad

    ean

    dcysK

    duplicationin

    F.virgata(alsopresentin

    P.citrian

    dP.

    longispinus)

    andriboflav

    intran

    sporter

    duplicationin

    P.marginatus.

    Husnik and McCutcheon www.pnas.org/cgi/content/short/1603910113 4 of 9

    www.pnas.org/cgi/content/short/1603910113

  • Fig. S3. FISH confirming that intrabacterial symbionts reside inside Tremblaya cells in (A) M. hirsutus and (B) P. marginatus mealybugs. Tremblaya cells are ingreen, and γ-proteobacterial symbionts (DEMHIR and MEPMAR) are in red. (Scale bar: 10 μm.)

    Husnik and McCutcheon www.pnas.org/cgi/content/short/1603910113 5 of 9

    www.pnas.org/cgi/content/short/1603910113

  • Table

    S1.

    Extended

    assembly

    metrics

    fordraft

    mea

    lybuggen

    omes

    Assem

    bly

    metric

    MHIR

    FVIR

    PCIT

    (rea

    ssem

    bly)

    PLON

    TPER

    PMAR

    Totalassembly

    size

    (bp)

    163,04

    4,54

    430

    4,57

    0,83

    237

    7,82

    9,87

    228

    4,99

    0,20

    123

    7,58

    2,51

    819

    1,20

    8,35

    1To

    talno.ofscaffolds

    12,889

    32,723

    167,51

    466

    ,857

    80,386

    60,102

    No.ofscaffolds≥1,00

    0bp

    8,04

    321

    ,984

    64,930

    40,284

    58,090

    33,617

    Largestscaffold

    (bp)

    393,85

    032

    2,87

    382

    ,122

    182,78

    854

    ,847

    76,575

    N50

    jN75

    47,025

    j22,30

    025

    ,562

    j12,55

    17,07

    8j3

    ,639

    10,126

    j4,908

    4,68

    1j2

    ,689

    6,79

    9j3

    ,788

    G+

    C(%

    )35

    .334

    .234

    .333

    .731

    .536

    .1No.ofNsper

    100kb

    p97

    .820

    .715

    2.6

    26.2

    8.8

    34.1

    CEG

    MA

    complete

    (of24

    8)23

    9(96.37

    %)

    239(96.37

    %)

    236(95.16

    %)

    229(92.34

    %)

    236(95.16

    %)

    242(97.58

    %)

    CEG

    MA

    complete

    pluspartial

    246(99.19

    %)

    243(97.98

    %)

    245(98.79

    %)

    244(98.39

    %)

    247(99.60

    %)

    245(98.79

    %)

    BUSC

    OsEu

    karyota

    (n=42

    9)C:85%

    [D:7.4%],F:3.0%

    ,M:11%

    C:84%

    [D:5.1%],F:3.9%

    ,M:11%

    C:80%

    [D:6.9%],F:7.2%

    ,M:11%

    C:78%

    [D:3.4%],F:9.0%

    ,M:12%

    C:77%

    [D:4.1%],F:10

    %,M:12%

    C:82%

    [D:5.8%],F:5.5%

    ,M:11%

    BUSC

    OsArthropoda(n=2,67

    5)C:76%

    [D:3.5%],F:14

    %,M:9.4%

    C:76%

    [D:3.3%],F:13

    %,M:9.9%

    C:71%

    [D:4.8%],F:16

    %,M:12%

    C:70%

    [D:2.3%],F:16

    %,M:13%

    C:66%

    [D:2.3%],F:16

    %,M:16%

    C:72%

    [D:3.0%],F:15

    %,M:12%

    Allva

    lues

    werecalculatedwithouten

    dosymbiontan

    dlow-cove

    rageco

    ntaminationco

    ntigs.BUSC

    OsArthropodaassessmen

    tsforAcyrthosiphonpisum

    gen

    omeassembly

    asareference:C:72%

    [D:6.1%

    ],F:15

    %,M:12%

    .C,co

    mplete;D,duplicated

    ;F,

    frag

    men

    ted;M,missing.

    Husnik and McCutcheon www.pnas.org/cgi/content/short/1603910113 6 of 9

    www.pnas.org/cgi/content/short/1603910113

  • Table

    S2.

    Insect

    scaffoldsco

    ntaininghorizo

    ntally

    tran

    sferredgen

    es

    Gen

    ecategory

    andHGT

    Scaffold

    nam

    e,length,an

    dk-mer

    cove

    rage(m

    erged

    k-mers)

    MHIR

    FVIR

    PCIT

    PLON

    TPER

    PMAR

    Bvitamin

    metab

    olism

    bioA

    NODE_

    1095

    _434

    37_3

    0.54

    27_ID_2

    189*

    NODE_

    2692

    _292

    64_4

    5.82

    33_ID_5

    383

    NODE_

    1158

    _223

    96_2

    1.40

    66_ID_2

    315

    NODE_

    1345

    4_63

    21_9

    2.05

    54_ID_2

    6907

    NODE_

    5755

    _774

    9_42

    .283

    5_ID_1

    1509

    NODE_

    1563

    8_39

    63_4

    8.90

    07_ID_3

    1275

    NODE_

    3702

    _143

    32_3

    0.93

    77_ID_7

    403

    bioB

    NODE_

    206_

    1036

    77_2

    6.44

    1_ID_4

    11*

    NODE_

    1537

    _394

    45_3

    7.31

    68_ID_3

    073

    NODE_

    1118

    _226

    42_2

    2.61

    56_ID_2

    23NODE_

    1146

    0_73

    25_1

    11.564

    _ID_2

    2919

    NODE_

    386_

    1932

    5_14

    .747

    1_ID_7

    71NODE_

    1524

    _159

    17_4

    6.93

    02_ID_3

    047

    bioD

    NODE_

    407_

    7651

    4_32

    .340

    2_ID_8

    13*

    NODE_

    1082

    3_77

    22_4

    6.66

    98_ID_2

    1645

    NODE_

    1705

    0_60

    03_2

    8.55

    77_ID_3

    4099

    NODE_

    6031

    _117

    80_4

    1.46

    39_ID_1

    2061

    NODE_

    6741

    _717

    7_24

    .074

    _ID_1

    348

    1NODE_

    2159

    8_29

    96_3

    9.46

    89_ID_4

    3195

    ribA

    NODE_

    36_1

    7833

    0_31

    .654

    2_ID_7

    1*NODE_

    854_

    5179

    8_32

    .657

    2_ID_1

    707

    NODE_

    1211

    8_77

    09_9

    .087

    8_ID_2

    4235

    NODE_

    1018

    7_81

    29_4

    5.45

    34_ID_2

    0373

    NODE_

    2234

    6_34

    61_1

    4.50

    56_ID_4

    4691

    NODE_

    1442

    _163

    34_4

    2.47

    95_ID_2

    883

    ribD

    NODE_

    3471

    _116

    46_3

    7.17

    15_ID_6

    941

    NODE_

    4692

    _194

    96_3

    3.19

    48_ID_9

    383*

    NODE_

    2235

    9_48

    79_3

    7.49

    48_ID_4

    4717

    NODE_

    4881

    _134

    43_3

    8.11

    56_ID_9

    761

    NODE_

    1083

    2_54

    98_4

    6.80

    1_ID_2

    1663

    NODE_

    9906

    _543

    6_52

    .039

    4_ID_1

    9811

    pan

    CNA

    NODE_

    1895

    _355

    06_4

    2.12

    94_ID_3

    789*

    NA

    NA

    NA

    NA

    Aminoacid

    metab

    olism

    cysK

    NA

    NODE_

    1251

    _435

    41_3

    6.83

    07_ID_2

    501*

    NODE_

    5169

    _123

    55_8

    .405

    61_ID_1

    0337

    NODE_

    6319

    _114

    25_9

    6.43

    25_ID_1

    2637

    NODE_

    5086

    _819

    5_13

    .861

    8_ID_1

    0171

    NODE_

    317_

    2780

    1_43

    .935

    8_ID_6

    33NODE_

    1576

    _200

    02_2

    0.75

    4_ID_3

    151

    NODE_

    2819

    3_28

    29_7

    0.88

    61_ID_5

    6385

    NODE_

    3332

    _150

    01_3

    6.89

    71_ID_6

    663

    dap

    FNODE_

    2062

    _242

    85_2

    6.55

    33_ID_4

    123*

    NODE_

    5954

    _158

    83_3

    8.14

    28_ID_1

    1907

    NODE_

    962_

    2395

    5_17

    .903

    9_ID_1

    923

    NODE_

    2046

    5_42

    68_3

    6.19

    01_ID_4

    0929

    NODE_

    6454

    _733

    5_15

    .475

    _ID_1

    290

    7NODE_

    2898

    6_16

    94_1

    75.113

    _ID_5

    7971

    lysA

    NODE_

    59_1

    4878

    6_27

    .084

    7_ID_1

    17NODE_

    4_29

    7799

    _35.73

    95_ID_7

    *NODE_

    3039

    4_37

    49_1

    4.32

    49_ID_6

    0787

    NODE_

    8644

    _922

    4_44

    .821

    1_ID_1

    7287

    NODE_

    7424

    _681

    8_19

    .968

    9_ID_1

    4847

    NODE_

    1012

    _189

    19_4

    6.86

    22_ID_2

    023

    tms

    NODE_

    1166

    _416

    17_2

    6.82

    28_ID_2

    331*

    NA

    NODE_

    6634

    _109

    45_1

    6.72

    2_ID_1

    3267

    NODE_

    1305

    0_64

    99_5

    5.78

    9_ID_2

    6099

    NODE_

    3443

    8_24

    17_2

    9.26

    71_ID_6

    8875

    NODE_

    8338

    _614

    6_45

    .941

    4_ID_1

    6675

    NODE_

    5474

    _326

    3_6.97

    353_

    ID_1

    0947

    NODE_

    7749

    _100

    66_1

    5.96

    57_ID_1

    5497

    NODE_

    2574

    6_32

    90_1

    40.852

    _ID_5

    1491

    NODE_

    6297

    _742

    5_23

    .065

    4_ID_1

    2593

    NODE_

    4174

    _964

    4_62

    .539

    3_ID_8

    347

    NODE_

    1111

    5_81

    60_6

    .391

    49_ID_2

    2229

    NODE_

    5435

    _125

    67_3

    3.04

    41_ID_1

    0869

    NODE_

    1961

    4_37

    86_3

    20.495

    _ID_3

    9227

    NODE_

    1222

    7_46

    96_3

    6.58

    5_ID_2

    4453

    NODE_

    3006

    _175

    61_4

    2.37

    35_ID_6

    011

    NODE_

    3489

    5_23

    81_2

    8.38

    95_ID_6

    9789

    Peptidoglycanmetab

    olism

    murA

    NA

    NODE_

    460_

    6630

    9_43

    .134

    1_ID_9

    19*

    NODE_

    1135

    4_80

    54_2

    0.72

    08_ID_2

    2707

    NODE_

    115_

    6123

    0_35

    .696

    2_ID_2

    29NA

    NA

    murB

    NA

    NODE_

    1275

    8_57

    17_3

    3.09

    94_ID_2

    5515

    (possible

    pseudogen

    e)NODE_

    369_

    3146

    1_35

    .533

    7_ID_7

    37*

    NODE_

    2053

    4_42

    54_4

    9.71

    11_ID_4

    1067

    NA

    NA

    murC

    NA

    NA

    NODE_

    1601

    _198

    97_1

    5.25

    5_ID_3

    201*

    NODE_

    2279

    3_38

    12_4

    9.67

    47_ID_4

    5585

    NA

    NA

    murD

    NA

    NA

    NODE_

    1378

    2_70

    24_6

    .683

    02_ID_2

    7563

    *NODE_

    2401

    9_35

    87_4

    0.03

    4_ID_4

    8037

    NA

    NA

    murE

    NA

    NA

    NODE_

    6492

    _110

    57_8

    .877

    11_ID_1

    2983

    *NODE_

    1736

    3_49

    62_3

    0.27

    12_ID_3

    4725

    NA

    NA

    murF

    NA

    NA

    NODE_

    594_

    2768

    0_14

    .487

    _ID_1

    187*

    NODE_

    4718

    _137

    04_4

    2.36

    41_ID_9

    435

    NA

    NA

    amiD

    NODE_

    127_

    1240

    60_2

    4.86

    87_ID_2

    53NA

    NODE_

    3719

    2_29

    84_1

    0.55

    92_ID_7

    4383

    NA

    NA

    NA

    mltB

    NA

    NA

    NODE_

    1970

    3_53

    83_1

    7.88

    93_ID_3

    9405

    NA

    NA

    NA

    b-Lactamase

    NA

    NODE_

    5744

    _164

    11_3

    2.71

    18_ID_1

    1487

    NODE_

    4174

    1_24

    94_8

    7.34

    03_ID_8

    3481

    NODE_

    4491

    _141

    29_3

    4.10

    56_ID_8

    981

    NODE_

    2774

    4_29

    40_1

    1.90

    02_ID_5

    5487

    NODE_

    1286

    _172

    79_5

    0.76

    95_ID_2

    571

    NODE_

    1555

    0_36

    79_3

    2.21

    36_ID_3

    1099

    NODE_

    9718

    _886

    9_20

    .899

    5_ID_1

    9435

    NODE_

    1646

    2_52

    18_3

    3.54

    75_ID_3

    2923

    NODE_

    1417

    8_46

    65_1

    2.50

    5_ID_2

    8355

    NODE_

    1916

    1_54

    97_2

    2.63

    18_ID_3

    8321

    NODE_

    2719

    5_30

    22_1

    66.648

    _ID_5

    4389

    NODE_

    2805

    2_29

    13_1

    5.79

    6_ID_5

    6103

    NODE_

    2415

    4_35

    66_5

    1.43

    41_ID_4

    8307

    NODE_

    6508

    _729

    7_15

    .320

    6_ID_1

    3015

    NODE_

    2155

    _206

    06_3

    7.46

    45_ID_4

    309*

    ddlB

    NA

    NODE_

    52_1

    2521

    4_31

    .955

    _ID_1

    03*

    NODE_

    2593

    _166

    10_2

    2.72

    97_ID_5

    185

    NODE_

    7871

    _982

    5_39

    .201

    5_ID_1

    5741

    NODE_

    2901

    7_28

    31_2

    0.62

    86_ID_5

    8033

    NA

    Other

    DUR1,2

    NA

    NA

    NODE_

    1398

    _209

    65_1

    7.31

    76_ID_2

    795*

    (both

    ureacarboxylase

    and

    allophan

    atehyd

    rolase)

    NODE_

    2264

    _201

    56_4

    4.51

    6_ID_4

    527

    (only

    allophan

    atehyd

    rolase)

    NA

    NA

    gshA

    NA

    NA

    NODE_

    3343

    5_33

    99_3

    0.69

    65_ID_6

    6869

    NA

    NA

    NA

    TypeIII

    effector

    NA

    NODE_

    4508

    _202

    39_3

    2.58

    29_ID_9

    015

    NODE_

    2326

    _173

    45_1

    0.17

    49_ID_4

    651

    (+more

    than

    10other

    copies)

    NODE_

    935_

    2898

    2_40

    .281

    7_ID_1

    869

    (+more

    than

    10other

    copies)

    NODE_

    174_

    2313

    3_18

    .807

    5_ID_3

    47(+

    more

    than

    10other

    copies)

    NODE_

    1751

    _149

    72_7

    8.53

    05_ID_3

    501

    NODE_

    932_

    4986

    8_38

    .390

    9_ID_1

    863

    NODE_

    31_4

    8312

    _40.32

    9_ID_6

    1NODE_

    955_

    4923

    1_37

    .124

    8_ID_1

    909

    NODE_

    4166

    _965

    0_48

    .827

    9_ID_8

    331

    NODE_

    444_

    6748

    9_35

    .791

    4_ID_8

    87*

    NODE_

    6268

    _751

    3_43

    .506

    2_ID_1

    2535

    NODE_

    1448

    _404

    90_3

    5.36

    09_ID_2

    895

    chitinase

    NA

    NA

    NA

    NA

    NODE_

    1934

    _119

    60_1

    5.46

    34_ID_3

    867

    NODE_

    378_

    2643

    5_39

    .288

    4_ID_7

    55*

    rlmI

    NA

    NA

    NODE_

    8054

    _986

    3_19

    .351

    2_ID_1

    6107

    NA

    NA

    NA

    AAA-A

    TPases

    NODE_

    36_1

    7833

    0_31

    .654

    2_ID_7

    1(+

    numerousother

    hits)

    NODE_

    854_

    5179

    8_32

    .657

    2_ID_1

    707

    (+numerousother

    hits)

    NODE_

    3869

    _140

    76_3

    6.17

    82_ID_7

    737

    (+numerousother

    hits)

    NODE_

    3376

    _165

    44_4

    0.29

    29_ID_6

    751

    (+numerousother

    hits)

    NODE_

    4822

    _839

    6_19

    .844

    6_ID_9

    643

    (+numerousother

    hits)

    NODE_

    1442

    _163

    34_4

    2.47

    95_ID_2

    883

    (+numerousother

    hits)

    Anky

    rinrepea

    tprotein

    (likelyopposite

    HGT

    direction;i.e

    .,from

    insectsto

    Wolbachia)

    NA

    NODE_

    942_

    4956

    4_41

    .794

    3_ID_1

    883

    (+numerousother

    hitsto

    anky

    rinproteins)

    NODE_

    1287

    _216

    00_2

    0.31

    77_ID_2

    573

    (+numerousother

    hitsto

    anky

    rinproteins)

    NODE_

    1876

    _218

    72_3

    8.68

    57_ID_3

    751

    (+numerousother

    hitsto

    anky

    rinproteins)

    NODE_

    2986

    _102

    56_2

    3.74

    63_ID_5

    971

    (+numerousother

    hitsto

    anky

    rinproteins)

    NODE_

    1130

    _180

    92_4

    7.76

    32_ID_2

    259

    (+numerousother

    hitsto

    anky

    rinproteins)

    NA,notap

    plicab

    le.

    *Longestscaffoldsforea

    choftheHGTcandidate.

    Husnik and McCutcheon www.pnas.org/cgi/content/short/1603910113 7 of 9

    www.pnas.org/cgi/content/short/1603910113

  • Table S3. Overview of evidence that the HGTs are encoded on the insect genomes

    HGT

    Phylogenetic origin(does not necessarily

    mean donor)

    Present in severalmealybug species andforms a single clade

    Other bacterialgenes on the

    scaffold

    Insect geneson the insect

    scaffoldsOverall HGTevidence

    bioA α-Proteobacteria: Rickettsiales Yes No Yes Strong supportbioB α-Proteobacteria: Rickettsiales Yes No Yes Strong supportbioD α-Proteobacteria: Rickettsiales Yes No Yes Strong supportribA γ-Proteobacteria: Enterobacteriales Yes AAA-ATPase HGT Yes Strong supportribD α-Proteobacteria: Rickettsiales Yes No Yes Strong supportpanC β-Proteobacteria No, only FVIR No Yes Moderate supportcysK γ-Proteobacteria: Enterobacteriales Yes No Yes Strong supportdapF α-Proteobacteria: Rickettsiales Yes No Yes Strong supportlysA α-Proteobacteria: Rickettsiales Yes No Yes Strong supporttms γ-Protobacteria or β-proteobacteria Yes No Yes Strong supportmurA γ-Proteobacteria: Enterobacteriales Yes No Yes Strong supportmurB Bacteroidetes Yes No Yes Strong supportmurC Bacteroidetes (PCIT) Yes but different origin No Yes Moderate support

    α-Proteobacteria: Rickettsiales (PLON)murD α-Proteobacteria: Rickettsiales Yes No Yes Strong supportmurE α-Proteobacteria: Rickettsiales Yes No No Moderate supportmurF α-Proteobacteria: Rickettsiales Yes No Yes Strong supportamiD γ-Proteobacteria: Enterobacteriales (PCIT) Yes but different origin No Yes Moderate support

    α-Proteobacteria: Rickettsiales (MHIR)mltB γ-Proteobacteria: Enterobacteriales No, only PCIT No No Weaker supportb-Lactamase γ-Proteobacteria: Enterobacteriales Yes No Yes Strong supportddlB α-Proteobacteria: Rickettsiales Yes No Yes Strong supportDUR1,2 γ-Proteobacteria: Enterobacteriales Yes but different origin No No Moderate supportgshA γ-Proteobacteria: Enterobacteriales No, only PCIT No No Weaker supportType III effector γ-Protobacteria or β-proteobacteria Yes No Yes Strong supportchitinase γ-Protobacteria or β-proteobacteria Yes No Yes Strong supportrlmI γ-Proteobacteria: Enterobacteriales No, only PCIT No No Weaker supportAAA-ATPases α-Proteobacteria: Rickettsiales NA ribA HGT Yes Moderate supportAnkyrin repeat

    proteinsα-Proteobacteria: Rickettsiales NA No Yes Moderate support

    Related to Fig. 4 and Fig. S2. NA, not applicable.

    Table S4. Tremblaya primer

    Genome region Forward primer Reverse primer(s)

    leuA_fwd ↔ rpsO_rRNA_fwd_rev CTAAGGGCTGAGGACGTTGG CCCCTACGCAGCCTGTTTATrpsO_rRNA_fwd_rev ↔ prs_rev CCCCTACGCAGCCTGTTTAT GGGTAGCTCAGCGGTAAGAGtRNA_Gly_fwd ↔ rsmH_rev GCCTAGTGCAGGGATAGAAGG CACTGAGGCTCTGAGTTGGCtRNA_Gly_fwd ↔ 23S_rRNA_rev1 GCCTAGTGCAGGGATAGAAGG CGTTGATAGGCTGGGTGTGTtRNA_Gly_fwd ↔ 23S_rRNA_rev2 GCCTAGTGCAGGGATAGAAGG AAGTTCCGACCTGCACGAATargG_fwd ↔ rib_pseudo_rev CCCTGGCCTATGCTTCTGAC GGAGGTCAGATTCGAGGCAGilvD_fwd ↔ hypothetical_protein_rev ATAAGGAGGAGGGTGCCTGT GTGATGGTGTTAGGTTGCGG

    These primers were used for duplicated rRNA operons and one more region-breaking assembly of fiveT. princeps genomes.

    Dataset S1. Phylogenetic trees for individual HGTs

    Dataset S1

    Values at nodes represent support from ML bootstrap pseudoreplicates. Extremely short inner branches were extended by dashed lines for better legibility.a, bioA; b, bioD; c, bioB; d, ribA; e, tms; f, cysK; g, ribD; h, panC; i, DUR1 and DUR12; j, dapF; k, lysA; l, b-lact; m, chiA; n, amiD; o, ddlB; p, murA; q, murB; r, murC;s, murD; t, murE; 1u, murF.

    Husnik and McCutcheon www.pnas.org/cgi/content/short/1603910113 8 of 9

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1603910113/-/DCSupplemental/pnas.1603910113.sd01.pdfwww.pnas.org/cgi/content/short/1603910113

  • Dataset S2. Tremblaya gene information

    Dataset S2

    (A) Gene order, functional categories from Clusters of Orthologous (COG) groups, Enzyme Commission (E.C.) numbers, protein products, and gene abbre-viations for all Tremblaya genomes. Tremblaya phenacola PAVE inversion is designated by light yellow color, pseudogenes are in red, noncoding RNAs are inmagenta (tRNAs are not shown), and hypothetical proteins are in blue. (B) Raw data to reproduce Fig. 3. There are two copies of leuA in TPTPER and two copiesof aroDQ in DEMHIR. Only glyS is found in TPPAVE and not glyQ. 0, Missing gene; 1, found on the endosymbiont genome; 2, pseudogene; 3, HGT found on theinsect genome; 4, insect gene.

    Husnik and McCutcheon www.pnas.org/cgi/content/short/1603910113 9 of 9

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1603910113/-/DCSupplemental/pnas.1603910113.sd02.xlswww.pnas.org/cgi/content/short/1603910113