Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Genomics of the capybara, two emblematic Colombian species
María José Gómez-Hughes¹, Santiago Herrera-Álvarez1,2, Andrew J. Crawford¹
¹Department of Biological Sciences, Universidad de los Andes, Bogotá, 111711, Colombia.
²Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637, USA.
Abstract
Capybaras, which are native to South America, are not only the largest rodents in the world, but they
also have a number of other characteristics that make them unique. They are semi-aquatic, grazing
mammals and live in large groups where females engage in communal breeding. Males communally
defend the territory through scent-marking with a specialized gland called the morillo and with two
anal glands. Here we present the first genome assembly and annotation for the lesser capybara,
Hydrochoerus isthmius, as well as the first transcriptome assembly for the capybara, H. hydrochaeris,
both of which are comparable in completeness with previously published rodent genomes, and
compared them with the previously published genome assembly for the capybara. We found evidence
of reduction on the effective population size of both species, as well as big regions of genomic
rearrangement with the guinea pig. Our phylogenetic analysis is consistent with previous phylogenies
reported for the suborder Hystrichomorpha, but species related there is evidence for the capybara
being a paraphyletic species. We hope that this study contributes for conservation efforts on these
species, as well as a better understanding of all the characteristics that make them unique.
Resumen
Los chigüiros, nativos a América del Sur, no solamente son los roedores más grandes del mundo sino
también tienen otras características que los hacen únicos. Son especies de mamíferos semiacuaticas
que pastean y viven en grandes grupos en los que las hembras crían comunitariamente a sus crías y los
machos defienden sus territorios mediante marcajes con el morillo, una glándula especializada, y dos
glándulas anales. Aquí presentamos el primer ensamblaje genómico del chigüiro menor,
Hydrochoerus isthmius, y el primer transcriptoma del chigüiro, H. hydrochaeris, como también
comparaciones con el genoma del chigüiro publicado anteriormente. Encontramos evidencia de
reducciones poblacionales de ambas especies, como también rearreglos genómicos en comparación
con el conejillo de indias. Nuestro análisis filogenético es consistente con análisis publicados
previamente para el suborder Hystricomorpha, pero hay evidencia para la parafilia del chigüiro.
Esperamos que este estudio contribuya a esfuerzos de conservación en estas especies, como también a
un mejor entendimiento de esas características que los hacen únicos.
Keywords: Hydrochoerus sp., chigüiro, populational genomics, conservation genomics, 10X
genomics, genome assembly.
Ethics Statement:
Tissue samples of the lesser capybara and capybara were obtained under research and collecting
permit No. 1177 issued to the Universidad de los Andes by the Autoridad Nacional de Licencias
Ambientales (ANLA; National Authority of Environmental Permits). Anesthetic and euthanization
protocols used were approved by the Universidad de los Andes’ Comite Institucional de Cuidado
y Uso de Animales de Laboratorio (CICUAL; approval number C.FUA_14-023).
Introduction
The Order Rodentia is the most diverse group of mammals in the world in terms of species and
ecological diversity as well as morphological variation (Samuels, 2009; Fabre et al., 2012). Rodents
comprise almost 40% of all mammal species (Burgin et al., 2018) and inhabit almost all terrestrial
biomes (Hafner & Hafner, 1988). There are currently five recognized suborders of rodents:
Sciuromorpha (dormices, mountain beavers, marmots, squirrels and squirrel-like rodents),
Castorimorpha (beavers, kangaroo rats and pocket gophers), Myomorpha (hamsters, jerboas, mice,
rats and mouse-like rodents), Anomaluromorpha (scaly tailed squirrels and springhares), and
Hystricomorpha (chinchillas, guinea pigs, gundis, porcupines and others), which are composed of 33
families (Wilson & Reeder, 2005).
Among these groups, in the order Hystricomorpha and family Caviidae (Wilson & Reeder,
2005), are the capybaras (genus: Hydrochoerus). Capybaras are known for being the largest extant
rodents (Figure 1A; Moreira et al., 2013), their semi-aquatic habits (Macdonald, 1981), and for being
social animals with communal breeding and communal defense (Macdonald, 1981). Capybaras are
also known for having distinctive feeding and scent marking behaviors. Feeding related, capybaras
graze on both aquatic and terrestrial herbaceous vegetation that undergo multiple passes through their
digestive tracts, either via regurgitation or cropography (Lord, 1994). Scent marking related,
capybaras possess two types of scent marking glands - the morillo, a protuberance that males express
in the top of their snouts which size can be predictive of dominance (Rosenfield et al., 2019), and two
anal glands - and is the social interaction most seen in them (Emilio & Macdonald, 1994).
There are currently two described species of capybaras: the capybara, H. hydrochaeris, and the
lesser capybara, H. isthmius (Mones, 1991), inhabiting eastern Colombia, eastern Venezuela, the
Guyanas, Ecuador, Peru, northeastern Argentina and Uruguay, and Panama, western Colombia and
western Venezuela, respectively (Figure 1B; Reid, 2016; Delgado & Emmons, 2016). However, some
dispute exists as to whether there are two or only one species of capybaras, with some still referring to
the lesser capybara as a subspecies (see Correa & Jorgenson, 2009 and Carrascal, Linares & Chacón,
2011), but classified as its own species in the database of mammalian taxonomy (Wilson & Reeder,
2005) and by the International Union for Conservation of Nature and Natural Resources (IUCN;
Delgado & Emmons, 2016), as well as to where in the Hystricomorpha phylogeny they are localized
(see Upham & Patterson, 2015; Álvarez, Arévalo, & Verzi, 2017; Rowe & Honeycutt, 2002).
Figure 1. (A) Relative size of the two species of capybara as compared to a 1.75 m tall human. The lesser
capybara (Hydrochoerus isthmius) is shown in purple on the left while the capybara (Hydrochoerus
hydrochaeris) is shown in blue on the right. (B) Geographic ranges of the capybara (blue) and the lesser
capybara (purple) according to the IUCN (Reid, 2016; Delgado & Emmons, 2016). (C) A picture showing a
male capybara with its morillo indicated by the red arrow. All images used were labeled for noncommercial use
with modifications from Wiki Commons.
Currently, the capybara is listed as a species of Least Concern by the IUCN (Reid, 2016), but
some concerns have arisen over the years and substantial population declines have been noted in this
species (Corriale & Herrera, 2014). The lesser capybara is listed as Data Deficient due to lack of
baseline studies on the status of these populations (Delgado & Emmons, 2016). This species has been
neglected in studies of conservation despite being harvested for meat, leather and fat (Pinheiro &
Moreira, 2013) and despite the threats to its native habitat (Aldana-Domínguez, Vieira-Muñoz &
Bejarano, 2013).
In this paper we present the first genome assembly for the lesser capybara and the first
transcriptome assembly for the capybara, as well as new analyses for the previously published
capybara genome assembly by Herrera-Álvarez et al. (2018). We compare demographic changes over
time of both species, compare synteny between the two capybara species and between each species
and the guinea pig (Cavia porcellus), evaluate distinctiveness of the capybaras as independent
monophyletic groups, assess the position of Hydrochoerus in the Hystricomorpha tree, and analyze
differentially expressed genes among tissues from the capybara.
Materials and Methods
Genome assembly of the capybara
Tissue from a wild-caught but captive-raised capybara (H. hydrochaeris), reportedly from
Bolivia, was donated by the San Diego Zoo’s Frozen Zoo to the 200 Mammalian Genomes Project
led by the Broad Institute, which then sequenced and assembled a draft genome using the sequencing
and assembly method of DISCOVAR de novo (Weisenfeld et al., 2014). This assembly was then
‘upgraded’ using Chicago libraries provided by Dovetail Genomics (Putnam et al., 2016) and financed
by Colciencias. Details on the final assembly can be found in Herrera-Álvarez et al. (2018).
Genome assembly of the lesser capybara
Tissue samples of the lesser capybara were collected from one juvenile H. isthmius from San
Juan del Carare, Santander, Colombia, on 22 June, 2017, that was subsequently accessioned into the
mammals collection of the Museo Historia Natural ANDES, Universidad de los Andes, Bogotá,
Colombia (field number AJC 7100, voucher number ANDES-M 2300). This sample was sequenced
using 10X Genomics linked reads technology in two lanes of Illumina HiSeq X10. The resulting reads
were run through longranger v2.2.2 (10X Genomics) to estimate the genome size, heterozygosity, and
to process the barcodes.
These reads were assembled with Supernova v2.0.1 (Weisenfeld et al., 2017), an assembler
created by 10X Genomics that uses a progressively larger contigs approach and its own trimming step
to create phased scaffolds from the reads. We included the mkoutput pseudohap option to visualize
only one haplotype on the resulting assembly.
To enhance this assembly we used the following three scripts. 1) Tigmint v1.1.2 was used to
produce an assembly that is both more contiguous and more correct by comparing the alignment of
linked reads to the draft assembly, to correct for possible mis-assemblies (Jackman et al., 2018). 2)
Arcs v1.0.6 was used to add an additional scaffold step by organizing the assembly with information
included in the linked reads and to join those scaffolds with more probability of being together to
create a more contiguous assembly (Yeo et al., 2017). 3) Sealer v2.0.2 was used to identify intra-
scaffold gaps in the draft assembly, search for flanking sequences, and then to fill the gaps by
realigning the raw reads to the assembly (Paulino et al., 2015). Sealer navigates de Bruijn graphs via
bloom filters based on k values and we chose k values of 64, 80, 96, 112, and 128, since this would
give us a range that could help us close gaps on areas of low coverage with the lower values, and
areas of high repetition levels with the higher values (Paulino et al., 2015).
Between each of the aforementioned steps, and at the end, we ran QUAST v5.0.2 (Gurevich
et al., 2013) to measure enhancement of quality metrics of the assembly. These metrics included (1)
contig sizes: number of contigs, length of the largest contig, contig and scaffold Nx (the length in base
pairs such that the sum of all contigs larger than said length add up to e x% of the length og the whole
assembly, e.g., N50), and (2) Comparison to the domestic guinea pig reference genome assembly
(RefSeq accession number: GCF_000151735.1; release: Cavpor 3.0) in terms of GC content (%),
number of mismatches per 100 kilobase (Kb), and number of indels per 100 Kb (Gurevich et al.,
2013; Table 1).
Finally, to assess genome completeness, we ran BUSCO v3.0.2 using the Vertebrate dataset
(Waterhouse et al., 2018) and compared the percentage of BUSCO genes recovered against the
genome assemblies of other rodent species published in Ensembl (Table 2).
Transcriptome sequencing, assembly, and functional annotation
Transcriptomic data were obtained from eleven tissues representing two H. hydrochaeris
individuals of either sex (Table 3). Tissues were preserved in Nucleic Acid Preservation (NAP;
Camacho-Sánchez et al., 2013) buffer to avoid RNA degradation. RNA was extracted with standard
TRI Reagent® Solution (Ambion Inc., Austin, Texas, USA) and then cleaned using the RNeasy Plus
Mini Kit (Qiagen. Hilden, Germany) and diluted to a final volume of 30μl with nuclease-free water.
Quantity of extracted RNA for library construction was measured with the Qubit® RNA HS Assay
kit. Complementary DNA (cDNA) libraries were constructed with the Illumina TruSeq v. 2 kit using
half reactions. Quality of cDNA libraries was assessed using Agilient 2100 BioAnalyzer and Agilient
High Sensitivity DNA kit. The 11 libraries were barcoded and run together in paired-end mode on one
lane of an Illumina Hiseq 2000.
For the transcriptome assembly, we used all reads from the 11 sequenced libraries. We first
trimmed the reads with trimmomatic v0.39 (Bolger, Lohse & Usadel, 2014), filtered with the FASTX-
toolkit v0.0.14, normalized based on the median coverage, and trimmed unreliable k-mers using the
khmer v1 digital normalization algorithm (Crusoe et al., 2015; Brown et al., 2012). We then used
Trinity to assemble the transcriptome from the remaining reads (Grabherr et al., 2011). To evaluate
the transcriptome assembly quality based on its completeness we used Trinity scripts and BUSCO
v3.0.2 (Waterhouse et al., 2018).
To functionally annotate the transcriptome, we extracted the longest open reading frames and
predicted the most likely coding regions with Transdecoder v3.0.0 (Haas & Papanicolaou, 2015).
Then we used Trinotate to functionally annotate the predicted polypeptides and to create a database
for navigating these data (following Bryant et al., 2017). We used BLAST v2.9.0 to search for
homology hits against the UniProt Swiss-Prot database (UniProt Consortium, 2018), and identify to
which Pfam protein family each transcript belonged (El-Gebali et al., 2018) using profile hidden
Markov Models with HMMER v3.2.1 (Mistry et al., 2013). We predicted signal peptides using a deep
neural network approach with SignalP v5.0 (Armenteros et al., 2019), predicted transmembrane
protein domains using hidden Markov models with tmHMM v2.0 (Krogh et al., 2001), and assigned
inferred proteins to Eggnog functional categories (Huerta-Cepas et al., 2015) and to gene ontology
categories (GO; Ashburner et al., 2000) using BLAST v2.9.0.
Genomic repeat masking
The results of repeat masking and annotation of the capybara were taken from Herrera-
Álvarez et al. (2018), and a similar approach was taken for the lesser capybara. To repeat mask the
lesser capybara assembly, we used RepeatMasker v4.0.9_p2 (Smit, Hubley & Green, 2015) specifying
“rodentia” as the species to guide the masking using repeat evidence from other rodents. For the type
of masking, we chose a soft-masked approach which gave us the possibility to visualize what the
repeat subsequences were, but without them interfering on downstream analyses.
Genome annotation and gene content
For annotating the lesser capybara genome, we selected three high-quality, annotated genome
assemblies from representative rodent species. We used the Maker v2.31 pipeline (Holt & Yandell,
2011) based on proteomes from the guinea pig (Cavia porcellus, Cavpor 3.0), the house mouse (Mus
musculus; GRCm38.p6), and the common rat (Rattus novergicus; Rnor6.0) to guide the annotation.
We downloaded the proteomes from the Ensembl release 97 and clustered them with CD-HIT v4.6.1
into a single non-redundant file (Li, Jaroszewski, & Godzik, 2001). The capybara genome annotation
was taken from Herrera-Álvarez et al. (2018). We used the Swiss-Prot reviewed database (UniProt
Consortium, 2018) to add functionality to the annotations of both species by homology hits found by
Blastp v2.9.0. Additionally, we used InterProScan v5.36-75 (Jones et al., 2014) to classify the
annotated genes into Pfam protein families (El-Gebali et al, 2018).
Microsatellites are a class of short tandem repeat (STR) motifs that are frequently used as
Mendelian markers in population genetic and kinship studies (Jame & Lagoda, 1996). Here we define
STRs as six or more dinucleotide repeats, and five or more repeats ranging from tri- to dodeca-
nucleotide repeats. To annotate microsatellites for both species, we used the MIcroSAtellite
identification tool (MISA-web; Beier et al., 2017).
Mitochondrial genome
We created a DNA database of the capybara and the lesser capybara genomes independently
and used the guinea pig mitochondrial genome (Accession number: NC_000884) as a query against
the database with Blastn v2.9.0. We then annotated the most probable scaffold to obtain the
mitochondrial genome sequence of each species using MITOS WebServer pipeline (Bernt et al.,
2013). To visualize the mitogenomes we used the CGView Server (Grant & Stothard, 2008).
Demography
We used pairwise sequentially Markovian coalescent (PSMC; Li & Durbin, 2011) to infer
how the effective population size (Ne ) may have changed over recent history. Briefly, this algorithm
reconstructs the distribution of times to most recent common ancestor (TMRCA) along chromosomes
by examining the density of heterozygous sites (Li & Durbin, 2011). To do this, we first indexed the
assemblies, to make alignments faster and less computational exhaustive, and mapped the raw reads
back to it with bwa v0.7.4 (Li & Durbin, 2009). We then sorted each alignment by their order and
converted it from a BAM to a VCF file with SAMtools v1.8 (Li et al., 2009), called the SNPs and
indels with bcftools v1.8 (Li, 2011), and transformed this file to a fastq file with vcftools v4.2.0
(Danecek et al, 2011). We then used this file to estimate the parameters of the PSMC model, with 100
bootstrap models, and a recombination parameter of “4+25*2+4+6” with psmc v0.6.5 (Li & Durbin,
2011).
Synteny between the capybaras and the common guinea pig
To identify genomic regions containing large rearrangements in the capybaras relative to the
guinea pig, we performed global pairwise alignments between the two capybara species, between the
capybara and the guinea pig, and between the lesser capybara and the guinea pig using bwa v0.7.4 (Li
& Durbin, 2009). To visualize these alignments, we drew 100 Kb windows where the sequences
would align between two half circles representing each of the assemblies using Circos v0.69-8
(Krzywinski et al., 2009).
Genetic diversity within Hydrochoerus
To estimate genomic divergence between northern and southern H. hydrochaeris relative to
H. isthmius, we ran a phylogenomic analysis of protein-coding sequences derived from genomic and
transcriptomic analyses, with the guinea pig as outgroup. To minimize possible problems with
paralogy, we analyzed only those genes included in the BUSCO Vertebrate orthologs dataset
(Waterhouse et al., 2018) which were obtained using BUSCO v3.0.2 from either the reference
genome assembly (H. isthmius, southern H. hydrochaeris from Bolivia) or the transcriptome de novo
assembly (northern H. hydrochaeris from the Colombian Llanos and the guinea pig transcriptome
(Cavia porcellus; genome version: Cavpor3.0; accession number: GCA_000151735.1). All BUSCO
genes that were found complete on the four datasets were aligned independently with MAFFT v7.309
using a BLOSUM 62 matrix (Katoh & Standley, 2013). Alignments were then trimmed with trimAl
v1.4 (Capella-Gutiérrez, Silla-Martínez & Gabaldón, 2009). To infer a species tree, we concatenated
the resulting alignments with FASconCAT-G v1.04 (Kück & Meusemann, 2010) and used IQtree
v1.6.10 to select the best fit model of substitution, for all genes, based on a corrected Akaike
information criterion (AICc) and to implement a partitioned likelihood analysis (Nguyen et al, 2014;
Chernomor, von Haeseler & Minh, 2016). Statistical support for relationships was estimated using
1000 non-parametric bootstraps for sites within partitions and 1000 likelihood ratio tests . The
resulting tree was visualized in iTOL (Letunic & Bork, 2019).
Phylogenomic analyses
To verify the position of Hydrochoerus on the Hystrichomorpha phylogeny, we downloaded
from Ensembl all available proteomes of Hystrichomorpha, and included mouse as an outgroup for a
total of 9 species (Table 2) and used Orthofinder v2.3.3 to search for orthologs (Emms, & Kelly,
2019). Then we extracted all single copy orthologs that were found in all nine species and performed
a pre-alignment quality filter with PREQUAL v1.02 to identify and filter non-homologous sequences
(Whelan, Irisarri & Burki, 2018). We performed a multiple sequence alignment with MAFFT v7.309
assuming a BLOSUM 62 matrix in each orthologue independently (Katoh & Standley, 2013), and
then trimmed the alignments for poorly aligned regions with trimAl v1.4 to maintain only the most
reliable alignments (Capella-Gutiérrez, Silla-Martínez & Gabaldón, 2009). To infer phylogenetic
relationships, we used two approaches. First, we concatenated all the alignments with FASconCAT-G
v1.04 (Kück & Meusemann, 2010), constructed a maximum likelihood tree with RAxML v8.2.12,
using a GAMMA model for rate heterogeneity that estimates the alpha parameter, and 100 bootstraps
for statistical support (Stamakis, 2014). Second, we estimated a species tree via a Bayesian approach
using MCMCTree implemented in PAML v4.9 (Yang, 2007). We discarded the first 2000 generations
of the Markov chain as a burnin and then sampled 20,000 trees one every 20 iterations. The timetree
was calibrated by constraining the root to a temporal interval of 68 - 78 million years ago,
corresponding to the TMRCA of the Hystricomorpha group and the mouse (Hedges, Dudley &
Kumar, 2006). The sample of posterior trees was used to generate a Hessian matrix using CODEML
in PAML v4.9 and assuming a WAG+GAMMA model. The Hessian matrix was used to run
MCMCTree again to obtain a Bayesian consensus tree which was visualized using the R package
MCMCTreeR (Puttick, 2019). As a check on the MCMCTree results, we used a second species-tree
approach that takes into account the individual history of each gene. ,We used RAxML v8.2.12 to
infer a maximum likelihood tree for each gene independently, also with a GAMMA model and 100
bootstraps per gene. The resulting gene trees were used as input to infer a species tree using NJst (Liu
& Yu, 2011).
Results
Sequencing and genome assembly of the lesser capybara
The sequencing of the lesser capybara, Hydrochoerus isthmius, resulted in a total of 1.751
billion reads, each of length 150 bp. From these reads it was inferred that
the lesser capybara genome has a size of 2.7 Gb long and a heterozygosity
of 0.24% (Table 1). The Supernova assembly (step 1 of the assembly
process) had a total size of 2.5 Gb, counting only scaffolds ≥ 10,000 bp,
and a scaffold N50 of 694 Kb (Table 4). Quast assembly statistics and the enhancement of
these throughout the successive steps of the assembly process (see Materials and Methods) are
reported in Table 4. The final lesser capybara genome draft contained 18,502 contigs with a contig
N50 of 232 Kb plus 7,702 scaffolds with an N50 of 787 Kb, and a GC content of 40.01%.
As a measure of genome completeness we used the fraction of genes recovered in our
assemblies out of a total of 3,023 BUSCO genes in the vertebrates data set. For the lesser capybara
we recovered, 2,563 genes that were assembled completely (84%) and 227 fragmented genes (7.5%),
with 233 genes missing (7.7%), making the lesser capybara assembly comparable to other rodent
genome assemblies in Ensembl (Figure 2).
Figure 2. Percentage of Vertebrate BUSCO genes recovered in published rodent genome assemblies (plus
rabbit), as a measure of assembly completeness. The genome assemblies of H. hydrochaeris (Herrera-Álvarez
et al., 2018) and H. isthmius (this study) are indicated in bold inside the rectangle.
Transcriptome sequencing, assembly, and functional annotation
A total of 882.24 million reads with a length of 150 bp where sequenced from the 11
RNAseqlibraries made from Hydrochoerus hydrochaeris. After the quality filters, normalization, and
trimming steps (see Materials and Methods) a total of 140 million reads were kept and subsequently
used for transcriptome assembly. The resulting transcriptome had a total of 994,100 transcripts
belonging to 768,228 genes. Transcriptome GC content was estimated to be 46.58% and a N50 of 704
bp, with an average length of 574.17 bp, taking into account only the longest isoform per gene.
Among the Eggnog functional categories, the most common were translation, ribosomal structure and
biogenesis (14.9%) followed by amino acid transport and metabolism (9.1%), and energy production
and conversion (8%; Figure 3A). The Pfam protein families most represented were immunoglobulin
V-set domain, immunoglobulin domain and Zinger finger C2H2 type with 18.2%, 8.5% and 5.4%
respectively (Figure 3B). And the three most represented GO terms were cellular nitrogen compound
metabolic process (5.1%), DNA metabolic process (3.7%) and biosynthetic process (3.2%; Figure
3C). Among the 768,228 genes, we predicted 273,824 coding regions and of these 5.3% (14,553 of
273,824) were predicted to have signal peptides.
Figure 3. Capybara (Hydrochoerus hydrochaeris) transcriptome functional annotation. (A) Functional
categories from Eggnog mapping. (B) Protein families from Pfam. (C) 25 Gene ontology categories most
represented.
Genomic repeat masking
Lesser capybara: The repeats identified by RepeatMasker occupied 27.72% of the total
assembly. These repeats belonged mostly to the LINEs repeat class (51.9%; long interspersed
elements), followed by LTRs (17.2%; long terminal repeats) and SINEs (15.4%; short interspersed
elements) (Figure 4A-B). Almost half of the repetitive elements (47.08%) were LINEs from the
subclass LINE-1, which is consistent for what is reported for humans (45.55%; Lander et al., 2001), in
mice (48.71%; Waterson et al., 2002) and in other rodents (Figure 4C; Smit, Hubley & Green, 2015).
Figure 4. Frequency of classes (A) and subclasses (B) of repetitive elements within the lesser capybara
(Hydrochoerus isthmius) genome assembly. (A) SINEs: short interspersed elements, LINEs: long interspersed
elements, LTR: long terminal repeats, others: satellites, simple repeats, small RNA, and low complexity repeats.
(B) ALU/B1, B2 - B4, and MIRs: subclasses of SINEs; LINE1, LINE2, and L3/CR1: subclasses of LINEs;
ERVL, ERVL-MaLRs, ERV class I, and ERV class II: subclasses of LTR elements; hAT-Charlie, and TCMar-
Tigger: subclasses of DNA elements. (C) Frequency of subclasses of repetitive elements on different species of
rodents. Data for all but the two capybara species are reported in Smit, Hubley & Green (2015).
Genomic annotation and gene content
We annotated a total of 26,080 genes, 82% of which had an AED < 0.5 indicating high
quality of the annotations. The higher number of genes annotated in the lesser capybara compared to
the capybara can be explained by a less fragmented genome in the latter one (scaffold N50: 787 Kb
and 12.2 Mb, respectively). More than half of the annotations were involved in cellular process
(31.2%) and metabolic process (20.10%), followed by biological regulation (15.3%) and localization
(11.1%; Figure 5).
Figure 5. Genes predicted in the lesser capybara (Hydrochoerus isthmius) genome annotation that are involved
in: (A) Biological processes, (B) cellular components, (C) molecular functions, and (D) Protein classes from
Pfam.
Microsatellites - A total of 509,265 and 718,560 microsatellites were found in the capybara
and lesser capybara respectively. In both assemblies, the longer the unit size for the microsatellites,
the more uncommon they were, but in some instances for uneven numbers (repeat unit length = 3, 7,
11) n+1 would have a higher count (Table 5; Figure 6).
Figure 6. Counts of microsatellites found in each genome assembly. The y axis represent the repeat unit length
in base pairs, while the color and size of each circle represents the total count of microsatellites with that unit
size found in each of the genome assemblies.
Mitochondrial genome
Capybara - The mitogenome assembly of the capybara, H. hydrochaeris, consisted of a
scaffold with three gaps. It contains two ribosomal RNA genes (12S and 16S), 21 transfer RNA
genes, and 13 protein coding genes (CDS). The assembly seemed to suggest a duplication of tRNA-W
and the deletion of tRNA-R.
Lesser capybara - The mitogenome of H. isthmius assembled here had a length of 16,525
base pairs with a GC content of 39.37%, and contained two ribosomal RNA genes (12S and 16S), 22
transfer RNA genes, and 13 protein coding genes (CDS). No major rearrangements or gains/losses
were found relative to the mammalian mitogenomes previously reported.
See Table 6 for the size and position of genes within the mitochondrial genomes of each species and
Figure 7 for a visual comparison of both mitochondrias with the mitochondria from the guinea pig.
Figure 7. Mitochondrial genome of the (A) capybara (Hydrochoerus hydrochaeris), (B) lesser capybara (H.
isthmius) and (C) guinea pig (Cavia porcellus). Annotated by MITOS web server. The mitogenome sequences
were found in a single scaffold in each of the capybaras assemblies with Blast v2.9.0 from similarity with the
guinea pig (Cavia porcellus) mitochondrial genome. Image made in the CGView Server.
Demography
We fit a pairwise sequentially Markovian coalescent (PSMC) model to the genome assembly
of each capybara species to evaluate possible changes in effective population size in the recent past
(Figure 8). The capybara’s PSMC model suggested a relatively steady population size mildly
fluctuating from Ne = 10,000 to 25,000. For the lesser capybara, on the other hand, a sudden
population expansion started ~500,000, peaked around 200,000, and crashed back down to roughly
Ne = 20,000 some 100,000 years (Figure 8).
Figure 8. Pairwise sequentially Markovian coalescence analysis (PSMC) of the capybara and lesser capybara
genome assemblies (in blue and red, respectively). Time goes from the present on the left towards the past on
the right in the x-axis, and the y-axis represents effective population size (Ne) in units of 104.
Synteny between the capybaras and the common guinea pig
We aligned the capybaras’ whole genome assemblies against each other and each of them
independently against the guinea pig genome assembly (AccNum GCA_000151735.1) to search for
regions of big genomic changes. Between the capybaras and the guinea pig there was found a major
region of unmatching where the guinea pig may have gain/rearranged a region, or the capybaras
lost/rearranged it (red arrows in Figure 9A-B). Between both capybara species assemblies there were
not major rearrangements, but it is noticeable the lower contiguity of the lesser capybara assembly
(Figure 9C). Due to the low contiguity of the assemblies used for this analysis, it is not possible to
determine if the unmatches detected are due to one or multiple rearrangements nor in which specific
parts of the capybaras genomes are they present.
Figure 9. Pairwise circos plots showing synteny on 100 kb windows between the capybara, Hydrochoerus
hydrochaeris, the lesser capybara, H. isthmius, and the guinea pig. Each species is represented by a color
(Capybara: blue, lesser capybara: orange, and guinea pig: green). (A) Comparison between the guinea pig, left,
and lesser capybara, right. (B) Comparison between the capybara, left, and the lesser capybara, right. (C)
Comparison between the guinea pig, left, and the capybara, right.
Genomic diversity within Hydrochoerus
We extracted BUSCO genes from the transcriptome of the capybara from the eastern Llanos
of Colombia, in the transcriptomes predicted in the gene annotations of the capybara (Bolivia) and
lesser capybara (western Colombia) genome assemblies, and in the guinea pig to use as an outgroup.
From this, we reconstructed a phylogenomic tree based on 2325 concatenated genes to test the
following hypothesis: if H. hydrochaeris and H. isthmius are distinct species, both capybara samples
would form a clade relative to the lesser capybara. Instead of this, we found that the lesser capybara
was nested inside of the capybara clade, being genetically closer to the Bolivian sample than to the
geographically more proximal Colombian Llanos sample, indicating a complex phylogeographic
history (Figure 10).
Figure 10. Simple phylogeographic analysis of the capybara (n = 2 localities) and the lesser capybara. For this
analysis, transcript samples from a capybara from Colombia, genetic samples from a capybara from Bolivia, and
a lesser capybara from Colombia were used. Each of the samples are mapped using yellow lines from the
phylogenetic tree to the region were they came from. The lesser capybara’s geographic range is indicated in
purple, and in blue the capybara’s geographic range.
Phylogenomic analysis
We inferred phylogenetic relationships among the two capybara species and available
Hystricomorph species based on proteomes in Ensembl, using the mouse as an outgroup. We found a
total of 508 single copy orthologs present in all 9 species that we subsequently used in the
phylogenomic analysis. Within the Hytricomorpha subclass we found three distinct clades: one
composed of the Damara mole rat and the naked mole rat (Family: Bathyergidae) that diverged from
the rest approximately 56 million years ago (mya), a second clade containing the chinchilla (Family:
Chinchillidae) and degu (Family: Octodontidae), and a third clade containing the guinea pigs and
capybara species (Family: Caviidae), these two last mentioned clades, separating from each other
approximately 30 mya (Figure 11).
Figure 11. Phylogenetic relationships among rodent species using single copy orthologs found in all nine
species. These orthologs were aligned, filtered for quality and concatenated with FASconCAT-G v1.04. Then a
maximum likelihood tree was constructed with RAxML and the species divergence was calculated with
MCMCTree. All Hystrichomorpha proteomes available on Ensembl, and the mouse’, were used as inputs. Blue
lines represent 95% credibility intervals around divergence times. Numbers on the upper x-axis represent
millions of years ago, and letters on the lower x-axis represent geological epochs (La: late cretaceous; Pa.:
paleocene; Eo.: eocene; Ol.: oligocene; Mi: miocene).
Discussion
Genome and transcriptome assemblies and annotation
Here we report the first genome assembly for the lesser capybara as well as the first transcriptome
assembly for the capybara. As has been demonstrated previously, low cost genome assemblies like the
ones provided by 10X genomics are an incredible tool for understanding a species from a genomic
perspective (e.g., Armstrong et al., 2019; Kocher et al., 2018; Hulse-Kemp, 2018). Even if the
resulting assembly is not highly contiguous, these kind of technologies allow one to infer a range of
biological processes.
Rapid population changes in the lesser capybara
PSMC models use coalescent times across heterozygous sites on a single diploid genome to infer
effective population size changes over time (Li & Durbin, 2011). Nadachowska‐Brzyska et al. (2016)
suggested that PSMC models are reliable only on genome assemblies with a mean coverage over 18X
and no more than 25% of missing data, thresholds that our capybara and lesser capybara genome
assemblies surpass. From the PSMC models reported here, we can see that both species are tending
into reducing their population sizes (Figure 8), a tendency that is more drastic in the lesser capybara.
These trends are seen in other large mammal species which tend to need larger patches of
uninterrupted habitat to exist (Berger, 2017). Currently, the capybara’s habitat is under various threats
and is being reduced due to extensive changes in land use (Göpel et al., 2019). Other reasons this
trend is evident in capybara is a change from hunting to breeding them for eating purposes by local
people (da Rosa et al., 2019). Active breeding reduces the effective population size leading to
problems such as inbreeding depression and higher genetic load (Wand, Santiago & Caballero; 2016;
Hedrick & Garcia-Dorado, 2016).
Phylogenomic analyses places capybaras with other caviids
Our phylogenomic analysis places capybaras as sister to guinea pigs (Family Caviidae; genus: Cavia
spp.) and the caviids as sister to a group conformed of chinchillas and degus (Families: Chinchillidae
and Octodontidae). Together these families conform the Caviomorpha and this clade is the sister to
Phiomorpha represented by the family Bathyergidae, the mole rats, in this analysis. This topology
supports that found previously (Upham & Patterson, 2015; Álvarez, Arévalo, & Verzi, 2017). Species
diversification times for the different groups match those reported by Álvarez, Arévalo, & Verzi
(2017), but different from the ones reported by Upham & Patterson (2015). However, Upham and
Patterson (2015) used sequences from two mitochondrial genes and three nuclear genes while
Álvarez, Arévalo, & Verzi, (2017) used mitochondrial genes in addition to five nuclear genes. We are
highly confident in the results reported here due to the use of whole genome assemblies to find single
copy orthologs and the quality filter steps taken, and since we took two different approaches, a
concatenated maximum likelihood tree and a neighbour joining approach that takes each gene as
independent from each other, to account for phylogenetic inference errors such as incomplete lineage
sorting.
Lesser capybara is nested inside capybara
Phylogenomic analyses demonstrate that the lesser capybara is more closely related to the capybara
sample from Bolivia that it is to the capybara from the Llanos of Colombia. Even though this result
suggests that the capybara (H. hydrochaeris) is a paraphyletic species, there is morphological
evidence showing that the clades are divergent (Mones, 1991). Additionally, given the small sample
size (n = 3), further population genetic studies coupled with morphological analyses should be carried
out. If the pattern seen here is true, it would be even more concerning the fact that some populations
have been artificially inbred as a consequence of humans breeding them since this fact can disrupt
evolutionary processes that are isolating capybaras from Eastern and Western Colombia, which are
separated by the Andes mountain range.
Acknowledgements
This work was supported by Colciencias grant 1204-659-44334 (to AJC). Special thanks to: the DCB
and the Facultad de Ciencias of the Universidad de los Andes for giving us access to the Magnus
cluster on which all the computational analyses were run. Thanks to the University de los Andes Vice-
president’s office for help with collecting and mobilization permits. Thanks to the members of the
Biom|ics lab whose comments and help were invaluable for this project, to Juanita Herrera, María
José Páramo and Diego Perico for their field help in the collection of samples. Thanks to Catalina
Palacios, Phil Morin, and Alejandro Reyes for their analytical help and advice. Finally, thanks to
Rachel Voyt and Melissa Hernández for their insightful comments on this manuscript.
Literature cited
Aldana-Domínguez, J., Vieira-Muñoz, M. I., & Bejarano, P. (2013). Conservation and use of the
capybara and the lesser capybara in Colombia. In Capybara (pp. 321-332). Springer, New York, NY.
Álvarez, A., Arévalo, R. L. M., & Verzi, D. H. (2017). Diversification patterns and size evolution in
caviomorph rodents. Biological Journal of the Linnean Society, 121(4), 907-922.
Armenteros, J. J. A., Tsirigos, K. D., Sønderby, C. K., Petersen, T. N., Winther, O., Brunak, S., ... &
Nielsen, H. (2019). SignalP 5.0 improves signal peptide predictions using deep neural networks.
Nature biotechnology, 37(4), 420.
Armstrong, E. E., Taylor, R. W., Prost, S., Blinston, P., van der Meer, E., Madzikanda, H., ... & Petrov,
D. (2019). Cost-effective assembly of the African wild dog (Lycaon pictus) genome using linked reads.
GigaScience, 8(2), giy124.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., ... & Harris, M. A.
(2000). Gene ontology: tool for the unification of biology. Nature genetics, 25(1), 25.
Beier, S., Thiel, T., Münch, T., Scholz, U., & Mascher, M. (2017). MISA-web: a web server for
microsatellite prediction. Bioinformatics, 33(16), 2583-2585.
https://doi.org/10.1093/bioinformatics/btx198
Berger, J. O. E. L. (2017). The science and challenges of conserving large wild mammals in 21st-
century American protected areas in. Science, Conservation, and National, 189-211.
Bernt, M., Donath, A., Jühling, F., Externbrink, F., Florentz, C., Fritzsch, G., ... & Stadler, P. F. (2013).
MITOS: improved de novo metazoan mitochondrial genome annotation. Molecular Phylogenetics and
Evolution, 69(2), 313-319.
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence
data. Bioinformatics, 30(15), 2114-2120.
Brown, C. T., Howe, A., Zhang, Q., Pyrkosz, A. B., & Brom, T. H. (2012). A reference-free algorithm
for computational normalization of shotgun sequencing data. arXiv preprint arXiv:1203.4802.
Bryant, D. M., Johnson, K., DiTommaso, T., Tickle, T., Couger, M. B., Payzin-Dogru, D., ... &
Bateman, J. (2017). A tissue-mapped axolotl de novo transcriptome enables identification of limb
regeneration factors. Cell reports, 18(3), 762-776.
Burgin, C. J., Colella, J. P., Kahn, P. L., & Upham, N. S. (2018). How many species of mammals are
there?. Journal of Mammalogy, 99(1), 1-14.
Bushmanova, E., Antipov, D., Lapidus, A., Suvorov, V., & Prjibelski, A. D. (2016). rnaQUAST: a
quality assessment tool for de novo transcriptome assemblies. Bioinformatics, 32(14), 2210-2212.
Cahill, J. A., Soares, A. E., Green, R. E., & Shapiro, B. (2016). Inferring species divergence times
using pairwise sequential Markovian coalescent modelling and low-coverage genomic data.
Philosophical Transactions of the Royal Society B: Biological Sciences, 371(1699), 20150138.
Camacho‐Sanchez, M., Burraco, P., Gomez‐Mestre, I., & Leonard, J. A. (2013). Preservation of RNA
and DNA from mammal samples under field conditions. Molecular Ecology Resources, 13(4), 663-
673.
Capella-Gutiérrez, S., Silla-Martínez, J. M., & Gabaldón, T. (2009). trimAl: a tool for automated
alignment trimming in large-scale phylogenetic analyses. Bioinformatics, 25(15), 1972-1973.
Carrascal, J., Linares, J., & Chacón, J. (2011). Behavior of the Hydrochoerus hydrochaeris isthmius in
a productive system, department of Córdoba, Colombia. Revista MVZ Córdoba, 16(3), 2754-2764.
Chernomor, O., von Haeseler, A., & Minh, B. Q. (2016). Terrace aware data structure for
phylogenomic inference from supermatrices. Systematic biology, 65(6), 997-1008.
Correa, J. B., & Jorgenson, J. P. (2009). Aspectos poblacionales del cacó (Hydrochoerus hydrochaeris
isthmius) y amenazas para su conservación en el Nor-Occidente de Colombia. Mastozoología
neotropical, 16(1), 27-38.
Corriale, M. J., & Herrera, E. A. (2014). Patterns of habitat use and selection by the capybara
(Hydrochoerus hydrochaeris): a landscape‐scale analysis. Ecological research, 29(2), 191-201.
Crusoe, M. R., Alameldin, H. F., Awad, S., Boucher, E., Caldwell, A., Cartwright, R., ... & Fenton, J.
(2015). The khmer software package: enabling efficient nucleotide sequence analysis. F1000Research,
4.
da Rosa, P. P., Ávila, B. P., Costa, P. T., Fluck, A. C., Scheibler, R. B., Ferreira, O. G. L., & Gularte,
M. A. (2019). Analysis of the perception and behavior of consumers regarding capybara meat by means
of exploratory methods. Meat science, 152, 81-87.
Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., ... & McVean, G.
(2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156-2158.
Delgado, C. & Emmons, L. (2016). Hydrochoerus isthmius . The IUCN Red List of Threatened Species
2016: e.T136277A22189896. https://dx.doi.org/10.2305/IUCN.UK.2016-
2.RLTS.T136277A22189896.en.
Göpel, J., Schüngel, J., Schaldach, R., Stuch, B., & Löbelt, N. (2019). Assessing the effects of
agricultural intensification on natural habitats and biodiversity in Southern Amazonia. bioRxiv, 846709.
Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., ... & Chen, Z.
(2011). Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature
biotechnology, 29(7), 644.
Grant, J. R., & Stothard, P. (2008). The CGView Server: a comparative genomics tool for circular
genomes. Nucleic acids research, 36(suppl_2), W181-W184.
Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for
genome assemblies. Bioinformatics, 29(8), 1072-1075.
El-Gebali, S., Mistry, J., Bateman, A., Eddy, S. R., Luciani, A., Potter, S. C., ... & Sonnhammer, E. L.
L. (2018). The Pfam protein families database in 2019. Nucleic acids research, 47(D1), D427-D432.
Emilio, A. H., & Macdonald, D. W. (1994). Social significance of scent marking in capybaras. Journal
of Mammalogy, 75(2), 410-415.
Emms, D. M., & Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for comparative
genomics. BioRxiv, 466201.
Fabre, P. H., Hautier, L., Dimitrov, D., & Douzery, E. J. (2012). A glimpse on the pattern of rodent
diversification: a phylogenetic approach. BMC evolutionary biology, 12(1), 88.
Haas, B., & Papanicolaou, A. (2015). TransDecoder (find coding regions within transcripts). Github,
nd https://github. com/TransDecoder/TransDecoder.
Hafner, J. C., & Hafner, M. S. (1988). Heterochrony in rodents. In Heterochrony in Evolution (pp. 217-
235). Springer, Boston, MA.
Hedges, S. B., Dudley, J., & Kumar, S. (2006). TimeTree: a public knowledge-base of divergence
times among organisms. Bioinformatics, 22(23), 2971-2972.
Hedrick, P. W., & Garcia-Dorado, A. (2016). Understanding inbreeding depression, purging, and
genetic rescue. Trends in Ecology & Evolution, 31(12), 940-952.
Herrera-Álvarez, S., Karlsson, E., Ryder, O. A., Lindblad-Toh, K., & Crawford, A. J. (2018). How to
make a rodent giant: Genomic basis and tradeoffs of gigantism in the capybara, the world’s largest
rodent. BioRxiv, 424606. https://doi.org/10.1101/424606
Holt, C., & Yandell, M. (2011). MAKER2: an annotation pipeline and genome-database management
tool for second-generation genome projects. BMC bioinformatics, 12(1), 491.
Hulse-Kemp, A. M., Maheshwari, S., Stoffel, K., Hill, T. A., Jaffe, D., Williams, S. R., ... & Schatz, M.
C. (2018). Reference quality assembly of the 3.5-Gb genome of Capsicum annuum from a single
linked-read library. Horticulture research, 5(1), 1-13.
Huerta-Cepas, J., Szklarczyk, D., Forslund, K., Cook, H., Heller, D., Walter, M. C., ... & Jensen, L. J.
(2015). eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for
eukaryotic, prokaryotic and viral sequences. Nucleic acids research, 44(D1), D286-D293.
Jackman, S. D., Coombe, L., Chu, J., Warren, R. L., Vandervalk, B. P., Yeo, S., ... & Birol, I. (2018).
Tigmint: correcting assembly errors using linked reads from large molecules. BMC bioinformatics,
19(1), 393.
Jarne, P., & Lagoda, P. J. (1996). Microsatellites, from molecules to populations and back. Trends in
ecology & evolution, 11(10), 424-429.
Jones, P., Binns, D., Chang, H. Y., Fraser, M., Li, W., McAnulla, C., ... & Pesseat, S. (2014).
InterProScan 5: genome-scale protein function classification. Bioinformatics, 30(9), 1236-1240.
Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7:
improvements in performance and usability. Molecular biology and evolution, 30(4), 772-780.
Kocher, S. D., Mallarino, R., Rubin, B. E., Douglas, W. Y., Hoekstra, H. E., & Pierce, N. E. (2018).
The genetic basis of a social polymorphism in halictid bees. Nature communications, 9(1), 1-7.
Krogh, A., Larsson, B., Von Heijne, G., & Sonnhammer, E. L. (2001). Predicting transmembrane
protein topology with a hidden Markov model: application to complete genomes. Journal of molecular
biology, 305(3), 567-580.
Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., ... & Marra, M. A.
(2009). Circos: an information aesthetic for comparative genomics. Genome research, 19(9), 1639-
1645.
Kück, P., & Meusemann, K. (2010). FASconCAT: convenient handling of data matrices. Molecular
Phylogenetics and Evolution, 56(3), 1115-1118.
Lander, E., Linton, L., Birren, B. et al (2001). Initial sequencing and analysis of the human genome.
Nature 409, 860–921. doi:10.1038/35057062.
Letunic, I., & Bork, P. (2019). Interactive Tree Of Life (iTOL) v4: recent updates and new
developments. Nucleic acids research.
Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and
population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform.
bioinformatics, 25(14), 1754-1760.
Li, H., & Durbin, R. (2011). Inference of human population history from individual whole-genome
sequences. Nature, 475(7357), 493.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & Durbin, R. (2009). The
sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079.
Li, W., Jaroszewski, L., & Godzik, A. (2001). Clustering of highly homologous sequences to reduce
the size of large protein databases. Bioinformatics, 17(3), 282-283.
Liu, L., & Yu, L. (2011). Estimating species trees from unrooted gene trees. Systematic biology, 60(5),
661-667.
Lord, R. D. (1994). A descriptive account of capybara behaviour. Studies on neotropical fauna and
environment, 29(1), 11-22.
Macdonald, D. W. (1981). Dwindling resources and the social behaviour of capybaras,(Hydrochoerus
hydrochaeris)(Mammalia). Journal of Zoology, 194(3), 371-391.
Mistry, J., Finn, R. D., Eddy, S. R., Bateman, A., & Punta, M. (2013). Challenges in homology search:
HMMER3 and convergent evolution of coiled-coil regions. Nucleic acids research, 41(12), e121-e121.
Mones, A. (1991). Monografía de la familia Hydrochoeridae (Mammalia: Rodentia).
Moreira, J. R., Alvarez, M. R., Tarifa, T., Pacheco, V., Taber, A., Tirira, D. G., ... & Macdonald, D. W.
(2013). Taxonomy, natural history and distribution of the capybara. In Capybara (pp. 3-37). Springer,
New York, NY.
Nadachowska‐Brzyska, K., Burri, R., Smeds, L., & Ellegren, H. (2016). PSMC analysis of effective
population sizes in molecular ecology and its application to black‐and‐white Ficedula flycatchers.
Molecular ecology, 25(5), 1058-1072.
Nguyen, L. T., Schmidt, H. A., von Haeseler, A., & Minh, B. Q. (2014). IQ-TREE: a fast and effective
stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular biology and
evolution, 32(1), 268-274.
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and
bias-aware quantification of transcript expression. Nature methods, 14(4), 417.
Paulino, D., Warren, R. L., Vandervalk, B. P., Raymond, A., Jackman, S. D., & Birol, I. (2015). Sealer:
a scalable gap-closing application for finishing draft genomes. BMC bioinformatics, 16(1), 230.
Pinheiro, M. S., & Moreira, J. R. (2013). Products and uses of capybaras. In Capybara (pp. 211-227).
Springer, New York, NY.
Putnam, N. H., O'Connell, B. L., Stites, J. C., Rice, B. J., Blanchette, M., Calef, R., ... & Haussler, D.
(2016). Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome
research, 26(3), 342-350.
Puttick, M. N. (2019). MCMCtreeR: functions to prepare MCMCtree analyses and visualize posterior
ages on trees. Bioinformatics, 35(24), 5321-5322.
Reid, F. (2016). Hydrochoerus hydrochaeris . The IUCN Red List of Threatened Species 2016:
e.T10300A22190005. https://dx.doi.org/10.2305/IUCN.UK.2016-2.RLTS.T10300A22190005.en.
Rosenfield, D. A., Nichi, M., Losano, J. D., Kawai, G., Leite, R. F., Acosta, A. J., ... & Pizzutto, C. S.
(2019). Field-testing a single-dose immunocontraceptive in free-ranging male capybara (Hydrochoerus
hydrochaeris): Evaluation of effects on reproductive physiology, secondary sexual characteristics, and
agonistic behavior. Animal reproduction science, 209, 106148.
Rowe, D. L., & Honeycutt, R. L. (2002). Phylogenetic relationships, ecological correlates, and
molecular evolution within the Cavioidea (Mammalia, Rodentia). Molecular Biology and Evolution,
19(3), 263-277.
Samuels, J. X. (2009). Cranial morphology and dietary habits of rodents. Zoological Journal of the
Linnean Society, 156(4), 864-888.
Smit, A. F. A., Hubley, R., & Green, P. (2015). RepeatMasker Open-4.0. 2013–2015.
Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post-analysis of large
phylogenies. Bioinformatics, 30(9), 1312-1313.
Trillmich, F., Kraus, C., Künkele, J., Asher, M., Clara, M., Dekomien, G., ... & Sachser, N. (2004).
Species-level differentiation of two cryptic species pairs of wild cavies, genera Cavia and Galea, with a
discussion of the relationship between social systems and phylogeny in the Caviinae. Canadian
Journal of Zoology, 82(3), 516-524.
UniProt Consortium. (2018). UniProt: a worldwide hub of protein knowledge. Nucleic acids research,
47(D1), D506-D515.
Upham, N. S., & Patterson, B. D. (2015). Evolution of caviomorph rodents: a complete phylogeny and
timetree for living genera. Biology of caviomorph rodents: diversity and evolution. Buenos Aires:
SAREM Series A, 1, 63-120.
Waterhouse, R. M., Seppey, M., Simão, F. A., Manni, M., Ioannidis, P., Klioutchnikov, G., ... &
Zdobnov, E. M. (2017). BUSCO applications from quality assessments to gene prediction and
phylogenomics. Molecular biology and evolution, 35(3), 543-548.
Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J. F., Agarwal, P., ... & Antonarakis,
S. E. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915),
520-562.
Wang, J., Santiago, E., & Caballero, A. (2016). Prediction and estimation of effective population size.
Heredity, 117(4), 193-206.
Weisenfeld, N. I., Kumar, V., Shah, P., Church, D. M., & Jaffe, D. B. (2017). Direct determination of
diploid genome sequences. Genome research, 27(5), 757-767. doi: 10.1101/gr.214874.116.
Weisenfeld, N. I., Yin, S., Sharpe, T., Lau, B., Hegarty, R., Holmes, L., ... & Nusbaum, C. (2014).
Comprehensive variation discovery in single human genomes. Nature genetics, 46(12), 1350.
Whelan, S., Irisarri, I., & Burki, F. (2018). PREQUAL: detecting non-homologous characters in sets of
unaligned homologous sequences. Bioinformatics, 34(22), 3929-3930.
Wilson, D. E., & Reeder, D. M. (Eds.). (2005). Mammal species of the world: a taxonomic and
geographic reference (Vol. 1). JHU Press.
Yang, Z. (2007). PAML 4: phylogenetic analysis by maximum likelihood. Molecular biology and
evolution, 24(8), 1586-1591.
Yeo, S., Coombe, L., Warren, R. L., Chu, J., & Birol, I. (2017). ARCS: scaffolding genome drafts with
linked reads. Bioinformatics, 34(5), 725-731.
Young, M. D., Wakefield, M. J., Smyth, G. K., & Oshlack, A. (2010). Gene ontology analysis for
RNA-seq: accounting for selection bias. Genome biology, 11(2), R14.
Tables
Table 1. Quality metrics reported by the software Supernova v2.0.1 before and after assembling the
lesser capybara genome (Weisenfeld et al., 2017).
Input statistics
Number of reads 1751.10 M
Mean read length after trimming 139.50 b
Raw coverage 84.29X
Effective read coverage 50.16X
Fraction of Q30 bases in read 2 75.26%
Median insert size 345.00b
Fraction of proper read pairs 89.81%
Fraction of barcodes used 1
Estimated genome size 3.14 Gb
Genome repetitivity index 9.95%
High AT index 0.06%
GC content of assembly 40.04%
Dinucleotide content 1.23%
Weighted mean molecule size 36.91 Kb
Molecule count extending 10 kb on both sides 67.21
Mean distance between heterozygous SNPs 2.28 Kb
Fraction of reads that are not barcoded 6.97%
N50 reads per barcode 1.36 K
Fraction of reads that are duplicates 30.66%
Nonduplicate and phased reads 38.94%
Table 2. Rodent proteomes used for comparative analyses.
Common name Species Genome version Accession number
Algerian mouse Mus spretus SPRET_EiJ_v1 GCA_001624865.1
Alpine marmot Marmota marmota marmota marMar2.1 GCA_001458135.1
American beaver Castor canadensis C.can_genome_v1.0 GCA_001984765.1
Arctic ground squirrel Urocitellus parryii ASM342692v1 GCA_003426925.1
Brazilian guinea pig Cavia aperea CavAp1.0 GCA_000688575.1
Chinese hamster CriGri Cricetulus griseus CriGri_1.0 GCA_000223135.1
Daurian ground squirrel Spermophilus dauricus ASM240643v1 GCA_002406435.1
Degu Octodon degus OctDeg1.0 GCA_000260255.1
Golden Hamster Mesocricetus auratus MesAur1.0 GCA_000349665.1
Guinea Pig Cavia porcellus Cavpor3.0 GCA_000151735.1
Kangaroo rat Dipodomys ordii Dord_2.0 GCA_000151885.2
Lesser Egyptian jerboa Jaculus jaculus JacJac1.0 GCA_000280705.1
Long-tailed chinchilla Chinchilla lanigera ChiLan1.0 GCA_000276665.1
Mongolian gerbil Meriones unguiculatus MunDraft-v1.0 GCA_002204375.1
Damara mole rat Fukomys damarensis DMR_v1.0 GCA_000743615.1
Naked mole-rat Heterocephalus glaber HetGla_female_1.0 GCA_000247695.1
Squirrel Ictidomys tridecemlineatus SpeTri2.0 GCA_000236235.1
Prairie vole Microtus ochrogaster MicOch1.0 GCA_000317375.1
Ryukyu mouse Mus caroli CAROLI_EIJ_v1.1 GCA_900094665.2
Mouse Mus musculus GRCm38.p6 GCA_000001635.8
Shrew mouse Mus pahari PAHARI_EIJ_v1.1 GCA_900095145.2
Steppe mouse Mus spicilegus MUSP714 GCA_003336285.1
Upper Galilee mountains blind mole rat Nannospalax galii S.galili_v1.0 GCA_000622305.1
Rabbit Oryctolagus cuniculus OryCun2.0 GCA_000003625.1
Northern American deer mouse Peromyscus maniculatus bairdii HU_Pman_2.1 GCA_003704035.1
Rat Rattus novergicus Rnor_6.0 GCA_000001895.4
Table 3. Tissues sequenced for the transcriptomic analysis.
Individuals sampled
Species Coll. Number Location (Lat, Lon) Sex
Capybara (Hydrochoerus hydrochaeris) AJC 05614 05.8106°, - 70.9718° Male
Capybara (Hydrochoerus hydrochaeris) AJC 05615 05.8106°, - 70.9718° Female (gravid)
Tissues sampled
1. Heart
2. Brain
3. Kidney
4. Testes
5. Ovaries
6. Morillo
7. Anal gland
8. Fetal tissue
9. Bone marrow
10. Thyroid gland
11. Pancreas
Table 4. Quast assembly statistics for the different steps taken during the assembly.
Assembly step Supernova Tigmint Arcs + Links Sealer (Final version)
Quast analysis Scaffolds Contigs Scaffolds Contigs Scaffolds Contigs Scaffolds Contigs
# contigs (>= 0 bp) 29608 - 29762 - 28300 - 28300 -
# contigs (>= 1000 bp) 29608 - 29679 - 28217 - 28217 -
# contigs (>= 5000 bp) 16923 - 16982 - 15548 - 15551 -
# contigs (>= 10000
bp) 13322 50406 13367 50406 11991 50406 11994 29315
# contigs (>= 25000
bp) 10095 37807 10140 37807 9046 37807 9043 23474
# contigs (>= 50000
bp) 8503 25211 8543 25211 7702 25211 7702 18502
Largest contig 20810282 1295922 14115789 1295922 14115789 1295922 14116249 2052147
GC (%) 40.01 40.01 40.01 40.01 40.01 40.01 40.02 40.01
Reference GC (%) 39.95 39.95 39.95 39.95 39.95 39.95 39.95 39.95
N50 694764 116657 692348 116667 787285 116613 787090 232449
NG50 993541 156066 988664 156066 1101859 156066 1101324 308863
N75 344971 63185 344508 63189 389945 63157 390021 125489
NG75 649192 107821 645404 107821 726481 107821 725900 216001
L50 1465 9763 1483 9762 1328 9768 1328 4962
LG50 788 5738 803 5738 732 5738 732 2927
L75 3421 20802 3445 20801 3048 20815 3048 10522
LG75 1650 10992 1670 10992 1493 10992 1494 5567
# misassemblies 0 0 0 0 0 0 0 0
# unaligned mis.
contigs 3846 6666 3865 6666 3685 6666 3687 5814
# unaligned contigs
7699 +
5623 part
44170 +
13918
part
7720 + 5647
part
44146 +
13918 part
6719 +
5272 part
44374 +
13922
part
6717 +
5277 part 21537 + 10144 part
# N's per 100 kbp 523.76 0 492.54 0 496.33 0 474.79 0.11
# indels per 100 kbp 402.9 403.6 402.9 403.6 402.79 403.56 403.03 403.38
Complete BUSCO (%) 84.16 81.19 84.16 81.19 84.16 81.19 84.16 82.84
Partial BUSCO (%) 2.97 4.29 2.97 4.29 3.3 4.62 3.3 4.29
Table 5. Microsatellites found in the capybara and lesser capybara genome assemblies.
Capybara - Hydrochoerus hydrochaeris
Unit size Number of SSRs
2 318056
3 60663
4 97980
5 25910
6 5210
7 319
8 721
9 187
10 126
11 8
12 85
Lesser capybara - Hydrochoerus isthmius
Unit size Number of SSRs
2 445243
3 87656
4 138686
5 38745
6 6562
7 493
8 727
9 163
10 151
11 17
12 117
Table 6. Capybara and lesser capybara mitogenome annotations.
Capybara - Hydrochoerus hydrochaeris
Name Feature Start Stop Strand
trnF tRNA 332 398 -
trnP tRNA 1756 1825 +
trnT tRNA 1832 1898 -
cob CDS 1908 3041 -
trnE tRNA 3049 3117 +
nad6 CDS 3130 3642 +
nad5 CDS 3657 5456 -
trnL1 tRNA 5457 5526 -
trnS1 tRNA 5526 5584 -
trnH tRNA 5588 5656 -
nad4 CDS 5667 7034 -
nad4l CDS 7031 7324 -
trnW tRNA 7326 7394 -
nad3 CDS 7397 7735 -
trnG tRNA 7742 7810 -
cox3_b CDS 7812 8033 -
cox3_a CDS 8039 8590 -
atp6 CDS 8596 9270 -
atp8 CDS 9246 9431 -
trnK tRNA 9433 9499 -
cox2-0 CDS 9506 10063 -
cox2-1 CDS 10062 10181 -
trnD tRNA 10183 10251 -
trnS2 tRNA 10259 10327 +
cox1_b CDS 10337 11872 -
trnY tRNA 11879 11947 +
trnC tRNA 11950 12016 +
trnN tRNA 12055 12127 +
trnA tRNA 12129 12197 +
trnW tRNA 12200 12269 -
nad2 CDS 12350 13297 -
trnM tRNA 13313 13381 -
trnQ tRNA 13384 13454 +
trnI tRNA 13452 13520 -
nad1 CDS 13528 14478 -
trnL2 tRNA 14482 14556 -
rrnL rRNA 14558 16125 -
trnV tRNA 16124 16192 -
rrnS rRNA 16190 16947 -
Lesser capybara - Hydrochoerus isthmius
Name Feature Start Stop Strand
trnP tRNA 1038 1107 +
trnT tRNA 1114 1181 -
cob CDS 1191 2324 -
trnE tRNA 2332 2400 +
nad6 CDS 2413 2925 +
nad5 CDS 2940 4739 -
trnL1 tRNA 4740 4809 -
trnS1 tRNA 4809 4867 -
trnH tRNA 4871 4939 -
nad4 CDS 4950 6317 -
nad4l CDS 6314 6607 -
trnR tRNA 6609 6677 -
nad3 CDS 6680 7018 -
trnG tRNA 7025 7093 -
cox3 CDS 7095 7877 -
atp6 CDS 7883 8557 -
atp8 CDS 8530 8718 -
trnK tRNA 8720 8786 -
cox2 CDS 8793 9470 -
trnD tRNA 9472 9540 -
trnS2 tRNA 9547 9615 +
cox1_b CDS 9625 10368 -
cox1_a CDS 10365 11159 -
trnY tRNA 11166 11234 +
trnC tRNA 11237 11303 +
trnN tRNA 11343 11415 +
trnA tRNA 11417 11485 +
trnW tRNA 11488 11557 -
nad2 CDS 11565 12584 -
trnM tRNA 12600 12668 -
trnQ tRNA 12671 12741 +
trnI tRNA 12739 12807 -
nad1 CDS 12815 13765 -
trnL2 tRNA 13769 13843 -
rrnL rRNA 13845 15413 -
trnV tRNA 15412 15480 -
rrnS rRNA 15478 16431 -
trnF tRNA 16431 16497 -