Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Advance Publication by J-STAGE
Genes & Genetic Systems
Received for publication: December 15, 2016
Accepted for publication: January 29, 2017
Published online: March 24, 2017
1
An artificial intelligence approach fit for tRNA gene studies in the era of big sequence data
Yuki Iwasaki1,†, Takashi Abe2,*, Kennosuke Wada1, Yoshiko Wada1, and Toshimichi Ikemura1,*
1 Department of Bioscience, Nagahama Institute of Bio-Science and Technology, 1266 Tamura-Cho, Nagahama, Shiga 526-0829, Japan 2 Department of Information Engineering, Faculty of Engineering, Niigata University, 2-8050, Ikarashi, Nishi-ku, Niigata 950-2181, Japan
* Corresponding authors
Takashi Abe
Department of Information Engineering, Faculty of Engineering, Niigata University, 2-8050, Ikarashi, Nishi-ku, Niigata-ken 950-2181, Japan
Tel: +81-25-262-6745
Fax: +81-25-262-6745
Email: [email protected]
Toshimichi Ikemura
Department of Bioscience, Nagahama Institute of Bio-Science and Technology, 1266 Tamura-Cho, Nagahama, Shiga-ken 526-0829, Japan
Tel: +81-749-64-8100
Fax: +81-749-64-8140
E-mail: [email protected]
†Present Address: National Institute of Genetics, 1111 Yata, Mishima, Shizuoka 411-8540, Japan
Running head: AI approach for tDNA studies
Key words: tRNA gene, metagenome, oligonucleotide composition, big data, phylogenetic marker
2
ABSTRACT
Unsupervised data mining capable of extracting a wide range of knowledge from big data
without prior knowledge or particular models is a timely application in the era of big
sequence data accumulation in genome research. By handling oligonucleotide compositions
as high-dimensional data, we have previously modified the conventional self-organizing map
(SOM) for genome informatics and established BLSOM, which can analyze more than ten
million sequences simultaneously. Here, we develop BLSOM specialized for tRNA genes
(tDNAs) that can cluster (self-organize) more than one million microbial tDNAs according to
their cognate amino acid solely depending on tetra- and pentanucleotide compositions. This
unsupervised clustering can reveal combinatorial oligonucleotide motifs that are responsible
for the amino acid-dependent clustering, as well as other functionally and structurally
important consensus motifs, which have been evolutionarily conserved. BLSOM is also
useful for identifying tDNAs as phylogenetic markers for special phylotypes. When we
constructed BLSOM with ‘species-unknown’ tDNAs from metagenomic sequences plus
‘species-known’ microbial tDNAs, a large portion of metagenomic tDNAs self-organized
with species-known tDNAs, yielding information on microbial communities in environmental
samples. BLSOM can also enhance accuracy in the tDNA database obtained from big
sequence data. This unsupervised data mining should become important for studying
numerous functionally unclear RNAs obtained from a wide range of organisms.
INTRODUCTION
Compilation of tRNA sequences and genes was originally established by Sprinzl and
coworkers (Sprinzl et al., 1978; Sprinzl and Vassolenko, 2005) and updated (tRNAdb;
http://trnadb.bioinf.uni-leipzig.de/) (Jühling et al., 2009). Using tRNAscan-SE, Genomic
tRNA Database (GtRNAdb; http://lowelab.ucsc.edu/GtRNAdb/) has been constructed for
complete and near complete genomes (Chan and Lowe, 2009). In addition, genomic
organization of eukaryotic tRNAs has been extensively studied, showing complex lineage-
specific variability (Bermudez-Santana et al., 2010). For both completely and partially
sequenced genomes, as well as vast numbers of metagenomic sequences from wide varieties
of environmental and clinical samples, we have constructed and updated a large-scale
database of tRNA genes (tDNAs) “tRNADB-CE” (http://trna.ie.niigata-u.ac.jp) (Abe et al.,
2011). Metagenomic sequences have attracted broad scientific and industrial interest, and
3
even short sequences obtained with new-generation sequencers (e.g., Sequence Read Achieve
in NCBI, http://www.ncbi.nlm.nih.gov/Traces/sra/) contain numerous full-length tDNAs
because of their short length. The number of tDNAs compiled in tRNADB-CE is already
large (1.7 million genes) and will undoubtedly increase more rapidly in the future. For
efficient knowledge discovery from such big data, new tools are important for promoting
promising research on genes for tRNAs and other RNAs including wide varieties of function-
unclear RNAs.
In the era of big sequence data obtained from high-throughput DNA sequencers, it
becomes important to establish an unsupervised data mining capable of extracting a wide
range of knowledge without prior knowledge, hypotheses, or particular models from
numerous genomic sequences (e.g., tDNA sequences) covering a wide range of species, for
which experimental studies other than DNA sequencing have been poorly done. Various
unsupervised data mining methods, such as K-means clustering and Fuzzy Art (Forgy, 1965;
Carpenter et al., 1991; Hastie et al., 2009), have been developed; and we have previously
developed an unsupervised clustering method “BLSOM (Batch-Learning Self-Organizing-
Map)” (Kanaya et al., 2001; Abe et al., 2003, 2005; Kikuchi et al., 2015), which can analyze
more than ten million genomic sequences simultaneously and allows acquisition of a wide
range of knowledge from big sequence data. For example, BLSOM with oligonucleotide (e.g.,
tetranucleotide) composition could cluster genomic sequence fragments (e.g., 1-kb sequences)
according to phylotype, and thus succeeded in phylogenetical classification of a large number
of metagenomic sequences (Uchiyama et al., 2005; Uehara et al., 2011; Nakao et al., 2013).
Oligonucleotides, such as penta- and hexanucleotides, often represent motif sequences
responsible for sequence-specific protein binding such as transcription-factor binding, and
their occurrences should differ from occurrences expected from mononucleotide composition
in each genome and among genomic portions within one genome. An analysis of the human
genome with pentanucleotide BLSOM has unexpectedly found evident enrichment of many
kinds of transcription-factor-binding motifs in pericentric heterochromatin regions (Iwasaki
et al., 2013), showing that BLSOM effectively detects characteristic, combinatorial
occurrences of functional motif oligonucleotides in genomic sequences with no prior
knowledge.
BLSOM is suitable for actualizing high-performance parallel-computing, and thus for the
analysis of a high dimensional big data. Here, we have tested its usefulness for data mining
4
from a large amount of tDNA sequences. Actually, BLSOM for tDNAs can reveal
combinatorial oligonucleotide motifs responsible for their amino acid-dependent clustering,
and BLSOM for species-unknown tDNAs obtained from metagenomic sequences plus
species-known microbial tDNAs can give information on microbial communities in
environmental samples.
MATERIALS AND METHODS
tDNA sequences To enhance the completeness and accuracy of tDNAs compiled in tRNADB-CE, three
computer programs, tRNAscan-SE (Lowe and Eddy, 1997), ARAGORN (Laslett and
Canback, 2004), and tRNAfinder (Kinouchi and Kurokawa, 2006) were used in combination,
since their algorithms partially differed and rendered somewhat different results. tDNAs
found concordantly by three programs were stored in tRNADB-CE and discordant cases
among programs were manually checked by experts in tRNA experimental fields (Abe et al.,
2011). The present study has constructed BLSOMs for tri-, tetra-, and pentanucleotide
compositions in tDNAs in tRNADB-CE (Version 7.0: Last update, 2014/01/25). Since a
portion of tDNAs lack the terminal CCA sequence, and one purpose of this study is to
establish a strategy for using tDNAs as phylogenetic markers, the CCA terminus sequence
has been excluded from BLSOM analyses in the present study. However, the inclusion of
CCA sequence for BLSOM analyses of tDNAs may enhance the amino acid-dependent
clustering for tDNAs containing the CCA sequence; the high level of CCA terminus has been
found on RNA-seq data (Findeiss et al., 2011) and importance of CCA terminus in the
genomic tag has been proposed (Weiner and Maizels, 1987). Therefore, the combinatorial use
of CCA-plus and -minus analyses may provide new additional information.
BLSOM
SOM (Self-Organizing Map) is an unsupervised clustering algorithm that nonlinearly maps
high-dimensional vectorial data onto a two-dimensional array of lattice points; i.e., a flexible
net that is spread into the multi-dimensional "data cloud" (Kohonen, 1982; Kohonen et al.,
1996). We previously modified the conventional SOM for genome informatics to make the
5
learning process and resulting map independent of the order of data input on the basis of
batch-learning SOM; BLSOM (Kanaya et al., 2001). The initial vectors were defined by
principal component analysis (PCA) instead of random values.
The frequency of each pentanucleotide obtained from vectorial data representing each
lattice point on Penta in Fig. 1A is calculated and normalized with the level expected from
the mononucleotide composition calculated from vectorial data representing the lattice point.
The observed/expected ratio is illustrated in red (overrepresented), blue (underrepresented),
and white (moderately represented) according to Abe et al. (2003) (Fig. 1D). Since there are
1,024 (= 45) different types of pentanucleotides, the occurrence of one type of
pentanucleotides in one tDNA is primarily 0 or 1. Thus, red and blue primarily show the
presence and absence of each pentanucleotide.
Parts-BLSOM
Computer programs used to find tDNAs can divide each tDNA sequence into the following
structural parts: 5’ side of acceptor-stem, D-arm, anticodon-arm, variable-arm (V-arm), T-
arm, 3’ side of acceptor-stem, and CCA-terminus. BLSOM, in which oligonucleotides found
in different structural parts are differentially counted, is designated Parts-BLSOM (for the
correspondence with the secondary cloverleaf structure, see Supplementary Fig. S1). Since
the CCA terminus sequence has been excluded from the present BLSOM analysis, PartsTri
and -Tetra treat 384 (=64 x 6) and 1536 (=256 x 6) variables, respectively. PartsPenta was not
included because of the V-arm’s short length for several amino acids; e.g., shorter than 5-nt
for Glu, Cys, and Gly. BLSOM programs can be obtained from our web site
(http://bioinfo.ie.niigata-u.ac.jp/?BLSOM). Distances of weight vectors between
neighbouring lattice points on BLSOM can be visualized as black levels with a U-matrix
method (Ultsch, 1993), and this provides information about similarity of oligonucleotide
composition in local areas on BLSOM (Iwasaki et al., 2013).
RESULTS
Oligonucleotide BLSOMs for bacterial tDNAs
6
Each tRNA has characteristic, combinatorial occurrences of various motif oligonucleotides
required to fulfil its function (e.g., binding to proper enzymes and rRNAs) and structural
formation (L-shaped form). To examine the usefulness of BLSOM for efficient knowledge
discovery from massive numbers of tDNAs, we have conducted BLSOMs for tri-, tetra-, and
pentanucleotide compositions in approximately 0.4 million tDNAs from more than 7000
bacterial genomes that are categorized as “Reliable tRNAs” in tRNADB-CE (Tri, Tetra, and
Penta in Fig. 1A, respectively); archaeal and fungal tDNAs will be analyzed later. Lattice
points containing tDNAs belonging to one amino acid are indicated in the color representing
the amino acid and those belonging to multiple amino acids are indicated in black. Most
lattice points, especially on Tetra and Penta, are colored, showing tDNAs to be separated
(self-organized) primarily by amino acid. Table 1 presents percentages of tDNAs located at
colored pure lattice points (i.e. lattice points containing tDNAs of one amino acid), showing
the amino acid-dependent clustering to be higher on Tetra and Penta than on Tri. Importantly,
the high level of amino acid-dependent clustering was obtained with no information other
than oligonucleotide composition. Thus, BLSOM should pick out characteristic combinations
of motif sequences required for proper recognition by various enzymes, such as aminoacyl-
tRNA synthetase (aaRS), and of sequences supporting proper L-shaped structures.
Fig. 1B marks lattice points containing tDNAs of individual amino acids separately on
Penta (for other amino acids, including selenocysteine, see Supplementary Fig. S2). tDNAs
for one amino acid form one or a few major territories, as well as many tiny satellite-type
spots. It should be noted that the major territory locating at the bottom of the Met panel is
composed solely of initiator Met tDNAs, but the upper major territory is composed of both
elongator Met tDNAs and Ile tDNAs containing anticodon CAT, which is enzymatically
converted to read ATA codon. When considering the biological significance of minor
territories and tiny satellites, the number of tDNAs in each lattice point is important. Hence
the vertical bar in Fig. 1C presents the number of Gly tDNAs (for other amino acids, see
Supplementary Fig. S2). Lattice points in two major Gly territories apparently contain many
tDNAs, and one tiny satellite located away from major territories also has multiple tDNAs
(arrowed in Fig. 1C), which represent 34 Gly-GCC tDNAs with two base differences and
derived from 6 Chlamydiae species listed in the Figure Legend. Similar satellite-type peaks
are observed for other amino acids (Supplementary Fig. S2) and represent isoacceptors of
various species primarily belonging to one phylogenetic family, which often differ in
7
sequence for few bases; examples of multiple alignment of their sequences are presented
along with their phylotypes in Supplementary Fig. S3; Arg-CGC for four Borrelia species,
Asn-GTT for six Thermotogae species. These types of noncanonical tDNAs should be
candidates for molecular phylogenetic markers representing a specific phylotype.
BLSOM clusters tDNAs according to amino acid, solely depending on oligonucleotide
composition, and visualizes major, minor, and noncanonical rare tDNAs. BLSOM is an
unsupervised clustering algorithm and allows us to explore causative factors responsible for
the amino acid-dependent clustering (self-organization) and to compare causative factors
pointed out by BLSOM with known molecular mechanisms experimentally proven for a
limited number of model organisms, such as the mechanisms reviewed by Marck and
Grosjean (2002). Importantly, we can address the following well-timed questions in the era of
big sequence data accumulation: to what range of phylotypes, can a certain known molecular
mechanism (e.g., sequence motifs recognized by an aaRS) be applied, and what types of
alternative mechanisms can be expected for other phylotypes? Since experimental studies
have been poorly conducted for most genomes recently sequenced, this in silico
characterization has become increasingly important, and BLSOM has powerful visualization
functions useful for addressing these questions.
Visualization of combinatorial occurrences of functionally important oligonucleotides
Functionally and structurally important sequences in tRNAs have been stably maintained
throughout evolution. Therefore, a wide range of species has the very closely related motifs,
while sequences outside the motifs have differentiated significantly. For example, to identify
cognate isoacceptors from a pool of tRNAs sharing a similar L-shaped structure, aaRS
recognizes a relatively small number of nucleotides as RNA code (Schimmel et al., 1993) and
identity element (Normanly and Abelson, 1989; Ibba and Söll, 2000; Ardell, 2010).
Oligonucleotide sequences maintained stably in a large number of isoacceptors from a wide
range of bacteria should be responsible and diagnostic for their amino acid-dependent
clustering (self-organization) on BLSOM. This prediction has been proven by using the
BLSOM capability to visualize diagnostic oligonucleotides contributing to self-organization,
as explained in MATERIAL AND METHODS. Red and blue in Fig. 1D and E show the
presence and absence of the respective pentanucleotide on Penta in Fig. 1A. Transitions
8
between red and blue for various pentanucleotides often coincide with borders between
territories of different amino acids, and Fig. 1D shows three examples of pentanucleotides
observed mainly in major territories of one amino acid and most likely related to identity
elements. The major red zone of ATAGA corresponds to the major territory of Arg in Fig. 1B;
this pentanucleotide exists in D-arm in a major portion of bacterial Arg tDNAs. The major
red zones of GATAA and CATAA correspond to major territories of Ile and Met tDNAs
listed in Fig. 1B; these pentanucleotides exist in anticodon-arm in these isoacceptors. In fact,
anticodon- and D(dihydroU)-arms have been reported to contain highly significant identity
elements; e.g., Ardell (2010) has systematically compiled identity determinants in
Proteobacteria. While a major type of identity element for one amino acid has been well
conserved in sequence throughout evolution, mechanisms for aaRS to recognize cognate
tRNAs seem to have diverged at a certain level, even among bacterial species (Marck and
Grosjean, 2002; Ardell, 2010). tDNAs with minor, noncanonical identity elements can be
detected by identifying tDNAs located outside red zones of the pentanucleotide representing
a canonical identity element. Such tDNAs should become phylogenetic markers for a
restricted phylogenetic lineage, and real examples will be mentioned later.
We next examine canonical-type oligonucleotides present in a large portion of bacterial
tDNAs. For example, the functionally important and well-conserved heptanucleotide
GGTTCGA in the Tѱ(pseudoU)C-arm (abbreviated as T-arm) is observed for a large
majority of bacterial tDNAs (Marck and Grosjean, 2002; Ardell, 2010) and therefore, three
constituent pentanucleotides of this heptanucleotide are observed in a major portion of lattice
points on Penta, as colored in red in Fig. 1E (TTCGA): two other pentanucleotides give
similar (but not identical) results. Small blue areas contain tDNAs that lack the consensus
sequence. Actually, the aforementioned Chlamydophila Gly-GCC tDNAs (arrowed in Fig.
1C) differ from the canonical heptanucleotide in T-arm at two bases and thus are located in a
blue zone (arrowed in Fig. 1E) for all three constituent pentanucleotides. This is one reason
that Chlamydophila Gly-GCC tDNAs form a satellite, odd peak apart from the Gly major
territory in Fig. 1C and shows that tDNAs with noncanonical sequences in T-arm will
become one type of candidates for phylogenetic markers. The occurrence levels of the three
pentanucleotides for each amino acid are presented in Supplementary Fig. S4. These
pentanucleotides are observed in more than 70% of bacterial tDNAs of Ala, Arg, Asn, Ile,
Lys, Met, Phe, Thr, and Val, but almost no tDNAs of Cys, Gln, Glu, Leu, Ser, and Try. For
9
Asp, Gly, His, Pro, and Trp, a portion of tDNAs have these pentanucleotides. Some tDNAs
belonging to the last five amino acids may become candidates for phylogenetic markers after
clarifying their existence/non-existence according to the phylogenetic group.
BLSOM specialized for tDNA research: Parts-BLSOM
Tri, Tetra, and Penta in Fig. 1 can cluster amino acid-specific tDNAs with no information
other than oligonucleotide composition and predict functionally and/or structurally important
motifs, as exemplified in Fig. 1D and E. This is a favorable feature of unsupervised data
mining. These BLSOMs, however, do not take into consideration the fundamental
characteristic of tRNAs that functionally and structurally important motifs exist in a specific
part of the molecule. Thus, an oligonucleotide that happens to have the same sequence to a
functional motif (e.g., identity element) but is located outside the functional site, cannot be
discriminated from the real functional element. Actually, even outside the characteristic
territories for one amino acid, there are small red zones for the pentanucleotide related to the
identity element of the amino acid (Fig. 1D), and these pentanucleotides have often been
found outside the identity element in noncognate tRNAs. We next add the following
information about tRNA molecules other than oligonucleotide composition and examine the
usefulness of this change.
Here, we construct a new BLSOM, in which oligonucleotides found in six different
structural parts are differentially counted. Tri- and tetranucleotide BLSOMs with 384 (= 64 x
6) and 1536 (= 256 x 6) variables have been constructed for bacterial tDNAs, as described in
Material and Methods; PartsTri and PartsTetra, respectively. Table 1 shows that the level of
amino acid-dependent clustering on PartsTetra (Fig. 2A) and PartsTri (data not shown) is
slightly higher than on the previous Penta for most amino acids. Fig. 2B shows that amino
acid-dependent clustering is simpler and minor territories and satellite spots are less evident
than in Fig. 1B and Supplementary S2B, supporting the view that the level of amino acid-
dependent self-organization has increased in Parts-BLSOMs (Table 1). Similar results have
been obtained for PartsTri (Supplementary Fig. S5). Fig. 2C presents the 3D view of the
number of Arg or Met tDNAs in each lattice point on PartsTri. The merit of Parts-BLSOM is
not only the slight increase of amino-acid dependent clustering, but this can provide the
10
following strategies for instructive knowledge discovery. The result of U-matrix listed in Fig.
2D will be explained later.
Strategies for revealing functionally and/or structurally important domains
Functionally important sequences, such as identity elements, have been experimentally
proven to differ often in sequence location among tRNAs belonging one isoacceptor group
and additionally among phylogenetic groups (Hou and Schimmel, 1988; McClain and Foss,
1988; Marck and Grosjean, 2002; Ardell, 2010). For an in silico prediction of their locations,
we have constructed PartsTri and PartsTetra, in which one of six functional parts is omitted.
The percentages of pure lattice points (i.e., lattice points containing tDNAs only of one amino
acid) on PartsTetra are listed (Fig. 3A). When omitting the anticodon-arm, the separation
level reduces significantly (< 75%) for His, Lys, and Trp, and less significantly for Arg, Asn,
Gln, Ile, Met, Phe, Pro, Thr, and Val. This reduction should reflect the contribution level of
each part for amino acid-dependent clustering and is consistent with locations of identity
elements experimentally proven for various model organisms (Normanly and Abelson, 1989;
Ibba and Söll, 2000; Marck and Grosjean, 2002; Ardell, 2010).
An alternative analysis is tri- and tetranucleotide BLSOM (Tri and Tetra) constructed
separately for each part. Fig. 3B lists the percentages of pure lattice points for each amino
acid on Tetra for each part. The anticodon-arm gives a good separation for all amino acids (>
90%), but D-arm gives a good separation (> 90%) only for Leu, Ser, and Tyr, and a
significant level of separation (> 40%) for Arg, Asp, Cys, Glu, and Pro; T-arm gives a
significant level (> 40%) for Asn, His, and Pro. These contribution levels are again consistent
with the results for identity elements experimentally proven for model bacteria. V-arm gives
a good separation (> 80%) only for Leu, Ser, and Tyr, showing that not only the sequence,
but also the size of each part should significantly affect their amino acid-dependent clustering.
The reason why both D- and V-arms give the high contribution to this separation for the class
II bacterial tRNAs (Leu, Ser, and Tyr) should relate to the finding that the D-loop plays a key
role in recognizing cognate tRNAs among the class II tRNAs (Asahara et al., 1993, 1998).
BLSOM for species-known plus species-unknown tDNAs
11
A big source for finding tDNAs is the massive number of metagenomic sequences derived
from a wide range of environmental and clinical samples, which have been compiled in
International DNA Data Banks (INSDC: DDBJ/ENA/NCBI). Since metagenomic sequences
have attracted broad scientific, industrial, and medical interests, tRNADB-CE has included
tDNAs obtained from metagenomic sequences (abbreviated as metagenomic tDNAs); and
these metagenomic tDNAs are analyzed here with BLSOM. Since metagenomic sequences
should be derived not only from bacteria, but also archaea and fungi, we have constructed
PartsTetra with species-unknown metagenomic tDNAs plus species-known bacterial,
archaeal, and fungal tDNAs; 0.6 million tDNAs in total (Both in Fig. 4A).
Species-unknown metagenomic and species-known microbial tDNAs are visualized
separately in Metagenome and Known in Fig. 4A. Amino acid-dependent clustering is
apparent, but separation patterns are more complex than those for species-known bacterial
tDNAs listed in Fig. 1A, and more black lattice points appear in Fig. 4A than 1A. A major
portion of black lattice points are observed for metagenomic tDNAs (Metagenome in Fig. 4A)
while a minor portion is observed also for species-known tDNAs (Known in Fig. 4A). Fig.
4B separately marks lattice points containing metagenomic and species-known tDNAs of Ala
and Asn (for other amino acids, see Supplementary Fig. S5). Detailed inspection of the
species-known tDNAs belonging to black lattice points in Fig. 4A has revealed these to be
primarily archaeal and fungal tDNAs, showing that BLSOM has separated archaeal and
fungal tDNAs from bacterial tDNAs and a significant level of metagenomic tDNAs should be
derived from archaea and fungi. The observation that a large portion of archaeal and fungal
tDNAs are located in black lattice points indicates that their self-organization depends largely
on their sequence characteristics distinct from bacterial tDNAs, rather than distinctions
between amino acids. To study amino acid-dependent clustering of archaeal and fungal
tDNAs, BLSOMs have to be constructed only for archaeal and fungal tDNAs.
The vertical bar in Fig. 4C presents the number of metagenomic and species-known
tDNAs of Ala and Asn. Fig. 4C shows, more clearly than Fig. 4B, that the metagenomic
tDNAs located apart from the major territories and primarily representing archaeal and fungal
tDNAs are more abundant than species-known tDNAs. Furthermore, locations of very high
peaks in the major territories differ between metagenomic and species-known tDNAs. This
can be more clearly shown by the following analysis on tDNAs of each amino acid.
12
BLSOM for each amino acid
Dick et al. (2009) has successfully applied the U-matrix method (Ultsch, 1993) of an
oligonucleotide-SOM to the phylogenetic clustering of environmental metagenomic
sequences. The U-matrix listed in Fig. 2D visualizes the dissimilarity level of oligonucleotide
composition between neighbouring lattice points as a blackness level; dark black lines turn
out to correspond primarily to borders between different amino-acid territories, showing clear
dissimilarity of oligonucleotide compositions in tDNAs responding to different amino acids.
In addition, even within a major territory of one amino acid, many partitions surrounded by
pale black lines are observed and appear to be primarily attributable to phylogenetic
differences. To investigate the phylotype-dependent separation in more detail, we have
constructed a BLSOM for each amino acid for species-known plus species-unknown tDNAs
(Fig. 5). On the All panel in Fig. 5A, lattice points containing Leu tDNAs from only bacterial,
archaeal, fungal, or metagenomic sequences are colored in blue, red, green, or gray,
respectively; those containing tDNAs from more than one category are marked in black. On
the Bacteria, Archaea, and Fungi panel, lattice points containing metagenomic tDNAs (gray)
plus bacterial, archaeal, or fungal tDNAs are separately colored, as described for the All
panel. A large portion of lattice points on the Bacteria panel are marked in black, showing
that many metagenomic tDNAs are clustered (self-organized) with known bacterial tDNAs,
predicting phylogenetic attribution of metagenomic tDNAs. On the Bacteria panel, some
clear gray contiguous areas contain metagenomic (but not bacterial) tDNAs, and some gray
areas contain archaeal and fungal tDNAs (black on the Archaea or Fungi panel), providing
phylogenetic attribution of these metagenomic tDNAs. On the U-matrix panel in Fig. 5A,
many white or pale black areas are surrounded by dark black circles. White and pale black on
the U-matrix shows similar oligonucleotide compositions between neighbouring lattice points,
i.e., between tDNAs located in neighbouring lattice points. Therefore, metagenomic tDNAs
within a white and pale black zone surrounded by a dark black circle can be phylogenetically
assigned by referring to species-known tDNAs colocalizing in this zone, as described by Dick
et al. (2009). A significant portion of metagenomic tDNAs is also located apart from species-
known tDNAs (gray contiguous areas in the All panel), showing these tDNAs to be derived
mainly from poorly studied genomes and thus pointing out the environmental sources
harbouring novel genomes.
13
The vertical bar in Fig. 5B presents the number of species-known and metagenomic
tDNAs of three amino acids. Locations of even very high peaks differ between species-
known and metagenomic tDNAs. Then, we focus on very high peaks observed for
metagenomic tDNAs. In the case of Ala, the highest peak (marked as No. 1) contains 1205
metagenomic tDNAs primarily obtained from marine samples plus two tDNAs of Candidatus
Pelagibacter (an oceanic carbon recycling bacteria); the peak No. 2 contains 171 tDNAs
primarily obtained from hot spring samples plus 2 tDNAs of Synechococcus sp. JA-3-3Ab
(Cyanobacteria bacterium Yellowstone A-Prime); and the peak No. 3 contains 159 marine
metagenomic tDNAs but no species-known tDNAs. In the case of Pro, No. 1 contains 244
marine metagenomic tDNAs plus 1 Candidatus tDNA; No. 2 contains 152 marine
metagenomic tDNAs plus 9 Chlorobi tDNAs; and No. 3 contains 287 marine tDNAs but no
species-known tDNAs. In the case of Leu, No. 1 contains 196 marine metagenomic tDNAs
plus 4 Chlorobi tDNAs; No. 2 and 3 contains 316 and 309 marine metagenomic tDNAs but
no species-known tDNAs. Results for each amino acid can provide phylogenetic information
of metagenomic sequences and assign novel tDNAs with new types of sequence
characteristics. Importantly, tDNAs that form peaks composed of multiple tDNAs should be
reliable tDNAs even though these have noncanonical characteristics. This can specify a large
number of new types of tDNAs, which have been poorly characterized. The analysis of big
sequence data, such as those obtained from metagenomic samples, can provide this type of
novel information.
DISCUSSION
Characteristics of unsupervised data mining
The present in silico findings obtained from more than 7000 bacterial genomes, for most of
which molecular studies other than DNA sequencing have been poorly done, can be
connected with molecular mechanisms that have been experimentally proven for a limited
number of model organisms; e.g., those reviewed by Marck and Grosjean (2002). In their
review, 50 genomes were selected to avoid an overrepresentation of phylogenetically too
closely related organisms and span the widest range of living areas; over 4000 tDNAs were
extracted, analyzed, and compared. In our study, as is typical for big data analyze s, we have
used all available bacterial data without particular filtration processes and thus included
14
genomes of closely related bacteria, including those of different strains of one species; in
total, 0.4 million tDNAs from more than 7000 bacterial genomes. These two distinct analyses
are complementary to each other, and BLSOM can clarify what range of species the
experimentally proven mechanisms are applicable to and can point out inapplicable
phylotypes.
Phylogenetic markers useful for short metagenomic sequences When searching for a certain genome of particular interest (e.g., for industrial usability) by
surveying a massive number of short metagenomic sequences, phylogenetic marker tDNAs
should become very useful because of their short length. Our group has started to search for
tDNAs that are useful as phylogenetic markers, especially for rare genomes, and has planned
to publish such markers in tRNADB-CE. The present study shows that the BLSOM with
species-known tDNAs plus species-unknown metagenomic tDNAs (Fig. 4) can provide a tool
for studying a microbial community in an ecosystem. When analysing a dataset composed
mainly of sequences shorter than 100 bp, this strategy is useful since conventional
phylogenetic tree methods cannot be properly applied to most short sequences; i.e. it is
impossible to construct reliable phylogenetic trees for most of these short sequences. If the
dataset is composed mainly of sequences longer than 500 bp, BLSOMs with tri- and
tetranucleotide compositions in all genomic fragments should be more suitable than tDNA-
BLSOM, because all genomic sequences are informative (Abe et al., 2005;Nakao et al., 2013).
It should also be mentioned that horizontal gene transfers between different species are a
general characteristic of microbial genomes. Therefore, we may not find phylogenetic
markers with 100% accuracy, because informatics methods, including sequence homology
searches, most likely assign the horizontally transferred genes to the donor, but not the
recipient genome. When noncanonical tDNAs are found concurrently in restricted members
of phylogenetically distant groups (e.g., different classes and families), the genes may
represent horizontally transferred genes or products of convergent evolution. Phylogenetic
marker tDNAs must be used, taking these into consideration.
Strategies for enhancing the accuracy of a large-scale tRNA database
15
As described in Materials and Methods, to enhance accuracy in compiling tDNAs in
tRNADB-CE, three computer programs have been used in combination, since their
algorithms partially differ and render somewhat different results (Abe et al., 2011). For tDNA
candidates predicted by only one or two computer programs, experts in tRNA experimental
researches have manually checked. Searching for the minimum anticodon set for a
completely sequenced genome (Osawa, 1995; Marck and Grosjean, 2002) is an important
check process (Abe et al., 2011). Another manual check applicable even to partially
sequenced genomes is to examine whether the candidates have been found iteratively in
closely related species; this process has become increasingly useful because the genomes of
many closely related species and even of different strains belonging to one species have been
sequenced. When the same or almost the same noncanonical sequences have been found
repeatedly, the tDNAs are included in the Reliable tRNA category (Abe et al., 2011), basing
on the knowledge that functionally important genes have been stably maintained throughout
evolution. Furthermore, accumulation of a large number of metagenomic sequences has
progressively increased the reliability of this verification strategy, by pointing to promising
characteristics of big data.
For creating a large-scale and high-quality database, it becomes important to find errors
that have slipped into the database, including those caused by DNA sequencing errors. As
mentioned above, tDNAs found concordantly by all three programs have been stored in
tRNADB-CE after brief anticodon checking. While this automatic compilation is
indispensable for surveying a huge number of genomic sequences accumulated in INSDC, a
new strategy for identifying erroneous cases is required for quality enhancement. Orphan
tDNAs located apart from correspondent amino acid territories on BLSOM may be
candidates for erroneous cases, because the present data mining method has pointed out their
sequence irregularity. Such cases should be manually checked by experts, even though three
computer programs have concordantly assigned them. In contrast, if tiny spots apart from
their correspondent major territories harbour multiple tDNAs (e.g., peaks in Fig. 1C, 2C, 4C,
and 5B), especially derived from phylogenetically related species, they should represent real
tDNAs, even though they have noncanonical sequence characteristics. These tDNAs will
become phylogenetic markers with high specificity for the respective phylotypes. In
tRNADB-CE, such tDNAs will be noted in a column that has been provided for comments on
each tDNA (Abe et al., 2011).
16
When constructing tRNADB-CE, we have encounted a significant number of cases where
three programs predicted different functional segmentation for one tDNA candidate although
the programs concordantly assign it as tDNA. This may indicate a demerit of Parts-BLSOM
because it requires the information concerning the functional segmentation. When analysing
metagenomic sequences, which should include various novel genomes, the ordinary BLSOM
may be useful because this does not require this prior information. Undoubtedly, their
combinatorial use should be a better choice, and this will predict the proper functional
segmentation of the tDNA.
CONCLUSION
Unsupervised data mining, which can extract a wide range of knowledge from big data
without prior knowledge or particular models, is well-timed in the era of big sequence data
accumulation in genome research. Importantly, the unsupervised data mining such as
BLSOM can provide the least expected knowledge. In addition, to gain a wide range of
knowledge efficiently from big data, it is important to view all data simultaneously on one
map and to focus on a specific data category by using strong visualization powers.
Oligonucleotide BLSOM, which can analyze more than ten million sequences at once, is
suitable for unveiling novel knowledge hidden within big sequence data, providing a timely
tool for a wide range of genome researches, which have been prompted by the remarkable
progress of high-throughput sequencing technology.
CONFLICT OF INTERESTS
The authors declare that there is no conflict of interests regarding the publication of this paper.
ACKNOWLEDGEMENT
We thank Drs. Hachiro Inokuchi and Yuko Yamada in Nagahama Institute of Bio-Science
and Technology and Dr. Akira Muto in Hirosaki University for valuable discussions. We also
thank Mr. Yudai Miyake in Nagahama Institute of Bio-Science and Technology for
preliminary BLSOM analyses of tDNAs. This work was supported by a Grant-in-Aid for
17
Scientific Research (C: no. 23500371 and 26330334) and a Grant-in-Aid for Publication of
Scientific Research Results (no. 15HP7006), from the Ministry of Education, Culture, Sports,
Science and Technology, Japan, and by a NIG Collaborative Research Program (A).
REFERENCES
Abe, T., Ikemura, T., Sugahara, J., Kanai, A., Ohara, Y., Uehara, H., Kinouchi, M., Kanaya, S., Yamada, Y., Muto, A., et al. (2011) tRNADB-CE 2011: tRNA gene database curated manually by experts. Nucleic Acids Res. 39, D210-D213.
Abe, T., Kanaya, S., Kinouchi, M., Ichiba, Y., Kozuki, T., and Ikemura, T. (2003) Informatics for unveiling hidden genome signatures. Genome Res. 13, 693-702.
Abe, T., Sugawara, H., Kinouchi, M., Kanaya, S., and Ikemura, T. (2005) Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. DNA Res. 12, 281-290.
Ardell, D.H. (2010) Computational analysis of tRNA identity. FEBS Lett. 584, 325-333.
Asahara, H., Himeno, H., Tamura, K., Hasegawa T, Watanabe, K., and Shimizu, M. (1993) Recognition nucleotides of Escherichia coli tRNALeu and its elements facilitating discrimination from tRNASer and tRNATyr. J.Mol. Biol. 231, 219-229.
Asahara, H., Nameki, N., and Hasegawa, T. (1998) In vitro selection of RNAs aminoacylated by Escherichia coli leucyl-tRNA synthetase. J.Mol. Biol. 283, 605-618.
Bermudez-Santana, C., Attolini, C.S., Kirsten, T., Engelhardt, J., Prohaska, S.J., Steigele, S., and Stadler, P.F. (2010) Genomic organization of eukaryotic tRNAs. BMC Genomics 11, 270.
Carpenter, G.A., Grossberg, S., and Rosen, D.B. (1991) Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Netw. 4, 759-771.
Chan, P.P., and Lowe, T.M. (2009) GtRNAdb: a database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res. 37, D93-D97.
Dick, G.J., Andersson, A.F., Baker, B.J., Simmons, S.L., Thomas, B.C., Yelton, A.P., and Banfield, J.F. (2009) Community-wide analysis of microbial genome sequence signatures. Genome Biol. 10, R85.
Findeiss, S., Langenberger, D., Stadler, P.F., and Hoffmann, S. (2011) Traces of post-transcriptional RNA modifications in deep sequencing data. Biol. Chem. 392, 305-313.
Forgy, E.W. (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21, 768-769.
18
Hastie, T., Tibshirani, R., and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second Edition). Springer-Verlag, New York.
Hou, Y.M., and Schimmel, P. (1988) A simple structural feature is a major determinant of the identity of a transfer RNA. Nature 12, 140-145.
Ibba, M., and Söll, D. (2000) Aminoacyl-tRNA synthesis. Annu. Rev.Biochem., 69, 617-650.
Iwasaki, Y., Wada, K., Wada, Y., Abe, T., and Ikemura, T. (2013) Notable clustering of transcription-factor-binding motifs in human pericentric regions and its biological significance. Chromosome Res. 21, 461-474.
Jühling, F., Mörl, M., Hartmann, R.K., Sprinzl, M., Stadler, P.F., and Pütz, J. (2009) tRNAdb 2009: compilation of tRNA sequences and tRNA genes. Nucleic Acids Res. 37, D159-D162.
Kanaya, S., Kinouchi, M., Abe, T., Kudo, Y., Yamada, Y., Nishi, T., Mori, H., and Ikemura, T. (2001) Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM) - characterization of horizontally transferred genes with emphasis on the E. coli O157 genome. Gene 276, 89-99.
Kikuchi, A., Ikemura, T., and Abe, T. (2015) Development of self-compressing BLSOM for comprehensive analysis of big sequence data. Biomed Res. Int. 2015, 506052.
Kinouchi, M., and Kurokawa, K. (2006) tRNAfinder: A software system to find all tRNA genes in the DNA sequence based on the cloverleaf secondary structure. J. Comput. Aided Chem. 7, 116-126.
Kohonen, T. (1982) Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 59-69.
Kohonen, T., Oja, E., Simula, O., Visa, A., and Kangas, J. (1996) Engineering applications of the self-organizing map. Proc. IEEE 84, 1358-1384.
Laslett, D., and Canback, B. (2004) ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 32, 11-16.
Lowe, T.M., and Eddy, S.R. (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955-964.
Marck, C., and Grosjean, H. (2002) tRNomics: Analysis of tRNA genes from 50 genomes of Eukarya, Archaea, and Bacteria reveals anticodon-sparing strategies and domain-specific features. RNA 8, 1189-1232.
McClain, W.H., and Foss, F. (1988) Changing the identity of a tRNA by introducing a G-U wobble pair near the 3' acceptor end. Science 240, 793-796.
19
Nakao, R., Abe, T., Nijhof, A.M., Yamamoto, S., Jongejan, F., Ikemura, T., and Sugimoto, C. (2013) A novel approach, based on BLSOMs (Batch Learning Self-Organizing Maps), to the microbiome analysis of ticks. ISME J. 7, 1003-1015.
Normanly, J., and Abelson, J. (1989) tRNA Identity. Annu. Rev.Biochem. 58, 1029-1049.
Osawa, S. (1995) Evolution of the Genetic Code. Oxford University Press, New York.
Schimmel, P., Giegé, R., Moras, D., and Yokoyama, S. (1993) An operational RNA code for amino acids and possible relationship to genetic code. Proc. Natl. Acad. Sci. USA 90, 8763-8768.
Sprinzl, M., Grüter, F., and Gauss, D.H. (1978) Collection of published tRNA sequences. Nucleic. Acids Res. 5, r15-r27.
Sprinzl, M., and Vassilenko, K.S. (2005) Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 33, D139-D140.
Uchiyama, T., Abe, T., Ikemura, T., and Watanabe, K. (2005) Substrate-induced gene-expression screening of environmental metagenome libraries for isolation of catabolic genes. Nat. Biotechnol. 23, 88-93.
Uehara, H., Iwasaki, Y., Wada, C., Ikemura, T., and Abe, T. (2011) A novel bioinformatics strategy for searching industrially useful genome resources from metagenomic sequence libraries. Genes Genet. Syst. 86, 53-66.
Ultsch, A. (1993) Self organized feature maps for monitoring and knowledge acquisition of a chemical process. In Proc. ICANN’93 Int. Conf. on Artificial Neural Networks (eds.: S. Gielen and B. Kappen) pp. 864-–867. Springer, London
Weiner, A.M., and Maizels, N. (1987) tRNA-like structures tag the 3' ends of genomic RNA molecules for replication: implications for the origin of protein synthesis. Proc. Natl. Acad. Sci. USA 84, 7383-7387.
20
Table 1. The percentages of tDNAs located at colored pure lattice points on BLSOMs
Tri Tetra Penta PartsTri PartsTetraAla 0.900 0.965 0.958 0.991 0.984 Arg 0.769 0.930 0.956 0.969 0.971 Asn 0.831 0.938 0.965 0.980 0.994 Asp 0.943 0.985 0.988 0.986 0.990 Cys 0.813 0.945 0.951 0.967 0.982 Gln 0.863 0.946 0.945 0.912 0.967 Glu 0.945 0.979 0.975 0.983 0.983 Gly 0.892 0.967 0.981 0.983 0.988 His 0.788 0.939 0.910 0.955 0.957 Ile 0.888 0.975 0.938 0.951 0.975
Leu 0.884 0.973 0.984 0.987 0.983 Lys 0.773 0.929 0.959 0.958 0.971 Met 0.864 0.958 0.956 0.966 0.981 Phe 0.833 0.983 0.983 0.993 0.989 Pro 0.818 0.935 0.958 0.979 0.989 Ser 0.925 0.952 0.963 0.983 0.987 Thr 0.782 0.918 0.901 0.969 0.976 Trp 0.827 0.896 0.917 0.908 0.919 Tyr 0.907 0.979 0.988 0.988 0.988 Val 0.864 0.949 0.941 0.974 0.972
21
LEGENDS TO FIGURES
Fig. 1. Oligonucleotide-BLSOM for bacterial tDNAs. (A) BLSOM for tri-, tetra-, and
pentanucleotide compositions (Tri-, Tetra-, and Penta). Lattice points containing tDNAs of
multiple amino acids are indicated in black, and those containing tDNAs of a single amino
acid are colored as follows: Ala (■), Arg (■), Asn (■), Arp (■), Cys (■), Gln (■), Glu (■), Gly
(■), His (■), Ile (■), Leu (■), Lys (■), Met (■), Phe (■), Pro (■), Ser (■), Thr (■), Trp (■),
Tyr (■), and Val (■). (B) Lattice points containing tDNAs of individual amino acids on Penta
in Fig. 1A are visualized separately with the color used there. (C) Number of Gly tDNAs in
each lattice point on Penta is presented by the height of the vertical bar. Lattice points
containing multiple tDNAs, but not one or a few tDNAs, turn out to be detectable. A satellite
single bar (arrowed) located between two major Gly territories is composed of 34 Gly-GCC
tDNAs belonging to 6 Chlamydophila species: C. caviae, C. felis, C. muridarum, C.
pneumonia, and C. trachomatis. (D) Examples of diagnostic pentanucleotides responsible for
amino-acid dependent clustering on Penta in Fig. 1A. The occurrence of each pentanucleotide
for each lattice point was calculated and normalized with occurrence expected from the
mononucleotide composition for the respective lattice point (Abe et al., 2005). This
observed/expected ratio is indicated in color; red (overrepresented), blue (underrepresented),
and blank (intermediate). (E) An example of pentanucleotides, TTCGA, observed for most
bacterial tDNAs. Lattice points are marked as described in B.
Fig. 2. Parts-BLSOM for bacterial tDNAs. (A) PartsTetra. Lattice points are marked as
described in Fig. 1A. (B) Lattice points containing tDNAs of individual amino acids on
PartsTetra are marked as described in Fig. 1B. On the Met panel, initiator Met tDNAs are
mostly located at the lower left part; elongator Met tDNAs and Ile tDNAs containing
anticodon CAT in DNA sequence are not separated from each other. (C) 3D-view. Number
of Arg and Met tDNAs in each lattice point is presented by the height of the vertical bar. (D)
The distances of vectorial data between neighbouring lattice points on PartsTetra are
visualized as blackness levels with a U-matrix method as described by Iwasaki et al. (2013).
Fig. 3. Contribution level of each functional part for amino-acid dependent clustering. (A)
The percentages of pure lattice points for each amino acid on PartsTetra, in which one
functional part is omitted, is presented by a vertical bar colored as follows: no omission (■),
omission of 5’ acceptor (■), D-arm (■), anticodon-arm (■), V-arm (■), T-arm (■), and 3’
22
acceptor (■). (B) The percentage of pure lattice points on Tetra for all parts and each part is
presented for each amino acid by a vertical bar colored as follows: 5’ acceptor (■), D-arm (■),
anticodon-arm (■), V-arm (■), T-arm (■), and 3’ acceptor (■).
Fig. 4. PartsTetra for metagenomic plus species-known microbial tDNAs. (A) Both; Lattice
points are marked for both types of tDNAs as described in Fig. 1A. Metagenome or Known;
lattice points containing only metagenomic or species-known microbial tDNAs are marked as
described in Fig. 1A. (B) Lattice points containing metagenomic or species-known microbial
tDNAs of Ala and Asn are marked as described in Fig. 1B. (C) 3D-view. Number of
metagenomic and species-know tDNAs of Ala and Asn in each lattice point on PartsTetra is
presented by the height of the vertical bar.
Fig. 5. PartsTetra for metagenomic plus species-known microbial tDNAs of one amino acid.
(A) 2D-view. All; lattice points containing tDNAs derived from only bacterial, archaeal,
fungal, or metagenomic sequences are colored in blue, red, green, or gray, respectively, and
those containing tDNAs from more than one category are marked in back. Bacteria, Archaea,
or Fungi panel; lattice points containing metagenomic tDNAs plus bacterial, archaeal, or
fungal tDNAs are separately colored as described for the All panel. (B) 3D-view. Number of
metagenomic and species-known tDNAs in each lattice point on PartsTetra is presented by
the height of the vertical bar.
Fig. 1
Tri Tetra PentaA
Gly
ATAGA CATAAGATAA
Arg Ile Met C Penta
D Penta
B Penta
E PentaTTCGA
Fig. 2
ProMetArg
AsnAlaBPartsTetraA
D U-matrixC Arg Met
(A)
(B)
0
20
40
60
80
100
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
All parts omission of 5’ acceptor omission of D-armomission of Anticodon-arm ommision of V-arm omission of T-armomission of 3'acceptor
(%)
(%)
Fig. 3
0
20
40
60
80
100
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
5' acceptor D-arm Anticodon-arm V-arm T-arm 3'acceptor
A
B
Fig. 4KnownMetagenomeBothA
AsnAla AsnAlaMetagenome KnownB Metagenome Known
AlaAlaMetagenome KnownC AsnAsnMetagenome Known
U-matrixFungiBacteria ArchaeaAllLeu
Fig. 5A
Pro Meta
Pro Known
Ala Meta
Ala Known
B
1
23
Leu Meta
Leu Known
123
12
3
LEGENDS TO SUPPLEMENTARY FIGURES
Supplementary Fig. S1. Secondary cloverleaf structure of tRNAAla (Escherichia coliK-12)
Supplementary Fig. S2. Visualization of lattice points containing tDNAs of one amino acid on Penta. (A) Penta listed in Figure 1A. (B) Lattice points containing tDNAs of one amino acid on Penta are visualized as described in Fig. 2A. (C) Number of tDNAs in each lattice point on Penta is presented for individual amino acids by the height of the vertical bar.
Supplementary Fig. S3. Multiple sequence alignment of tDNAs belonging to a sharp peak in a satellite spot located away from the correspondent major territories. Multiple sequence alignment was conducted with ClustalW supported by DDBJ. (A) An example of one base difference. (B) An example of a few base difference. Class/family and species for each tRNA are listed; for details, see tRNADB-CE.
Supplementary Fig. S4. Occurrence level of pentanucleotide of interest in bacterial tDNAs of one amino acid on Penta. Pentanucleotides located in oligonucleotides specified in Fig. 3C are analysed.
Supplementary Fig. S5. Visualization of lattice points containing tDNAs of individual amino acids on PartsTri. Lattice points containing tDNAs of individual amino acids on PartsTri are visualized separately with the colour used in Fig. 1A.
5’
5' side of acceptor-stem
D-arm
Anticodon-arm
Variable-arm (V-arm)
T-arm
3' side of acceptor-stem
3’Fig. S1
Fig. S2
SecGlyGlu
GlnCysPentaA B
C ArgGlyGlu
C11102389-Arg-GCG GTGTCCATAGCTCAGTTGGATAGAGCGTTAGATTGCGATTCTTAAGGTCGGAGGTTCAAGC11102523-Arg-GCG GTGTCCATAGCTCAGTTGGATAGAGCGTTAGATTGCGATTCTTAAGGTCGGAGGTTCAAGC11102556-Arg-GCG GTGTCCATAGCTCAGTTGGATAGAGCGTTAGATTGCGATTCTTAAGGTCGGAGGTTCAAGC002902-Arg-GCG GTGTCCATAGCTCAGTTGGATAGAGCGTTAGATTGCGATTCTTAAGGTCGGAGGTTCAAGC09101517-Arg-GCG GTGTCCATAGCTCAGTTGGATAGAGCGTTAGATTGCGATTCTTAAGGTCGGAGGTTCAAGC11101558-Arg-GCG GTGTCCATAGCTCAGTTGGATAGAGCGTTAGATTGCGATTCTTAAGGTCGGAGGTTCAAGC001499-Arg-GCG GTGTCCATAGCTCAGTTGGATAGAGCGTTAGATTGCGATTCTTAAGGTCGGAGGTTCAAG
************************************************************
C11102389-Arg-GCG TCCTCTTGGACACGAAA Spirochaetes, Borrelia bissettii DN127C11102523-Arg-GCG TCCTCTTGGACACGAAA Spirochaetes, Borrelia burgdorferi JD1C11102556-Arg-GCG TCCTCTTGGACACGAAA Spirochaetes, Borrelia burgdorferi N40C002902-Arg-GCG TCCTCTTGGACACGAAA Spirochaetes, Borrelia garinii PBiC09101517-Arg-GCG TCCTCTTGGACACGAAA Spirochaetes, Borrelia burgdorferi ZS7C11101558-Arg-GCG TCCTCTTGGACACGGAA Spirochaetes, Borrelia afzelii PKoC001499-Arg-GCG TCCTCTTGGACACGGAA Spirochaetes, Borrelia afzelii PKo
************** **
C08004967-Asn-GTT TCCGGCGTAGCTCAACCGGCAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGTC11123141-Asn-GTT TCCGGCGTAGCTCAACCGGCAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGTC08004966-Asn-GTT TCCGGCGTAGCTCAACCGGCAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGTC08011006-Asn-GTT TCCGGCGTAGCTCAACCGGTAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGTC025681-Asn-GTT TCCGGCGTAGCTCAATAGGCAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGTC09110479-Asn-GTT TCCGGCGTAGCTCAATAGGCAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGTC09110480-Asn-GTT TCCGGCGTAGCTCAATAGGCAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGTC025683-Asn-GTT TCCGGCGTAGCTCAATAGGCAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGT
*************** ** ****************************************
C08004967-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Fervidobacterium nodosum Rt17C11123141-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Thermotoga thermarum DSM 5069C08004966-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Fervidobacterium nodosum Rt17C08011006-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Thermosipho lettingae TMOC025681-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Thermosipho melanesiensis BI429C09110479-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Thermosipho africanus TCF52BC09110480-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Thermosipho africanus TCF52BC025683-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Thermosipho melanesiensis BI429
****************
Fig. S3 A
B
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
GGTTC GTTCG TTCGA TAGCT AGCTC GCTCA
Fig. S4
ValTyrTrpThr
SerProPheMet
LysLeuIleHis
GlyGluGlnCys
AspAsnArgAlaFig. S5