Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
THE UNIVERSITY OF CHICAGO
VISUAL REPRESENTATION LEARNING IN GENOMICS
A DISSERTATION SUBMITTED TO
THE FACULTY OF THE DIVISION OF THE THESIS DIVISION
IN CANDIDACY FOR THE DEGREE OF
TYPE OF DEGREE
DEPARTMENT OF THESIS DEPARTMENT
BY
THESIS AUTHOR
CHICAGO, ILLINOIS
GRADUATION DATE
Copyright c© 2020 by Thesis Author
All Rights Reserved
TABLE OF CONTENTS
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 PROBLEM STATEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1 Genomic Islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . 22.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Operons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 62.2.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1 From Genes to Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Why Images? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Deep Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 SUMMARY OF WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
iii
LIST OF FIGURES
3.1 Examples of images generated from the compare region viewer. Each directedarrow represents a gene color coded to match its functionality. The first rowis the genome neighborhood of the focus gene (red), and the subsequent rowsrepresent anchored regions from similar genomes sorted by their phylogeneticdistances to the query genome. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Example of an image generated by our service to be fed as input to the neuralnetwork. Each arrow represents a single gene. Each row captures the area ofinterest in a genome. The query genome is the top row. The rest of the rows aregenomes selected by evolutionary distance. The query gene pair are color-codedwith blue, and the inter-genetic distance between them is colored red. The restof the genes share the same color if they belong to the same family. . . . . . . . 12
3.3 Architecture of the ResNet18 Neural Network used to predict operons. . . . . . 14
iv
LIST OF TABLES
v
ACKNOWLEDGMENTS
vi
ABSTRACT
The advent of available data on the rise, and the recent breakthroughs in hardware and soft-
ware technologies over the past years, helped Machine Learning(ML) establish itself as the
golden standard across many domains, with its applications’ performance beating previous
state of the art methods. One of the areas where this is directly noticeable and that receives
significant attention justified by great improvements in the field, is computer vision. In this
thesis, We demonstrate that even problems that are not intuitively identified as part of the
computer vision domain, can leverage the computer vision technologies by being mapped
into the domain space. In addition, and through a technique called transfer learning, deep
neural networks that are powering the advances in computer vision applications can be used
in areas that suffer from having limited datasets and lack of ground-truth data, which is
the cornerstone for any supervised learning task. Transfer learning works by training these
deep models first on large - even if irrelevant- datasets, and then retraining(usually the last
layers) on the limited datasets corresponding to the application of interest. For example, a
deep neural network can first be trained on Imagenet, a well-known dataset that contains
more than a million images, and then re-trained on any other image dataset over which
classification is to be performed. This approach has been shown to be superior over training
a model starting with random weights. It also alleviates the risk of overfitting and makes
it possible to use deep models when the datasets are extremely limited. Even with genomic
data being constantly on the rise, having meaningful and labeled data is a constant chal-
lenge, which makes supervised learning applications on some bioinformatics problems more
infeasible. There are numerous problems that suffer from limited datasets in bioinformatics,
we demonstrate the effectiveness of transfer learning on two of these problems, after mapping
them to computer vision tasks. We further show that the results we achieve are better than
those reported by state of the art methods, that rely mostly on tabular data. The two prob-
lems we present are identifying genomic islands, and operons, in bacterial genomes. These
are two well-known genome annotation problems, known for being experimentally costly and
vii
to suffer from the lack of extensive verified data-sets. We conclude this research with a con-
jecture of why these problems perform better when represented as image-related problems,
than their tabular counterparts, even in cases where the underlying data is supposedly the
same.
viii
CHAPTER 1
PROBLEM STATEMENT
In this research, we demonstrate the effectiveness and superiority of using computer vision
technologies on visual representations of tabular data that may not be intuitively repre-
sented as images. Such an advantage may owe its favor in part to the superior models that
become accessible once the underlying problem is translated into the domain of computer
vision. These deep models can be utilized through a technique called transfer learning. Such
an approach becomes very important in areas where verified datasets are limited, even if
the problems may not intuitively be those of computer vision. Biology offers a myriad of
such problems. In this thesis, we focus on two problems: identifying genomic islands and
identifying operons in prokaryotic genomes. We present our methods (Shutter Island for
Genomic Islands, Operon Hunter for operons), that use a deep learning model widely used
in computer vision, to detect genomic islands and operons respectively. The intrinsic value
of deep learning methods lies in their ability to generalize. Via transfer learning, the model
is pre-trained on a large generic dataset and then re-trained on images that we generate to
represent genomic fragments. Using manually verified reference datasets, we demonstrate
that this image-based approach generalizes better and is superior to the existing tools. We
argue that this approach could also be applied to other problems in the domain where data
is lacking, and end with a discussion of why mapping tabular data into images could be
producing better results.
1
CHAPTER 2
INTRODUCTION
2.1 Genomic Islands
2.1.1 Background and motivation
Horizontal gene transfer is the main source of adaptability for bacteria, through which genes
are obtained from different sources including bacteria, archaea, viruses, and eukaryotes. This
process promotes the rapid spread of genetic information across lineages, typically in the form
of clusters of genes referred to as genomic islands (GIs). Different types of GIs exist, often
classified by the content of their cargo genes or their means of integration and mobility. A
genomic island (GI) then is a cluster of genes that is typically between 10 kb and 200 kb in
length and has been transferred horizontally [11].
Horizontal gene transfer (HGT) may contribute to anywhere between 1.6% and 32.6% of
the genomes [12, 13, 14, 15, 16, 17, 18, 19, 20]. This percentage implies that a major factor
in the variability across bacterial species and clades can be attributed to GIs [21]. Thus,
they impose an additional challenge to our ability to reconstruct the evolutionary tree of
life. The identification of GIs is also important for the advancement of medicine, by helping
develop new vaccines and antibiotics[22, 23] or cancer therapies[24]. For example, knowing
that PAIs can carry many pathogenicity genes and virulence genes [25, 26, 27], researchers
found that potential vaccine candidates resided within PAIs [28]. Various computational
methods have been devised to detect different types of GIs, but no single method currently
is capable of detecting all GIs.
2.1.2 Related Work
While early methods focused on manual inspection of disrupted genomic regions that may
resemble GI attachment sites [29] or show unusual nucleotide content [30, 31], the most
2
recent computational methods fall into two broad categories: methods that involve sequence
composition and methods that involve comparative genomics [1]. They both focus on one or
more of the features that make GIs distinct, such as compositional bias, mobility elements,
and transfer RNA (tRNA) hotspots [5, 7, 11, 25, 26, 32, 33, 34]. We discuss some of these
features in more detail; they are listed by decreasing order of importance in [1, 35].
• One of the most important features of GIs is that they are sporadically distributed;
that is, they are found only in certain isolates from a given strain or species.
• Since GIs are transferred horizontally across lineages and since different bacterial lin-
eages have different sequence compositions, measures such as GC content or, more gen-
erally, oligonucleotides of various lengths (usually 2–9 nucleotides) are being used [36,
37, 38, 39, 40, 41, 42]. Codon usage is a well-known metric, which is the special case
of oligonucleotides of length 3.
• Since the probability of having outlying measurements decreases as the size of the
region increases, tools usually use cut-off values for the minimum size of a region (or
gene cluster) to be identified as a GI.
• Another type of evidence comes not from the attachment sites but from what is in
between, since some genes (e.g., integrases, transposases, phage genes) are known to
be associated with GIs [25].
• In addition to the size of the cluster, evidence from mycobacterial phages [43] suggests
that the size of the genes themselves is shorter in GIs than in the rest of the bacterial
genome. Different theories suggest that this may confer mobility or packaging or
replication advantages [8].
• Some GIs integrate specifically into genomic sites such as tRNA genes, introducing
flanking direct repeats. Thus, the presence of such sites and repeats may be used as
evidence for the presence of GIs [45, 46, 47].
3
Other research suggests that the directionality of the transcriptional strand and the protein
length are key features in GI prediction [8]. The available tools focus on one or more of the
mentioned features.
Islander works by first identifying tRNA and transfer-messenger RNA genes and their
fragments as endpoints to candidate regions, then disqualifying candidates through a set of
filters such as sequence length and the absence of an integrase gene [3]. IslandPick identifies
GIs by comparing the query genome with a set of related genomes selected by an evolutionary
distance function [44]. It uses Blast and Mauve for the genome alignment. The outcome
heavily depends on the choice of reference genomes selected. Phaster uses BLAST against
a phage-specific sequence database (the NCBI phage database and the database developed
by Srividhya et al. [48]), followed by DBSCAN [49] to cluster the hits into prophage regions.
IslandPath-DIMOB considers a genomic fragment to be an island if it contains at least one
mobility gene, in addition to 8 or more consecutive open reading frames with di-nucleotide
bias [50]. SIGI-HMM uses the Viterbi algorithm to analyze each gene’s most probable
codon usage states, comparing it against codon tables representing microbial donors or highly
expressed genes, and classifying it as native or non-native accordingly [51]. PAI-IDA uses the
sequence composition features, namely, GC content, codon usage, and dinucleotide frequency,
to detect GIs [52]. Alien Hunter (or IVOM) uses k-mers of variable length to perform its
analysis, assigning more weight to longer k-mers [38]. Phispy uses random forests to classify
windows based on features that include transcription strand directionality, customized AT
and GC skew, protein length, and abundance of phage words [8]. Phage Finder classifies 10
kb windows with more than 3 bacteriophage-related proteins as GIs [53]. IslandViewer is an
ensemble method that combines the results of three other tools—SIGI-HMM, IslandPath-
DIMOB, and IslandPick—into one web resource [54].
4
2.1.3 Challenges
No single tool is able to detect all GIs in all bacterial genomes [31]. Methods that narrow
their search to GIs that integrate under certain conditions, such as into tRNAs, miss out
on the other GIs. Similarly, not all GI regions exhibit atypical nucleotide content [22, 55].
Evolutionary events such as gene loss and genomic rearrangement [5] present more challenges.
Also, highly expressed genes (e.g., genes in ribosomal protein operons), or those having an
island host and donor that belong to the same or closely related species, or the fact that
amelioration would pressure even genes from distantly related genomes to adapt to the host
over time would lead to the host and the island exhibiting similar nucleotide composition [56]
and subsequently to false negatives [23]. Tools that use windows face difficulty in adjusting
their size: small sizes lead to large statistical fluctuation, whereas larger sizes result in low
resolution [57].
For comparative genomics methods, the outcomes depend strongly on the choice of
genomes used in the alignment process. Very distant genomes may lead to false positives,
and very close genomes may lead to false negatives. In general, the number of reported GIs
may differ across tools, because one large GI is often reported as a few smaller ones or vice
versa, also making it harder to detect end points and boundaries accurately, even with the
use of hidden Markov models by tools such as AlienHunter and SIGI-HMM.
Furthermore, no reliable GI dataset exists that can validate the predictions of all these
computational methods [38]. Although several databases exist, they usually cover only
specific types of GIs [Islander, PAIDB, ICEberg], which would flag as false positives any
extra predictions made by those tools. Moreover, as Nelson et al. state, “The reliability of
the databases has not been verified by any convincing biological evidence” [6].
5
2.2 Operons
2.2.1 Background and Related Work
In prokaryotes, contiguous genes are often arranged into operons. These operons usually
include metabolically or functionally related genes that are often grouped together on the
same strand and transcribed in the same polycistronic messenger RNA [58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68]. Genes in the same operon share a common promoter and terminator[69].
We reserve the definition of Operons as those including at least two genes, distinguishing
them from Transcription Units, which may include one or more genes[1]. It is estimated
that more than 50% of bacterial genes form operons[70]. Since operons are often formed
by genes that are related functionally or metabolically, predicting operons could help us
predict higher levels of gene organization and regulatory networks[58, 59, 65, 66, 67, 68, 71].
Such predictions could help annotate gene functions[72], aid in drug development[73], and
antibiotic resistance[74].
Different methods use different features of operons to make their predictions. Some meth-
ods focus on promoters and terminators and use Hidden Markov Models(HMMs) to make
their predictions[75, 76, 77], others rely on gene conservation information[78], while others
leverage functional relatedness between the genes instead[58, 79]. Perhaps the most promi-
nent features used in operon prediction are transcription direction and inter-genetic distances
[58, 66, 70, 81, 82, 84, 83, 85, 86, 87, 88, 89, 91]. The idea that adjacent genes that are con-
served across multiple genomes are likely to be co-transcribed makes gene conservation an-
other important feature[67, 92]. In addition to HMMs, different machine learning(ML) tech-
nologies are deployed for the predition of operons, such as neural networks[81, 82], support
vector machines [83], and decision tree-based classification[84]. Other tools utilize Bayesian
probabilities [85, 86, 87], genetic algorithms[88], and graph-theoretic techniques[86, 79].
Many tools perform their predictions on gene pairs, and then assemble their predictions
6
that are made on an entire genome into operons by combining contiguous pairs that are
labeled as operonic.[58, 60]. These tools are often benchmarked using two genomes whose
operons have been studied extensively, E. coli and B. subtilis[60].
Among the tools that predict operons, we focus on two methods. The first is reported
to have the highest accuracy among operon prediction tools[64]. It is based on an arti-
ficial neural network that uses inter-genetic distance and protein functional relationships.
The method uses gene neighborhood, fusion, co-occurrence and co-expression in addition to
protein-protein interaction and literature mining to generate a score that can be used as
input to their network[62, 64]. The predictions made by this method were compiled into
what is called the Prokaryotic Operon Database(ProOpDB)[65]. The second tool is called
the Database of Prokaryotic Operons (DOOR) and was ranked as the best operon predictor
among 14 other predictors[59], and was also reported by another study to have the second
highest accuracy after ProOpDB[64]. Their algorithm uses a combination of a non-linear
(decision-tree based) classifier and a linear (logistic function-based classifier) depending on
the number of experimentally validated operons that could be used in the training[59].
2.2.2 Challenges
Before delving into our method, it’s worth pointing out some of the challenges that un-
dermine operon prediction. Predictors that rely on features such as gene conservation or
functional assignment are limited in the sense that they could not be applied to genomes or
genomic fragments that lack these features. So while such predictors might perform well on
gene pairs that include the necessary features, their performance might drop considerably
when considering the whole genome[58]. Moreover, even though most methods validate their
results by comparing their predictions to experimentally verified operons, the fact that the
experimentally verified datasets are not available for all prokaryotes or that the datasets
7
used vary between studies poses extra challenges when trying to compare prediction meth-
ods. Brouwer et al tried to compare several methods on a uniform dataset and noticed
significant gap in the metrics reported. The drop was even higher when considering full
operons rather than gene pairs[64, 93]. Finally, Some methods include the testing dataset in
the training dataset which leads to reported accuracies that are significantly higher, whereas
the information flow would not be easily and readily transferable to other genomes[66].
8
CHAPTER 3
METHODS
3.1 From Genes to Images
PATRIC (the Pathosystems Resource Integration Center) is a bacterial bioinformatics re-
source center that we are part of (https://www.patricbrc.org)[94]. It provides researchers
with the tools necessary to analyze their private data and to compare it with public data.
PATRIC recently surpassed the 200,000 publicly sequenced genomes mark, ensuring that
enough genomes are available for effective comparative genomics studies. PATRIC provides
a compare region viewer service, where a query genome is aligned against a set of other
related genomes anchored at a specific focus gene. The service finds other genes that are of
the same family as the query gene and then aligns their flanking regions accordingly. Such
graphical representations are appealing because they help users visualize the genomic areas
of interest. In the resulting plots, genomic islands should appear as gaps in alignment as
opposed to conserved regions. We replicated the service by implementing an offline version
forked from the production user interface, which is necessary for computational efficiency
and for consistency in the face of any future interface changes.
Figure 2.1 shows sample visualizations of different genomic fragments belonging to the
two classes. Each row represents a region in a genome, with the query genome being the top
row. Each arrow represents a single gene, capturing its size and strand directionality. Colors
represent functionality. The red arrow represents the query gene, at which the alignment
with the rest of the genomes is anchored. The remaining genes share the same color if they
belong to the same family; they are colored black if they are not found in the query genome’s
focus region. Some colors are reserved for key genes: green for mobility genes, yellow for
tRNA genes, and blue for phage related genes.
Figures 2.1a and 2.1b are examples of a query genome with a non-conserved neighbor-
hood. The focus peg lacks alignments in general or is being aligned with genes from other
9
genomes with different neighborhoods, containing genes with different functionalities from
those in the query genome (functionality is color coded). In contrast, Figures 2.1c and 2.1d
show more conserved regions, which are what we expect to see in the absence of GIs.
Figure 3.1: Examples of images generated from the compare region viewer. Each directedarrow represents a gene color coded to match its functionality. The first row is the genomeneighborhood of the focus gene (red), and the subsequent rows represent anchored regionsfrom similar genomes sorted by their phylogenetic distances to the query genome.
10
3.2 Why Images?
Representing genomic fragments as images makes it easier to leverage the powerful machine
learning (ML) technologies that have become the state of art in solving computer vision
problems. Algorithms based on deep neural networks have proven to be superior in competi-
tions such as the ImageNet Large Scale Visual Recognition Challenge (ILSVRS) [95]. Deep
learning is the process of training neural networks with many hidden layers. The depth of
these networks allows them to learn more complex patterns and higher-order relationships,
at the cost of being more computationally expensive and requiring more data to work effec-
tively. Improvements in such algorithms have been translated to improvements in a variety
of domains reliant on computer vision tasks [96].
The images generated by PATRIC capture many of the most important GI features
mentioned earlier—the sporadic distribution of islands, the protein length, functionality,
and strand directionality—using color-coded arrows of various sizes.
For operons, the images generated are almost the same as the ones we generated for
genomic slands. The notable differences are the region size, that we reduced to 5(kilo base
pairs)KBp since the immediate vicinity is more relevant when dealing with operons. The
arrows and the distance between them are scaled on each row to reflect the actual length and
distance between the genes. Colors still represent functionality. The blue arrows are reserved
to represent the query gene pair. The rest of the genes are still color-coded by their family
membership. The families used in the coloring process are PATRIC’s Global Pattyfams that
are generated by mapping signature k-mers to protein functionality, using non-redundant
protein databases built per genus before being transferred across genera[97]. In addition, we
highlight the inter-genetic distance between the query gene pair by filling the gap as a red
rectangle, and removed the arrows from all rows except the first row representing the query
genome, since that is the only directionality we care about. This way, the generated images
capture most of the prominent operon features mentioned earlier, such as gene conservation,
functionality, strand direction, size, and inter-genetic distance. Figure 3.1 shows an example
11
of how the generated images look like.
Figure 3.2: Example of an image generated by our service to be fed as input to the neuralnetwork. Each arrow represents a single gene. Each row captures the area of interest in agenome. The query genome is the top row. The rest of the rows are genomes selected byevolutionary distance. The query gene pair are color-coded with blue, and the inter-geneticdistance between them is colored red. The rest of the genes share the same color if theybelong to the same family.
3.3 Deep Transfer Learning
Training deep models over a limited dataset puts the model at the risk of overfitting. One
way around this problem is using a technique referred to as transfer learning[98]. In transfer
learning, a model does not have to be trained from scratch. Instead, the idea is to retrain
a model that has been previously trained on a related task. The newly retrained model
should then be able to transfer its existing knowledge and apply it to the new task. This
approach affords the ability to reuse models that have been trained on huge amounts of data,
while adding the necessary adjustments to make them available to work with more limited
datasets, adding a further advantage to our approach of representing the data visually.
In our approach (Shutter Island), we use Google’s Inception V3 architecture that has
12
been previously trained on ImageNet. The Inception V3 architecture is a 48-layer-deep
convolutional neural network [96]. Training such a deep network on a limited dataset such
as the one available for GIs is unlikely to produce good results. ImageNet is a database that
contains more than a million images belonging to more than a thousands categories [95].
The ImageNet project runs the ILSVRC dataset annually. The Inception V3 model reaches
a 3.5% top-5 error rate on the 2012 ILSVRC dataset, where the winning model that year
had a 15.3% error rate. Thus, a model that was previously trained on ImageNet is already
good at feature extraction and visual recognition. To make the model compatible with the
new task, the top layer of the network is retrained on our GI dataset, while the rest of the
network is left intact, a strategy that is more powerful than starting with a deep network
with random weights.
For Operon Hunter, in addition to the mentioned changes in the generated images,
some changes were also introduced on the training part. Namely, we switched to using
the FastAI[99] framework and tested the performance of the different available pre-trained
models on our dataset. We observed the best performance when using the ResNet18 model,
whose architecture is displayed in Figure 3.2. We also found better results when we changed
the mode of transfer learning. Specifically, instead of only re-training the last layer of the
network while freezing the weights in the previous layers, we noticed better results when the
previous layers’ weights were unfrozen.
In addition to using powerful technologies and extensive data, this approach may add an
extra advantage over whole-genome alignment methods because performing the alignment
over each gene may provide a higher local resolution and aid in resisting evolutionary effects
such as recombination and others that may have happened after the integration and that
usually affect GI detection efforts.
13
Figure 3.3: Architecture of the ResNet18 Neural Network used to predict operons.
14
CHAPTER 4
SUMMARY OF WORK
So far, we successfully mapped two problems from biology to qualify as computer vision
problems. To do that, we implemented tools that represent genomic fragments as images,
with genes as arrows and different rows capturing genome alignments. We successfully re-
trained a well known computer vision deep learning neural network on the resulting image
datasets related to these problems, and demonstrated the superiority of this approach over
the existing state of the art methods that rely on tabular data. To do that, we successfully
curated the appropriate datasets and devised extensive validation methods to perform a uni-
form comparison across other tools, which was needed especially given the existing seemingly
random nature of validation in the literature. We are currently exploring the benefits of this
image-based approach when compared to tabular-based ones, especially in situations where
the former is counter-intuitive and hope to come up with a conjecture of what may be the
reason for the superior performance.
15
REFERENCES
[1] Langille, M., Hsiao, W., Brinkman, F. (2010) Detecting genomic islands using bioinfor-matics approaches. Nature Reviews Microbiology, 8(5), 373–382.
[2] Hacker, J., et al. (1990) Deletions of chromosomal regions coding for fimbriae andhemolysins occur in vitro and in vivo in various extraintestinal Escherichia coli isolates.Microb. Pathog, 8, 213—225.
[3] Hudson, C., Lau, B., Williams, K. (2014) Islander: a database of precisely mappedgenomic islands in tRNA and tmRNA genes. Nucleic Acids Research, 43(D1), D48–D53.
[4] Barondess, J.J., Beckwith, J. (1990) A bacterial virulence determinant encoded by lyso-genic coliphage lambda. Nature, 346, 871—874.
[5] Dobrindt, U., Hochhut, B., Hentschel, U., Hacker, J. (2004) Genomic islands inpathogenic and environmental microorganisms. Nat. Rev. Microbiol., 2, 414—424.
[6] Lu, B., Leong, H. (2016) Computational methods for predicting genomic islands in mi-crobial genomes. Computational And Structural Biotechnology Journal, 14, 200–206.
[7] Juhas, M., van der Meer, J.R., Gaillard, M., Hood, D.W., et al. (2009) Genomic islands:tools of bacterial horizontal gene transfer and evolution. FEMS Microbiol. Rev., 33,376—3793.
[8] Akhter, S., Aziz, R., Edwards, R. (2012) PhiSpy: a novel algorithm for finding prophagesin bacterial genomes that combines similarity- and composition-based strategies. NucleicAcids Research, 40(16), e126–e126.
[9] Fogg, P., Colloms, S., Rosser, S., Stark, M., Smith, M. (2014) New applications for phageintegrases. Journal of Molecular Biology, 426(15), 2703–2716.
[10] Hambly, E., Suttle, C.A.. (2005) The viriosphere, diversity, and genetic exchange withinphage communities. Curr. Opin. Microbiol., 8, 444-–50.
[11] Hacker, J., Kaper, J.B. (2000) Pathogenicity islands and the evolution of microbes.Annu. Rev. Microbiol., 54, 641—679.
[12] Liu, H., Zhu, J. (2010) Analysis of the 3’ ends of tRNA as the cause of insertion sites offoreign DNA in Prochlorococcus. Journal Of Zhejiang University SCIENCE B, 11(9),708–718.
[13] Nelson, K.E., et al. (1999) Evidence for lateral gene transfer between archaea and bac-teria from genome sequence of Thermotoga maritime. Nature, 399(6734), 323-329.
[14] Garcia-Vallve, V.S., Romeu, A., Palau, J. (2000) Horizontal gene transfer in bacterialand archaeal complete genomes. Genome Res., 10(11), 1719–1725.
16
[15] Koonin, E.V., Makarova, K.S., Aravind, L. (2001) Horizontal gene transfer in prokary-otes: quantification and classification. Annu. Rev. Microbiol., 55(1), 709-742.
[16] Nakamura, Y., Itoh, T., Matsuda, H., Gojobori, T., (2004) Biased biological functionsof horizontally transferred genes in prokaryotic genomes. Nat. Genet., 36(7), 760–766.
[17] Choi, I.G., Kim, S.H., (2007) Global extent of horizontal gene transfer. PNAS, 104(11),4489–4494.
[18] Casjens, S. (2003) Prophages and bacterial genomics: what have we learned so far?Mol. Microbiol., 49, 277—300.
[19] Casjens, S., Palmer, N., van Vugt, R., Huang, W.M., Stevenson,B., Rosa, P., Lathigra,R., Sutton, G., Peterson, J., Dodson, R.J. et al. (2000) A bacterial genome in Fux: thetwelve linear and nine circular extrachromosomal DNAs in an infectious isolate of theLyme disease spirochaete Borrelia burgdorferi. Mol. Microbiol., 35, 490-–516.
[20] Canchaya, C., Proux, C., Fournous, G., Bruttin, A., Brussow, H. (2003) Prophagegenomics. Microbiol. Mol. Biol. Rev., 67, 238-–276.
[21] Arndt, D., Grant, J., Marcu, A., Sajed, T., Pon, A., Liang, Y., & Wishart, D. (2016)PHASTER: a better, faster version of the PHAST phage search tool. Nucleic AcidsResearch, 44(W1), W16–W21.
[22] Zhou, Y., Liang, Y., Lynch, K., Dennis, J., & Wishart, D. (2011) PHAST: a fast phagesearch tool. Nucleic Acids Research, 39, W347–W352.
[23] Coates, A.R., Hu, Y. (2007) Novel approaches to developing new antibiotics for bacterialinfections. Br. J. Pharmacol., 152, 1147—1154.
[24] Bar, H., Yacoby, I., Benhar,I. (2008) Killing cancer cells by targeted drug-carrying phagenanomedicines. BMC Biotechnol., 8, 37.
[25] Hacker, J., Blum-Oehler, G., Muhldorfer, I., Tschape, H. (1997) Pathogenicity islands ofvirulent bacteria: structure, function and impact on microbial evolution. Mol. Microbiol.,23, 1089—1097.
[26] Schmidt H, Hensel M. (2004) Pathogenicity Islands in bacterial pathogenesis. Clin.Mcrobiolog. Rev., 17, 14—56.
[27] Ho Sui, S.J,, Fedynak, A., Hsiao, W.W.L., Langille, M.G.I., Brinkman, F.S.L. (2009)The association of virulence factors with genomic islands. PLoS One, 4, e8094.
[28] Moriel, D.G., Bertoldi, I., Spagnuolo, A., Marchi, S., Rosini, R., et al. (2010) Identifi-cation of protective and broadly conserved vaccine antigens from the genome of extrain-testinal pathogenic Escherichia coli. Proc. Natl. Acad. Sci. U S A, 107, 9072-–9077.
[29] Fouts, D.E. (2004) Bacteriophage bioinformatics. In Fraser, C.M., Read, T.D., Nelson,K.E. (eds), Microbial Genomes. Humana Press Inc., Totowa, NJ, pp. 71—91.
17
[30] Nicolas, P., Bize, L., Muri, F., Hoebeke, M., Rodolphe,F ., Ehrlich, S.D., Prum, B.,Bessieres,P. (2002) Mining Bacillus subtilis chromosome heterogeneities using hiddenMarkov models. Nucleic Acids Research, 30, 1418—1426.
[31] Srividhya, K., Alaguraj, V., Poornima, G., Kumar, D., Singh, G.P., Raghavenderan, L.,Katta, M., Mehta, P., Krishnaswamy, S. (2007) Identification of prophages in bacterialgenomes by dinucleotide relative abundance difference. PLoS One, 2, e1193.
[32] Gogarten, J.P., Townsend, J.P.. (2005) Horizontal gene transfer, genome innovation andevolution. Nat. Rev. Microbiol., 3, 679-–687.
[33] Soucy, S.M., Huang, J., Gogarten, J.P. (2015) Horizontal gene transfer: building theweb of life. Nat. Rev. Genet., 16, 472–482.
[34] Hacker, J., Bender, L., Ott, M., Wingender, J., Lund, B., et al. (1990) Deletions ofchromosomal regions coding for fimbriae and hemolysins occur in vitro and in vivo invarious extra intestinal Escherichia coli isolates. Microb. Pathog., 8, 213-–225.
[35] Vernikos, G. S., Parkhill, J. (2008) Resolving the structural features of genomic islands:a machine learning approach. Genome Res., 18, 331—342.
[36] Greub, G., Collyn, F., Guy, L., Roten, C.A. (2004) A genomic island present along thebacterial chromosome of the Parachlamydiaceae UWE25, an obligate amoebal endosym-biont, encodes a potentially functional F-like conjugative DNA transfer system. BMCMicrobiology, 4(1), 48
[37] Lawrence, J.G., Ochman, H. (1998) Molecular archaeology of the Escherichia coligenome. Proceedings of the National Academy of Sciences, 95(16), 9413—9417.
[38] Vernikos, G. S., Parkhill, J. (2006) Interpolated variable order motifs for identificationof horizontally acquired DNA: revisiting the Salmonella pathogenicity islands. Bioinfor-matics, 22, 2196—2203.
[39] Karlin, S., Mrazek, J. & Campbell, A. M. (1998) Codon usages in different gene classesof the Escherichia coli genome. Mol. Microbiol., 29, 1341—1355.
[40] Karlin, S. (2001) Detecting anomalous gene clusters and pathogenicity islands in diversebacterial genomes. Trends Microbiol., 9, 335—343.
[41] Sandberg, R., et al. (2001) Capturing whole-genome characteristics in short sequencesusing a naive Bayesian classifier. Genome Res., 11, 1404—1409.
[42] Tsirigos, A., Rigoutsos, I. (2005) A new computational method for the detection ofhorizontal gene transfer events. Nucleic Acids Research, 33, 922—933.
[43] Hatfull, G.F., Jacobs-Sera, D., Lawrence, J.G., Pope, W.H., Russell, D.A., Ko, C.C.,Weber, R.J., Patel, M.C., Germane, K.L., Edgar, R.H., et al. (2010) Comparative ge-nomic analysis of 60 mycobacteriophage genomes: genome clustering, gene acquisition,and gene size. J. Mol. Biol., 397, 119—143.
18
[44] Langille, M.G., Hsiao, W.W., Brinkman, FS. (2008) Evaluation .of genomic island pre-dictors using a comparative genomics approach. BMC Bioinformatics, 9, 329.
[45] Williams, K.P. (2002) Integration sites for genetic elements in prokaryotic tRNA andtmRNA genes: sublocation preference of integrase subfamilies. Nucleic Acids Res., 30,866-–875.
[46] Reiter, W.D., Palm, P., Yeats, S. (1989) Transfer RNA genes frequently serve as inte-gration sites for prokaryotic genetic elements. Nucleic Acids Research, 17, 1907-–1914.
[47] Bellanger, X., Payot, S., Leblond-Bourget, N., Guedon, G. (2014) Conjugative andmobilizable genomic islands in bacteria: evolution and diversity. FEMS Microbiol. Rev.,38, 720—760.
[48] Srividhya, K.V., Rao, G.V., Raghavenderan, L., Mehta, P., Prilusky,J., Manicka,S.,Sussman, J.L., Krishnaswamy, S. (2006) Database and comparative identification ofprophages. In: Huang,D-S, Li,K and Irwin,GW (eds). Intelligent Control and Automa-tion, Lecture Notes in Control and Information Sciences. Springer, Berlin, Vol. 344, pp.863—868.
[49] Ester, M., Kriegel, H., Sander, J., Xu, X. (1996) A density-based algorithm for discov-ering clusters in large spatial databases with noise. In: KDD-1996 Proceedings AAAIPress, Menlo Park, pp. 226—231.
[50] Hsiao, W., Wan, I., Jones, S.J., et al. (2003) IslandPath: aiding detection of genomicislands in prokaryotes. Bioinformatics, 19(3), b418—420.
[51] Waack, S., Keller, O. Asper, R., et al. (2006) Score-based prediction of genomic islandsin prokaryotic genomes using hidden Markov models. BMC Bioinformatics, 7, 142.
[52] Tu, Q., Ding, D. (2003) Detecting pathogenicity islands and anomalous gene clustersby iterative discriminant analysis. FEMS Microbiol. Lett., 221, 269—275.
[53] Fouts, D. (2006) Phage Finder: automated identification and classification of prophageregions in complete bacterial genome sequences. Nucleic Acids Res., 34, 5839-–5851.
[54] Langille, M.G., Brinkman, F. IslandViewer: an integrated interface for computationalidentification and visualization of genomic islands. Bioinformatics, 25, 664-–665.
[55] Nelson, K.E., Weinel, C., Paulsen, I.T., Dodson, R.J., Hilbert, H., Martins dos San-tos, V.A., Fouts, D.E., Gill, S.R., Pop, M., Holmes, M. et al. (2002) Complete genomesequence and comparative analysis of the metabolically versatile Pseudomonas putidaKT2440. Environ. Microbiol., 4, 799—808.
[56] Lawrence, J. G., Ochman, H. (1997) Amelioration of bacterial genomes: rates of changeand exchange. J. Mol. Evol., 44, 383—397.
[57] Zhang, R., Zhang, C.T. (2004) A systematic method to identify genomic islands andits applications in analyzing the genomes of Corynebacterium glutamicum and Vibriovulnificus CMCP6 chromosome I. Bioinformatics, 20(5), 612—622.
19
[58] Romero,P.R. and Karp,P.D. (2004) Using functional and organizational informationto improve genome-wide computational prediction of transcription units on pathway-genome databases. Bioinformatics, 20, 709–717.
[59] Mao F, Dam P, Chou J, et al (2009) . DOOR: a database for prokaryotic operons.Nucleic Acids Res 2009;37:459–463.
[60] Haller R, Kennedy M, Arnold N, Rutherford R (2010). The transcriptome of Mycobac-terium tuberculosis. Appl Microbiol Biotechnol. 2010; 86:1–9.
[61] Pelly, S., Winglee, K., Xia, F., Stevens, R. L., Bishai, W. R., & Lamichhane, G. (2016).REMap: operon map of M. tuberculosis based on RNA sequence data. Tuberculosis, 99,70-80.
[62] Taboada, B., Estrada, K., Ciria, R., & Merino, E. (2018). Operon-mapper: a web serverfor precise operon identification in bacterial and archaeal genomes. Bioinformatics, 34(23), 4118-4120.
[63] Price, M. N., Huang, K. H., Alm, E. J., & Arkin, A. P. (2005). A novel method foraccurate operon predictions in all sequenced prokaryotes. Nucleic acids research, 33 (3),880-892.
[64] Zaidi, S. S. A., & Zhang, X. (2016). Computational operon prediction in whole-genomesand metagenomes. Briefings in functional genomics, 16 (4), 181-193.
[65] Taboada, B., Ciria, R., Martinez-Guerrero, C. E., & Merino, E. (2011). ProOpDB:Prokaryotic Operon Data Base. Nucleic acids research, 40 (D1), D627-D631.
[66] Taboada, B., Verde, C., & Merino, E. (2010). High accuracy operon prediction methodbased on STRING database scores. Nucleic acids research, 38 (12), e130-e130.
[67] Bergman, N. H., Passalacqua, K. D., Hanna, P. C., & Qin, Z. S. (2007). Operon predic-tion for sequenced bacterial genomes without experimental information. Appl. Environ.Microbiol., 73 (3), 846-854.
[68] Fortino, V., Smolander, O. P., Auvinen, P., Tagliaferri, R., & Greco, D. (2014). Tran-scriptome dynamics-based operon prediction in prokaryotes. BMC bioinformatics, 15 (1),145.
[69] Fran B, Perrin D, Monod J, et al (1960). The operon : a group of genes whose expressionis coordinated by an operator. J Bacteriol;29:1727–9.
[70] Okuda S, Kawashima S, Kobayashi K, et al (2007). Characterization of relationshipsbetween transcriptional units and operon structures in Bacillus subtilis and Escherichiacoli. BMC Genomics;8:48.
[71] Hodgman TC (2000). A historical perspective on gene/protein functional assignment.Bioinformatics;16:10–5.
20
[72] Joon M, Bhatia S, Pasricha R, et al (2010). Functional analysis of an intergenic non-coding sequence within mce1 operon of M. tu- berculosis. BMC Microbiol;10:128.
[73] Wang S, Wang Y, Liang Y, et al (2007). A multi-approaches-guided genetic algorithmwith application to operon prediction. Artif Intell Med;41:151–9.
[74] Pantosti A, Sanchini A, Monaco M (2007). Mechanisms of antibiotic resistance inStaphylococcus aureus. Future Microbiol;2: 323–34.
[75] Yada,T., Nakao,M., Totoki,Y. and Nakai,K. (1999) Modeling and predicting transcrip-tional units of Escherichia coli genes using hidden Markov models. Bioinformatics, 15:987–993.
[76] Craven,M., Page,D., Shavlik,J., Bockhorst,J. and Glasner,J. (2000) A probabilisticlearning approach to whole-genome operon pre- diction. Proc. Conf. Intell. Syst. Mol.Biol., 8: 116–127.
[77] Tjaden,B., Haynor,D.R., Stolyar,S., Rosenow,C. and Kolker,E. (2002) Identifying oper-ons and untranslated regions of transcripts using Escherichia coli RNA expression anal-ysis. Bioinformatics, 18 (Suppl. 1): S337–S344.
[78] Ermolaeva,M.D., White,O. and Salzberg,S.L. (2001) Predic- tion of operons in microbialgenomes. Nucleic Acids Res., 29: 1216–1221.
[79] Zheng,Y., Szustakowski,J.D., Fortnow,L., Roberts,R.J. and Kasif,S. (2002) Computa-tional identification of operons in microbial genomes. Genome Res., 12: 1221–1230.
[80] Ogata,H., Fujibuchi,W., Goto,S. and Kanehisa,M. (2000) A heur- istic graph comparisonalgorithm and its application to detect functionally related enzyme clusters. Nucleic AcidsRes., 28: 4021–4028.
[81] Chen,X., Su,Z., Xu,Y. and Jiang,T. (2004) Computational prediction of operons inSynechococcus sp. WH8102. Genome Inform., 15, 211–222.
[82] Tran,T.T., Dam,P., Su,Z., Poole,F.L., Adams,M.W., Zhou,G.T. and Xu,Y. (2007)Operon prediction in Pyrococcus furiosus. Nucleic Acids Res., 35, 11–20.
[83] Zhang,G.Q., Cao,Z.W., Luo,Q.M., Cai,Y.D. and Li,Y.X. (2006) Operon predictionbased on SVM. Comput. Biol. Chem., 30, 233–240.
[84] Dam,P., Olman,V., Harris,K., Su,Z. and Xu,Y. (2007) Operon prediction using bothgenome-specific and general genomic information. Nucleic Acids Res., 35, 288–298.
[85] Bockhorst,J., Craven,M., Page,D., Shavlik,J. and Glasner,J. (2003) A Bayesian networkapproach to operon prediction. Bioinformatics, 19, 1227–1235.
[86] Edwards,M.T., Rison,S.C., Stoker,N.G. and Wernisch,L. (2005) A universally applica-ble method of operon map prediction on minimally annotated genomes using conservedgenomic context. Nucleic Acids Res., 33, 3253–3262.
21
[87] Westover,B.P., Buhler,J.D., Sonnenburg,J.L. and Gordon,J.I. (2005) Operon predictionwithout a training set. Bioinformatics, 21, 880–888.
[88] Jacob,E., Sasikumar,R. and Nair,K.N. (2005) A fuzzy guided genetic algorithm foroperon prediction. Bioinformatics, 21, 1403–1407.
[89] Salgado,H., Moreno-Hagelsieb,G., Smith,T.F. and Collado- Vides,J. (2000) Operonsin Escherichia coli: genomic analyses and predictions. Proc. Natl Acad. Sci. USA, 97,6652–6657.
[90] Okuda,S., Kawashima,S., Kobayashi,K., Ogasawara,N., Kanehisa,M. and Goto,S.(2007) Characterization of relationships between transcriptional units and operon struc-tures in Bacillus subtilis and Escherichia coli. BMC Genomics, 8, 48.
[91] Yan,Y. and Moult,J. (2006) Detection of operons. Proteins, 64, 615–628.
[92] Overbeek, R., M. Fonstein, M. D’Souza, G. D. Pusch, and N. Maltsev. (1999). The useof gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. USA 96:2896–2901.
[93] Brouwer RWW, Kuipers OP, Van Hijum SAFT (2008). The relative value of operonpredictions. Brief Bioinform:367–75.
[94] Wattam, A.R, ZDavis, J.J., Assaf, R., Boisvert, S., Bun, T., Conrad, N., Dietrich, E.M.,Disz, T., Gabbard, J.L., Gerdes, S., Henry, C.S., Kenyon, R.W., Machi, D., Mao, C.,Nordberg, E.K., Olsen, G.J., Murphy-Olson, D.E., Olson, R., Overbeek, R., Parrello,B., Pusch, G.D., Shukla, M., Vonstein, V., Warren, A., Xia, F., Yoo, H., Stevens, R.L.(2017) Improvements to PATRIC, the all-bacterial Bioinformatics Database and AnalysisResource Center. Nucleic Acids Research 45(D1), D535–D542.
[95] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.. (2015) ImageNet large scale visualrecognition challenge. IJCV.
[96] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. (2016) Rethinking theInception Architecture for Computer Vision. The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2818–2826.
[97] Davis JJ, Gerdes S, Olsen GJ, Olson R, Pusch GD, Shukla M, Vonstein V, WattamAR and Yoo H (2016) PATtyFams: Protein Families for the Microbial Genomes in thePATRIC Database. Front. Microbiol. 7:118. doi: 10.3389/fmicb.2016.00118
[98] How to Retrain an Image Classifier for New Categories - TensorFlow Hub — TensorFlow.(2018) Retrieved from https://www.tensorflow.org/hub/tutorials/image retraining
[99] FastAI — FastAI. (2018). Retrieved from https://docs.fast.ai/index.html
22