Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
1 EBA Manual
Evolutionary Breakpoint Analyser (EBA)
Identification of evolutionary breakpoint regions and classify them using their
phylogenetic relationship.
User manual
Jitendra Narayan and Denis M. Larkin
Institute of Biological, Environmental and Rural Sciences (IBERS),
Aberystwyth University
Pengais, Aberystwyth
Ceredigion
SY23 1PW
Version 1.1 (09/06/2015)
2 EBA Manual
Contents Chapter 1 ....................................................................................................................................................................... 3
First steps .................................................................................................................................................................. 3
1.1 Introduction ....................................................................................................................................................... 3
1.2 Installation .......................................................................................................................................................... 3
1.2.1 System Requirements................................................................................................................................ 3
1.2.2 Obtaining the tool ..................................................................................................................................... 3
1.2.3 Installing the EBA framework ................................................................................................................ 3
1.3 Mandatory Perl Modules.................................................................................................................................. 4
1.3.1 Install Perl Modules Manually ................................................................................................................. 7
1.4 GenBank Taxonomy Database ....................................................................................................................... 8
1.5 Sample Data ....................................................................................................................................................... 8
Chapter 2 ....................................................................................................................................................................... 9
How to use EBA ..................................................................................................................................................... 9
2.1 The script EBA.pl ............................................................................................................................................. 9
2.1.1 Mandatory parameters .............................................................................................................................. 9
2.1.2 Optional parameters ................................................................................................................................. 9
2.2 Input file formats ............................................................................................................................................ 10
2.2.1 Homologous Synteny Blocks (HSB) files ............................................................................................ 10
2.2.2 Classification file ...................................................................................................................................... 11
2.2.3 Chromosome size file ............................................................................................................................. 11
2.2.4 BetaScore file ........................................................................................................................................... 12
2.3 Folder setting ................................................................................................................................................... 12
2.3.1 Resolution Folders .................................................................................................................................. 12
2.3.2 Location of all resolutions folders ........................................................................................................ 12
Chapter 3 ..................................................................................................................................................................... 13
Result and Intermediate Files .............................................................................................................................. 13
3.1 Intermediate files and folder ......................................................................................................................... 13
3.1.1 EBA_OutFiles Directory ....................................................................................................................... 13
3.1.2 EBA_ImageFiles Directory ................................................................................................................... 14
3.1.3 ResultFiles Directory .............................................................................................................................. 16
3.1.4 Viz_EHFiles Directory .......................................................................................................................... 18
Chapter 4 ..................................................................................................................................................................... 19
Merging all resolutions ......................................................................................................................................... 19
3 EBA Manual
4.1 Merging resolution with prime ..................................................................................................................... 19
4.2 Final merged reuse .......................................................................................................................................... 19
4.3 Final merged graph ......................................................................................................................................... 19
4.4 Final merged classification pie chart ............................................................................................................ 20
4.5.3 Final merged breakpoint line graph .......................................................................................................... 20
4.5 Other graphs .................................................................................................................................................... 21
4.5.1 Breakpoint line graph ............................................................................................................................. 21
4.5.2 Breakpoint histogram ............................................................................................................................. 21
Chapter 5 ..................................................................................................................................................................... 22
Point to remember ................................................................................................................................................ 22
Chapter 1
First steps
1.1 Introduction
The evolutionary breakpoint analyser (EBA) implements methods for identification of evolutionary
breakpoint regions and classify them using their phylogenetic relationships at different resolutions. Aim
of this manual to provide you an overview of organization, installation, configuration of the program.
The EBA code was written in Perl and is distributed under the GNU GPL License or the Artistic License
2.0. For more information please check http://www.gnu.org/copyleft/gpl.html and
http://opensource.org/licenses/artistic-license-2.0
1.2 Installation
1.2.1 System Requirements
EBA was implemented in Perl and tested under the Window, Linux and Mac OS environments. It has the
following requirements:
Window, Linux or Mac OS
Perl
Perl Modules
1.2.2 Obtaining the tool
The Evolutionary Breakpoint Analyser (EBA) tool can be downloaded from Aberystwyth University, or
BioinformaticsOnline (BOL) websites.
www.bioinformaticsonline.com/EBA
1.2.3 Installing the EBA framework
Users are requested to check the system for Perl language and some required Perl modules before
installing and running the EBA tool. Moreover, if you are working on MacOS or Linux OS then it is
expected that you have Perl installed in your system/server. You can check it by typing perl –v at the
command prompt which shows the following message if Perl installed:
4 EBA Manual
jitendra@jitendra-OptiPlex-390[EBAv1.1] perl -v
This is perl 5, version 14, subversion 2 (v5.14.2) built for x86_64-linux-gnu-thread-multi (with 80 registered patches, see perl -V
for more detail)
Copyright 1987-2011, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found
in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on this system using "man perl" or "perldoc perl". If
you have access to the Internet, point your browser at http://www.perl.org/, the Perl Home Page.
If you need more information, visit http://www.perl.org/. If your system does not show above message
then Perl is not installed, to install access http://www.perl.org/get.html and follow the download and
installation instructions. You can also try strawberry Perl (http://strawberryperl.com/) or CitrusPerl
http://www.citrusperl.com/ if working on Window machine.
In order to install the EBA framework, the user must obtain the compressed EBA folder from the
website mentioned above (see Section 1.2.2) and decompress it under the desired directory:
> cp –i EBAv1.1.tar.gz <directory>
> cd <directory>
> tar xvf EBAv1.1.tar.gz
Where <directory> is the path for the desired directory (Example: /usr/local/). After decompressing the
EBA.zip, a directory named EBA can be found inside of <directory> that contains all required scripts
package.
1.3 Mandatory Perl Modules
In order to install and run EBA framework script, user needs to install the following mandatory modules
in your operating system.
File::Path
Math::Round
List::Compare
GD
GD::Graph
GD::Text
GD::Image
The easiest way to install Perl module is to use CPAN, the repository of Perl modules. It has an
interactive shell that you can use, and we will see how to do that. In order to make sure our installation
works you need to use sudo and call CPAN (from any directory).
$ sudo cpan
If this is the first time you are running CPAN, just answer yes to all config questions and you are good to
go. Now we have to install above mentioned modules required. By using the command install, that can be
achieved by
5 EBA Manual
cpan$ install File::Path
cpan$ install Math::Round
cpan$ install List::Compare
In Mac OS you need to follows following steps for GD:
Now, you deal with the last module and usually the most laborious to install, GD. Ideally you should have
all possible library support for GD and for this you have to install additional libraries. Below are the
demonstrations of the most common dependencies and see if we need anything else. Usually libjpeg and
libpng are required by GD, but sometime FreeType, zlib and LibGD are also required if not available. So,
let’ download
$ mkdir tempDir
$ cd tempDir
Install libjpeg:
tempDir$ curl -O http://www.ijg.org/files/jpegsrc.v8d.tar.gz
tempDir$ tar -xzvf jpegsrc.v8d.tar.gz
tempDir$ cd jpeg-8d
tempDir/jpeg-8d$ ./configure
tempDir/jpeg-8d$ make
tempDir/jpeg-8d$ sudo make install
Install libpng:
tempDir/jpeg-8d$ cd ..
tempDir$ curl -O ftp://ftp.simplesystems.org/pub/libpng/png/src/libpng16/libpng-1.6.2.tar.gz
tempDir$ tar -xzvf libpng-1.6.2.tar.gz
tempDir$ cd libpng-1.6.2.tar.gz
tempDir/libpng-1.6.2$ ./configure
tempDir/libpng-1.6.2$ make
tempDir/libpng-1.6.2$ sudo make install
Hopefully, you have all required dependencies. Now you can download GD and check if the configuration you have so far is enough. You can download GD libraries from website, and if GD’s website is down, then you can get the source from Bitbucket and use identical commands to install is
tempDir$ curl -O https://bitbucket.org/libgd/gd-libgd/downloads/libgd-2.1.0-rc2.tar.gz tempDir$ tar -xzvf libgd-2.1.0-rc2.tar.gz tempDir$ cd libgd-2.1.0-rc2 tempDir/libgd-2.1.0-rc2/src$ ./configure
If everything goes well then you will see, at the end of configuration run something like this:
** Configuration summary for gd 2.1.0:
Support for PNG library: yes Support for JPEG library: yes Support for Freetype 2.x library: yes Support for Fontconfig library: yes Support for Xpm library: no Support for pthreads: yes
6 EBA Manual
Sometime, you will fail to get above mention message with yes option. Then you need to check Freetype (freetype-2.5.0.1) and Fontconfig in your OS, you would have to download, configure, make and install them, just like libpng and libjpeg. If everything is OK then
tempDir/pierrejoye-gd-libgd-5551f61978e3/src$ make tempDir/pierrejoye-gd-libgd-5551f61978e3/src$ sudo make install
The last step is to install GD in Perl. Normally, if you use CPAN to install it on OS X, it fails. You can try and see, it might be possible in your cases, if all goes well. $ perl -MCPAN -e 'get GD'
If not successful then, you need to do it by manual method. You need to go to CPAN website and download the latest Perl’s GD implementation and with similar commands to above we install it.
tempDir$ curl -O http://www.cpan.org/authors/id/L/LD/LDS/GD-2.49.tar.gz (if curl fails copy and past on your browser)
tempDir$ tar -xzvf GD-2.49.tar.gz tempDir$ cd GD-2.49 tempDir/GD-2.49$ perl Makefile.PL tempDir/GD-2.49$ make tempDir/GD-2.49$ sudo make install
Done, you can check you program now, hopefully it will work perfect.
Other easiest way to get a lot of these is with a program called Fink, which is similar in nature to the CPAN installer, but installs common GNU utilities. Fink is available from <http://sourceforge.net/projects/fink/>.
In Linux you need to follow these steps:
sudo apt-get install libgd2-xpm-dev
sudo install GD::Graph
sudo indtall GD
If you need to install addition module, you need to search the perl packages with
sudo apt-cache search perl <Module Name>
If you module is there you will get the package distribution name to download, then you can download that module
with provided name by following the same steps mentioned above.
In Window follow these steps:
C:\Documents and Settings\Owner>ppm
PPM interactive shell (2.2.0) - type 'help' for available commands.
PPM> install GD
Install package 'GD?' (y/N): y
Installing package 'GD'...
Downloading http://ppm.ActiveState.com/PPMPackages/5.6plus/MSW. ...
Installing C:\Perl\site\lib\auto\GD\GD.bs
Installing C:\Perl\site\lib\auto\GD\GD.dll
Installing C:\Perl\site\lib\auto\GD\GD.exp
Installing C:\Perl\site\lib\auto\GD\GD.lib
Installing C:\Perl\html\site\lib\GD.html
Installing C:\Perl\site\lib\GD.pm
Installing C:\Perl\site\lib\qd.pl
7 EBA Manual
Installing C:\Perl\site\lib\auto\GD\autosplit.ix
PPM>
You can download all the above mentioned module using CPAN or install manually. You can find the
Perl module installation instruction on http://www.cpan.org/modules/INSTALL.html website. Manual
installation need to follow some rules describe below.
If you are not very familiar with command line, you can use Perl Package Manager (PPM) graphical
interface to install GD module. If you are using ActiveState then you need to type ppm at command
prompt, then it will open a graphical interface for Perl module installation. Search GD and GD-Graph
and install it by clicking on install button (Fig: 1.3.1)
Fig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules.
Some modules are not available as ppm's or have reported errors on the ppm build system, this does not
mean they do not work. You can use CPAN in ActiveState Perl like cpan <Module name>. User can
follow the above mentioned (PPM or CPAN) methods to install any supporting Perl modules User can
find more detail at http://www.activestate.com/blog/2010/10/how-install-cpan-modules-activeperl.
1.3.1 Install Perl Modules Manually
1.3.1.1 Download Perl module
Go to CPAN Search website (http://www.cpan.org/) and search for the module that you wish to
download. In this example, let us search, download and install GD::Graph Perl module. We have
downloaded the GDGraph-1.47.tar.gz to /home/download
> cd /home/download
> gzip -d GDGraph-1.47.tar.gz
> tar xvf GDGraph-1.47.tar.gz
8 EBA Manual
> cd GDGraph-1.47.tar.gz
1.3.1.2 Build the perl module
> perl Makefile.PL
Checking if your kit is complete...
Looks good
Writing Makefile for GDGraph-1.47.tar.gz
Writing Makefile for GDGraph-1.47.tar.gz
> make
> make test
1.3.1.3 Install the perl module
> make install
Typically, Perl modules will be dependent on several other modules. Chasing all these dependencies one-
by-one can be very painful and annoying task, therefore manual method is not recommended but useful if
working offline. We recommend the CPAN method of installation which downloads all dependencies
automatically. User need to use CPAN method only if the system/server is connected to the Internet.
1.4 GenBank Taxonomy Database
The EBA program uses GenBank taxonomy database to classify the studied species and convert them
into evolutionary breakpoint classification format (see 2.2.2). Therefore, user should download the latest
version of GenBank taxonomy database from NCBI server. These compressed dataset is located
at ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz, which need to uncompressed and transfer the
uncompressed taxdump folder at EBA root.
> tar -zxvf taxdump.tar.gz
> mv -i taxdump /EBA
By doing all these steps, the installation of EBA is concluded and the package is ready to use (See Chapter
2 for instructions on how to use the script EBA.pl).
1.5 Sample Data
Keeping the fact that you are new to EBA, we have provided sample input file and their structure in a
SAMPLE_DATA folder for testing purpose. User can check the sample files and the folder structure as
well.
User can check it by running the following commands on command line.
EBA.pl -n 6 -d SAMPLE_DATA –r Gallus_gallus –t 20 –p 100
It will generate all the intermediate files and final results in ResultFiles folder. If you are getting any error
running EBA on test data please contact us [email protected] or [email protected]
9 EBA Manual
Chapter 2
How to use EBA
2.1 The script EBA.pl
The evolutionary breakpoint framework, EBA contains one main script called EBA.pl. This script
coordinates the execution of the whole process of breakpoint identification and classification framework.
The following section describes the mandatory and optional flags need to be used at the time of calling
the program.
The script EBA.pl has the following command line:
perl EBA.pl [-n | --number] <number> [-d | --directory] <dir> [-r | –reference] <refName> [-p | --
prime] <prime> [-t |--threshold] <number>
It has mandatory parameters and also a set of optional parameters.
2.1.1 Mandatory parameters
[-n | --number] <number>: This parameter requires total number of all species that users are going to
work with. The number should not include the reference species in total list. For example, if the user is
working with 7 species (including reference species) then they need to provide --number or –n 6.
[-d | --directory] <dir>: All the homologous synteny blocks (HSB) files are expected to be in a single
data directory (see 2.3). The name of the data directory is independent and it can be anything, so user can
name it according to their requirements.
[-r | --reference] <refName>: The scientific name of reference species should be provided. The first
and second name of the species should be separated by space or underscore.
[-p | --prime] <prime>: This parameter need the name of primary resolution name. The name of the
resolution should be numeric (see 2.3).
[-t | --threshold] <number>: In order to calculate reuse breakpoint amongst species, user need to
provide certain threshold value for breakpoint filtration. The values are usually 20, but it may vary.
2.1.2 Optional parameters
[Optional] Add more option if required or would like to interfere the default options.
[-b | --beta]: Betascore is required to calculate final classification scores. The EBA program
automatically calculates the beta score for each species, but user can also provide their own beta score if
not happy with the EBA’s beta score. This parameter is useful; if user has no sufficient resolutions to
calculate beta score. If you would like to calculate the beta then "-b or --beta" option need to provide. If
no option is provided then user should provide own beta score in “BetaScore” file at root (see 2.2.4).
[-c | --classify]: The phylogenetic relationship of all the species should be provided in evolutionary
breakpoint format in “classification.eba” file at root, if not calculated using EBA tool (see file format). If
the user relies on GenBank based taxonomy classification then they can classify them without using –c or
--classy option.
[-i | --increase] <number>: Sometime user need to increase the size of the breakpoint to calculate the
breakpoint overlaps. The increment parameter will be handy in such cases. It increases the breakpoint
10 EBA Manual
sizes by <number> on both sides. For example, if user would like to increase the breakpoint size by
1000bp then they should provide –i or --increase 500.
[-l | --lineage]: If user are not interested in lineage specific breakpoint then they can provide –l or –
lineage flag in command line, which will exclude the lineage from the final table.
[-x | --exclude]: If user are not interested in classification group which is define with only one species
then they can exclude the classification group using –x or --exclude parameter.
[-k | --keep]: The EBA tool produces all the intermediate files for future verification, if user are
interested in it then they can keep it by -k or --keep.
[-v | --validate]: If you are using it for the first time, it is recommended to keep validation flag on, by –v
or --validate. Moreover, you can save some time if it is off.
[-e | --engrave]: If you would like to print the each target breakpoint coordinates, in final classification
file then user need to use this flag. By default EBA will only print the EBRs classification with their
corresponding coordinates.
[-logfile]<filename>: User can write all the command line message in a single file by –logfile flag and
file name. The file will have .log extension and can be found in EBA root directory.
2.2 Input file formats
Input file format is always a crucial part for any pipeline setting. Therefore, we try to keep the input files
format as simple as possible. This section covers and explains all the input file formats which are received
as input by the EBA framework.
2.2.1 Homologous Synteny Blocks (HSB) files
The HSB files contain the information about valid homologous syntenic blocks between reference (RE)
and target (TA) genomes. The HSB file is a TSV (Tab Separated Values) file which has RE and TA
chromosome, blocks coordinates, orientations, and assembly status. The file should not contain header
information, and only one tab separated HSB blocks allowed per row. The one pair of HSB block is
defined by the following columns:
Name Definition
+Column1 Reference (RE) name
+ Column2 Reference (RE) chromosome number
+ Column3 Start coordinates of RE HSBs
+ Column4 End coordinates of the RE HSBs
+ Column5 Chromosome name of target (TA) species.
+ Column6 Start coordinates of TA HSBs
+ Column7 End coordinates of TA HSBs
+ Column8 Strand of the block
+ Column9 Target (TA) species name
+ Column10 Target (TA) chromosome [ Scaffolds or Chromosome]
The HSBs file should contain real homologous syneny block information, in which the start coordinates
of HSBs should be smaller than the end coordinates. The HSBs blocks for a single species must be non-
overlapping with each other.
11 EBA Manual
The TA species name should be binomial name (which may be shortened to just "binomial"), a
binomen or a scientific name; the first part of the name identifies the genus to which the species belongs;
the second part identifies the species within the genus. For example, humans belong to the
genus Homo and within this genus to the species Homo sapiens. The Homo sapiens should be mention in
HSBs files as homo_sapiens. In other word the genus and species name should be separated by
underscore (_) sign. The TA species name is not case sensitive.
The EBA script checks the target genome assembly information by looking at the TA chromosome
column of the HSBs file. If the target genome is not completely assembled to chromosome then user
should mention “Scaffolds”, and if TA species genome is well assembled and you rely on the assembly
you should mention “Chromosome” in TA chromosome column. The name is not case sensitive, so you
can mention in capital, small or mixed letter. User should always keep in mind that a genome can be
assemble to chromosome or scaffolds but not mixed of both name are allowed in a single species HSBs
file.
Additionally, the name of the HSBs file for an individual species must have the following naming pattern:
<NameOfSpecies>_<ResolutionNumber>_<OtherInfo>.txt
For example, if the 100 Kb resolutions’ HSB file for cattle (reference) and pig (target) then the file name
must be pig_100_final.txt
2.2.2 Classification file
The EBA framework needs phylogenetic classification information of all species, which are being used
for current analysis. The EBA tool provides an option to either calculate standard GenBank classification,
or use our own classification. If use decided to calculate the studied species classification using EBA, then
it automatically generate the desire classification file format, otherwise user need to provide the
classification file named “classification.eba” at the root of the program. Following are the file formats for
the classification file:
Name Definitions
+GroupName Name of the order or phylogenetic nodes.
+Seperator Equal to (=) sign
+SpeciesName Name of all species which constitute that group.
Note: If the GroupName is Lineage the SpeciesName should leave blank. User can mention comments in
classification file which should start with hash sign (#). [[Need to check #]]
2.2.3 Chromosome size file
The EBA tool uses chromosome based approach to calculate classification scores. In order to do so, the
tool need chromosome sizes of reference (RE) genome. The RE chromosome size file is tab separated
two column file, in which the first column contains RE chromosome name and second column contains
their maximum size. Following table will be much helpful to understand the file format:
Name Definition
+Column1 The reference chromosome name
+Column2 Maximum size of the reference chromosome
12 EBA Manual
2.2.4 BetaScore file
Beta score is the probability of missed breakpoint at certain resolutions. This score is only calculated,
when there are least three/four resolutions dataset available during analysis. However, user can use EBA
tool on less number of resolution by providing own probable beta score. In our analysis, we generally
found on average 0.02 score for missed breakpoint in each species. Following is the file format for
BetaScore file:
Name Definition
+Column1 Resolution number of target species.
+Column2 Target species scientific name, the first and second should be separated by
underscore (_).
+Column3 Beta score of the species.
Note: The resolution and TA name is not separated by tab, rather concatenated by colon (:).
The beta score for all studied resolution were plotted for all studied species in root EBA directory for
visual analysis purpose. The betaScore graph is saved as “betaScore.gif”.
2.3 Folder setting
The data folder, which contains all the HSBs files, should be organized in proper way to recognized by
the EBA script. Following are the set of rules that user need to keep in mind before running on new
dataset.
2.3.1 Resolution Folders
The resolution folders name must be number, and named after resolution at which this is analysed. For
example, if user is working with four resolutions namely 50, 100, 300, and 500 Kb then user need to
create the directory name by its resolution number and store it’s all HSBs files.
2.3.2 Location of all resolutions folders
All the resolution folders (named after studied resolution) should be kept inside the main input data
directory folder (project folder). This is mandatory as EBA script look for all resolutions HSBs files in
separate folders. The EBA read understand the resolution of HSBs files by their folders name.
13 EBA Manual
Fig: 3.0. The folder hierarchy structure for the EBA tool. The resolution folder name should be numeric
and named after resolution number. Only three resolution folders are shown in image, but user can enter
any number of resolutions folders in it. The executable program or script EBA.pl should be at root, with
some mandatory files like “chr_size.txt” and “classification.eba”. For classification point of view user
need to unzip taxonomy database at root.
Chapter 3
Result and Intermediate Files
3.1 Intermediate files and folder
The EBA tool writes all output results inside each resolution directory by creating following mandatory
directories:
EBA/DataDirectory/ResolutionDirectory/EBA_OutFiles.
EBA/DataDirectory/ResolutionDirectory/EBA_ImageFiles
EBA/DataDirectory/ResolutionDirectory/VizEHFiles
EBA/DataDirectory/ResolutionDirectory/ResultFiles
This section describes each one of the items which are output by it.
3.1.1 EBA_OutFiles Directory
The EBA_OutFiles directory contains all the intermediates files generated during breakpoint analysis.
Each file is name ends with *.eba[N] extensions, and follows the following pattern:
<NameOfSpecies>_*.eba[N]
14 EBA Manual
where [N] is the number which indicates the intermediate file order, which start from 0 and end at 8.
Note: This EBA_OutFiles folder will be deleted automatically after the completion of EBA program. In
order to keep intermediate folders and files, users need to add –k or --keep flag at the time of running
program.
3.1.2 EBA_ImageFiles Directory
After performing the breakpoint identification and analysis against all studied species, the script generates
certain image and data files for future usages. These images and their data files are written inside the
ImageFiles directory and have the following name pattern:
NameOfImage.gif
NameOfImage.data
Each image file has one corresponding data file, which can be identified by same name with <*>.data
extension.
Following are the image files present in EBA_ImageFiles directory:
EBR_classification_fractions_<resolution name>.gif and .data
EBR_density_chromosomes_<resolution name>.gif and .data
Fraction_EBRs_and_gaps_<resolution name>.gif and .data
Fraction_EBRs_and_gaps_<resolution name>.gif and .data
Number_unique_resuse_EBRs_<resolution name>.gif and .data
Pie_chart_classification_EBRs_<resolution name>.gif and .data
Pie_chart_unclassified_EBRs_<resolution name>.gif and .data
The *.data file which contain actual data provide flexibility to use any other program to draw the same but
different style images.
3.1.2.1 Cumulated Bar Graph
This cumulated bar graph contains the information about real and pseudo breakpoints in each studies
species. User can find real and pseudo definition and rules in paper supplementary section.
15 EBA Manual
Figure 3.1: Example cumulative bar graph for real and pseudo breakpoint data. The red bar identifies the
number of real breakpoints. The stacked red bar indicate the pseudo breakpoint detected in that species.
3.1.2.2 Cumulated Stacked Bar Final Graph
The final stacked bar graph represents total number of EBRs identified for a single classification group or
species.
16 EBA Manual
Figure 3.2: The red bar represents the total number of EBRs found reuse in that lineage or order, whereas
green stacked bar denotes the unique EBRs in each species or group. The graph is shorted with species
and group name.
3.1.2.3 Final classification line graph
The final classifications for each resolution were represented with line graph. In this line graph each
classification group and lineage breakpoint were plotted against reference chromosome. The line graph is
saved in EBA_ImageFiles folder with “EBR_density_chromosomes_<resolution name>.gif” name.
3.1.3 ResultFiles Directory
The ResultFiles directory contains the final results files of breakpoint analysis. The results are stored in
two different files.
Result_<resolution name>.final and ResultReuse_<resolution name>.final
17 EBA Manual
3.1.3.1 Result_<resolution name>.final file format
This file contains the final results of the process of breakpoint identification and classification for all
breakpoints. It is a TSV file which has the breakpoint coordinates, scores, species, and decisions of all
breakpoints. It has the following columns:
Name Definition
+Chromosome The name of reference(RE) chromosome
+Widest EBR interval
start(bp)
The possible widest breakpoint which overlaps with all studied species.
+Widest EBR interval end(bp)
The possible widest breakpoint which overlaps with all studied species.
+Narrowest EBR
interval start(bp)
The narrow breakpoint region was calculated by looking at all narrowest
overlapping breakpoint region.
+Narrowest EBR
interval end(bp)
The narrow breakpoint region was calculated by looking at all narrowest
overlapping breakpoint region.
+Species containing the
EBR
It contains the scientific name of all the species overlapping at certain
reference breakpoint region. All the species names are separated by column (:)
sign.
+Classification
group:assignment score
The classification scores for all possible order and lineage group are store in
this column. Each order/lineage name and their score are separated by
column (:) sign. Whereas the entire classification group are spaced in the
column.
+Highest probability
classification
This column contains the most probable breakpoint classification name with
their corresponding scores.
+Second highest
probability classification
The second best breakpoint classification score is stored in this column. The
classification name and score are separated by column (:) sign.
+Ratio between the first
and second probabilities
This column contain ratio of breakpoint classification score.
+No. species in the
dataset
Total number of studied species.
+No. species with gap Number of gaps overlapping with certain breakpoint region.
+No. species with the
EBR
The total number of breakpoint number overlapping at specific RE break
region.
Percentage informative
species
Percentage of species used in classification
Error probability The error probability for each classification
Note: The species name may appear in this file if users apply –e|--engrave flag at command line at the
time of running the program. The species name is dynamic in nature, and depends in number of species
used.
3.1.3.2 ResultReuse_<res>.final file format
The occurrences of reuse breakpoint amongst species were calculated using final classification file. The
reuse file stores the final classification, reuse, uncertain, unique breakpoint information along with their
scores. The file format for reuse file is as follows:
Name Definition
18 EBA Manual
+Reference genome name The name of the reference species. It can be modify according to the user
requirements.
+Reference chromosome Chromosome name of the reference species
+Final classification The final classification name of the RE breakpoint region.
+Narrowest EBR interval
start(bp)
The reference breakpoint starts coordinate.
+Narrowest EBR interval
end(bp)
The reference breakpoint ends coordinate.
+Classification The breakpoint classification decision, which can be Unique, Reuse, and
Uncertain.
+Scores The classification score for the most probable breakpoint
+Ratio between the first
and second classification
probabilities
The breakpoint score ration between first and second best classification
for a single breakpoint region.
+No. species in the dataset Total number of species used in breakpoint classification and analysis.
+No. species with gap Total number of gap regions.
+No. species with the EBR The number of all breakpoint overlapping at species RE breakpoint
region.
+Percentage informative
species
Total percentage of species used to calculate the breakpoint scores.
+Ratio between the first
and second classification
probabilities adjusted for
reuse EBRs
The new ratios calculated for reuse cases using certain set of rules.
Note: The reuse, unique, and uncertain are discussed in supplementary information file.
3.1.4 Viz_EHFiles Directory
The evolution highway (EH) friendly visualization files were generated to visualize it for visual
confirmation. All the files in this folder are saved with their species name. Following are the naming
pattern of the file.
<species_name>_viz_table.viz
Each file contains the breakpoint overlapping information of all the species with respect to reference
genome. Following are the file format of the *.viz files.
Name Definitions
+Column1 Scientific name of the target (TA) species.
+Column2 Chromosome name of the reference (RE) species.
+Column3 Number of breakpoint overlapping species.
+Column4 RE start coordinates
+Column5 RE end coordinates
Note: These files are useful to look through species specific breakpoint regions and check the overlapping
species.
19 EBA Manual
Chapter 4
Merging all resolutions
4.1 Merging resolution with prime
All the studies resolutions were merged and created a new hypothetical resolution. The merging of
resolutions was done using any user-defined prime resolution (see 2.1.1) and created a
“Result_Merge.final” file in root folder. The EBA script use the prime resolution and added new EBRs
information to already existing break regions by checking upper and lower resolutions. The new EBRs
information (which includes breaks coordinates and gap info) are added only if specific break region is
present in at least two different resolutions. The file format of the “Result_Merge.final” file is same as
Result_<resolution name>.final file (see 3.1.3.1).
4.2 Final merged reuse
The final “ResultResue.final” file is created which contain unique, uncertain, and reuses breakpoint
information. The file format is same as ResultResue.final files (see 3.1.3.2).
4.3 Final merged graph
The merge folder also generated same sort of graph as of each resolutions, with one additional graph
name “Unique_reuse_EBRs_all_resolutions.gif”.
The final data with all breakpoint classification information were graphically demonstrated in
“Unique_reuse_EBRs_all_resolutions.gif”.
20 EBA Manual
Figure 4.1: The red bar represents the total number of EBRs found reuse in specific lineage or order,
whereas green stacked bar denotes the unique EBRs in each species or group. The graph is shorted with
species and group name.
4.4 Final merged classification pie chart
The final EBRs classification file was plotted in pie chart named as “EBR_classification_fractions.gif”. In
which uncertain is red, reuses are displayed in green, whereas unique EBRs were represented by cyan
colour.
4.5.3 Final merged breakpoint line graph
The final breakpoint classification data were represented with line graph. It represents the number of
breakpoint per chromosome for specific classification group or lineage. The image file is named as
“EBR_denisity_chromosomes.gif”.
21 EBA Manual
4.5 Other graphs
4.5.1 Breakpoint line graph
The numbers of breakpoints in each species for all resolution were plotted in line graph, which display
the number of breakpoint in each species with respect to reference chromosomes.
Figure 4.2: The numbers of breakpoint occurrences in each species were represented with line graph. The
Y axis represents the rate of breakpoints occurrences irrespective of their chromosome size and X axis
are the name of reference chromosomes.
4.5.2 Breakpoint histogram
The comparative breakpoint numbers were calculated in all studied resolutions, which are useful to so
comparative visual analysis of breakpoint in each species at studied resolutions. This histogram graph is
name as “No_unclassified_EBRs_per_resolution.gif”.
22 EBA Manual
Figure 4.3: The number of breakpoint present in each species at all studies resolutions (in above cases
three resolutions). The Y axis represents number of breakpoint, whereas X axis represents the studied
resolutions.
Chapter 5
Point to remember
Some important point about the EBA framework:
1. The list of homologous synteny block (HSBs) for a single species should not be duplicated in HSBs
files. In addition, the end of the HSBs coordinates should always bigger than the start coordinates.
2. Double check the GenBank taxonomy classification file, as some taxonomy of the species is still
doubtful and need improvements or human interferences. Moreover, if users are doing any modification
in “classification.eba” file then he/she should be careful in typing species name, any duplication or typo
can cause error in final EBRs classification.
3. If users don’t have sufficient resolution to calculate betaScore using EBA script, the user are advised to
use default 0.02 score in betaScore file, which suggest that there is 2% chances of breakpoint to be missed
in certain species (see 2.2.4).
4. The EBA script uses user-defined threshold values [-t | --threshold] (see 2.1.2) to calculate the reuse
breakpoint amongst species and group. We strongly recommend the use 20 threshold values, but users
are advised to check the threshold themselves for a suitable value according to their input data and
species taxonomy.
5. Name of the species should always be scientific if using GenBank taxonomy data for EBA
classification.
23 EBA Manual
6. Mandatory parameters:
[-n |--number] <number> provide the number of species that you are going to compare.
[-d |--directory]<dir> name of the directory containing the list of HSBs.
[-r |--reference]<refname> name of the reference species [must be a scientific name ].
[-p |--prime]<prime> provide the primary resolution name [the resolution name should be numeric ].
[-t |--threshold]<number> threshold value for reuse breakpoint filtration.
Try -h for more detail.