BioPerl – An Overview

BioPerl Overview

BioPerl – An Overview

Gloria Rendon

April 2006

University of Illinois at Urbana-Champaign

BioPerl Overview

• BioPerl is a…– Biology toolkit of modules for Bioinformatics,

Genetics, Life Sciences– Framework to do Computational Biology– Object-oriented flavor of Perl plus an

extensive Bioinformatics library– Collection of Perl modules that facilitate the

development of perl scripts for bioinformatics applications

BioPerl Overview

• BioPerl is NOT a– A set of ready to use programs, like many commercial

packages and free web-based interfaces – Suitable language for all aspects of Computational

Biology; not suitable for high-precision, fast, intensive numeric data analysis ][ex: simulations, modeling, probabilities, etc.

– A strongly type language; which means min. time is spent on tasks such as error-checking and consistency of the data

– A visually-oriented language, poor GUI capabilities for code development

BioPerl Overview

A very brief History• BioPerl, the open source group of volunteers dedicated

to the development of this language, is 10 years old• BioPerl, the “stable” core language, is four years old,

release date 2002; contained modules for sequence manipulation, accessing of databases using a range of data formats and execution and parsing of the results of various molecular biology programs including Blast, clustalw, TCoffee, genscan, ESTscan and HMMER.

• Latest version is 1.4, release date 2003; contains core and extensions, additional libraries, repositories, compiled programs

• Future release is 1.5, release date ? Contains GUI capabilities, persistence capabilities, client-server CORBA-compliant capabilities, and process pipelining capabilities.

BioPerl Overview

Software Requirements

• Core BioPerl requires Perl [versions 5.0 and above] to be already installed

• Mimimal and complete versions of BioPerl exists and are found online and packaged as “bundles”; they both subsume the Core package

• Custome installation: you can pick and choose which modules to install; among the most commonly downloaded ones are:

– For accessing remote databases, you will also need: File-Temp-0.09 and IO-String-1.01

– For accessing Ace database, you will also need: AcePerl-1.68– For remote Blast searches: libwww-perl-5.48 Digest-MD5-2.12 HTML-Parser-

3.13 libnet-1.073 MIME-Base64-2.11 URI-1.09 IO-String-1.216– For xml parsing: libxml-perl-2.30 XML-Twig-2.02 Soap-Lite-0.52 XML-DOM-1.37

expat-1.95.1

• Even though developers strive to produce independent modules; their interdependencies are sometimes unavoidable. So, make sure you have installed all the necessary modules on the host system. For more current and additional information on external modules required by bioperl, check http://bioperl.org/Core/external.shtml

BioPerl Overview

• Bioperl also uses several C programs for sequence alignment and local blast searching. To use these features of bioperl you will need an ANSI C or Gnu C compiler as well as the actual program available from sources such as:

• for Smith-Waterman alignments: bioperl-ext-0.6 fromhttp://bioperl.org/Core/external.shtml

• for clustalw alignments: ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/• for tcoffee alignments:

http://igs-server.cnrsmrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html• for local blast searching:

ftp://ftp.ncbi.nlm.nih.gov/blast/server/current_release/• for EMBOSS applications:

http://www.hgmp.mrc.ac.uk/Software/EMBOSS/download.html

Software Requirements…

BioPerl Overview

Installation

• Locate the package(s) on the network at: http://search.cpan.org/ • Download• Decompress and remove the file archive• Create a makefile• Run “make”, “make test”, and “make install” for every module• The CPAN module can also be used to install all of the modules in a• single step as a “bundle” of modules, Bundle::BioPerl, eg

$>perl -MCPAN -e shellcpan>install Bundle::BioPerl<installation details....>cpan>install B/BI/BIRNEY/bioperl-1.0.tar.gz<installation details....>cpan>quit

• The process described above is for the UNIX OS. • The minimal package should also work under NT, Windows, Mac

OS X; however, it has not been widely tested.

BioPerl Overview

BioPerl, the Core toolkit

• Much of bioperl is focused on sequence manipulation.– Accessing sequence data from local and remote databases– Transforming formats of database/ file records– Manipulating individual sequences– Searching for "similar" sequences– Creating and manipulating sequence alignments– Searching for genes and other structures on genomic DNA– Developing machine readable sequence annotations

BioPerl Overview

• Bundle is a collection of modules that could be related• Each module is composed of one or more classes• A class is the blueprint of an object• An object contains data part and methods part.• Data is visible to the object only. Encapsulation.• The methods act on the data and could be private or public.• Objects interact with each other through method invocation only.• Inheritance is one of the many relationships that can exist between

classes, i.e. the ISA relationship• Other common relationships used in Bioperl:

– xxxxx and xxxxIO where the latter class is the IO wrapper of the former class xxxxx

– xxxxx and xxxxxI where the latter class is the Interface of the former class xxxxx

A one-minute crash course on Object-Oriented Languages

BioPerl Overview

BioPerl Class Diagram

Source: http://bioperl.org/wiki/Class_Diagram

BioPerl Overview

A closer look at the class diagram…

• Bio::DB• Bio::Seq• Bio::Index• Bio::Align• Bio::Search• Bio::Graphics• Bio::Biblio• Bio::Structure• Bio::Variation• Bio::LiveSeq

All class diagrams shown here follow UML conventions

BioPerl Overview

Documentation concerns

• Few class diagrams have been published

• Even fewer dataflow diagrams exist

• But, every single class in Bioperl has a POD

• Last resource: look at the code itself.

[Remember, Bioperl is open source]

BioPerl Overview

The Class POD

Source: actual codeDownload: link to repos.Name: class nameSynopsis: usageDescription: textualAppendix: list of methodsAuthor ContributorsFeedback

BioPerl Overview

Sequence classes in BioPerl

BioPerl Overview

Sequence in BioPerl

• Seq is the central sequence object in bioperl. Most common sequence manipulations can be performed with Seq.

• RichSeq objects store additional annotations beyond those used by standard Seq objects

• SeqWithQuality objects are used to manipulate sequences with quality data, like those produced by phred

• PrimarySeq object is basically a “stripped down” version of Seq.• LocatableSeq object might be more appropriately called an “AlignedSeq”

object. It is a Seq object which is part of a multiple sequence alignment. It has “start” and “end” positions indicating from where in a larger sequence it may have been extracted.

• LargeSeq object is a special type of Seq object used for handling very long ( eg > 100 MB) sequences.

• LiveSeq addresses the problem of features whose location on a sequence changes over time. This can happen, for example, when sequence feature objects are used to store gene locations on newly sequenced genomes - locations which can change as higher quality sequencing data becomes available.

• SeqI objects are Seq “interface objects” (see section II.4 and Bio). They are used to ensure bioperl’s compatibility with other software packages.

BioPerl Overview

1. de novo: use Bio::Seq; my $seq1 = Bio::Seq->new ( -seq => 'ATGAGTAGTAGTAAAGGTA',

-id => 'my seq', -desc => 'this is a new Seq');

2. from a file: use Bio::SeqIO;

# through file IO functionsmy $seqin = Bio::SeqIO->new ( -file => 'seq.fasta',

-format => 'fasta'); my $seq3 = $seqin->next_seq();

# through file handlesmy $inseq = Bio::SeqIO->newFh ( -file => ‘<seqs.sp',

-format => ‘swiss'); my $outseq = Bio::SeqIO->newFh ( -file => ‘>seqs.fasta',

-format => 'fasta'); print $outseq $_ while <$inseq>;

Sequence Creation, Retrieval and Access

BioPerl Overview


3. from a remote database# these three lines each returns a Seq object $gb = new Bio::DB::GenBank();$seq1 = $gb->get_Seq_by_id(’MUSIGHBA1’);$seq2 = $gb->get_Seq_by_acc(’AF303112’))

# this line returns a SeqIO object$seqio = $gb->get_Stream_by_batch( [ qw(J00522 AF303112 2981014)]));

Bioperl supports sequence data retrieval from the Genbank, Genpept, RefSeq, Swissprot, and EMBL databases.

BioPerl Overview

BioPerl Overview


4. from a local database– Before accessing sequences from local sequence datafiles, they have to be

made Bioperl-readable by indexing them with Bio::Index or Bio::DB::Fasta.– The following sequence data formats are supported by Bio::Index: Genbank,

Swissprot, Pfam, EMBL and Fasta. – Once the set of sequences have been indexed using Bio::Index, individual

sequences can be accessed using syntax very similar to that described above for accessing remote databases.

use Bio::Index::Fasta; # using fasta file format$Index_File_Name = shift;$inx = Bio::Index::Fasta->new( -filename => $Index_File_Name,

-write_flag => 1);$inx->make_index(@ARGV);foreach $id (@ARGV) {

$seq = $inx->fetch($id); # Returns Bio::Seq object}

BioPerl Overview


5. format conversion with SeqIO– SeqIO can read a stream of sequences - located in a single or in multiple

files - in a number of formats: Fasta, EMBL, GenBank, Swissprot, PIR, GCG, SCF, phd/phred, Ace, or raw (plain sequence).

– Once the sequence data has been read in with SeqIO, it is available to bioperl in the form of Seq objects.

– Moreover, the Seq objects can then be written to another file (again using SeqIO) in any of the supported data formats making data converters simple to implement, for example:

use Bio::SeqIO;$in = Bio::SeqIO->new( ’-file’ => "inputfilename",

’-format’ => ’Fasta’);$out = Bio::SeqIO->new(’-file’ => ">outputfilename",

’-format’ => ’EMBL’);while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }

BioPerl Overview

Yet another view of the Seq class

BioPerl Overview

Bioperl Features [ex: XML tags Bioperl features]

BioPerl Overview

Sequences and Annotations

BioPerl Overview

use Bio::SeqFeature::Generic; use Bio::SeqIO;

$in = Bio::SeqIO->newFh(-file => $ARGV[0]); $out = Bio::SeqIO->newFh();

$seq = <$in>; $feat = new Bio::SeqFeature::Generic (

-start => 10, -end => 100,

-strand => -1, -primary => 'repeat', -source => 'repeatmasker', -score => 1000, -tag => {

new => 1, author => 'someone', sillytag => 'this is silly!' }

);

$seq->add_SeqFeature($feat);

print $out $seq;

Annotations, de novo

BioPerl Overview

BioPerl Overview

Sequences and Locations

BioPerl Overview

my $fuzzylocation = new Bio::Location::Fuzzy(-start => '<30', -end => 90, -loc_type => '.‘

);

• A Location object is like an index within a range. • It is designed to be associated with a SeqFeature object

to indicate where on a larger structure (eg a chromosome or contig) the feature can be found.

• It was implemented as a separate object, rather than as a simple index on a range, because

- Some objects have multiple locations or sub-locations (eg a gene’s exons may have multiple start and stop locations) - In unfinished genomes, the precise locations of features is not known with certainty.

BioPerl Overview

BioPerl Overview

Seq other commonly used methods

- The following methods return string values

$seqobj->desc() # a description of the sequence$seqobj->display_id(); # the human read-able id of the sequence$seqobj->seq(); # string of sequence$seqobj->subseq(5,10); # part of the sequence as a string$seqobj->accession_number(); # when there, the accession number$seqobj->alphabet(); # one of ’dna’,’rna’,’protein’$seqobj->primary_id(); # a unique id for this sequence

- The following methods return an array of Bio::SeqFeature objects

$seqobj->top_SeqFeatures # The ’top level’ sequence features$seqobj->all_SeqFeatures # All sequence features

- The following methods returns new sequence objects, but do not transfer features across:

$seqobj->trunc(5,10) # truncation from 5 to 10 as new object$seqobj->revcom # reverse complements sequence$seqobj->translate # translation of a DNA sequence from start/end

BioPerl Overview

Basic sequence statistics

• SeqStats object provides methods for obtaining the molecular weight of the sequence as well the number of occurrences of each of the component residues (bases for a nucleic acid or amino acids for a protein.) For nucleic acids, SeqStats also returns counts of the number of codons used. For example:

use SeqStats;$seq_stats = Bio::Tools::SeqStats->new($seqobj);$weight = $seq_stats->get_mol_wt();$monomer_ref = $seq_stats->count_monomers();$codon_ref = $seq_stats->count_codons(); # for DNA sequence

• The SeqWords object is similar to SeqStats and provides methods for calculating frequencies of “words” (eg tetramers or hexamers)

BioPerl Overview

More on Format conversion with AlignIO

BioPerl Overview

• AlignIO is the bioperl object for data conversion of alignment files. • AlignIO is patterned on the SeqIO object and shares most of SeqIO’s features. • AlignIO currently supports INPUT in the following formats: fasta, mase,

stockholm, prodom, selex, bl2seq, clustalw, msf/gcg, water (from EMBOSS, see III.3.6), needle (from EMBOSS, see III.3.6)

• AlignIO supports OUTPUT in these formats: fasta, mase, selex, clustalw, msf/gcg.

• One significant difference between AlignIO and SeqIO is that AlignIO handles IO for only a single alignment at a time (SeqIO.pm handles IO for multiple sequences in a single stream.) Syntax for AlignIO is almost identical to that of SeqIO:

use Bio::AlignIO;$in = Bio::AlignIO->new(’-file’ => "inputfilename" ,

’-format’ => ’fasta’);$out = Bio::AlignIO->new(’-file’ => ">outputfilename",

’-format’ => ’pfam’);while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }

• The only difference is that here, the returned object reference, $aln, is to a SimpleAlign object rather than a Seq object.

BioPerl Overview

SimpleAlign

BioPerl Overview

• The SimpleAlign class contains methods to select sequences or columns, but it can not filter alignments by functions (i.e. properties)

• In order to filter columns by properties, you have to extract the columns by yourself, filter them and reconstruct the new sequences.

• The following example filters gap columns.

BioPerl Overview

use strict; use Bio::AlignIO; my $in = new Bio::AlignIO ( -file => $ARGV[0], -format => 'clustalw' ); my $out = newFh Bio::AlignIO ( -fh => \*STDOUT, -format => 'clustalw' ); my $aln = $in->next_aln(); # create a list containing all columns foreach my $seq ( $aln->each_alphabetically() ) {

my $colnr = 0; foreach my $chr ( split("", $seq->seq()) ) {

$aln_cols[$colnr] .= $chr; $colnr++; } } # then do the work: we want to eliminate all the columns containing gaps # 1/ we create a list containing all the columns without any gap my $gapchar = $aln->gap_char(); my @no_gap_cols = (); foreach my $col ( @aln_cols ) {

next if $col =~ /\Q$gapchar\E/; push @no_gap_cols, $col;

} # now we replace the old gapped list with the new ungapped onemy @seq_strs = (); foreach my $col ( @no_gap_cols ) {

my $colnr = 0; foreach my $chr ( split"", $col ) {

$seq_strs[$colnr] .= $chr; $colnr++; }}

foreach my $seq ( $aln->each_alphabetically() ) { $seq->seq(shift seq_strs);} print $out $aln;

BioPerl Overview

Search and Analysis of Similar Sequences

• Bioperl offers a number of modules to facilitate running Blast, both locally and remotely, as well as to parse the often voluminous reports produced by Blast.

• Note, Bioperl itself does not have an internal library for running Blast; instead, it calls the necessary program and then manipulates its results internally

BioPerl Overview

BioPerl Overview

StandAloneBlast

• The module Bio::Tools::Run::StandAloneBlast offers the ability to wrap local calls to blast from within perl.

• All of the currently available options of NCBI Blast (eg PSIBLAST, PHIBLAST, bl2seq) are available from within the bioperl StandAloneBlast interface.

• Of course, to use StandAloneBlast, one needs to have installed locally ncbiblast as well as one or more blast-readable databases.

• Basic usage of the StandAloneBlast.pm module is simple. Initially, a local blast “factory object” is created, then the supported blast executables can be issued.

# local BLASTuse Bio::SeqIO; use Bio::Tools::Run::StandAloneBlast;

# step one, creating the factory@params = (’program’ => ’blastn’, ’database’ => ’ecoli.nt’);$factory = Bio::Tools::Run::StandAloneBlast->new(@params);#step two, the input seq are entered$input = Bio::Seq->new(’-id’=>"test query", ’-seq’=>"ACTAAGTGGGGG");$blast_report = $factory->blastall($input);#step three, accessing parts of the blast reportmy $result = $blast_report->next_result; while( my $hit = $result->next_hit()) {

print "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n"; }

BioPerl Overview

RemoteBlast

• Bioperl supports remote execution of blasts at NCBI by means of the RemoteBlast object. • A skeleton script to run a remote blast might look as follows:

#remote BLAST#step 1: query submissionopen SEQS, “>ecoliblastseqs.txt”;$remote_blast = Bio::Tools::Run::RemoteBlast->new(

’-prog’ => ’blastp’,’-data’ => ’ecoli’,’-expect’ => ’1e-10’ );

$r = $remote_blast->submit_blast("t/data/ecolitst.fa");#step2: results retrieval and storagewhile (@rids = $remote_blast->each_rid ) { foreach $rid ( @rids ) {

$rc = $remote_blast->retrieve_blast($rid); push(<SEQS>, $rc);

}}close SEQS;

BioPerl Overview

Parsing Similarity Search Reports

• Bioperl supports a wider range of parsing capabilities than for running the search engines that produce them.

• Bioperl objects to parse and/or search BLAST, PSIBLAST and FASTA reports; they include: Search.pm, SearchIO.pm, BPlite.pm and Blast.pm (for parsing Blast reports). Future release will incorporate support for HMMer and GenScan among others.

BioPerl Overview

Parsing a Blast Report

use Bio::SearchIO; my $blast_report = new Bio::SearchIO ('-format' => 'blast',

'-file' => $ARGV[0]); my $result = $blast_report->next_result; while( my $hit = $result->next_hit()) {

print "\thit name: ", $hit->name(), "\n"; while( my $hsp = $hit->next_hsp()) {

print "E: ", $hsp->evalue(), "frac_identical: ", $hsp->frac_identical(), "\n";

}}

BioPerl Overview

BioPerl Overview

Other Parsers

• Bioperl has a family of parsers that work in a slightly different way than the previous one:– The report belongs to a different class, the

Bio::Tools:BPlite class, which has a different set of methods for get the information.

– A factory has to be created first and then to it Bioperl applies the parameters of the search

• This family of parsers include: BPLite, BPpsilite, BPbl2seq

BioPerl Overview

BioPerl Overview

use Bio::SeqIO; use Bio::Tools::Run::StandAloneBlast;

my $Seq_in = Bio::SeqIO->new (-file => $ARGV[0], -format => 'fasta');

my $query = $Seq_in->next_seq();my $factory = Bio::Tools::Run::StandAloneBlast->new(

'program' => 'blastp', 'database' => 'swissprot' );

my $blast_report = $factory->blastall($query);

while (my $subject = $blast_report->nextSbjct()) { print $subject->name(), "\n"; while (my $hsp = $subject->nextHSP()) {

print join("\t", $hsp->P, $hsp->percent, $hsp->score), "\n";

}

}

BioPerl Overview

BioPerl OverviewPart 2

• recap

• creating databases

• accessing databases with OBDA

• relational databases, SQL and others

• closing

BioPerl Overview

Creating own databases: 1. By Storing results of Searches as flat files

• Bioperl offers a number of modules to facilitate running Blast, both locally and remotely, as well as to parse the often voluminous reports produced by Blast.

• Note, Bioperl itself does not have an internal library for running Blast; instead, it calls the necessary program and then manipulates its results internally

BioPerl Overview

RemoteBlast

• Bioperl supports remote execution of blasts at NCBI by means of the RemoteBlast object.

• A skeleton script to run a remote blast might look as follows:

#remote BLAST#step 1: query submissionopen SEQS, “>ecoliblastseqs.txt”;$remote_blast = Bio::Tools::Run::RemoteBlast->new(

’-prog’ => ’blastp’,’-data’ => ’ecoli’,’-expect’ => ’1e-10’ );

$r = $remote_blast->submit_blast("t/data/ecolitst.fa");#step2: results retrieval and storagewhile (@rids = $remote_blast->each_rid ) { foreach $rid ( @rids ) {

$rc = $remote_blast->retrieve_blast($rid); push(<SEQS>, $rc);

}}close SEQS;

BioPerl Overview

StandAloneBlast

• The module Bio::Tools::Run::StandAloneBlast offers the ability to wrap local calls to blast from within perl.

• All of the currently available options of NCBI Blast (eg PSIBLAST, PHIBLAST, bl2seq) are available from within the bioperl StandAloneBlast interface.

• Of course, to use StandAloneBlast, one needs to have installed locally ncbiblast as well as one or more blast-readable databases.

• Basic usage of the StandAloneBlast.pm module is simple. Initially, a local blast “factory object” is created, then the supported blast executables can be issued.

# local BLASTuse Bio::SeqIO; use Bio::Tools::Run::StandAloneBlast;

# step one, creating the factory@params = (’program’ => ’blastn’, ’database’ => ’ecoli.nt’);$factory = Bio::Tools::Run::StandAloneBlast->new(@params);#step two, the input seq are entered$input = Bio::Seq->new(’-id’=>"test query", ’-seq’=>"ACTAAGTGGGGG");$blast_report = $factory->blastall($input);#step three, accessing parts of the blast reportmy $result = $blast_report->next_result; while( my $hit = $result->next_hit()) {

print "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n"; }

BioPerl Overview

Creating databases:2. Mirroring databases on your local system

• Basically follow each site’s instructions on downloading and setting up the database locally.

BioPerl Overview

Creating databases:3. By using a database management system like Oracle, SQL, Access, Postgres, etc.

• Again, follow each package’s instructions on downloading and setting up the database locally.

• Note: this is a nice tutorial on relational databases for biologists:

http://www.cs.virginia.edu/papers/ismb02_sql.pdf

BioPerl Overview

The OBDA Registry System

• OBDA stands for Open Biological Database Access.• The OBDA System was designed so that one could use

the same application code to access data from all three of the database types by simply changing a few lines in a configuration file. This makes application code more portable and easier to maintain.

• The core of the OBDA System is a Database Registry. • This registry is a combination of both local and site-wide

configuration files which define one or more databases and the access methods to use to access them.

• The registry is platform-independent and is used for specifying how BioPerl programs find sequence databases.

Source: http://www.bioperl.org/wiki/HOWTO:OBDA

BioPerl Overview

OBDA registry system

Local

Flat fileRemote DBLocal

Relational DBCORBA server

Local DB

Your Application

You

Are

here

Accessing data via OBDA Registry System

BioPerl Overview

• Note: Accessing data via the OBDA system is optional in BioPerl. One can easily access sequence data via the usual database-format-specific modules such as Bio::Index::Fasta or Bio::DB::Fasta

The OBDA Registry System

Source: http://www.bioperl.org/wiki/HOWTO:OBDA

BioPerl Overview

Local

Flat fileRemote DBLocal

Relational DBCORBA server

Local DB

Your Application

You

Are

here

Accessing data without the OBDA Registry System

BioPerl Overview

Setup of the OBDA Registry

• The OBDA registry itself is a small text file

• By convention, the name of the file is seqdatabase.ini

• one such file may look like this:

VERSION=1.00 [embl] protocol=biofetch location=http://www.ebi.ac.uk/cgi-bin/dbfetch dbname=embl

[swissprot] protocol=biofetch location=http://www.ebi.ac.uk/cgi-bin/dbfetch dbname=swall

[refseq]Protocol=biofetchLocation=http://www.ebi.au.uk/cgi-bin/dbfetchDbname=refseq

BioPerl Overview


The general format is:

[database-name]

tag=value

tag=value

…

Protocol Tag(s) Description

flat location

dbname

path to the database dir

name of database dir

* config.dat generated during indexing must be here

biofetch location

dbname

base URL for the web service* http://ebi.ac.uk/cgi-bin/biofetch

name of the database

biosql location

dbname

driver

user

passwd

biodbname

host:port

database name

[sqlserver|postgres|oracle|access|csv|informix|odbc|rdb]

username

password

database name

BioPerl Overview

1. create the text file seqdatabase.ini as just explained

2. copy the file to one of these standard locations:$HOME/.bioinformatics/seqdatabase.ini

/etc/bioinformatics/seqdatabase.ini

3. modify search path by adding this env variableOBDA_SEARCH_PATH=/home/yourdir/;http://foo.org/

4. if applicable, “install” the local databases otherwise, skip this step if you plan to use

biofetch only5. write code inside your application to use the

registry


BioPerl Overview

use Bio::DB::Registry;

...

$registry = Bio::DB::Registry->new;

$db = $registry->get_database('embl');

$seq = $db->get_Seq_by_acc("J02231");

print $seq->seq,"\n";

Notes:

$registry is an object of type Bio::DB:Registry

$db is an object of type Bio::DB:RandomAccessI

$seq is an object of type Bio::Seq

Details of location of the embl database and access method are not specified here but in the seqdatabase.ini file

BioPerl Overview

Special Case 1: installing local database files - flat files

• A flat file is a local file of sequences (e.g fasta, local copy of embl, swissprot, EMBL, etc.)

• These files have to be indexed before they can be used by the OBDA system.

• A small script will index the flat file for you• The resulting index file will be called config.dat and there

will be one for each flat file that has been indexed• The index itself is an object of type Bio::DB:Flat

BioPerl Overview

Special case 1...

• For example, the following command will create an index that:– will be written to /usr/share/biodb/<symbolic_name_of_db>, – the symbolic name of the database is genbank, – the indexing scheme is flat, – the format of the source database file is fasta, – and the file itself is data/*.fa - a group of files ending in .fa

bioflat_index.pl –c –l /usr/share/biodb -d genbank -i flat -f fasta data/*.fa

• The corresponding entry in the seqdatabase.ini file will look like this:

VERSION=1.00 …[genbank] protocol=flatlocation=/usr/share/biodbdbname=genbank

BioPerl Overview

Special Case 2: adaptors for relational databases [BioSql]

• Relational databases such as SQL, Oracle, Postgress, etc require the use of adaptor objects written specifically for the OBDA system.

• Refer to the documentation of the specific one you use for more details.

• An example with SQL follows:

my $adp = $dbadp->get_object_adaptor("Bio::SeqI"); my $seq = Bio::Seq->new(-accession => $acc, -namespace => 'swissprot', -version => $ver); my $dbseq = $adp->find_by_unique_key($seq); my $feat = new Bio::SeqFeature::Generic( -primary_tag => $primary_tag, -strand => 1, -start => 100, -end => 10000, -source_tag => 'blat'); $dbseq->add_SeqFeature($feat); $dbseq->store;

BioPerl Overview

BioPerl Overview

Closing …

• Just scratched the surface• Not covered here from Core Bioperl:

- creating sequence alignments (ClustalW)- displaying alignment results SimpleAlign- XML –auto web form generation- SQL – persistent BioPerl- other data structures: trees, maps, etc.

• Hundreds of modules and applications

BioPerl Overview

Other Links

• BioPerl wikihttp://bioperl.org/wiki/Main_Page

• Relational Databases or Biologistshttp://www.cs.virginia.edu/papers/ismb02_sql.pdf

• Bioperl Tutorialhttp://bioperl.org/Core/Latest/bptutorial.html

• OBDA homepagehttp://obda.open-bio.org/

• BioSQL discussion group http://lists.open-bio.org/pipermail/biosql-l/2003-July/thread.html#404

Documents

BioPerl – An Overview