65
BioPerl Overview BioPerl – An Overview Gloria Rendon April 2006 University of Illinois at Urbana- Champaign

BioPerl – An Overview

  • Upload
    dareh

  • View
    61

  • Download
    0

Embed Size (px)

DESCRIPTION

BioPerl – An Overview. Gloria Rendon April 2006 University of Illinois at Urbana-Champaign. BioPerl is a… Biology toolkit of modules for Bioinformatics, Genetics, Life Sciences Framework to do Computational Biology Object-oriented flavor of Perl plus an extensive Bioinformatics library - PowerPoint PPT Presentation

Citation preview

Page 1: BioPerl – An Overview

BioPerl Overview

BioPerl – An Overview

Gloria Rendon

April 2006

University of Illinois at Urbana-Champaign

Page 2: BioPerl – An Overview

BioPerl Overview

• BioPerl is a…– Biology toolkit of modules for Bioinformatics,

Genetics, Life Sciences– Framework to do Computational Biology– Object-oriented flavor of Perl plus an

extensive Bioinformatics library– Collection of Perl modules that facilitate the

development of perl scripts for bioinformatics applications

Page 3: BioPerl – An Overview

BioPerl Overview

• BioPerl is NOT a– A set of ready to use programs, like many commercial

packages and free web-based interfaces – Suitable language for all aspects of Computational

Biology; not suitable for high-precision, fast, intensive numeric data analysis ][ex: simulations, modeling, probabilities, etc.

– A strongly type language; which means min. time is spent on tasks such as error-checking and consistency of the data

– A visually-oriented language, poor GUI capabilities for code development

Page 4: BioPerl – An Overview

BioPerl Overview

A very brief History• BioPerl, the open source group of volunteers dedicated

to the development of this language, is 10 years old• BioPerl, the “stable” core language, is four years old,

release date 2002; contained modules for sequence manipulation, accessing of databases using a range of data formats and execution and parsing of the results of various molecular biology programs including Blast, clustalw, TCoffee, genscan, ESTscan and HMMER.

• Latest version is 1.4, release date 2003; contains core and extensions, additional libraries, repositories, compiled programs

• Future release is 1.5, release date ? Contains GUI capabilities, persistence capabilities, client-server CORBA-compliant capabilities, and process pipelining capabilities.

Page 5: BioPerl – An Overview

BioPerl Overview

Software Requirements

• Core BioPerl requires Perl [versions 5.0 and above] to be already installed

• Mimimal and complete versions of BioPerl exists and are found online and packaged as “bundles”; they both subsume the Core package

• Custome installation: you can pick and choose which modules to install; among the most commonly downloaded ones are:

– For accessing remote databases, you will also need: File-Temp-0.09 and IO-String-1.01

– For accessing Ace database, you will also need: AcePerl-1.68– For remote Blast searches: libwww-perl-5.48 Digest-MD5-2.12 HTML-Parser-

3.13 libnet-1.073 MIME-Base64-2.11 URI-1.09 IO-String-1.216– For xml parsing: libxml-perl-2.30 XML-Twig-2.02 Soap-Lite-0.52 XML-DOM-1.37

expat-1.95.1

• Even though developers strive to produce independent modules; their interdependencies are sometimes unavoidable. So, make sure you have installed all the necessary modules on the host system. For more current and additional information on external modules required by bioperl, check http://bioperl.org/Core/external.shtml

Page 6: BioPerl – An Overview

BioPerl Overview

• Bioperl also uses several C programs for sequence alignment and local blast searching. To use these features of bioperl you will need an ANSI C or Gnu C compiler as well as the actual program available from sources such as:

• for Smith-Waterman alignments: bioperl-ext-0.6 fromhttp://bioperl.org/Core/external.shtml

• for clustalw alignments: ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/• for tcoffee alignments:

http://igs-server.cnrsmrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html• for local blast searching:

ftp://ftp.ncbi.nlm.nih.gov/blast/server/current_release/• for EMBOSS applications:

http://www.hgmp.mrc.ac.uk/Software/EMBOSS/download.html

Software Requirements…

Page 7: BioPerl – An Overview

BioPerl Overview

Installation

• Locate the package(s) on the network at: http://search.cpan.org/ • Download• Decompress and remove the file archive• Create a makefile• Run “make”, “make test”, and “make install” for every module• The CPAN module can also be used to install all of the modules in a• single step as a “bundle” of modules, Bundle::BioPerl, eg

$>perl -MCPAN -e shellcpan>install Bundle::BioPerl<installation details....>cpan>install B/BI/BIRNEY/bioperl-1.0.tar.gz<installation details....>cpan>quit

• The process described above is for the UNIX OS. • The minimal package should also work under NT, Windows, Mac

OS X; however, it has not been widely tested.

Page 8: BioPerl – An Overview

BioPerl Overview

BioPerl, the Core toolkit

• Much of bioperl is focused on sequence manipulation.– Accessing sequence data from local and remote databases– Transforming formats of database/ file records– Manipulating individual sequences– Searching for "similar" sequences– Creating and manipulating sequence alignments– Searching for genes and other structures on genomic DNA– Developing machine readable sequence annotations

Page 9: BioPerl – An Overview

BioPerl Overview

• Bundle is a collection of modules that could be related• Each module is composed of one or more classes• A class is the blueprint of an object• An object contains data part and methods part.• Data is visible to the object only. Encapsulation.• The methods act on the data and could be private or public.• Objects interact with each other through method invocation only.• Inheritance is one of the many relationships that can exist between

classes, i.e. the ISA relationship• Other common relationships used in Bioperl:

– xxxxx and xxxxIO where the latter class is the IO wrapper of the former class xxxxx

– xxxxx and xxxxxI where the latter class is the Interface of the former class xxxxx

A one-minute crash course on Object-Oriented Languages

Page 10: BioPerl – An Overview

BioPerl Overview

BioPerl Class Diagram

Source: http://bioperl.org/wiki/Class_Diagram

Page 11: BioPerl – An Overview

BioPerl Overview

A closer look at the class diagram…

• Bio::DB• Bio::Seq• Bio::Index• Bio::Align• Bio::Search• Bio::Graphics• Bio::Biblio• Bio::Structure• Bio::Variation• Bio::LiveSeq

All class diagrams shown here follow UML conventions

Page 12: BioPerl – An Overview

BioPerl Overview

Documentation concerns

• Few class diagrams have been published

• Even fewer dataflow diagrams exist

• But, every single class in Bioperl has a POD

• Last resource: look at the code itself.

[Remember, Bioperl is open source]

Page 13: BioPerl – An Overview

BioPerl Overview

The Class POD

Source: actual codeDownload: link to repos.Name: class nameSynopsis: usageDescription: textualAppendix: list of methodsAuthor ContributorsFeedback

Page 14: BioPerl – An Overview

BioPerl Overview

Sequence classes in BioPerl

Page 15: BioPerl – An Overview

BioPerl Overview

Sequence in BioPerl

• Seq is the central sequence object in bioperl. Most common sequence manipulations can be performed with Seq.

• RichSeq objects store additional annotations beyond those used by standard Seq objects

• SeqWithQuality objects are used to manipulate sequences with quality data, like those produced by phred

• PrimarySeq object is basically a “stripped down” version of Seq.• LocatableSeq object might be more appropriately called an “AlignedSeq”

object. It is a Seq object which is part of a multiple sequence alignment. It has “start” and “end” positions indicating from where in a larger sequence it may have been extracted.

• LargeSeq object is a special type of Seq object used for handling very long ( eg > 100 MB) sequences.

• LiveSeq addresses the problem of features whose location on a sequence changes over time. This can happen, for example, when sequence feature objects are used to store gene locations on newly sequenced genomes - locations which can change as higher quality sequencing data becomes available.

• SeqI objects are Seq “interface objects” (see section II.4 and Bio). They are used to ensure bioperl’s compatibility with other software packages.

Page 16: BioPerl – An Overview

BioPerl Overview

1. de novo:  use Bio::Seq; my $seq1 = Bio::Seq->new ( -seq => 'ATGAGTAGTAGTAAAGGTA',

-id => 'my seq', -desc => 'this is a new Seq');

2. from a file:  use Bio::SeqIO;

# through file IO functionsmy $seqin = Bio::SeqIO->new ( -file => 'seq.fasta',

-format => 'fasta'); my $seq3 = $seqin->next_seq();

# through file handlesmy $inseq = Bio::SeqIO->newFh ( -file => ‘<seqs.sp',

-format => ‘swiss'); my $outseq = Bio::SeqIO->newFh ( -file => ‘>seqs.fasta',

-format => 'fasta'); print $outseq $_ while <$inseq>;

Sequence Creation, Retrieval and Access

Page 17: BioPerl – An Overview

BioPerl Overview

Sequence Creation, Retrieval and Access

3. from a remote database# these three lines each returns a Seq object $gb = new Bio::DB::GenBank();$seq1 = $gb->get_Seq_by_id(’MUSIGHBA1’);$seq2 = $gb->get_Seq_by_acc(’AF303112’))

# this line returns a SeqIO object$seqio = $gb->get_Stream_by_batch( [ qw(J00522 AF303112 2981014)]));

Bioperl supports sequence data retrieval from the Genbank, Genpept, RefSeq, Swissprot, and EMBL databases.

Page 18: BioPerl – An Overview

BioPerl Overview

Page 19: BioPerl – An Overview

BioPerl Overview

Sequence Creation, Retrieval and Access

4. from a local database– Before accessing sequences from local sequence datafiles, they have to be

made Bioperl-readable by indexing them with Bio::Index or Bio::DB::Fasta.– The following sequence data formats are supported by Bio::Index: Genbank,

Swissprot, Pfam, EMBL and Fasta. – Once the set of sequences have been indexed using Bio::Index, individual

sequences can be accessed using syntax very similar to that described above for accessing remote databases.

use Bio::Index::Fasta; # using fasta file format$Index_File_Name = shift;$inx = Bio::Index::Fasta->new( -filename => $Index_File_Name,

-write_flag => 1);$inx->make_index(@ARGV);foreach $id (@ARGV) {

$seq = $inx->fetch($id); # Returns Bio::Seq object}

Page 20: BioPerl – An Overview

BioPerl Overview

Sequence Creation, Retrieval and Access

5. format conversion with SeqIO– SeqIO can read a stream of sequences - located in a single or in multiple

files - in a number of formats: Fasta, EMBL, GenBank, Swissprot, PIR, GCG, SCF, phd/phred, Ace, or raw (plain sequence).

– Once the sequence data has been read in with SeqIO, it is available to bioperl in the form of Seq objects.

– Moreover, the Seq objects can then be written to another file (again using SeqIO) in any of the supported data formats making data converters simple to implement, for example:

use Bio::SeqIO;$in = Bio::SeqIO->new( ’-file’ => "inputfilename",

’-format’ => ’Fasta’);$out = Bio::SeqIO->new(’-file’ => ">outputfilename",

’-format’ => ’EMBL’);while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }

Page 21: BioPerl – An Overview

BioPerl Overview

Yet another view of the Seq class

Page 22: BioPerl – An Overview

BioPerl Overview

Bioperl Features [ex: XML tags Bioperl features]

Page 23: BioPerl – An Overview

BioPerl Overview

Sequences and Annotations

Page 24: BioPerl – An Overview

BioPerl Overview

use Bio::SeqFeature::Generic; use Bio::SeqIO;

$in = Bio::SeqIO->newFh(-file => $ARGV[0]); $out = Bio::SeqIO->newFh();

$seq = <$in>; $feat = new Bio::SeqFeature::Generic (

-start => 10, -end => 100,

-strand => -1, -primary => 'repeat', -source => 'repeatmasker', -score => 1000, -tag => {

new => 1, author => 'someone', sillytag => 'this is silly!' }

);

$seq->add_SeqFeature($feat);

print $out $seq;

Annotations, de novo

Page 25: BioPerl – An Overview

BioPerl Overview

Page 26: BioPerl – An Overview

BioPerl Overview

Sequences and Locations

Page 27: BioPerl – An Overview

BioPerl Overview

my $fuzzylocation = new Bio::Location::Fuzzy(-start => '<30', -end => 90, -loc_type => '.‘

);

• A Location object is like an index within a range. • It is designed to be associated with a SeqFeature object

to indicate where on a larger structure (eg a chromosome or contig) the feature can be found.

• It was implemented as a separate object, rather than as a simple index on a range, because

- Some objects have multiple locations or sub-locations (eg a gene’s exons may have multiple start and stop locations) - In unfinished genomes, the precise locations of features is not known with certainty.

Page 28: BioPerl – An Overview

BioPerl Overview

Page 29: BioPerl – An Overview

BioPerl Overview

Seq other commonly used methods

- The following methods return string values

$seqobj->desc() # a description of the sequence$seqobj->display_id(); # the human read-able id of the sequence$seqobj->seq(); # string of sequence$seqobj->subseq(5,10); # part of the sequence as a string$seqobj->accession_number(); # when there, the accession number$seqobj->alphabet(); # one of ’dna’,’rna’,’protein’$seqobj->primary_id(); # a unique id for this sequence

- The following methods return an array of Bio::SeqFeature objects

$seqobj->top_SeqFeatures # The ’top level’ sequence features$seqobj->all_SeqFeatures # All sequence features

- The following methods returns new sequence objects, but do not transfer features across:

$seqobj->trunc(5,10) # truncation from 5 to 10 as new object$seqobj->revcom # reverse complements sequence$seqobj->translate # translation of a DNA sequence from start/end

Page 30: BioPerl – An Overview

BioPerl Overview

Basic sequence statistics

• SeqStats object provides methods for obtaining the molecular weight of the sequence as well the number of occurrences of each of the component residues (bases for a nucleic acid or amino acids for a protein.) For nucleic acids, SeqStats also returns counts of the number of codons used. For example:

use SeqStats;$seq_stats = Bio::Tools::SeqStats->new($seqobj);$weight = $seq_stats->get_mol_wt();$monomer_ref = $seq_stats->count_monomers();$codon_ref = $seq_stats->count_codons(); # for DNA sequence

• The SeqWords object is similar to SeqStats and provides methods for calculating frequencies of “words” (eg tetramers or hexamers)

Page 31: BioPerl – An Overview

BioPerl Overview

More on Format conversion with AlignIO

Page 32: BioPerl – An Overview

BioPerl Overview

• AlignIO is the bioperl object for data conversion of alignment files. • AlignIO is patterned on the SeqIO object and shares most of SeqIO’s features. • AlignIO currently supports INPUT in the following formats: fasta, mase,

stockholm, prodom, selex, bl2seq, clustalw, msf/gcg, water (from EMBOSS, see III.3.6), needle (from EMBOSS, see III.3.6)

• AlignIO supports OUTPUT in these formats: fasta, mase, selex, clustalw, msf/gcg.

• One significant difference between AlignIO and SeqIO is that AlignIO handles IO for only a single alignment at a time (SeqIO.pm handles IO for multiple sequences in a single stream.) Syntax for AlignIO is almost identical to that of SeqIO:

use Bio::AlignIO;$in = Bio::AlignIO->new(’-file’ => "inputfilename" ,

’-format’ => ’fasta’);$out = Bio::AlignIO->new(’-file’ => ">outputfilename",

’-format’ => ’pfam’);while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }

• The only difference is that here, the returned object reference, $aln, is to a SimpleAlign object rather than a Seq object.

Page 33: BioPerl – An Overview

BioPerl Overview

SimpleAlign

Page 34: BioPerl – An Overview

BioPerl Overview

• The SimpleAlign class contains methods to select sequences or columns, but it can not filter alignments by functions (i.e. properties)

• In order to filter columns by properties, you have to extract the columns by yourself, filter them and reconstruct the new sequences.

• The following example filters gap columns.

Page 35: BioPerl – An Overview

BioPerl Overview

use strict; use Bio::AlignIO; my $in = new Bio::AlignIO ( -file => $ARGV[0], -format => 'clustalw' ); my $out = newFh Bio::AlignIO ( -fh => \*STDOUT, -format => 'clustalw' ); my $aln = $in->next_aln(); # create a list containing all columns foreach my $seq ( $aln->each_alphabetically() ) {

my $colnr = 0; foreach my $chr ( split("", $seq->seq()) ) {

$aln_cols[$colnr] .= $chr; $colnr++; } } # then do the work: we want to eliminate all the columns containing gaps # 1/ we create a list containing all the columns without any gap my $gapchar = $aln->gap_char(); my @no_gap_cols = (); foreach my $col ( @aln_cols ) {

next if $col =~ /\Q$gapchar\E/; push @no_gap_cols, $col;

} # now we replace the old gapped list with the new ungapped onemy @seq_strs = (); foreach my $col ( @no_gap_cols ) {

my $colnr = 0; foreach my $chr ( split"", $col ) {

$seq_strs[$colnr] .= $chr; $colnr++; }}

foreach my $seq ( $aln->each_alphabetically() ) { $seq->seq(shift seq_strs);} print $out $aln;

Page 36: BioPerl – An Overview

BioPerl Overview

Search and Analysis of Similar Sequences

• Bioperl offers a number of modules to facilitate running Blast, both locally and remotely, as well as to parse the often voluminous reports produced by Blast.

• Note, Bioperl itself does not have an internal library for running Blast; instead, it calls the necessary program and then manipulates its results internally

Page 37: BioPerl – An Overview

BioPerl Overview

Page 38: BioPerl – An Overview

BioPerl Overview

StandAloneBlast

• The module Bio::Tools::Run::StandAloneBlast offers the ability to wrap local calls to blast from within perl.

• All of the currently available options of NCBI Blast (eg PSIBLAST, PHIBLAST, bl2seq) are available from within the bioperl StandAloneBlast interface.

• Of course, to use StandAloneBlast, one needs to have installed locally ncbiblast as well as one or more blast-readable databases.

• Basic usage of the StandAloneBlast.pm module is simple. Initially, a local blast “factory object” is created, then the supported blast executables can be issued.

# local BLASTuse Bio::SeqIO; use Bio::Tools::Run::StandAloneBlast;

# step one, creating the factory@params = (’program’ => ’blastn’, ’database’ => ’ecoli.nt’);$factory = Bio::Tools::Run::StandAloneBlast->new(@params);#step two, the input seq are entered$input = Bio::Seq->new(’-id’=>"test query", ’-seq’=>"ACTAAGTGGGGG");$blast_report = $factory->blastall($input);#step three, accessing parts of the blast reportmy $result = $blast_report->next_result; while( my $hit = $result->next_hit()) {

print "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n"; }

Page 39: BioPerl – An Overview

BioPerl Overview

RemoteBlast

• Bioperl supports remote execution of blasts at NCBI by means of the RemoteBlast object. • A skeleton script to run a remote blast might look as follows:

#remote BLAST#step 1: query submissionopen SEQS, “>ecoliblastseqs.txt”;$remote_blast = Bio::Tools::Run::RemoteBlast->new(

’-prog’ => ’blastp’,’-data’ => ’ecoli’,’-expect’ => ’1e-10’ );

$r = $remote_blast->submit_blast("t/data/ecolitst.fa");#step2: results retrieval and storagewhile (@rids = $remote_blast->each_rid ) { foreach $rid ( @rids ) {

$rc = $remote_blast->retrieve_blast($rid); push(<SEQS>, $rc);

}}close SEQS;

Page 40: BioPerl – An Overview

BioPerl Overview

Parsing Similarity Search Reports

• Bioperl supports a wider range of parsing capabilities than for running the search engines that produce them.

• Bioperl objects to parse and/or search BLAST, PSIBLAST and FASTA reports; they include: Search.pm, SearchIO.pm, BPlite.pm and Blast.pm (for parsing Blast reports). Future release will incorporate support for HMMer and GenScan among others.

Page 41: BioPerl – An Overview

BioPerl Overview

Parsing a Blast Report

use Bio::SearchIO; my $blast_report = new Bio::SearchIO ('-format' => 'blast',

'-file' => $ARGV[0]); my $result = $blast_report->next_result; while( my $hit = $result->next_hit()) {

print "\thit name: ", $hit->name(), "\n"; while( my $hsp = $hit->next_hsp()) {

print "E: ", $hsp->evalue(), "frac_identical: ", $hsp->frac_identical(), "\n";

}}

Page 42: BioPerl – An Overview

BioPerl Overview

Page 43: BioPerl – An Overview

BioPerl Overview

Other Parsers

• Bioperl has a family of parsers that work in a slightly different way than the previous one:– The report belongs to a different class, the

Bio::Tools:BPlite class, which has a different set of methods for get the information.

– A factory has to be created first and then to it Bioperl applies the parameters of the search

• This family of parsers include: BPLite, BPpsilite, BPbl2seq

Page 44: BioPerl – An Overview

BioPerl Overview

Page 45: BioPerl – An Overview

BioPerl Overview

use Bio::SeqIO; use Bio::Tools::Run::StandAloneBlast;

my $Seq_in = Bio::SeqIO->new (-file => $ARGV[0], -format => 'fasta');

my $query = $Seq_in->next_seq();my $factory = Bio::Tools::Run::StandAloneBlast->new(

'program' => 'blastp', 'database' => 'swissprot' );

my $blast_report = $factory->blastall($query);

while (my $subject = $blast_report->nextSbjct()) { print $subject->name(), "\n"; while (my $hsp = $subject->nextHSP()) {

print join("\t", $hsp->P, $hsp->percent, $hsp->score), "\n";

}

}

Page 46: BioPerl – An Overview

BioPerl Overview

BioPerl OverviewPart 2

• recap

• creating databases

• accessing databases with OBDA

• relational databases, SQL and others

• closing

Page 47: BioPerl – An Overview

BioPerl Overview

Creating own databases: 1. By Storing results of Searches as flat files

• Bioperl offers a number of modules to facilitate running Blast, both locally and remotely, as well as to parse the often voluminous reports produced by Blast.

• Note, Bioperl itself does not have an internal library for running Blast; instead, it calls the necessary program and then manipulates its results internally

Page 48: BioPerl – An Overview

BioPerl Overview

RemoteBlast

• Bioperl supports remote execution of blasts at NCBI by means of the RemoteBlast object.

• A skeleton script to run a remote blast might look as follows:

#remote BLAST#step 1: query submissionopen SEQS, “>ecoliblastseqs.txt”;$remote_blast = Bio::Tools::Run::RemoteBlast->new(

’-prog’ => ’blastp’,’-data’ => ’ecoli’,’-expect’ => ’1e-10’ );

$r = $remote_blast->submit_blast("t/data/ecolitst.fa");#step2: results retrieval and storagewhile (@rids = $remote_blast->each_rid ) { foreach $rid ( @rids ) {

$rc = $remote_blast->retrieve_blast($rid); push(<SEQS>, $rc);

}}close SEQS;

Page 49: BioPerl – An Overview

BioPerl Overview

StandAloneBlast

• The module Bio::Tools::Run::StandAloneBlast offers the ability to wrap local calls to blast from within perl.

• All of the currently available options of NCBI Blast (eg PSIBLAST, PHIBLAST, bl2seq) are available from within the bioperl StandAloneBlast interface.

• Of course, to use StandAloneBlast, one needs to have installed locally ncbiblast as well as one or more blast-readable databases.

• Basic usage of the StandAloneBlast.pm module is simple. Initially, a local blast “factory object” is created, then the supported blast executables can be issued.

# local BLASTuse Bio::SeqIO; use Bio::Tools::Run::StandAloneBlast;

# step one, creating the factory@params = (’program’ => ’blastn’, ’database’ => ’ecoli.nt’);$factory = Bio::Tools::Run::StandAloneBlast->new(@params);#step two, the input seq are entered$input = Bio::Seq->new(’-id’=>"test query", ’-seq’=>"ACTAAGTGGGGG");$blast_report = $factory->blastall($input);#step three, accessing parts of the blast reportmy $result = $blast_report->next_result; while( my $hit = $result->next_hit()) {

print "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n"; }

Page 50: BioPerl – An Overview

BioPerl Overview

Creating databases:2. Mirroring databases on your local system

• Basically follow each site’s instructions on downloading and setting up the database locally.

Page 51: BioPerl – An Overview

BioPerl Overview

Creating databases:3. By using a database management system like Oracle, SQL, Access, Postgres, etc.

• Again, follow each package’s instructions on downloading and setting up the database locally.

• Note: this is a nice tutorial on relational databases for biologists:

http://www.cs.virginia.edu/papers/ismb02_sql.pdf

Page 52: BioPerl – An Overview

BioPerl Overview

The OBDA Registry System

• OBDA stands for Open Biological Database Access.• The OBDA System was designed so that one could use

the same application code to access data from all three of the database types by simply changing a few lines in a configuration file. This makes application code more portable and easier to maintain.

• The core of the OBDA System is a Database Registry. • This registry is a combination of both local and site-wide

configuration files which define one or more databases and the access methods to use to access them.

• The registry is platform-independent and is used for specifying how BioPerl programs find sequence databases.

Source: http://www.bioperl.org/wiki/HOWTO:OBDA

Page 53: BioPerl – An Overview

BioPerl Overview

OBDA registry system

Local

Flat fileRemote DBLocal

Relational DBCORBA server

Local DB

Your Application

You

Are

here

Accessing data via OBDA Registry System

Page 54: BioPerl – An Overview

BioPerl Overview

• Note: Accessing data via the OBDA system is optional in BioPerl. One can easily access sequence data via the usual database-format-specific modules such as Bio::Index::Fasta or Bio::DB::Fasta

The OBDA Registry System

Source: http://www.bioperl.org/wiki/HOWTO:OBDA

Page 55: BioPerl – An Overview

BioPerl Overview

Local

Flat fileRemote DBLocal

Relational DBCORBA server

Local DB

Your Application

You

Are

here

Accessing data without the OBDA Registry System

Page 56: BioPerl – An Overview

BioPerl Overview

Setup of the OBDA Registry

• The OBDA registry itself is a small text file

• By convention, the name of the file is seqdatabase.ini

• one such file may look like this:

VERSION=1.00 [embl] protocol=biofetch location=http://www.ebi.ac.uk/cgi-bin/dbfetch dbname=embl

[swissprot] protocol=biofetch location=http://www.ebi.ac.uk/cgi-bin/dbfetch dbname=swall

[refseq]Protocol=biofetchLocation=http://www.ebi.au.uk/cgi-bin/dbfetchDbname=refseq

Page 57: BioPerl – An Overview

BioPerl Overview

Setup of the OBDA Registry

The general format is:

[database-name]

tag=value

tag=value

Protocol Tag(s) Description

flat location

dbname

path to the database dir

name of database dir

* config.dat generated during indexing must be here

biofetch location

dbname

base URL for the web service* http://ebi.ac.uk/cgi-bin/biofetch

name of the database

biosql location

dbname

driver

user

passwd

biodbname

host:port

database name

[sqlserver|postgres|oracle|access|csv|informix|odbc|rdb]

username

password

database name

Page 58: BioPerl – An Overview

BioPerl Overview

1. create the text file seqdatabase.ini as just explained

2. copy the file to one of these standard locations:$HOME/.bioinformatics/seqdatabase.ini

/etc/bioinformatics/seqdatabase.ini

3. modify search path by adding this env variableOBDA_SEARCH_PATH=/home/yourdir/;http://foo.org/

4. if applicable, “install” the local databases otherwise, skip this step if you plan to use

biofetch only5. write code inside your application to use the

registry

Setup of the OBDA Registry

Page 59: BioPerl – An Overview

BioPerl Overview

use Bio::DB::Registry;

...

$registry = Bio::DB::Registry->new;

$db = $registry->get_database('embl');

$seq = $db->get_Seq_by_acc("J02231");

print $seq->seq,"\n";

Notes:

$registry is an object of type Bio::DB:Registry

$db is an object of type Bio::DB:RandomAccessI

$seq is an object of type Bio::Seq

Details of location of the embl database and access method are not specified here but in the seqdatabase.ini file

Page 60: BioPerl – An Overview

BioPerl Overview

Special Case 1: installing local database files - flat files

• A flat file is a local file of sequences (e.g fasta, local copy of embl, swissprot, EMBL, etc.)

• These files have to be indexed before they can be used by the OBDA system.

• A small script will index the flat file for you• The resulting index file will be called config.dat and there

will be one for each flat file that has been indexed• The index itself is an object of type Bio::DB:Flat

Page 61: BioPerl – An Overview

BioPerl Overview

Special case 1...

• For example, the following command will create an index that:– will be written to /usr/share/biodb/<symbolic_name_of_db>, – the symbolic name of the database is genbank, – the indexing scheme is flat, – the format of the source database file is fasta, – and the file itself is data/*.fa - a group of files ending in .fa

bioflat_index.pl –c –l /usr/share/biodb -d genbank -i flat -f fasta data/*.fa

• The corresponding entry in the seqdatabase.ini file will look like this:

VERSION=1.00 …[genbank] protocol=flatlocation=/usr/share/biodbdbname=genbank

Page 62: BioPerl – An Overview

BioPerl Overview

Special Case 2: adaptors for relational databases [BioSql]

• Relational databases such as SQL, Oracle, Postgress, etc require the use of adaptor objects written specifically for the OBDA system.

• Refer to the documentation of the specific one you use for more details.

• An example with SQL follows:

my $adp = $dbadp->get_object_adaptor("Bio::SeqI"); my $seq = Bio::Seq->new(-accession => $acc, -namespace => 'swissprot', -version => $ver); my $dbseq = $adp->find_by_unique_key($seq); my $feat = new Bio::SeqFeature::Generic( -primary_tag => $primary_tag, -strand => 1, -start => 100, -end => 10000, -source_tag => 'blat'); $dbseq->add_SeqFeature($feat); $dbseq->store;

Page 63: BioPerl – An Overview

BioPerl Overview

Page 64: BioPerl – An Overview

BioPerl Overview

Closing …

• Just scratched the surface• Not covered here from Core Bioperl:

- creating sequence alignments (ClustalW)- displaying alignment results SimpleAlign- XML –auto web form generation- SQL – persistent BioPerl- other data structures: trees, maps, etc.

• Hundreds of modules and applications

Page 65: BioPerl – An Overview

BioPerl Overview

Other Links

• BioPerl wikihttp://bioperl.org/wiki/Main_Page

• Relational Databases or Biologistshttp://www.cs.virginia.edu/papers/ismb02_sql.pdf

• Bioperl Tutorialhttp://bioperl.org/Core/Latest/bptutorial.html

• OBDA homepagehttp://obda.open-bio.org/

• BioSQL discussion group http://lists.open-bio.org/pipermail/biosql-l/2003-July/thread.html#404