87
BioJava Core API

BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Embed Size (px)

Citation preview

Page 1: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

BioJava Core API

Page 2: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Java for Bioinformatics?

Cross platform means develop on one platform deploy on any.

Widely accepted industry standard. Lots of support libraries for modern

technologies (XML, WebServices, JDBC).

Scales well from small to industrial strength enterprise sized programs.

Page 3: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Java for Bioinformatics?

Object Oriented. Rapid development due to

Very strict types Simple clear syntax Exception handling and recovery Cross platform Extensive class library Code reuse

Page 4: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

What is BioJava?

A collection of Java objects that represent and manipulate biological data

Not a program, rather a programming library

Open source (LGPL) open for all development, even commercial. Not ‘sticky’ or ‘viral’.

Page 5: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

What is BioJava?

Collection of objects to assist bioinformatics research

Started at EBI/Sanger in 1998 by Matthew Pocock and Thomas Down

25+ developers have contributed (5 core)

Page 6: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

What is BioJava?

BioJava has acquired 1100+ classes, 130,000+ lines of code.

Uses CVS version control, JUnit testing and ANT builds.

It now has a fairly stable API. 76 packages!

Page 7: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Where is BioJava

Home Page www.biojava.org

BioJava in Anger http://www.biojava.org/docs/bj_in_anger/

Mailing Lists [email protected] [email protected]

Nightly Builds http://www.derkholm.net/autobuild/

Page 8: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Obtaining BioJava

Download http://www.biojava.org/download/ Get binaries, source and docs

biojava-live (requires cvs) cvs -d

:pserver:[email protected]:/home/repository/biojava login Password is ‘cvs’ cvs -d

:pserver:[email protected]:/home/repository/biojava checkout biojava-live

cvs update -Pd

Page 9: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Compiling biojava-live

Requires the ANT build tool http://jakarta.apache.org/ant/

The ANT tool will use build.xml to Arrange source code Compile source Make jar file Make Java docs Build demos Build and Run tests Change to biojava-live; type ant

Unit testing requires JUnit http://junit.sourceforge.net/

Page 10: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Setting up BioJava

Put the following JAR files on your class path:

biojava.jar bytecode-0.92.jar commons-cli.jar commons-collections-2.1.jar commons-dbcp-1.1.jar commons-pool-1.1.jar

Page 11: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Object Orient Patterns and BioJava Design

Page 12: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

BioJava Design

Uses some reasonably “advanced” concepts Design by Interface Protected or Private constructors Factory classes and Methods Flyweight/ Singleton objects

Page 13: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Interfaces Hide Implementation

In BioJava there are several implementations of the Distribution interface.

Any can be legally returned by a method that returns a Distribution (the returning method may even return different ones depending on the situation).

Any can be legally used as an argument to a method that requires a Distribution.

All are guaranteed to contain a minimal set of common methods.

Page 14: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Flyweight and Singleton Objects

A Singleton is a class with only one instance and only one access point.

A Singleton will need a Private constructor and may be static (e.g. AlphabetManager).

A Flyweight object uses sharing to support large numbers of fine-grained object efficiently.

For example in BioJava there is only ever one instance of the DNA Symbol “A”. A sequence of A’s is really just a list of pointers to that one object.

Page 15: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Factory and Static methods

Sometimes it is useful to prevent a user from directly constructing an object via a constructor. If the construction is complex. If the choice of the optimal implementation is

best left to the API developer. If important resources are best protected from

end users e.g. Singletons/ Flyweights. Rather than instantiating the object via its

constructor a static method or Factory object is used

Page 16: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Examples

Static method: FiniteAlphabet dna = DNATools.getDNA();

Static field: DistributionFactory df = DistributionFactory.DEFAULT;

Factory method: Distribution d = df.createDistribution(dna);

Page 17: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Two Levels of BioJava

Macro type programming Tools classes (SeqIOTools,

DistributionTools etc). Static methods for common tasks.

Full programming Lots of customizations and ‘plug and

play’ possible. More exposure to the sharp edges of the

API. Less documentation.

Page 18: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Alphabets, Symbols and Sequences

Page 19: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Symbols

In BioJava the DNA residue “A” is an object.

In Bioperl “A” would be a String. The “A” object is part of the sequence

not the sequence. “A” from DNA is not equal to “A” from

RNA or “A” from Protein.

Page 20: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Why not Strings?

DNA A != RNA A != Protein A For Strings “A”.equals(“A”); DNA Alphabet also contains

K,Y,W,S,R,M,B,D,G,V,N

Page 21: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Why not Strings?

Object Y contains C and T, The String “Y” doesn’t contain anything

Translation HashMaps with Strings are flawed. Biojava GGN translates to GLY String GGN maps to null

A fully redundant String to String HashMap translation table requires 4096 keys!

Page 22: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Symbols are Canonical

DNATools.a() == DNATools.a(); There is only one instance of ‘a’

DNATools.a().equals(DNATools.a()); ProteinTools.a() != DNATools.a(); Even on Remote JVM’s!

During serialization Alphabet indexing is transient and ‘reconnected’ via readResolve() methods.

Page 23: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Alphabets

A set of Symbols Alphabets can be infinite

DoubleAlphabet, IntegerAlphabet Some Alphabets have a Finite number

of Symbols DNA, RNA etc

Alphabet and FiniteAlphabet interfaces

Page 24: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

org.biojava.bio.Alphabet

boolean contains(Symbol s) Returns whether or not this Alphabet contains the symbol.

List getAlphabets() Return an ordered List of the alphabets which make up a compound alphabet.

Symbol getAmbiguity(java.util.Set syms) Get a symbol that represents the set of symbols in syms.

Symbol getGapSymbol() Get the 'gap' ambiguity symbol that is most appropriate for this alphabet

String getName() Get the name of the alphabet.

Symbol getSymbol(java.util.List rl) Get a symbol from the Alphabet which corresponds to the specified ordered list of symbols.

SymbolTokenization getTokenization(java.lang.String name) Get a SymbolTokenization by name. 

void validate(Symbol s) Throws a precanned IllegalSymbolException if the symbol is not contained within this Alphabet.

Page 25: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

org.biojava.bio.FiniteAlphabet

In addition to the previous methods

void addSymbol(Symbol s) Adds a symbol to this Alphabet

Iterator iterator() Retrieve an Iterator over the Symbols in this Alphabet. 

void removeSymbol(Symbol s) Remove a symbol from this alphabet.

int size() The number of symbols in the alphabet.

Page 26: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

The Default Alphabets

DNA (a,c,g,t) RNA (a,c,g,u) PROTEIN (all amino acids including ‘Sel’) PROTEIN-TERM (all PROTEIN plus “*”) STRUCTURE (PDB structure symbols) Alphabet of all integers (Infinite Alphabet)

Can generate SubIntegerAlphabets Alphabet of all doubles (Infinite Alphabet)

Page 27: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Getting the common Alphabets

import org.biojava.bio.symbol.*; import java.util.*; import org.biojava.bio.seq.*;  public class AlphabetExample {   public static void main(String[] args) {     Alphabet dna, rna, prot;      //get the DNA alphabet by name     dna = AlphabetManager.alphabetForName("DNA");      //get the RNA alphabet by name     rna = AlphabetManager.alphabetForName("RNA");      //get the Protein alphabet by name     prot = AlphabetManager.alphabetForName("PROTEIN");     //get the protein alphabet that includes the * termination Symbol     prot = AlphabetManager.alphabetForName("PROTEIN-TERM");      //get those same Alphabets from the Tools classes     dna = DNATools.getDNA();     rna = RNATools.getRNA();     prot = ProteinTools.getAlphabet();     //or the one with the * symbol     prot = ProteinTools.getTAlphabet();    } }

Page 28: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

SymbolLists are made of Symbols

org.biojava.bio.symbol.SymbolList A sequence of Symbols from the same

Alphabet. Uses biological coordinates from 1 to

length cf String from 0 to length-1

Page 29: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Doesn’t this waste memory?

A SymbolList is not really a List of Symbol Objects.

Rather a List of Object references. Still a bit heavier than a char[] but not

serious.

A CG

T

AACGTGGGTTCCAACT

Page 30: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

The Bigger Picture

A CG

T

AACGTGGGTTCCAACT

AlphabetManager

“DNA”

“Protein”

Page 31: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

The SymbolList interface

void edit(Edit edit)           Apply an edit to the SymbolList as specified by the edit object. 

Alphabet getAlphabet()           The alphabet that this SymbolList is over. 

Iterator iterator()           An Iterator over all Symbols in this SymbolList. 

int length()           The number of symbols in this SymbolList. 

String seqString()           Stringify this symbol list. 

SymbolList subList(int start, int end)           Return a new SymbolList for the symbols start to end inclusive. 

String subStr(int start, int end)           Return a region of this symbol list as a String. 

Symbol symbolAt(int index)           Return the symbol at index, counting from 1.

List toList()           Returns a List of symbols.

Page 32: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

String to SymbolList

import org.biojava.bio.seq.*import org.biojava.bio.symbol.*;

 public class StringToSymbolList { public static void main(String[] args) {

     try {  //create a DNA SymbolList from a String  SymbolList dna = DNATools.createDNA("atcggtcggctta");  //create a RNA SymbolList from a String  SymbolList rna = RNATools.createRNA("auugccuacauaggc");   //create a Protein SymbolList from a String  SymbolList aa = ProteinTools.createProtein("AGFAVENDSA");}catch (IllegalSymbolException ex) {  //this will happen if you use a character in one of your strings that is //not an accepted IUB Character for that Symbol.  ex.printStackTrace();}   

}}

Page 33: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

SymbolList to String

import org.biojava.bio.symbol.*;

public class SymbolListToString {  

public static void main(String[] args) {SymbolList sl = null;

   //code here to instantiate sl    

//convert sl into a String String s = sl.seqString(); }}

Page 34: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

The Sequence Interface

A Sequence is a SymbolList with more information.

In addition to Annotatable and SymbolList:String getName()

The name of this sequence. 

String getURN() A Uniform Resource Identifier (URI) which identifies the sequence represented by this object.

Also implements FeatureHolder which allows addition of Feature Objects.

Page 35: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Quickly generate a Sequence

import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*;  public class StringToSequence {   public static void main(String[] args) {      try {       //create a DNA sequence with the name dna_1       Sequence dna = DNATools.createDNASequence("atgctg", "dna_1");        //create an RNA sequence with the name rna_1       Sequence rna = RNATools.createRNASequence("augcug", "rna_1");        //create a Protein sequence with the name prot_1       Sequence prot = ProteinTools.createProteinSequence("AFHS", "prot_1");     }     catch (IllegalSymbolException ex) {       //an exception is thrown if you use a non IUB symbol       ex.printStackTrace();     }   } }

Page 36: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

More Complex Symbols and Alphabets

Page 37: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Ambiguity Symbols

Ambiguous or Fuzzy data is a fact of life, especially with sequencing.

DNA traces can contain symbols such as n, r, w, v, h, k, y, n etc.

In BioJava DNA symbols a, c, g, t are AtomicSymbols.

Ambiguous symbols like y are BasisSymbols.

Page 38: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

BasisSymbols

A BasisSymbol may be represented as a list of one or more Symbols.

BasisSymbol extends Symbol. Ambiguity Symbols are always

BasisSymbols getSymbols() The list of symbols that

this symbol is composed from.

Page 39: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

AtomicSymbols

AtomicSymbols are not ambiguous. They cannot be further divided into

Symbols that are valid members of the parent Alphabet.

In the case of compound Alphabets they can be divided into valid Symbols from component Alphabets.

Page 40: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

AtomicSymbols

The AtomicSymbol interface extends BasisSymbol but adds no new methods only behaviour contracts.

AtomicSymbol instances guarantee that getMatches() returns an Alphabet containing just that Symbol and each element of the List returned by getSymbols() is also atomic.

Page 41: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Atomic and Basis

A T

AATW

W

AlphabetManager“DNA”

AtomicSymbols

BasisSymbol

Page 42: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Translating Ambiguity

BioJava handles translation of ambiguity very smoothly.

DNA ‘n’ = [a,c,g,t] Transcribes to RNA ‘n’ [a,c,g,u] ggn translates to Gly agn translates to [Ser, Arg] Most protein ambiguities have no

‘token’ and are printed as ‘X’

Page 43: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

CrossProduct Alphabets

A CrossProductAlphabet is a combination of two or more Alphabets.

Any type of CrossProductAlphabet is possible

Dimers (DNA x DNA) Codon (DNA x DNA x DNA) Conditional ((DNA x DNA) x DNA) Mixed ((DNA x DNA x DNA) x PROTEIN)

Page 44: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Finite and Compound Alphas

A CG

T

[AAC][GTG]GGTTCCAACT

DNA AtomicSymbols

ACA GTG(DNA x DNA x DNA) AtomicSymbols

GNG (DNA x DNA x DNA) BasisSymbol

Page 45: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

What are they good for?

Codon Symbols (DNA x DNA x DNA). Many analysis Classes such as Count

and Distribution use Symbol as an argument. A hexamer can be an AtomicSymbol.

Phred is DNA x Integer 1st and Higher order Markov Models

use CrossProductAlphabets.

Page 46: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

How do I make a CrossProductAlphabet?

import java.util.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*;  public class CrossProduct {   public static void main(String[] args) {      //make a CrossProductAlphabet from a List     List l = Collections.nCopies(3, DNATools.getDNA());     Alphabet codon = AlphabetManager.getCrossProductAlphabet(l);      //get the same Alphabet by name     Alphabet codon2 =         AlphabetManager.generateCrossProductAlphaFromName(

"(DNA x DNA x DNA)“ );

      //show that the two Alphabets are canonical     System.out.println(codon == codon2);   } }

Page 47: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Making Triplet Views on a SymbolList

import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*;  public class CodonView {   public static void main(String[] args) {     try {       //make a DNA SymbolList       SymbolList dna = DNATools.createDNA("atgcccgcgtaa");       System.out.println("Length of dna " + dna.length());        //get a Codon View (window size of three)       SymbolList codons  = SymbolListViews.windowedSymbolList(dna, 3);       System.out.println("Length of codons " + codons.length());        //get a Triplet View       SymbolList triplets = SymbolListViews.orderNSymbolList(dna, 3);       System.out.println("Length of triplets "+ triplets.length());     }     catch (Exception ex) {       ex.printStackTrace();     }   } }

Page 48: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Getting a Symbol for a Codon

import java.util.*;  import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*;  public class MakeATG {   public static void main(String[] args) {     //make a CrossProductAlphabet from a List     List l = Collections.nCopies(3, DNATools.getDNA());     Alphabet codon = AlphabetManager.getCrossProductAlphabet(l);      //get the codon made of atg     List syms = new ArrayList(3);     syms.add(DNATools.a());     syms.add(DNATools.t());     syms.add(DNATools.g());      Symbol atg = null;     try {       atg = codon.getSymbol(syms);     }     catch (IllegalSymbolException ex) {       //used Symbol from Alphabet that is not a component of codon       ex.printStackTrace();     }     System.out.println("Name of atg: "+ atg.getName());   } } 

Page 49: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Breaking a Codon into its Parts

import java.util.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*;  public class BreakingComponents {   public static void main(String[] args) {     //make the 'codon' alphabet     List l = Collections.nCopies(3, DNATools.getDNA());     Alphabet alpha = AlphabetManager.getCrossProductAlphabet(l);      //get the first symbol in the alphabet     Iterator iter = ((FiniteAlphabet)alpha).iterator();     AtomicSymbol codon = (AtomicSymbol)iter.next();     System.out.print(codon.getName()+" is made of: ");      //break it into a list its components     List symbols = codon.getSymbols();     for(int i = 0; i < symbols.size(); i++){       if(i != 0)         System.out.print(", ");       Symbol sym = (Symbol)symbols.get(i);       System.out.print(sym.getName());     }   } }

Page 50: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Basic Sequence Operations

Page 51: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Getting a section of a SymbolList

symbolAt(int i) Returns a Symbol

subList(int min, int max) Returns a SymbolList

subString(int min, int max) Returns the subsection tokenized to a

String

Page 52: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Transcription

In BioJava DNA sequences and RNA sequences are from different Alphabets. To convert between them:

//make a DNA SymbolListSymbolList dna = DNATools.createDNA("atgccgaatcgtaa");

 //convert it to RNASymbolList rna = DNATools.toRNA(dna);

 //just to prove it workedSystem.out.println(rna.seqString()); //augccgaaucguaa

//biological transcription (ie copy and reverse strand)rna = DNATools.transcribeToRNA(dna); //5’ atgccgaatcgtaa 3’System.out.println(rna.seqString()); //5’ uuacgauucggcau 3’

Page 53: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Reverse Complement

import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*;  public class ReverseCompiment {   public static void main(String[] args) throws Exception{     SymbolList forward = DNATools.createDNA("atcgctagcgatcg");      //two step     SymbolList reverse = SymbolListViews.reverse(forward);     SymbolList revc1 = DNATools.complement(reverse);      //one step     SymbolList revc2 = DNATools.reverseComplement(forward);      //test for equivalence     System.out.println(revc1.equals(revc2));   } }

Page 54: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Translation

RNATools contains the “Universal” RNA to Protein TranslationTable.

Standard procedure is transcribe DNA to RNA and then translate.

Page 55: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Translation Example

import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*;  public class Translate {    public static void main(String[] args) {     try {       //create a DNA SymbolList       SymbolList symL = DNATools.createDNA("atggccattgaatga");        //transcribe to RNA       symL = RNATools.toRNA(symL);        //translate to protein       symL = RNATools.translate(symL);        //prove that it worked       System.out.println(symL.seqString());     }     catch (Exception ex) {      ex.printStackTrace()       }

   } }

Page 56: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Sequence I/O

Page 57: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Don’t ever write another Parser

If you can avoid it! BioJava supports

Genbank, GenPept, RefSeq, EMBL, SwissProt, PDB, Fasta, ABI, LocusLink, Unigene (requires Java 1.4)

GAME, AGAVE Blast, Fasta, HMMER (models and results), BlastXML,

MEME, Phred OBDA, BioIndex, BioSQL, DAS, GFF, XFF Ensembl (with biojava-ensembl package)

StAX/ Tag value RMI and Serialization

Page 58: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Simple I/O

Most of BioJava’s simpler I/O operations are conveniently wrapped up behind static methods from the SeqIOTools class.

SeqIOTools can read and write: Fasta (protein or DNA) EMBL GenBank (flat file and XML) SwissProt GenPept MSF (protein or DNA) Fasta Alignments

Page 59: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

SeqIOTools Reader Methods

SequenceIterator i = SeqIOTools.readGenbank(br); SequenceIterator i = SeqIOTools.readGenpept(br); SequenceIterator i = SeqIOTools.readSwissprot(br); SequenceIterator i = SeqIOTools.readEmbl(br); etc… SequenceIterator i = (SequenceIterator)

SeqIOTools.fileToBiojava("fasta", "dna“, br);

Alignment a = (Alignment) SeqIOTools.fileToBiojava(“MSF", “rna“, br);

Page 60: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Features, Locations, Annotations

Page 61: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Features and Annotations

Sequence data often comes with added information about the various properties of the sequence (Genbank, SwissProt etc).

BioJava divides this information into global properties (Annotations) and Localized properties (Features).

Page 62: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Annotatable

Annotatable is an “mix-in” interface that indicates the implementing object contains a Annotation object.

It defines one method. Annotation getAnnotation();

Page 63: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Annotations

org.biojava.bio.Annotation Annotations are used for Global properties. Species, Accession Number, xrefs, date,

publication. Key – value maps. Key and Value are objects but almost always are

Strings. Annotation.EMPTY_ANNOTATION

static convenience class good place holder, avoids null pointer exceptions immutable

Page 64: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Annotation API

Map asMap() Return a map that contains the same key/values as this Annotation. 

boolean containsProperty(java.lang.Object key) Returns whether there the property is defined. 

Object getProperty(java.lang.Object key) Retrieve the value of a property by key. 

Set keys() Get a set of key objects. 

void removeProperty(java.lang.Object key) Delete a property 

void setProperty(java.lang.Object key, java.lang.Object value) Set the value of a property.

Page 65: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

FeatureHolder

FeatureHolder is another “mix-in” interface which allows the implementing object to hold Features.

Sequence implements FeatureHolder. Features are created by

FeatureHolders. FeatureHolders can be filtered.

Page 66: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

FeatureHolder methods

boolean containsFeature(Feature f) Check if the feature is present in this holder.

int countFeatures() Count how many features are contained.

Feature createFeature(Feature.Template ft)  Create a new Feature, and add it to this FeatureHolder.

Iterator features()  Iterate over the features in no well defined order.

FeatureHolder filter(FeatureFilter filter)  Query this set of features using a supplied FeatureFilter. 

FeatureHolder filter(FeatureFilter fc, boolean recurse)  Return a new FeatureHolder that contains all of the children of this one that passed the filter fc.

FeatureFilter getSchema() Return a schema-filter for this FeatureHolder.

void removeFeature(Feature f)  Remove a feature from this FeatureHolder.

Page 67: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Features are Annotatable

Features implement Annotatable Can hold an annotation Global annotations of a Feature

/note: /db_xref: etc

Page 68: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Features may be nested

Features implement FeatureHolder! Therefore Features may hold nested

Features c.f. The AWT Menu is a MenuItem e.g. A gene has exons and introns Filtering can be recursive A Feature cannot hold itself (directly or

indirectly)

Page 69: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Location API

Locations are objects that specify a minimum and maximum bound on a region of sequence.

Contains some useful methods, particularly getMin() and getMax().

Many methods have been deprecated and are now delegated to LocationTools.

LocationTools is the best place to get new instances of a Location.

PointLocation, RangeLocation, CircularLocation, CompoundLocation.

Page 70: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

LocationTools

static boolean areEqual(Location locA, Location locB)    Return whether two locations are equal.

static boolean contains(Location locA, Location locB)     Return true iff all indices in locB are also contained by locA.

static Location flip(Location loc, int len)     Flips a location relative to a length.

static Location intersection(Location locA, Location locB)      Return the intersection of two locations.

static CircularLocation makeCircularLocation(int min, int max, int seqLength)      A simple method to generate a RangeLocation wrapped in a CircularLocation

static Location makeLocation(int min, int max)      Return a contiguous Location from min to max.

static boolean overlaps(Location locA, Location locB)      Determines whether the locations overlap or not.

static Location subtract(Location x, Location y)    Subtract one location from another.

static Location union(java.util.Collection locs)      The n-way union of a Collection of locations.static 

Location union(Location locA, Location locB)      Return the union of two locations.

Page 71: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Location Example

import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*;  public class SpecifyRange {   public static void main(String[] args) {     try {       //make a RangeLocation specifying the residues 3-8       Location loc = LocationTools.makeLocation(3,8);       //print the location       System.out.println("Location: "+loc.toString());        //make a SymbolList       SymbolList sl = RNATools.createRNA("gcagcuaggcggaaggagc");       System.out.println("SymbolList: "+sl.seqString());        //get the SymbolList specified by the Location       SymbolList sym = loc.symbols(sl);       System.out.println("Symbols specified by Location: "+sym.seqString());     }     catch (IllegalSymbolException ex) {       //illegal symbol used to make sl       ex.printStackTrace();     }   } }

Page 72: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Filtering Features

FeatureHolders have a filter method that accepts a FeatureFilter as an argument.

Features that are accepted by the FeatureFilter are returned as a new FeatureHolder.

Filtering may be done recursively so that nested Features are subjected to the same FeatureFilter .

Page 73: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

FeatureFilters

FeatureFilter is an interface that specifies one method. boolean accept(Feature f)

There are 26 implementations of FeatureFilter in BioJava available as inner classes of the FeatureFilter interface.

Most commonly used are ByType, BySource, StrandFilter, OverlapsLocation, ContainedByLocation.

Also boolean logic filters: And, Or, Not

Page 74: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Analysis and Distributions

Page 75: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Distributions and Counts

The Distribution and Count interfaces are from the org.biojava.bio.dist package.

Counts are maps from AtomicSymbols to counts.

Distributions are maps from Symbols to frequencies.

Page 76: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Distributions

Distributions are central to analysis Map Symbols to Frequencies Can be trained or weights can be set Used heavily in dp (dynamic programming)

package. HMM transitions and emmissions

Many implementations, frequently used are: SimpleDistribution OrderNDistribution UniformDistribution

Page 77: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Distribution API

 Alphabet getAlphabet() The alphabet from which this spectrum emits symbols. Distribution getNullModel() Retrieve the null model Distribution that this Distribution recognizes. double getWeight(Symbol s) Return the probability that Symbol s is emited by this spectrum. void registerWithTrainer(DistributionTrainerContext dtc) Register this distribution with a training context. Symbol sampleSymbol() Sample a symbol from this state's probability distribution. void setNullModel(Distribution nullDist) Set the null model Distribution that this Distribution recognizes. void setWeight(Symbol s, double w) Set the probability or odds that Symbol s is emited by this state.

Page 78: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

DistributionFactory

Generally a Distribution is created using a DistributionFactory.

The DistributionFactory interface contains a static inner class called DEFAULT that implements DistributionFactory

DistributionFactory df = DistributionFactory.DEFAULT; Distribution d = df.createDistribution(dna.getAlphabet());

Page 79: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Distribution Training

Distributions can be trained on observed sequences using a DistributionTrainerContext.

One or more Distributions can be registered with the DTC. //register the Distributions with the trainer

dtc.registerDistribution(dnaDist);

Page 80: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

DistributionTrainerContext

A DistributionTrainer is assigned to each registered Distribution by the DTC.

If unusual training behaivour is required you can register your own DistributionTrainer at the same time.

The dtc can also add pseudocounts if needed.

Ambiguities are automagically handled. Counts are split according to the null model.

Page 81: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Training Example

      //make a DNA SymbolList       SymbolList dna = DNATools.createDNA("atcgctagcgtyagcntatsggca");        //get a DistributionTrainerContext       DistributionTrainerContext dtc = new SimpleDistributionTrainerContext();        //make the Distribution       Distribution dnaDist =           DistributionFactory.DEFAULT.createDistribution(dna.getAlphabet());               //register the Distribution with the trainer       dtc.registerDistribution(dnaDist);               for(int j = 1; j <= dna.length(); j++){         dtc.addCount(dnaDist, dna.symbolAt(j), 1.0);       }               //train the Distribution       dtc.train();         

Page 82: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

setWeight() Example

FiniteAlphabet a = DNATools.getDNA();Distribution d =

DistributionFactory.DEFAULT.createDistribution(a);//set the weight of each symbold.setWeight(DNATools.a(),0.3);d.setWeight(DNATools.c(),0.2);d.setWeight(DNATools.g(),0.2); d.setWeight(DNATools.t(),0.3);

Page 83: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

DistributionTools

DistributionTools holds static methods for creating and manipulating Distributions.

Tasks include: Equal emission spectra? Shannon Entropy, information, KL Distance. Generate biased sequences. Make a Distribution[] from an Alignment (each Distribution

represents one position in an Alignment. Average two or more Distributions. Randomize a Distribution. Make a Distribution from a Count.

Page 84: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

Serialization of Distributions

Distributions are Serializable Write to and Read from Binary RMI

XMLDistributionWriter Write any Distribution to a stream in XML format.

XMLDistributionReader SAXParser Read any Distribution from a XML stream

Page 85: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

XML Output

<?xml version="1.0" ?>

<Distribution type="Distribution">

  <alphabet name="DNA" />

  <weight sym="adenine" prob="0.32178516910737204" />

  <weight sym="cytosine" prob="0.04596199299395364" />

  <weight sym="guanine" prob="0.1405504188012911" />

  <weight sym="thymine" prob="0.4917024190973832" />

</Distribution>

Page 86: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

What Else??

Dynamic Programming (HMMs) Bibliography Alignments Blast and Fasta parsing

Page 87: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support

What Else??

BioSQL support GUI components Chromatograms Molecular Biology (pI, mass, restriction

enzymes) Molecular Structure