Experiences with logic programming in bioinformatics

Experiences using logic programming in bioinformatics

Chris MungallBerkeley Bioinformatics and Ontologies Group

http://berkeleybop.orgLawrence Berkeley National Laboratory

ICLP 2009

Outline• Biology and biological data integration: a brief introduction• Obol: First experiences applying LP• Blipkit: a reusable bioinformatics developer’s toolkit

– Modular structure– I/O and relational database connectivity

• Some applications of Blipkit and LP– Genes and genomics– Phenotype matching– Web applications

• Conclusions• Where next? Some recommendations for the LP community

The promise and challenges of biological research

• Why study biological systems?– Because they’re fascinating– Improve health– Improve the environment

• BUT: Biology is hard– Biological systems are extremely diverse– Biology deal with phenomena at multiple levels of granularity– There is a deluge of data

• Bioinformatics– Biology as an information science– Computational methods vital to understanding

Diversity of biological systems

Biology in the small: MoleculesDNA

RNApseudoknot

Cells and organismal biology

myelinsheath

cellnucleus

schwanncell

axonterminal

node ofRanviersoma

dendrite

axon

blastula gastrula

gastrulation

Ecosystems

bio-databases• 1200 Biological Databases published in Nucleic Acids Research

– many more unpublished– many of these are database federations (e.g. Ensembl)

• Heterogeneous systems– Storage mechanism:

• Relational• XML• Flat files• Ad-hoc, semi-structured, natural language

– Limited APIs• lack of standards• limited query expressivity

• Poorly integrated– Limited integration beyond identifier cross-references– Users must manually integrate– Bioinformatics runs on perl glue

genesmetabolicpathways

fruitflies

tumors

mutants

Data interrogation and discovery

• Sample of tasks– Find mutations in regions upstream of

neurotransmitter-producing genes– Find drug targets or animal models for

neurodegenerative diseases– What biological pathways are enriched in high acidity

environments?• Answer each of these is difficult– Manual aggregation from lots of databases– Various kinds of inference required

OBO: Open Biological OntologiesObject Process

chemical

macromolecule

cellular

organismal

population and environmental

small

large

Obol: First experience with LP in bioinformatics

• Problem– Many existing bio-ontologies were in fact more like terminologies– Basic axioms, is_a hierarchies– Deeper logical structure implicit in terms

• Long noun phrases, recursively composed• “regulation of transcription during G1 phase of mitotic cell cycle”

• Existing solutions (2004)– Take advantage of semi-controlled syntax of terms– Parse using ad-hoc regular expressions

• Influence of perl in bioinformatics!• But context-free grammars (at least) were required

A better solution: Definite Clause Grammars

• Obol: A collection of domain specific DCGs• Significant improvement over perl RegExs– Declarative– More expressive– Integration with simple reasoning– Bi-directional:• can be used for term generation from logical

expressions

Example process grammarprocess(P) regulation(P) | specification(P) | transcription(P) | ...process(P and during(W)) process(P),[during],process(W).process(P and part_of(W)) process(P),[of],process(W).regulation( regulates(P) ) [regulation,of],process(P).specification( specifies(C) ) [specification, of], cell(C).cell(C and part_of(O)) ogan(O),cell(C).

“regulation of transcription during G1 phase of mitotic cell cycle”

regulates(transcription) and during(g1_phase and part_of(mitosis))

“regulation of transcription from RNA polymerase II promoter involved in ventral spinal cord interneuron specification”

regulates(transcription and has_signal(rna_pol_ii)) and part_of(specifies(interneuron and part_of(ventral_spinal_cord)))

Implementations• Obol v1 : 2005

– XSB– DCGs + tabling Earley / chart parsing– Basic ontology reasoning (tabling to avoid

cycles)– Integration into java editing environment

(XSB interprolog)• Obol v2 : 2006

– Port to SWI-Prolog– Web interface– Earley algorithm implementation– Backward chaining for simple reasoning– Forward chaining for full reasoning

• Obol v2.5 : 2007– Reversion to plain DCGs

• careful construction to avoid cycleshttp://wiki.geneontology.org/index.php/Obol

• Current• Obol java• Obol v3 : 2009

• In progress• OWL-Centric• Built on Thea2

http://wiki.geneontology.org/index.php/Obol

http://wiki.geneontology.org/index.php/Obol

Results

• Obol grammars applied successfully to generate axioms for multiple ontologies– particularly the Gene Ontology– Still used frequently

• Lessons learned– Small amount of basic LP goes a long way• LP techniques not widely known in bioinformatics

– Different LP systems have different strengths• Choosing between them is hard – and frustrating

Could LP prove as successful in the wider bioinformatics arena?

– Rule-based analysis pipelines• prolog > make

– Integration of ontology reasoning and database queries• prolog > datalog > sql

– Pathways• graphs, ASP

– Genomics• Linear transformations, CLP

– Phylogenetics• operations on trees

Toolkit Paradigm: BioPerl• http://www.bioperl.org/

– Established 1990s• Collaborative

– Open Source, svn repository– No funding, all voluntary

• Modular– Namespaces– Interrelated

• Separation of I/O from models– Parsers– Writers– SQL database bindings

• Publication:– The BioPerl toolkit, Stajich et al, Genome Research

2002– 1044 citations (google scholar)

• Spinoffs:• biojava• biopython• bioruby• bioocaml• …

• Parent org• open bioinformatics

foundation• Issues

• object oriented• perl!

http://www.bioperl.org/

blipkit: biological programming toolkit

• A general purpose reusable library– Takes care of ‘plumbing’ – parsing, writing, interface

• Deductive database + application framework• High modular: one package per domain

– ontologies– genomes– structures– phylogeny and evolution– phenotypes– systems biology

• SWI-Prolog specific• http://blipkit.org

http://blipkit.org/

http://blipkit.org/

Anatomy of a blip domain package• Model(s) of the domain

– dependencies to other domain modules– extensional and intensional predicates

• I/O– parsers/writers for small subset of bioinformatics file formats

• DCGs or external perl– translators for common XML schemas– Native prolog serialization of model ‘for free’

• Web UI• Bridges

– Relational– Other prolog models– Ontology models

Domain model modules• A model consists of extensional + intensional predicates• Extensional predicates

– Unit clauses / facts - Asserted and/or compiled from fact files– Akin to relational tables

• Intensional predicates– Declarative: No I/O side effects

• Prolog has no built in extensional/intensional distinction– All clauses treated equally

• Facts conventionally declared dynamic/1 and multifile/1– Some metamodeling is useful

• Easy to roll own– A standard metamodel module would be useful

• optional type system + relational DDL style constraints• Works as documentation

Example from systems biology model:- module(sb_db,[ reaction_product/2, reaction_reactant/2, reaction_modifier/2, derivation_link/3, …]).

:- use_module(bio(dbmeta)). % metamodel

%% reaction_product(?R,?P) is nondet% relation between a biochemical reaction and a molecular constituent produced in the reaction:- extensional(reaction_product/2).

%% reaction_reactant(?R,?P) is nondet% relation between a biochemical reaction and a molecular constituent that is consumed in the reaction:- extensional(reaction_reactant/2).

%% reaction_modifier(?R,?P) is nondet% relation between a biochemical reaction and a molecular constituent that plays a role in the process but is unmodified :- extensional(reaction_modifier/2).

% --- INTENSIONAL PREDICATES ---

%% derivation_link(?Input,?Output,?Via)% two species directly linked via a connecting% reaction (excludes modifiers)derivation_link(Input,Output,R):- reaction_reactant(R,Input), reaction_product(R,Output).

%...[snip]…

Integrating with relational databases

• Most biological data stored in relational databases– Many provide open SQL ports for distributed queries

• RDBs scale well with large quantities of data– …but RDBs lack necessary deductive capabilities

• Expressivity Hierarchy– FOL– Pure prolog– Datalog– Relational Model

• Using prolog with RDBs should be easy… right?

sql_compiler• Given a mapping to a relational schema:

– rewrites prolog terms as SQL queries– Used in conjunction with db connectivity module

• History– Draxler, 1992– Source forked, modified versions available with various prologs

• Blip includes extensions to– Rewrite sub-optimal queries– Rewrite non-recursive prolog clauses– Integrate with SWI ODBC

Example query rewriting

derivation_link(Input,Output,R):- reaction_reactant(R,Input), reaction_product(R,Output).

+

program

relation(reac_in,2).attribute(1,react_in,reac_id,int).attribute(2,react_in,input_id,int).relation(reac_out,2).attribute(1,react_out,reac_id,int).attribute(2,react_out,output_id,int).

schema metadata ?- derivation_link(X,Y)

?- sqlbind(sb_db:all, mydb).

program rewriting

call goal

SELECT reac_in.reac_id, reac_in.input_id, reac_in.output_idFROM reac_in, reac_outWHERE reac_in.reac_id=reac_out.reac_id;

+

reaction_reactant(R,P) <- reac_in(R,P).reaction_product(R,P) <- reac_out(R,P).

mapping

query rewriting

odbc.pl

Obtaining data from web services• Many large bioinformatics data providers provide RESTful APIs

– NCBI– caBIG

• SWI libraries used– http_client– sgml (for parsing XML payloads)

• XML -> Models– Direct translation of sgml too low level– XSLT-inspired prolog template-oriented processing language

• Application:– ontology enhanced search term expansion

• E.g. “find all genes implicated in neurodegenerative disease”• ‘parkinsons’ OR ‘alzheimers’ OR …

Applications of Blipkit and LP techniques

• Genomics and DNA sequences– Deduction of implicit information– Consistency checking of genome datasets

• Phenotype matching– Finding similarities of mutational effects

Genome inference

• Deluge of genomic data– Cost per genome decreasing– Soon we will all know our genome sequence– But what does it mean?

• Effective use of genomics data relies on deductive inference– Many rules are logical: genome calculus– Currently encoded using ad-hoc imperative code– Probabilistic inference also useful

• But must be built on top of the logical inference

DNA

human chromosome 1: 247m base pairs, 4220 genesEntire genome: 3x109 bps, 20k genes

TA

G C

DNA

human chromosome 1: 247m base pairs, 4220 genesEntire genome: 3x109 bps, 20k genes

TA

G C

Gene expression:

transcriptionsplicingtranslation

Transcription

A subsequence of a DNA sequence istranscribed to an RNA sequence

regulated by sequence called promoters andenhancers

Splicing

Zero or more subsequences (introns) of the RNA sequence are spliced out. The remaining sequences(exons) are joined together at splice sites.

-guided by splice site sequences-combinatorial possibilities

Translation

A subsequence of the RNA sequence (the Coding Sequence Region -- CDS) is translatedusing a genetic translation table.- {A,C,G,U}x3 Amino Acid

-Not all RNAs are coding

CDS5’ (upstream)UTR

3’ (upstream)UTR

exon 1 exon 2 exon 3

Formalization of gene expression

• Genome calculus– operations on linear sequences

• subsequence, join, translate– Certain sequence types are entailed by other sequences

• Calculus is surprisingly conserved across all life– but biology is fuzzy and full of exceptions

• Archaea utilize different translation table• Nematodes add trans-splicing• Mammalian introns are huge• Many genes are co-transcribed• Viral genes overlap in different translation frames…

Genomics databases

• Genome databases are important for– biomedicine– understanding evolution in a molecular level

• Problem: genome databases are incomplete– stating all implicit features leads to redundancy– integration and complex queries difficult– ad-hoc rules embedded in imperative code

• Problem: genome databases are inconsistent– Different interpretation of gene, exon, UTR etc

Solution: Sequence Ontology + Deductive Database

• The Sequence Ontology standardizes sequence terms– Additional axioms are being added

• Encoding genome calculus• Genome relations based on Allen Interval Algebra

• Can be used in conjunction with a deductive genome database– consistency checking

• does this genome dataset make sense?– inference and querying

• what entities are present in region X?

Sequence relationship predicates based on Allen Interval Algebra

• no recursion• conjunction of

binary terms• uses arithmetic

(for efficiency)Extensions:

• strands• circular genomes

upstream_of(X,Y) :- has_end(X,XE), has_start(Y,YS), XE < YS.

exon1 exon2 exon3 exon4 exon5

?- upstream_of(exon3,X).X=exon1 ;X=exon2

Intron-exon inferenceintron( i(T,S,E) ) :- exon(X1), exon(X2), has_end(X1,S,T), has_start(X2,E,T), \+ ((exon(X3), contained_by(X3,T), starts_after_start_of(X3,X1), ends_before_end_of(X3,X2))).

• function terms as arguments

• possibility of recursion through negation

exon1 exon2

exon(exon1). exon(exon2).has_end(exon1,1000,t1).has_start(exon2,2000,t2).?- intron(X).X = i(t1,1000,2000)

t1

OWL implementation

• Many axioms cannot be expressed in OWL– Interval relations – no arithmetic in OWL

• option 1: use SWRL• option 2: enumerate all base pairs and use property chain axioms

– Cannot infer properties of unnamed individuals• E.g. introns from exons

– Cyclic structures cannot be described• Requires Description Graph extension

• Open World Assumption– useful for semantic web– CWA is more convenient for genomics

Deductive database implementation

• Methods:– Convert sequence ontology OWL->DLP via Thea2– Manually edit– Add rules that cannot be expressed in OWL

• Tested on XSB and Yap– requires tabling

• Results– Currently scales to small regions– more debugging required– difficult to eliminate unstratified negation

Disjunctive datalog implementation

• Adds:– Constraints– Disjunctions in rule heads

• Implementation– DLV-Complex : allows functions in arguments– Program written from scratch: Rules must be ‘safe’

• Results– Scales over small regions– Useful for detecting inconsistencies in data

• More research needed– More efficient programs– Use of relational database backend– Further exploration of ASP semantics

• Genomic rules have many exceptions

Prolog implementation

• Removes:– rules that cause cycles with backtracking

• Implementation– Optional use of Nested Containment List library (C + SWI

FLI)• Results– Results can be incomplete due to missing rules

• E.g. intron :- exon, but not exon :- intron• Ruleset can be tailored for dataset

– Scales over medium sized datasets

Hybrid Prolog-Relational implementation

Uses same program as prolog implementation• Relational database store facts (extensional)

– can be distributed• Uses sql_compiler + mappings to genomics databases

– Ensembl– Chado

• Non-recursive prolog rules dynamically translated to complex SQL• Recursive subclass rules translated

– by query compiler using UNIONs– precomputed and stored in relational database

• Scales to full genomes

LP for genomics: conclusions

• No one paradigm is perfect– Many axioms cannot be expressed in OWL

• but tools are good– Disjunctive Datalog good for consistency checking in small

regions– More research required on efficiency of tabling solution, ASPs– WAM solution most efficient– Manually rewriting programs is tedious!

• Hybrid solutions useful– RDBs for asserted facts

Application: match.com for diseases

• Organisms have phenotypes– characteristics under the control of the genes of that

organism• Related genes can have similar phenotypic effects– even when the least common ancestor of the gene is

500m years ago• Finding these genes can help understand– disease– evolution

Application: match.com for diseases

Semantic Similarity

• Given a collection of– features F = {f1, f2, …}– attributes A = {a1, a2, …}– feature-attribute mappings:

• a(f) = F x A

• For any feature pair x,y, calculate:– Jacard coefficient

• | a(x) ∩ a(y)| / | a(x) a(y)|∪– maximum IC

• IC(a) = -log2p(a)• maxIC(x,y) = Max[IC(a) : a a(x)∩ a(y)] ∈

SWI-Prolog implementation• Uses GMP

– normal prolog programs have unbounded integer arithmetic– allows fast bitwise implementations of set intersection/union

• Encode feature attribute lists as integers– m : A {0, .., |A|-1}– ai(f) = ∑ 2 m(a) a a(f) ∈

• Set intersection and union computed using bitwise and/or– Fast implementation of Jacard coefficient– J is (A1 /\ A2 / A1 \/ A2)

Similarity metrics + reasoning

• Attributes are description logic class expressions– rarely exact matches across species

dystrophic∩ ∃quality_of. arm_muscle atrophied∩

∃quality_of. pectoral_fin_muscle

≠

a(human1) ∩ a(zebrafish7) = {}

a(human1) a(zebrafish7)

Use reasoning to find subsumer

• Find Least Common Ancestor expression– typically class expression, not named class

dystrophic∩ ∃quality_of. arm_muscle atrophied∩

∃quality_of. pectoral_fin_muscle

a(human1) a(zebrafish7)

a*(human1) ∩ a*(zebrafish7) = {decreased_size∩ quality_of. muscle_of_upper_limb } ∃

decreased_size∩ quality_of. ∃muscle_of_upper_limb

Implementation: Uses Thea2

• Thea2 is a prolog package for OWL2– http://github.com/vangelisv/thea– reads/writes

• RDF/XML• OWL-XML• Native prolog form• Description Logic Programs (DLPs)

• Reasoning strategies– Prolog– DL reasoners (via JPL/OWLAPI)– SQL DB + forward chaining

http://github.com/vangelisv/thea

http://github.com/vangelisv/thea

Phenotype matching: Results

• Proof of concept on 10 human disease genes– publication forthcoming

• Currently applying to neurodegenerative diseases

• Funding to extend to all Mendelian diseases

Web Applications

• http://berkeleybop.org/obo• Web interface to Open Bio Ontologies– Implemented in perl + SWI-Prolog

• Prototype for future development– SWI-Prolog– Production version in perl and/or java

http://obofoundry.org/obo

Experiences using LP for bioinformatics: conclusions

• A little bit of LP goes a long way– The theory-application gap is largely untapped

• A variety of LP paradigms are useful– ASP, datalog, DLs, prolog, ILP, …– Interoperation can be hard!

• LP for ‘real world’ applications– It is possible!– Declarative approach arguably superior– Web/database applications are a sweet spot– We need to show more success stories– ..and to dispel myths

Recommendation: make it easier for users

• Documentation:– Unify community knowledge in a single wiki– Create a general LP mail list– c.f. OWL/SemWeb community

• Tools:– Program analysis– Lint-like tool for tabled prologs, ASP– Visualization

• Libraries– CPAN for Prolog

Recommendation: make it open-source

• Why– Encourages collaboration– Bioinformaticians love open source– The people who fund bioinformaticians love open source– Open source can still generate revenue

• How– Deposit code in open source code repositories

• github, sourceforge, googlecode, etc– Embrace Web 2.0

• blog it, put it on a wiki

Recommendation: interoperate with RDBs

• Why?– RDBs and LP should be a natural match– Application developers are conservative and familiar with RDBs– lightweight in-memory embedded RDBs are becoming more popular

• How:– Hide LP systems behind pseudo-SQL interface

• SQL queries and DDL translated behind the scenes. cf sql_compiler• Users can use native LP syntax and semantics as they feel comfortable

– Embed LP systems directly in RDBs• E.g. PostgreSQL extensions

– Improve prolog->SQL interfaces• Common API c.f. JDBC (Java), DBI (Perl)

Recommendation: A unified API to all LP systems

• Use case:– calling LP system from host language (java, perl, ruby, even

other prolog)• Problem:

– No standardization amongst APIs• Analagous problem:

– RDB APIs– Solved: a 20th century problem

• Proposal:– Common REST interface– Single interface per host language

Interoperation between LP systems• LP systems (ILP, ASP, Prolog, …) differ in whether they accept:

– Foo(x).– ‘Foo’(x).– ‘foo bar’(x).– foo(‘x y’).– foo(“x y”).

• Non-prolog systems should:– Adhere to ISO standard for intersection with pure prolog

• Or at least provide ISO mode

• Also:– ISO Common Logic– W3C RIF

Future directions

• Scalable LP• Probabilistic + logic modeling– CLP(Bayes)– PRISM

Robot scientist

The Automation of Science

King et al.Science 3 April 2009: 85-89DOI: 10.1126/science.1165620

http://news.bbc.co.uk/2/hi/science/nature/7979113.stm



Acknowledgments

• Vangelis Vassiliadis (Thea)• Stephen Veitch (intervaldb)• Christoph Draxler (sql_compiler)• Jan Wielemaker + SWI Mail list• Paulo Moura• Vítor Santos Costa + Yap developers• Terrence Swift + XSB developers

Technology

Experiences with logic programming in bioinformatics