Automated Exploration of Bioinformatics Spaces Simon Colton Computational Bioinformatics Laboratory

Automated Exploration of Bioinformatics Spaces

Simon Colton

Computational Bioinformatics Laboratory

Purpose of the Talk To make you aware of another tool

which may have some potential for use in the Metalog project

To get feedback on this potential

To briefly describe two other projects

The Substructure Server Old-style approach to using machine learning

(ML) for predictive toxicology– What do the positives have in common that the

negatives do not?– For chemicals, possibly using ILP is like using a

sledgehammer to crack a nut Substructures are often the answer (e.g., mutagenesis)

– Substructure server looks explicitly for substructures Vehicle for me to understand ML in predictive

toxicology and server-client technology– May even be of some use one day

Substructure ServerDevelopment

Team– Simon Colton

Prolog machine learning routine (FIND-S)

– Saravanan Anandathiyagar Server technology

– Laurence Darby Distributing the process over our linux farm

– Gives roughly 5 times speed up

– A.N.Other masters student (TBA) Front end (Babel) Back end (Molgen, etc.)

Old-Style Predictive Toxicology

Reason 1:– Using only chemistry, attributes etc.

Not using biochemical pathways

Reason 2:– Using predictive machine learning

Not using descriptive machine learning

Predictive Inductionin Bioinformatics

Interesting problem found – Interesting from a biochemistry perspective– Interesting from a computer science perspective

Packaged as prediction/classification– Turned into positives and negatives– Much work done to shoe-horn into a prediction task

Reason(s) learned why positives are positive– Almost guaranteed that any answer found will be

interesting, because the problem is interesting

Generating Hypotheses Predictive machine learning produces

hypotheses of the form:– A Toxic– Toxic C– B Toxic– D ¬Toxic– etc.

With any luck, A, B or C will be interesting in their own right– And enter the biochemistry literature!

But what if… There was an interesting relationship

– Between a concept and a subset of the positives. Isn’t this interesting?

Examples:A Toxic & B

C ¬Toxic & D & E

Predictive versus Descriptive Learning

Predictive learning– You know what you are looking for– You just don’t know what it looks like

Descriptive learning– You don’t know what you are looking for– But you want to find something interesting

Eventually:– You don’t even know you are looking for something

Descriptive Induction Not as goal directed as predictive induction Same background information given

– Perhaps no categorisation into pos & neg A theory is produced which contains:

– Examples– Concepts which categorise/describe sets of examples– Hypotheses which relate concepts– Explanations which explain the hypotheses

For instance: – Acid + Base Salt + Water

Tools are supplied so that– The user can extract interesting parts of the theory

The HR System in 3 Slides

Concept formation– Starts with background info like Progol– Builds new concepts from old ones

Using one of 15 production rules (composition, instantiation, counting, matching, etc.) Unary or binary Many settings for how concept formation occurs

– Derives examples & definition of concepts Heuristic search (if user specifies)

– Uses a best first search 20+ measures of interestingness for concepts/conjectures Chooses to build new concepts from best old ones

The HR System in 3 Slides

Conjecture Making– “Proper” induction!– Notices patterns in examples for concepts

Newly formed concept has no examples– Makes a non-existence conjecture

Two concepts have exactly the same examples– Makes an equivalence conjecture

One concept’s examples are subset of another– Makes an implication conjecture

– Extracts simpler hypotheses from empirical ones– Able to make “near-conjectures”

Patterns don’t have to be exact User specifies a tolerance level

The HR Systemin 3 Slides

Generating explanations– User supplies a set of axioms– HR appeals to a third party theorem prover

And a third party model generator (otter/mace)

– To attempt to prove/disprove That the hypothesis follows from the axioms

Sometimes, explanations are interesting– In domains such as group theory

Explanations are proofs of theorems

Sometimes, explanations show that a hypothesis is dull– Anything provable by the theorem prover is trivial

Extreme(!) Theory Formation

All my best examples are from maths Given only one concept:

– How to divide two integers HR finds the conjecture

– Odd refactorable numbers are squares Invented concepts:

– Odd, square, refactorable, (even, tau, …) Made concept of odd refactorables

– Noticed the examples are a subset of the examples for square numbers

No proof supplied (I proved this one)

What HR Can Deliver HR generates hypotheses like Progol

– But there are too many– Require filters to prune dull ones

Some concepts might be interesting aside from their relation to toxicity

HR points out interesting examples– E.g., a molecule has the only occurrence of

a particular sub-molecule

Interesting New Angle Anomaly detection First experiments in analysis of Bach chorale

melodies– Which ones were different to the rest

Not necessarily breaking rules Could be: something occurring more often

– “Parsimony outlier” measure of interestingness Hope to try this with metabolic pathways

– Give me 30 pathways I’ll give you reasons why each is unique

– Give me an invented pathway I’ll show you possible reasons it’s wrong…

What I need Objects of interest

– Pathways Background concepts

– Ways to describe the pathways Axioms

– What we know is true about pathways Measures of interestingness

– Essential to separate the wheat from chaff– Evolve over time as we use HR together

Future for my Work Form theories about biochemical data Domain of interest

– Pathways Technical problems

– Enabling HR to work with probabilistic information (not yet possible)

– Enabling HR to work with larger datasets– Understanding pathways!

The Amaze Database Bioinformatics MSc. Project

– Organised by Marek Sergot Challenge

– To resurrect the Amaze database Of biochemical pathways

– EBI originally, now Université libre de Bruxelles

– To get hold of data, put into a database, put a front-end onto this, etc.

– And write translation routines So that we can get at the information

This is a resource we should use– Please let me know your requirements

Documents

Automated Exploration of Bioinformatics Spaces Simon Colton Computational Bioinformatics Laboratory