Upload
robertstevens65
View
63
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Keynote talk, at the KR4HC workshop at Artificial Intelligence in medicine Europe, Verona, 2009
Citation preview
The Changing Nature of Biomedical Research: Semantic e-Science
Robert Stevens
BioHealth Informatics Group
University of Manchester
Introduction
• (Modern bio-molecular) Science• E-Science• Semantics and science• Semantic e-Science
Ernest Rutherford
“All science is either physics or stamp collecting”
Image: http://en.wikipedia.org/wiki/File:Ernest_Rutherford2.jpg
Mathematical Sciences
Laws in Biology
Charles Darwin
Image: http://en.wikipedia.org/wiki/File:Charles_Darwin_01.jpg
On The Origin of Species - 1859
Central Dogma
Image: http://cellbio.utmb.edu/CELLBIO/DNA-RNA.jpg
Classic and Modern Biology
Genotype Phenotype
Modern biology
Classic biology
Speed of sequencing
• First human genome
– 10+ years to produce– Cost $500 million– Huge international effort
• Now done in 10 weeks
– (for $399)– http://tinyurl.com/genomecost– http://www.23andme.com
1000+ databases
• according to Nucleic Acids Research
PubMed: 2 papers per minute
• ~700,000 individual papers• Grows at 2 papers per minute
(see http://blogs.bbsrc.ac.uk for details)
Biology now has lots of facts
Lots of catalogues
Genome
Proteome
Transcriptome
Interactome
Metabolome
PHENOME
Creating Woods, not Trees
Genes
Proteins
Pathways
Interactions
LiteratureComplex Machines
Virtual Organism
…. from biological facts, we make a system that is some model of a real organism
Networks of Chemicals
Image: http://genome-www.stanford.edu/rap_sir/images/Web_FigF_RAP1_glycolysis.gif
Systems within Systems
Image: http://www.ehponline.org/members/2007/10373/fig1.jpg
Uniprot:- A protein database?
Navigating the Web of Knowledge in Bioinformatics
Bioinformatics Experiments are Data pipelines
Resources/S
ervices
Investigate the evolutionary relationships between proteins
Proteinsequences
Multiplesequencealignment
Query
[Peter Li]
My data
My tool
Linking together data resourcesHypo Science – the routine for the manyHyper Science – big projects, big science
The In Silico Experiment
• We can mine these data for possible hypotheses
• “what are the genes that are involved in some disease phenotype?”
• Correlate genes in QTL with differentially regulated genes in microarray via pathways; query the literature base with these genes, pathways and phenotype; …
• Resulting facts form some hypothesis: A co-ordinated set of SNPs increase cholesterol biosynthesis in macrophage, while delaying apoptosis of these cells; increased super-oxide production aids tolerance to trypanosomiasis in cattle
How bioinformatics was DoneIntegrating data sets
• Slave labour• Collections of Scripts• Warehouses• Applications
– Galaxy– Gaggle– Integr8– Ensembl– …..
• Workflows!
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta
Workflows: E. Science laboris
• Data preparation and analysis pipelines.• Data preparation pipelines• Data integration pipelines• Data analysis pipelines• Data annotation pipelines• Warehouse population refreshing• Data and text mining • Knowledge extraction.• Parameter sweeps over
simulations/computations• Model building and verification• Knowledge management and model
population• Hypothesis generation and modelling
• A workflow is a specification.• WFmS is the machinery for
coordinating the execution of (scientific) services and linking together (scientific) resources.
• Handles cross cutting concerns like: error handling, service invocation, data movement, data streaming, data provenance tracking, process auditing, execution monitoring, security access, blah blah…..
• Agile software development
Workflows: E. Science laboris
Enactment Engine
My data
My tool
Workflow Execution Engine
Workflow execution engineLocal desktop and remote server Implicit iteration over large data collectionsNested workflowsAutomated data flowEvent history log and data provenance trackingWithin-workflow programmingExtensibility points for plug-ins
Graphical workbenchFor ProfessionalsPlug-in architecture
Incorporate new service without coding. Services as they are.Access to local and remote resources and analysis tools
Re-Design
Rewritten
• Comparing resistant vs. susceptible strains – Microarrays
• Mapping quantitative traits – Classical genetics QTL
• Integrated Microarray data, genomic sequences, pathway data, literature mining.
Trypanosomiasis Study
Paul Fisher, et al Nucleic Acids Research, 2007, 35(16)
Genotype to Pathway
Created by Paul Fisher
Pathway to Phenotype
Created by Paul Fisher
• Eliminated user bias and premature filtering
• The scale and complexity of data and literature.
• Systematic data analysis
• Data analysis provenance
• Manageable amount of output data for biologists to interpret and verify
• Data driven science
“Looking where others hadn’t”
“make sense of this data” -> “does this make sense?”
http://www.youtube.com/watch?v=Y6_Kz5L010g
Transferring Characteristics
Uncharacterised protein
Tra1 La2 La3
High similarity transfer characteristics
… A Fact Based Discipline
• Rather than laws captured in mathematics….• We have lots of facts: the discipline’s knowledge• Rather than “calculating” what a protein does, we
investigate and write it down• Equivalent to writing down the trajectories of all
thrown objects and not doing ballistics!• To do biology one needs “the knowledge”
Heterogeneity
• 28 ways to format the representations of a biological sequence
• Though one way to represent the bases or amino acids…
• Different words same concept• Different concepts same words• Different and implicit data schema
An Identity Crisis
• Database entries have identifiers unique within their database
• The type of entity described in an entry doesn’t have an identifier
• Different entries about the same type talk about it differently
• How do we know when an entry in one DB talks about the same thing as another entry in another DB?
• That’s the skill of a bioinformatician
Categories and Category Labels
GO:0000368
U2-type nuclear mRNA 5' splice site recognition
spliceosomal E complex formation
spliceosomal E complex biosynthesis
spliceosomal CC complex formation
U2-type nuclear mRNA 5'-splice site recognition
The Role of Knowledge
• A lot of facts• Perhaps organised into a system• No equivalent of “laws of mechanics” – we
can’t do this biology with mathematics• Or at least not without knowing what the
numbers mean...• This is why we’ve been using ontologies!
Uses of Ontology in Bioinformatics
Post-Genomic Biology
• Fly, mouse, yeast, worm all have their own terminologies
• I want to compare genomes• How?• The genomic sequence is easily dealt with
computationally and comparisons are easy• This is not true of the annotations or knowledge of
those sequences• Need a common understanding
Annotation of Data
• Big effort to create controlled vocabularies using ontologies
• A huge annotation effort – describe the entities in DB with terms from ontologies
• The Gene Ontology (http://www.geneontology.org)• The Open Biomedical Ontologies Consortium
GO in Analysis
• Microarray analysis one of the original visions for GO• Clustering of modulated genes cluster about
functional attributes of their proteins• GO also used in, for example, semantic similarity;
text analysis; etc.
Biocatalogue content screenshot
Shield users and applications from service interoperability and incompatibility plumbing.
Turn your app into a service
Service providers Not only web services
How a bioinformatician assumes stuff should work
Pettifer, University of Manchester
inside
A collection of interactive tools for analysing protein sequence and structure
http://utopia.cs.manchester.ac.uk/
Semantic Descriptions of All
• Not just bio-entities in data• The laboratory experiments by which they were
generated• The protocols for their analysis • The services for their analysis
Semantic Integration
• Same identifiers means integration and interoperation• Most workflow hobbled by syntactic and semantic
heterogeneity• Syntactic integration (Bio2RDF)• Semantic integration via ontologies and naming
schemes• Enables better e-Science through semantic science
Fact Management
• When “stamp collecting” we’re collecting facts• Biology is a fact management activity• Knowing what these facts mean is very important• Science is performed on data and the semantics of data
enable us to do science• Semantic e-Science
Summary
• The nature of modern biology gives it interesting knowledge (fact) management issues
• It is a knowledge based discipline• Not unique, but often extreme• Ontologies seen as one component in management
(but not a panacea)• E-Science gives infra-structure for management;
semantics enable analysis• Actually, very light use of semantics
More Acknowledgements
• Phil Lord• Simon Jupp• Carole Goble