Providing an environment where every data-driven researcher will thrive Professor Carole Goble...

Preview:

Citation preview

Providing an environment where every data-driven researcher will thrive

Professor Carole Goblecarole.goble@manchester.ac.ukUniversity of Manchester, UK

• Pipelines– Scientific workflows over (web) services – Data pipelines, model population and

validation, simulation sweeps– Distributed, federated datasets and analyses

combined with local datasets and analysis– Opening up resources.

• e-Laboratories– Crowd-sourcing, group curating and

sharing/reusing scientific assets. – Web 2.0 and Semantic Web.– Social networking, community content,

collaborative filtering– Sharing and exchanging “Research Objects”– Opening up capabilities and capacity.

• Pan European collaboration.• Systems Biology of Microorganisms

13 projects, 91 institutes– Different research outcomes – A cross-section of microorganisms,

incl. bacteria, archaea and yeast. • Record and describe the dynamic

molecular processes occurring in microorganisms by computerized mathematical models.– Modellers meet experimentalists

• Pool research capacities, data, models and know-how.

• Retrospectively.

http://www.sysmo.net

BaCell-SysMO COSMIC

SUMO KOSMOBAC SysMO-LAB

PSYSMO Valla

MOSES TRANSLUCENT

STREAM SulfoSYS

+ two more

Data-driven• Multiple ‘omics

– genomics, transcriptomics– proteomics, metabolomics

• Images, • Reaction Kinetics• Models• Data sets + experiments + models

– SBML, Agent-based, Mechanics based• Analysis of data

Systems biology workflows in MCISB

• High throughput experimental methods

• Public data sets (e.g. EBI)• Web Services• ~ 1400 NAR January Issue

• Little databases• Lab books• Spreadsheets• Private and Shared.• Proliferation• Derived data• Long tail.

Little Data

MyDatasets My

Analytics

Big DataGroup ScienceData services

“Little” Data“Local” Science

PublishAccess

Massive decentralisation – wikis, sticks, spreadsheets

Massive centralisation – commons, clouds, curated core facilities

Tremendous fragilityDigital Dust in Data Tombs

Picking Pain Points. Keeping it Real.• Project Directors

– Data remains with us under our control.

– We control who sees what.

– Just enough exchange.• SysMO PALs

– Spreadsheets.– Yellow Pages.– Standard Operating

Procedures.

An education

Modellers vs ExperimentalistsComputational thinkingSystems thinking

Gray‘s Laws (modified)• Working Now, Working to working

– Gateways and ramps– Jam today, jam tomorrow– Just enough, just in time– Work with what you got already

• 20 questions– Is there any group generating kinetic data?– Is this data available?– Who is working with which organism?– What methods are been used to determine enzyme

activity?– Under which experimental conditions are my partners

working on for the measurement of glucose concentration?

???

?

Help people search for and

find stuff

DataServices

ProcessesModels

SoftwareExperts

SysMO SEEK Assets Catalogue. Archive. Social Network. Sharing Space. Gateway. • Yellow Pages

– People. Expertise. Projects. Institutions. Facilities. Studies.

• Data– Experimental data sets and analysed results.– Gateway to data stores – SABIO-RK, ‘omics

• Models– Store. Stimulate. Publish. Curate. – Gateway to COPASI, JWS Online, BioModels.

• Processes– Laboratory protocols – Standard Operating Procedures– Bioinformatics analyses – computational workflows - Taverna– Model population and validation – workflows – Taverna– Gateway to myExperiment, MolMeth, OpenWetWare….In

terli

nkin

g A

SSET

S C

ATA

LOG

UE

Linking data to process

Standard Operating ProceduresModelsSoftwareProvenanceThe Lab BookRetrospective method reconstructionThe myth of reproducible science

• Scientists willing to share methods and protocols.

• SOPs an early win.

• Defined standard metadata model based on Nature Protocols.

• Seeded.

Linking data with stuff• Research Objects for packaging and

exchanging Assets– Workflows linked to models linked to

data linked to SOPs – Encapsulate community standards– Mixed resources: External and central.– Trust– “Preservation Packet”– Bechhofer et al 2010 forthcoming in The Future of

The Web for Collaborative Science 2010. • SBRML

– Systems Biology Results Markup Language

– To tie to the SBML

At the coal-face

The Spreadsheet.The Content Management Systems.Legacy assets are assets.Metadata ramps.

The Content Management System

• Lightweight and flexible. Low take-on, hidden operations costs. Knowledgeable Civilians. Looks nice.

• Anarchy amenable.

Spreadsheets

• Template distribution• Template mapping

SysMOLab

Everyone wants metadata. No one wants to collect it.

Standards mayhemMetadata millstonesMost data is thrown away.

Metadata for my sakeMetadata compliance by stealthPreparation for publishing

CIMR Core Information for Metabolomics ReportingMIABE Minimal Information About a Bioactive Entity MIACA Minimal Information About a Cellular Assay MIAME Minimum Information About a Microarray Experiment MIAME/Env MIAME / Environmental transcriptomic experiment MIAME/Nutr MIAME / Nutrigenomics MIAME/Plant MIAME / Plant transcriptomics MIAME/Tox MIAME / Toxicogenomics MIAPA Minimum Information About a Phylogenetic Analysis MIAPAR Minimum Information About a Protein Affinity Reagent MIAPE Minimum Information About a Proteomics Experiment MIARE Minimum Information About a RNAi Experiment MIASE Minimum Information About a Simulation Experiment MIENS Minimum Information about an ENvironmental Sequence MIFlowCyt Minimum Information for a Flow Cytometry Experiment MIGen Minimum Information about a Genotyping Experiment MIGS Minimum Information about a Genome Sequence MIMIx Minimum Information about a Molecular Interaction Experiment MIMPP Minimal Information for Mouse Phenotyping Procedures MINI Minimum Information about a Neuroscience Investigation MINIMESS Minimal Metagenome Sequence Analysis Standard MINSEQE Minimum Information about a high-throughput SeQuencing Experiment MIPFE Minimal Information for Protein Functional Evaluation MIQAS Minimal Information for QTLs and Association Studies MIqPCR Minimum Information about a quantitative Polymerase Chain Reaction experimentMIRIAM Minimal Information Required In the Annotation of biochemical Models MISFISHIE Minimum Information Specification For In Situ Hybridization and Immunohistochemistry

ExperimentsSTRENDA Standards for Reporting Enzymology DataTBC Tox Biology Checklist

BioPAX : Biological Pathways Exchange http://www.biopax.org/FuGE Functional Genomics Experiment MGED: Microarray Experimental Conditionshttp://www.mibbi.org/index.php/MIBBI_portalMIBBI: Minimum Information for Biological and Biomedical Investigations

Minimum Information Models

63%47%

Just Enough Results Model• Harvest standards e.g.

MIAME (MIBBI.org)• Analyse consortium

schemas and spreadsheets

• JERMs for each data type – microarray, metabolomics, proteomics ....

• Map project data sources to JERMs.

• Distribute JERM spreadsheet templates

“I only want to collect and share just enough results”

JERM Spreadsheets Templates

Controlled vocabulary plug in

• RDF for ripping, mashing and comparing spreadsheets.• A little semantics goes a long way

Reward curation

Local curation at the point of capture – ISA-TAB for ‘omics.Centralised curation – SBML, CellML, SBOAutomated curation.Which data is worth curating?

• Blue-Collar Science.

• Curator Credit• Curator Career• Funding.• Personal and

institutional visibility

• Scholarly citation metrics

• Federate workloads

• Unpopular with the big data providers.

www.biocurators.org

Commons-based Quality Control.

Progressive Curation: “lazy evaluation” metadata

Just enough, Just in timeJam today and Jam tomorrow

Gain

Pain

VeryBAD

Good, butUnlikely

Just right

Sensitive sharing. Collaborate to competeGood reasons not to.

Just enough just in time sharing.Data kept at host.Registered centrally through harvesting.Pre-Publication sharing vs Publication

Competitive advantage.Academic vanity.

Adoption. Reputation.

Scrutiny.Being scooped.

Misinterpretation.Reputation.Legal issues.

Rew

ards

Risk

s

Nature 461, 145 (10 September 2009) | doi:10.1038/461145a

Access Permissions

Just Enough Sharing

Reusing myExperiment

Reward sharing and reusing not

reinventing.

Technically. Culturally. Institutionally.

Credit and Risk Mitigation.

Attribution.Trust.

Credit

Reward and Provenance

Reusing myExperiment

Some pretty key things• Data citation

• Stable and shared ids and names– A nightmare.– Sharednames.org– Biosharing.org

• Versioning and Provenance– Models, software, data sets– Ensembl web service doesn’t report version number.

Data commons, Data havens

For data after the project has ended.

For the common good or me.Tidy and untidy data.

Beth’s Provenance Objects

Bio2RDF

Access and availability of data and data analysis resources

Web services underpin the ESFRI ELIXIR programme.Interfaces that are understandable and stable.

Designed for people too.No access, no tools, no point (Keith Haines)

Deposition to community databanks that minimise pain.

What is it?

Is it working?

Data analysis, model population and data pipelining ramps.

Crossing the adoption chasmThere is a world of complexity for data preparation, processing and analysisScience Informatics Sweatshops.E-Laboratories. Workflows. Portals.Pre-cooked processes and process templates. Pre-cooked interfaces.Training.

MicroArray from

tumor tissue

Microarray

preProcessing

Lymphoma

prediction

Lymphoma Prediction Workflow

Wei Tan Univ. Chicago

Ack. Juli Klemm, Xiaopeng Bian, Rashmi Srinivasa (NCI)Jared Nedzel (MIT)

caArray

GenePattern

Use gene-expression

patterns associated with two lymphoma types to predict the type of an

unknown sample.

myExperiment Communities

• Supermarket shoppers

• Tool builders

• Trainers and Trainees

Drop and ComputeDrop and Compute

Local folder synchronised and shared via cloud

Condor job submitted by drag and drop

Results appear in Dropbox

Ian Cottam

Bashing against local IT

NO – you can’t access that datastore / run your

analysis. Joined up thinking.

Data + PublicationsData trapped in documentsSupplemental informationText miningText mining workflowsText mining to find method and controls

Reflect. Elsevier Challenge Winner 2009

Manual and Auto-mark up[Oscar-3]

Do not underestimate the power of Interactive Visualisation and BrowsingPre-cooked complex queries.Navigation.With my data.At the click of a button.

• Distributed Annotation Service• Upload and overlay my data

SysMO summary• Providing an environment where every data-driven

researcher will thrive• Reality is messy.

– Extreme Technology Determinism vs Voluntarist Sociocultural shaping

• Extreme and continuous partnership with users.– Act Local Think Global

• Agile development environment facilitated stream of features to tackle pain points.– Leverage other e-Laboratories, Maintaining scientists’ buy-in.

• Socio-Political Axis dominates the Technical Axis.– Collaboration evolutions, Confidence in exchange.

Coordination

Sustainabi

lity

Interope

rability

Adoption

Capacity

Data

Six Action Plan Areas

Capacity building of our skills base

• Influence training and capacity building programmes.

• Promote training for young and mid-career researchers and research technologists.

• Enable mixed skilled research teams to include research and information technologists.

• Value and reward highly skilled research and information technologists within HE institutions with a career structure.

Data Silo culture

Funding silos

Discipline silos

Academic Credit and Risk

Mitigation

for sharing, curating, and reusing not reinventing

Data and Software is free like puppies

are free

University of Stellenbosch, South AfricaUniversity of Manchester, UK

Jacky Snoep

EML Research gGmbH, Germany

Isabel Rojas

University of Manchester, UK

Olga Krebs

Wolfgang Müller

Sergejs Aleksejevs

Carole Goble

Stuart Owen

Katy Wolstencroft

Finn Bacal

Links• myGrid Project

– http://www.mygrid.org.uk

• SysMO-DB– http://www.sysmo-db.org

• myExperiment– http://www.myexperiment.org

• Taverna– http://www.taverna.org.uk

• JWS Online– http://jjj.biochem.sun.ac.za/

• SABIO-RK– http://sabio.villa-bosch.de/

Recommended