56
Providing an environment where every data-driven researcher will thrive Professor Carole Goble [email protected] University of Manchester, UK

Providing an environment where every data-driven researcher will thrive Professor Carole Goble [email protected] University of Manchester,

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Providing an environment where every data-driven researcher will thrive

Professor Carole [email protected] of Manchester, UK

Page 2: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

• Pipelines– Scientific workflows over (web) services – Data pipelines, model population and

validation, simulation sweeps– Distributed, federated datasets and analyses

combined with local datasets and analysis– Opening up resources.

• e-Laboratories– Crowd-sourcing, group curating and

sharing/reusing scientific assets. – Web 2.0 and Semantic Web.– Social networking, community content,

collaborative filtering– Sharing and exchanging “Research Objects”– Opening up capabilities and capacity.

Page 3: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

• Pan European collaboration.• Systems Biology of Microorganisms

13 projects, 91 institutes– Different research outcomes – A cross-section of microorganisms,

incl. bacteria, archaea and yeast. • Record and describe the dynamic

molecular processes occurring in microorganisms by computerized mathematical models.– Modellers meet experimentalists

• Pool research capacities, data, models and know-how.

• Retrospectively.

http://www.sysmo.net

BaCell-SysMO COSMIC

SUMO KOSMOBAC SysMO-LAB

PSYSMO Valla

MOSES TRANSLUCENT

STREAM SulfoSYS

+ two more

Page 4: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Data-driven• Multiple ‘omics

– genomics, transcriptomics– proteomics, metabolomics

• Images, • Reaction Kinetics• Models• Data sets + experiments + models

– SBML, Agent-based, Mechanics based• Analysis of data

Page 5: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Systems biology workflows in MCISB

Page 6: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

• High throughput experimental methods

• Public data sets (e.g. EBI)• Web Services• ~ 1400 NAR January Issue

• Little databases• Lab books• Spreadsheets• Private and Shared.• Proliferation• Derived data• Long tail.

Little Data

Page 7: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

MyDatasets My

Analytics

Big DataGroup ScienceData services

“Little” Data“Local” Science

PublishAccess

Page 8: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Massive decentralisation – wikis, sticks, spreadsheets

Massive centralisation – commons, clouds, curated core facilities

Tremendous fragilityDigital Dust in Data Tombs

Page 9: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Picking Pain Points. Keeping it Real.• Project Directors

– Data remains with us under our control.

– We control who sees what.

– Just enough exchange.• SysMO PALs

– Spreadsheets.– Yellow Pages.– Standard Operating

Procedures.

Page 10: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

An education

Modellers vs ExperimentalistsComputational thinkingSystems thinking

Page 11: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Gray‘s Laws (modified)• Working Now, Working to working

– Gateways and ramps– Jam today, jam tomorrow– Just enough, just in time– Work with what you got already

• 20 questions– Is there any group generating kinetic data?– Is this data available?– Who is working with which organism?– What methods are been used to determine enzyme

activity?– Under which experimental conditions are my partners

working on for the measurement of glucose concentration?

???

?

Page 12: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Help people search for and

find stuff

DataServices

ProcessesModels

SoftwareExperts

Page 13: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

SysMO SEEK Assets Catalogue. Archive. Social Network. Sharing Space. Gateway. • Yellow Pages

– People. Expertise. Projects. Institutions. Facilities. Studies.

• Data– Experimental data sets and analysed results.– Gateway to data stores – SABIO-RK, ‘omics

• Models– Store. Stimulate. Publish. Curate. – Gateway to COPASI, JWS Online, BioModels.

• Processes– Laboratory protocols – Standard Operating Procedures– Bioinformatics analyses – computational workflows - Taverna– Model population and validation – workflows – Taverna– Gateway to myExperiment, MolMeth, OpenWetWare….In

terli

nkin

g A

SSET

S C

ATA

LOG

UE

Page 14: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,
Page 15: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Linking data to process

Standard Operating ProceduresModelsSoftwareProvenanceThe Lab BookRetrospective method reconstructionThe myth of reproducible science

Page 16: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

• Scientists willing to share methods and protocols.

• SOPs an early win.

• Defined standard metadata model based on Nature Protocols.

• Seeded.

Page 17: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Linking data with stuff• Research Objects for packaging and

exchanging Assets– Workflows linked to models linked to

data linked to SOPs – Encapsulate community standards– Mixed resources: External and central.– Trust– “Preservation Packet”– Bechhofer et al 2010 forthcoming in The Future of

The Web for Collaborative Science 2010. • SBRML

– Systems Biology Results Markup Language

– To tie to the SBML

Page 18: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

At the coal-face

The Spreadsheet.The Content Management Systems.Legacy assets are assets.Metadata ramps.

Page 19: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

The Content Management System

• Lightweight and flexible. Low take-on, hidden operations costs. Knowledgeable Civilians. Looks nice.

• Anarchy amenable.

Page 20: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Spreadsheets

• Template distribution• Template mapping

SysMOLab

Page 21: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Everyone wants metadata. No one wants to collect it.

Standards mayhemMetadata millstonesMost data is thrown away.

Metadata for my sakeMetadata compliance by stealthPreparation for publishing

Page 22: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

CIMR Core Information for Metabolomics ReportingMIABE Minimal Information About a Bioactive Entity MIACA Minimal Information About a Cellular Assay MIAME Minimum Information About a Microarray Experiment MIAME/Env MIAME / Environmental transcriptomic experiment MIAME/Nutr MIAME / Nutrigenomics MIAME/Plant MIAME / Plant transcriptomics MIAME/Tox MIAME / Toxicogenomics MIAPA Minimum Information About a Phylogenetic Analysis MIAPAR Minimum Information About a Protein Affinity Reagent MIAPE Minimum Information About a Proteomics Experiment MIARE Minimum Information About a RNAi Experiment MIASE Minimum Information About a Simulation Experiment MIENS Minimum Information about an ENvironmental Sequence MIFlowCyt Minimum Information for a Flow Cytometry Experiment MIGen Minimum Information about a Genotyping Experiment MIGS Minimum Information about a Genome Sequence MIMIx Minimum Information about a Molecular Interaction Experiment MIMPP Minimal Information for Mouse Phenotyping Procedures MINI Minimum Information about a Neuroscience Investigation MINIMESS Minimal Metagenome Sequence Analysis Standard MINSEQE Minimum Information about a high-throughput SeQuencing Experiment MIPFE Minimal Information for Protein Functional Evaluation MIQAS Minimal Information for QTLs and Association Studies MIqPCR Minimum Information about a quantitative Polymerase Chain Reaction experimentMIRIAM Minimal Information Required In the Annotation of biochemical Models MISFISHIE Minimum Information Specification For In Situ Hybridization and Immunohistochemistry

ExperimentsSTRENDA Standards for Reporting Enzymology DataTBC Tox Biology Checklist

BioPAX : Biological Pathways Exchange http://www.biopax.org/FuGE Functional Genomics Experiment MGED: Microarray Experimental Conditionshttp://www.mibbi.org/index.php/MIBBI_portalMIBBI: Minimum Information for Biological and Biomedical Investigations

Minimum Information Models

63%47%

Page 23: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Just Enough Results Model• Harvest standards e.g.

MIAME (MIBBI.org)• Analyse consortium

schemas and spreadsheets

• JERMs for each data type – microarray, metabolomics, proteomics ....

• Map project data sources to JERMs.

• Distribute JERM spreadsheet templates

“I only want to collect and share just enough results”

Page 24: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

JERM Spreadsheets Templates

Controlled vocabulary plug in

• RDF for ripping, mashing and comparing spreadsheets.• A little semantics goes a long way

Page 25: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,
Page 26: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Reward curation

Local curation at the point of capture – ISA-TAB for ‘omics.Centralised curation – SBML, CellML, SBOAutomated curation.Which data is worth curating?

Page 27: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

• Blue-Collar Science.

• Curator Credit• Curator Career• Funding.• Personal and

institutional visibility

• Scholarly citation metrics

• Federate workloads

• Unpopular with the big data providers.

www.biocurators.org

Page 28: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Commons-based Quality Control.

Page 29: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Progressive Curation: “lazy evaluation” metadata

Just enough, Just in timeJam today and Jam tomorrow

Gain

Pain

VeryBAD

Good, butUnlikely

Just right

Page 30: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Sensitive sharing. Collaborate to competeGood reasons not to.

Just enough just in time sharing.Data kept at host.Registered centrally through harvesting.Pre-Publication sharing vs Publication

Page 31: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Competitive advantage.Academic vanity.

Adoption. Reputation.

Scrutiny.Being scooped.

Misinterpretation.Reputation.Legal issues.

Rew

ards

Risk

s

Nature 461, 145 (10 September 2009) | doi:10.1038/461145a

Page 32: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Access Permissions

Just Enough Sharing

Reusing myExperiment

Page 33: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Reward sharing and reusing not

reinventing.

Technically. Culturally. Institutionally.

Credit and Risk Mitigation.

Page 34: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Attribution.Trust.

Credit

Reward and Provenance

Reusing myExperiment

Page 35: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Some pretty key things• Data citation

• Stable and shared ids and names– A nightmare.– Sharednames.org– Biosharing.org

• Versioning and Provenance– Models, software, data sets– Ensembl web service doesn’t report version number.

Page 36: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Data commons, Data havens

For data after the project has ended.

For the common good or me.Tidy and untidy data.

Beth’s Provenance Objects

Bio2RDF

Page 37: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Access and availability of data and data analysis resources

Web services underpin the ESFRI ELIXIR programme.Interfaces that are understandable and stable.

Designed for people too.No access, no tools, no point (Keith Haines)

Deposition to community databanks that minimise pain.

Page 38: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

What is it?

Is it working?

Page 39: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Data analysis, model population and data pipelining ramps.

Crossing the adoption chasmThere is a world of complexity for data preparation, processing and analysisScience Informatics Sweatshops.E-Laboratories. Workflows. Portals.Pre-cooked processes and process templates. Pre-cooked interfaces.Training.

Page 40: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

MicroArray from

tumor tissue

Microarray

preProcessing

Lymphoma

prediction

Lymphoma Prediction Workflow

Wei Tan Univ. Chicago

Ack. Juli Klemm, Xiaopeng Bian, Rashmi Srinivasa (NCI)Jared Nedzel (MIT)

caArray

GenePattern

Use gene-expression

patterns associated with two lymphoma types to predict the type of an

unknown sample.

Page 41: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

myExperiment Communities

• Supermarket shoppers

• Tool builders

• Trainers and Trainees

Page 42: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Drop and ComputeDrop and Compute

Local folder synchronised and shared via cloud

Condor job submitted by drag and drop

Results appear in Dropbox

Ian Cottam

Page 43: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Bashing against local IT

NO – you can’t access that datastore / run your

analysis. Joined up thinking.

Page 44: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Data + PublicationsData trapped in documentsSupplemental informationText miningText mining workflowsText mining to find method and controls

Page 45: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Reflect. Elsevier Challenge Winner 2009

Page 46: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Manual and Auto-mark up[Oscar-3]

Page 47: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Do not underestimate the power of Interactive Visualisation and BrowsingPre-cooked complex queries.Navigation.With my data.At the click of a button.

Page 48: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

• Distributed Annotation Service• Upload and overlay my data

Page 49: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

SysMO summary• Providing an environment where every data-driven

researcher will thrive• Reality is messy.

– Extreme Technology Determinism vs Voluntarist Sociocultural shaping

• Extreme and continuous partnership with users.– Act Local Think Global

• Agile development environment facilitated stream of features to tackle pain points.– Leverage other e-Laboratories, Maintaining scientists’ buy-in.

• Socio-Political Axis dominates the Technical Axis.– Collaboration evolutions, Confidence in exchange.

Page 50: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Coordination

Sustainabi

lity

Interope

rability

Adoption

Capacity

Data

Six Action Plan Areas

Page 51: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Capacity building of our skills base

• Influence training and capacity building programmes.

• Promote training for young and mid-career researchers and research technologists.

• Enable mixed skilled research teams to include research and information technologists.

• Value and reward highly skilled research and information technologists within HE institutions with a career structure.

Page 52: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Data Silo culture

Funding silos

Discipline silos

Page 53: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Academic Credit and Risk

Mitigation

for sharing, curating, and reusing not reinventing

Page 54: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Data and Software is free like puppies

are free

Page 55: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

University of Stellenbosch, South AfricaUniversity of Manchester, UK

Jacky Snoep

EML Research gGmbH, Germany

Isabel Rojas

University of Manchester, UK

Olga Krebs

Wolfgang Müller

Sergejs Aleksejevs

Carole Goble

Stuart Owen

Katy Wolstencroft

Finn Bacal

Page 56: Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester,

Links• myGrid Project

– http://www.mygrid.org.uk

• SysMO-DB– http://www.sysmo-db.org

• myExperiment– http://www.myexperiment.org

• Taverna– http://www.taverna.org.uk

• JWS Online– http://jjj.biochem.sun.ac.za/

• SABIO-RK– http://sabio.villa-bosch.de/