21
Systems Biology Data Dissemination Working Group 25FEB2015

Systems Biology Data Dissemination Working Group 25FEB2015

Embed Size (px)

Citation preview

Systems Biology Data Dissemination Working Group

25FEB2015

Goal of the SysBio DDWG

• Coordinate approaches for sharing and dissemination of systems biology project data within the program and to the broader infectious disease and systems biology research communities

• Share best practices in data management • Leverage external data resources for data

dissemination, including the NIAID Bioinformatics Resource Centers

DDWG Activities

• Monthly 1 hr conference calls• Annual workshops• Membership– Representatives from FluDyNeMo, Fluomics, OMICS-

LHV, Omics4TB, MaHPIC– Representatives from EuPathDB, PATRIC, ViPR/IRD– Representatives from DMID

• Co-Chairs– Michelle Craft, OMICS-LHV– Richard Scheuermann, ViPR/IRD

Workshop AgendaA. SysBio data management best practices (Michelle Craft, presenter)

* Data Management Best Practice Highlights

* Overview of Data Carpentry and Software Carpentry

B. SysBio center plans for project websites (Richard Scheuermann, moderator)

* Presentation of highlights by each center (5' each)

* Discussion of general internal data sharing strategies and short term public dissemination plans

* Discussion of long term dissemination plans

C. Relevant public data archives (Jessie Kissinger, presenter)

* Which existing public archives could be used for long term dissemination of SysBio data

* What SysBio data types are not currently supported by public data archives

* Discussion of long term dissemination plans

D. Transcriptomic data derived from RNA-seq (Brian Aevermann, presenter)

* Determine if new transcriptomic (meta)data needs to be captured for new SysBio program

* Determine which aspects of RNA-seq data are not covered by current microarray data support

* Decide how to support data processing (meta)data – structured data fields vs free text protocols

* Determine which RNA-seq data should be disseminated and where

Best Practices Overview

• File Management– Descriptive Names– Metadata– Sensitive Data– Data Versions

• File Content– Rows vs Columns– Spreadsheet Mistakes– File Formats

• Working with Data– Find useful tools– Quality control, data manipulation– Software and Analysis Versions

Courtesy of Michelle Craft

Project Websites

• Informational content using content management systems, e.g. WordPress, Drupal

• Data sharing portal–Within consortium– Public

Previous Data Submission Workflows

Study metadata

Experiment metadata

Primary results

Analysis metadata

Processed data matrix

Free text metadataGEO/PeptideAtlas/SRA/MetaboLights

ViPR/IRD/PATRIC

Host factor biosets

pointer

submission

submission

pointer

Systems Biology sites

Experimental metadataStudy

Subject

Biological Sample

Experiment

Bioset

Data standards background

• Ontology for Biomedical Investigations (OBI)– Peters, Bjoern and OBI Consortium, The. Ontology for Biomedical

Investigations. Available from Nature Precedings (2009).– Ryan R Brinkman, et al. “Modeling biomedical experimental

processes with OBI”. Journal of Biomedical Semantics (2010).

• OBX data standard– Developed for ImmPort using OBI structure and implemented in a

relational database

– Y. Megan Kong, et al. “Toward an Ontology-Based Framework for Clinical Research Databases”. J Biomed Inform (2011).

• Systems Biology data standard– Derived from OBX/ImmPort and extended to capture data

transformations and derived data (Biosets)

1 3 5 8 14

Serial Challenge Timeline

0

-2 0 3 5 8

Sequential Sampling Studies

Serial/Longitudinal Studies

-2days

daysA/California/07/2009

A/California/07/2009

Courtesy of Elodie Ghedin

1 3 5 8 14

Serial Challenge Timeline

0

n=4 Ferrets at each time point

-2 0 3 5 8

Nasal Wash Nasal Wash

FACS Whole Blood

Serum SerumBronchial Lavage

Lungs

FACS Whole Blood

Blood in RNAlater

Blood in RNAlater

Nasal Wash

FACS Whole Blood

Serum

Blood in RNAlater

Nasal Wash

FACS Whole Blood

Serum

Blood in RNAlater

Nasal Wash

Serum

Lungs

FACS Whole Blood

Blood in RNAlater

Nasal Wash

Serum

Lungs

FACS Whole Blood

Blood in RNAlater

Nasal Wash

Serum

Lungs

FACS Whole Blood

Blood in RNAlater

Nasal Wash

Serum

Lungs

FACS Whole Blood

Blood in RNAlater

Nasal Wash

Serum

Lungs

FACS Whole Blood

Blood in RNAlater

Bronchial Lavage Bronchial Lavage Bronchial Lavage Bronchial Lavage Bronchial Lavage

Sequential Sampling Studies

Serial/Longitudinal Studies

-2

Nasal Wash

Serum

FACS Whole Blood

Blood in RNAlater

days

days

Courtesy of Elodie Ghedin

Experiment Types

subjectorganism

treatment agent

T1

treatmentprocess

specimenisolation 1

treatedorganism

datatransformation 1

omicsassay 1

primarydata 1

processeddata 1

Generalized Experiment Workflow

treatedorganism

isolatedspecimen 1

treatedorganism

sacrificedorganism

sacrificeprocess

physicalassessment

assessmentdata

specimenisolation 2

datatransformation 2

omicsassay 2

primarydata 2

processeddata 2

isolatedspecimen 2

specimenisolation 3

datatransformation 3

omicsassay 3

primarydata 3

processeddata 3

isolatedspecimen 3

T2 T3 T4

T5

subjectorganism

treatment agent

T1

treatmentprocess

specimenisolation 1

treatedorganism

datatransformation 1

omicsassay 1

primarydata 1

processeddata 1

Generalized Experiment Workflow

treatedorganism

isolatedspecimen 1

treatedorganism

sacrificedorganism

sacrificeprocess

physicalassessment

assessmentdata

specimenisolation 2

datatransformation 2

omicsassay 2

primarydata 2

processeddata 2

isolatedspecimen 2

specimenisolation 3

datatransformation 3

omicsassay 3

primarydata 3

processeddata 3

isolatedspecimen 3

T2 T3 T4

T5

t

subjectorganism

treatment agent

T1

treatmentprocess

specimenisolation 1

treatedorganism

datatransformation 1

omicsassay 1

primarydata 1

processeddata 1

Generalized Experiment Workflow

treatedorganism

isolatedspecimen 1

treatedorganism

sacrificedorganism

sacrificeprocess

physicalassessment

assessmentdata

specimenisolation 2

datatransformation 2

omicsassay 2

primarydata 2

processeddata 2

isolatedspecimen 2

specimenisolation 3

datatransformation 3

omicsassay 3

primarydata 3

processeddata 3

isolatedspecimen 3

T2 T3 T4

T5

t

subjectorganism

treatment agent

T1

treatmentprocess

specimenisolation 1

treatedorganism

datatransformation 1

omicsassay 1

primarydata 1

processeddata 1

Generalized Experiment Workflow

treatedorganism

isolatedspecimen 1

treatedorganism

sacrificedorganism

sacrificeprocess

physicalassessment

assessmentdata

specimenisolation 2

datatransformation 2

omicsassay 2

primarydata 2

processeddata 2

isolatedspecimen 2

specimenisolation 3

datatransformation 3

omicsassay 3

primarydata 3

processeddata 3

isolatedspecimen 3

T2 T3 T4

T5

Courtesy of Adolfo Garcia-Sastre

Typical RNA-seq Data Processing Workflow

Raw data: fastq*

Mapped reads: SAM/BAM Cufflinks analysis Assembled transcripts: SAM/BAM

TopHat analysis

Differential Expression analysis (edgeR)

Differentially expressedgenes: text*

Data archiving SRA Record

Ref Genome: fasta (version) ENSEMBL

version

Data archiving

GEO

Scaling and norm (cuffMerge)

Transcript abundance values: text*

BRC

Data archiving

BRC

Future Directions

• Finalize core generic (meta)data modules for treatments, specimen sampling, organism assessments, omics assays, data processing

• Determine if additional assay-specific data fields are needed

• Decide which results data should be captured for public dissemination

• Decide which public data archives should be used

• Ensure appropriate linkage between related data