GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Preview:

Citation preview

GigaDB explained

Christopher I HunterInternational Training Workshop on Big Data

11-Mar-2015

Presentation contents

• GigaDB introduction• Data types hosted• Anatomy of a dataset DOI • Navigate GigaDB site• Search tool • Submission tool• The extensible metadata schema

--------------------- Coffee Break ---------------------

• ISA tools introduction• ISA-Tab as an exchange format• ISA in action

GigaScience Database

Giga-overview

• GigaDB hosts biological data(any type of data related to, or used in biological studies)

• Primarily associated with the BMC journal, GigaScience

• Funded by BGI-Research and China National GeneBank

• Currently ~160 datasets available• Genomic datasets represent majority of

data(~70%)• ~90% of all data from BGI (or partner) studies• But there 13 different types represented• All manually curated

Data types

• Various Nucleotide data types:– Genomic, Transcriptomic, Metagenomic,

Epigenomic, Genome mapping.• Mass spectrometry:

– Proteomics, Metabolomics, MS-Imaging.• Software & Workflows• Other

– Imaging, Neuroscience, Network analysis

Navigating the GigaDB website

• Home page• Dataset DOI page• Data download options• Search tool• Submission:

– Who should submit to GigaDB– How to submit data

Anatomy of a GigaDB entry

• All relevant information is held together in packets called Datasets

• Each dataset has a stable DOI page

• If required there can be a hierarchy of datasets

• Title• Study type(s)• Image• Citation

• Description

• Funders• Links to Google scholar

and EuroPMC to see who has cited this dataset

• Email submitter• Link to manuscript• Links to external

resources

Cont.

• Samples used in the study

• Files listed as part of the study

• History of dataset changes

• Social media links

• Links to other datasets of similar nature

Downloading the data

FTP• Conventional/easy to use• Can pull individually from

web page • 1 or multiple files using

command line unix• Speed = upto 1 Mb/sec

Downloading the data

Aspera• Requires plugin download• Only available to use via

web-app• 1 or multiple files • Speed = upto 100 Mb/sec

– (e.g. upto 100x faster than FTP)

Search tool

• Search for the term “genome” in the search bar at the top of any dataset page:

Search tool

= GigaDB datasets = Samples = Files

It will only display files that contain matches to the search term

Submitting data to GigaDB• All data submitted to GigaDB must be fully consented

for public release• Where appropriate data should be submitted to

established public archives first. (e.g. INSDC)

• At present we only host data associated with GigaScience journal articles, or by prior approval by the Editors of GigaScience.

• Potential submitters should approach the editors and database curators by email to discuss possible inclusion.

Valid

ation

chec

ks

Fail – submitter is provided error report

Pass – dataset is uploaded to GigaDB.

Submission Workflow

Curator makes dataset public (can be set as future date if required)

DataCite XML file

Submission

Submit Excel spreadsheet or uses online wizard

GigaDB

DOI assigned

FilesSubmitter provides files by ftp or Aspera

XML is generated and registered with DataCite

Curator Review

Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).

DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)

Public GigaDB dataset

Curator makes dataset public (can be set as future date if required)

DataCite XML file

GigaDB

DOI assigned

FilesSubmitter provides files by ftp or Aspera

XML is generated and registered with DataCite

Curator Review

Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).

DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)

Public GigaDB dataset

Submit Excel spreadsheet or uses online wizard

Valid

ation

chec

ks

Fail – submitter is provided error report

Pass – dataset is uploaded to GigaDB.

Submission Workflow

Submission

Submission• Once approved there are two

options for submitting metadata;– offline using an Excel spreadsheet

– online using the wizard

• Soon to be a third option (ISA-tab)

Online vs Offline

• Guided • Good for few large

samples• Allows greater addition

of linking

• Limited documentation• Best for large number of

samples/files

Submission wizardRegister, Log in, Goto your profile page:

Add all the links to related data

Add all the links to related data

Add all the links to related data

Add all the links to related data

Add all Sample metadata

Curator makes dataset public (can be set as future date if required)

DataCite XML file

XML is generated and registered with DataCite

DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)

Public GigaDB dataset

Submit Excel spreadsheet or uses online wizard

Valid

ation

chec

ks

Fail – submitter is provided error report

Pass – dataset is uploaded to GigaDB.

Submission

GigaDB

DOI assigned

FilesSubmitter provides files by ftp or Aspera

Curator Review

Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).

Submission Workflow

Manual check /curate

• After metadata has been submitted it is checked by a curator

• A private upload area is assigned and user can upload data files by Aspera or FTP

Submit Excel spreadsheet or uses online wizard

Valid

ation

chec

ks

Fail – submitter is provided error report

Pass – dataset is uploaded to GigaDB.

Submission

DOI assigned

FilesSubmitter provides files by ftp or Aspera

Curator Review

Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).

Curator makes dataset public (can be set as future date if required)

DataCite XML file

XML is generated and registered with DataCite

GigaDB

DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)

Public GigaDB dataset

Submission Workflow

Mint the DOI

• Once all the files and metadata are stored and linked appropriately we will mint the DOI with out partners at DataCite.

Publish the dataset

• Publication date = date on which DOI is released to public.

• Immediately added to GigaDB RSS feed. • Any other promotion of datasets is done in

conjunction with manuscript publication

Behind the scenes

The extensible metadata schema

• Spectrum of data being hosted is very broad• Database needs to be:

– Structured, but allow wide variety– Be able to incorporate multiple standards– Utilise ontologies– Link to multiple external sources

The GigaDB schema looks like this:

Just the Dataset tables

Just the Dataset tablesdatasetidsubmitter_idimage_ididentifiertitledescriptiondataset_sizeftp_siteupload_statusexcelfileexcelfile_md5publication_datemodification_datepublisher_idtoken

Just the Dataset tables

Store wide variety of attributes

attributeidattribute_namedefinitionmodelstructured_comment_namevalue_syntaxallowed_unitsoccurrenceontology_linknote

Checklists

• Different things are important in different experiment types

• Various communities have standard checklists they try to adhere to

• GigaDB can leverage those different checklists and integrate them where possible.

MIxS

• Genomic Standards Consortium (GSC)– Minimum Information about x Sequence

http://gensc.org/projects/mixs-gsc-project/

• It includes:– set of core descriptors for sequence data– Set of measurements and observations describing the

environment of the sample– Goes beyond the minimum, by defining ~370 attributes that

could be used.• It is hoped that the adoption of this standard would

elevate the quality, accessibility and utility of information that can be collected.

SRA & PX

• The Sequence Read Archive (SRA) and the ProteomeXchange (PX) also both provide specific terms (attributes) that we can map to.

Other checklists

• We are able to include attributes from any model or standard and link that from the attributes table

• So if there is a recommended standard for a particular data-type we can incorporate it.

Ontologies

• Units• Taxonomy• Any that are defined in standards• Common ones in use:

– DOID - Disease ontology– EFO - Experimental Factor ontology– SO - Sequence ontology– UBERON - cross-species ontology of anatomical structures– ENVO - ENVironment Ontology

Future developments• Develop an Application programming Interface

(API)– Including support for ISA format import and export

• Improve dataset DOI display pages– Include experiment information

• Improve submission wizard– Include bulk upload tables

• Add ontology look-up automatically• Integrate other tools

That’s it for GigaDB.

Thanks for listening!

Any Questions?

Next up, ISA tools.

ISA tools

Christopher I HunterInternational Training Workshop on Big Data

11-Mar-2015

What is ISA?

Investigation

Study (and/or Sample)

Assay

What is ISA-tab?

• ISA-tab is a general purpose, domain agnostic, flexible format to describe multi-omic experiments.

• It can be used as a submission format to some archives and there are a suite of tools for conversion into other common formats.

What are ISA-tools?

• A suite of tools based on the ISA-tab format• Developed and maintained by a team at Oxford

University, UK.• The main tool of interest here is the ISA-creator

Live demoDon’t panic, I have screen shots if it all goes wrong!

The Ontology lookup function

ISA Validation tool

• Default only checks ISA-tabs are formed correctly

• Can be configured to check against checklists

ISA converter tool

• The development team actively work on new converter tools

• And are always happy to work with domain experts to make more

The ISA team

• Susanna Sansone• Philippe Rocca-Serra• Alejandra Gonzalez-Beltran

http://www.isa-tools.org/

That’s it for ISA.

Thanks for listening!

Any Questions?

Next up, GigaScience software and workflows.