Workshop - finding and accessing data - Cambridge August 22 2016

  • View
    282

  • Download
    1

  • Category

    Science

Preview:

Citation preview

We are always looking for data

Finding and accessing human genomic data for

research

Cambridge, 22nd August 2016

Slides will be made available online

Tweets welcome #CamFindData

Outline of the day

- Data sources and data access (Charlotte)- Case study: University of Cambridge- Coffee break- Introduction to Repositive (Fiona)- Hands-on session: searching for data- Round up and closure

On-line tools used during the workshop

To ask questions during the presentation and answer questions:

go to slido.com

enter event code: 1641

To leave feedback on the workshop:

http://tinyurl.com/feedback220816

We are on twitter: @glyn_dk

@repositiveio@DNAdigest

@CamOpenData

Cambridge, 22nd August 2016

Slides will be made available online

Tweets welcome #CamFindData

1. What data are you looking for?

Join at slido.com with the event code #1641

This workshop will focus on finding and accessing human genomic data.

… why would you be looking for genomic data for your research?

How much data do you need to publish a paper?

2001: 1 human genome

2012: 1000 Genomes (1092 genomes, since increased to ~2500)

2015: UK10K & deCODE (>100k induviduals) Cancer Genome Atlas ~11,000 genomesExAC consortium 65,000 exomes

?

Case studies

Raquel,PhDStudent,London,UK.

Researchinggenesassociatedwithrareeyedisorders.

Problems:- Doesn’tknowwheretolook

fordata.- Doesn'tknowifdataeven

exists.

“I gave up on finding the data - it was very time consuming and not proving fruitful – so I started focusing more on generating my own data.”

Mahantesh,AcademicResearcher,Taipei,Taiwan.

Studyingpharmacogenomicsincardiovascularepidemiology.

Problems:- Needslotsofdata.- Knowsitexistsbutstruggles

withgettingaccesstoit.

“Often it’s very hard to get the required number of cases and controls to carry out research in public health and epidemiology.”

Jana,CompanyBiocurator,Zurich,Switzerland.

BiocuratingmicroarrayandRNA-Seqdata.

Problems:- Needslotsofdata.- Lotsofdataouttherebut

hardtofilterdownto‘useful/relevant’data.

“Many repositories don’t list the metadata details I need to know if a dataset is useful to me, I can waste a lot of time searching.”

What can I do?

PRO TIPS:

Involve a statistician early on in your study design!

Include more reference data in your analysis

Search for collaborators who have the data you need

Tell your colleagues and peers what type of data you have in your lab

Use external sources of data….

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Large amounts of data, but not accessible

≈.5 PB Sequenceavailable

80+ PB

Sequencedeveryyear

WGS data available in public repos

Exponential growth rate

Under-utilised datahashuge potentialfor

medicalresearch

2. Data resources from around the world

Public repositories

• some you apply for access, especially if data contains clinical info or whole genome PID

• some are open access: GEO, SRA, PGP, OpenSNP, GigaDB, …

• some are consented for general research use, some have specific consent

How many data sources?

How many sources of human genomics data do you know

about?

Hundreds of data sources…buttheyaren’teasytofind!

http://dx.doi.org/10.1371/journal.pbio.1002418 First 30 data sources listed here:

Jan-15 Mar-15 Jun-15 Sep-15 Dec-15 Mar-16 Jun-160

50

100

150

200

250

300

1025 33 35

102

174

239

DATA is fragmented

Data sources across the globeGEOlocationof278datasourcesanalysed.

Found by tracking IP address of the source.

Theseinclude:

PublicRepositories

Universities

Companies

BioBanks

Researchconsortiums

It may be confusing

Data source content

Assay Types

Dedicated to…

More information about data sources

… in our recent paper:

http://tinyurl.com/plos-biology-repositive

3. Getting access to Restricted data

Benefits:• Strictgovernance• Individualsareprotected• Reviewofconsent• Applicantsignsforfull

responsibilityforgovernance

Disadvantages:• Nocontrolofdataonceaccess

isgiven• Highbarrierforaccess–too

high?

Data accessibility

Candownloadthedatastraightawayorafterloggingin.

Needtoapplyforaccesstothedata.

HasbothOpenandRestrictedaccessdatawithinone

repository.

Access type of 225 sampled data sources.

Often a long process

Bottlenecks: • Finding relevant and usable

data• Getting authorisation to

access data• Formatting data• Storing and moving data

We studied the problem with qualitative interviews followed by a survey of researchers in

human genetics

T. A. van Schaik et alThe need to redefine genomic data sharing: a focus on data accessibility, Applied & Translational Genomics, 2014 10.1016/j.atg.2014.09.013

Often a long process

Researchers spend months trying find and access genomic data, and often choose to not access data at all

NIH / eRA Commons login

No

Yes

Organisation registered with eRA

Organisation has DUNS number

No

NoWrite research proposal

Yes+ 2-3 days

+ 1-2 weeks

+ 1 week

Yes

Submit proposal

+ 1-2 days

Access grantedFind/Download/Decrypt data

+ 1-4 weeks

Science…

+ 1-2 days

PRO Tip: If you use human genomic data, apply for the GRU datasets in dbGaP, one application – access to all the GRU datasets.

dbGaP application process

Blog Post:http://blog.repositive.io/how-to-successfully-apply-for-access-to-dbgap/

Sanger eDAM Account

No

Write research proposal

+ 1 hourYes

Submit proposal

+ 1-2 days

Access grantedFind/Download/Decrypt data

+ 2-7 days

Science…

+ 1-2 days

EGA application process

Blog Post:http://blog.repositive.io/how-to-successfully-apply-for-access-to-ega/

• PostdoctoralresearcheratUniversityofCambridgeMedicalSchool

• WorkingongeneticinheritanceandCancer• UsingNGSdataandbioinformatics

• Aftersearchingfordataonlineshedecidedtoapplyfor:• 2dbGaPdatasets• 3EGAdatasets

Cambridge specific Case Study

Blog Post:Pending… will be on http://blog.repositive.io/

The Research Operations Office -willhelpyouwiththecontracts(DTAs)andsignatures.

• HasadesignatedindividualwhoprocessesalldbGaPapplicationsastheyallabidebyNIHlegalrestrictionsandregulationsabouthowtohandlethedataoncegrantedaccess.

• ForEGAapplications,eachDTAmustgetprocessedseparatelybecausethereisnoconsensusforthe‘contracts’betweeneachdataset.

Cambridge specific Case Study

Blog Post:Pending… will be on http://blog.repositive.io/

The nominated IT director -willbespecifictoyourdepartment.

• TheywillneedtoconfirmyoucansupporttherequirementsoftheDTA.

• IftheheadofyourdepartmentalITisnothappytosign–theheadofITfortheUniversitywillbeabletosignitoff.

Cambridge specific Case Study

Blog Post:Pending… will be on http://blog.repositive.io/

Top Tips:Beprepared…

• Thinkaboutyourstoragespace!

• Thinkaboutwhatsortofanalysisandprocessingyouaregoingtodowiththedataonceyoudohaveit.Aftersuchalongprocess,theapprovalcouldbetooquick!!

• Designatetime!

• Understandwhatyouneedbeforeyoustarttheapplicationprocess!

• Youonlyhave1year!

Cambridge specific Case Study

4. Not all data is restricted

Applyingforaccesstorestricteddataisahardandtimeconsumingprocess.

Thinkaboutusingopen access data!

Makethe(research)worldabetterplacebysharinginreturn

Best practices: Share in return!

• Ifyouexpectdatatobeavailabletoyou–youhavetomakeyourdataavailabletoo!

• Encouragecollaborations:powerbynumbers

1. Get credit –publishandmakeyourdataavailable2. Give credit –citedatasources3. Understand consent –forallusesofclinicaldata

Best practices

• Useallavailabletools to make your life easier:• Datapublicationsvisibilityandcitationsforyourdata,e.g.

GigaScienceandScientific Data

• Figshare,Zenodo,Dryadforsharingopenaccessdata

• PhenomeCentral,Matchmaker exchange forrarediseaseresearch

• Repositiveforfindingdataacrossrepositoriesandmakeyourowndatadiscoverable

Best practices: use the tools

• Digital consent:towardsautomaticprocessingofapplications

• Dynamic consent andpowertothepatient,e.g.PatientsKnowBest

• Privacy-preserving access todatasets:preservingcontrolandgovernancewithdatacustodian,lowerbarrierforaccess

What the future holds

Workshop: Findingandaccessinghumangenomicdataforresearch

Fiona Nielsen – August 22nd 2016

We are always looking for data

Genetics, Cancer,

Rare diseaseresearch

Weneedaccesstotherightdataattherighttime

DNAinterpretation

requireslots of data

Data is not easy to find and access

FRAGMENTEDPoor visibility of available

genomic data

ADMIN BURDENHuge overhead to manage

data access

BAD CULTURELack of data sharing habits in

research culture

We are enabling best practices

MAKE DATA DISCOVERABLE

SIMPLIFY WORKFLOWS

CONTRIBUTE TOCOMMUNITY

DNAdigest and Repositive – Connecting the world of genomic datahttp://www.tinyurl.com/plos-biology-repositive

Connecting the world of genomic data

Live demo http://discover.repositive.io

Team 2 minute presentation

1. Introduction What data did you try to find and why?Have you tried to search for this data before?

2. MethodsThe 5 main steps you took on Repositive to try and find this data.

3. ResultsDid you find the data on Repositive?What challenges did you encounter?

4. ConclusionSum up your experience in 1 sentence.

1 2 3 4 5

Tell us your thoughts: @repositiveio

@glyn_dk

And read more on http://repositive.io

Bugs and feedback to: Charlotte at Repositive.io

Thank you!

Recommended