23
Making Research Data Discoverable and Usable (It’s the metadata, stupid!) Anita de Waard VP Research Data Collabora7ons [email protected] h=p://researchdata.elsevier.com/

Talk at OHSU, September 25, 2013

Embed Size (px)

DESCRIPTION

Presentation on research data management to Oregon Science and HEalth University

Citation preview

Page 1: Talk at OHSU, September 25, 2013

Making  Research  Data    Discoverable  and  Usable    (It’s  the  metadata,  stupid!)  

Anita  de  Waard  VP  Research  Data  Collabora7ons  

[email protected]        

h=p://researchdata.elsevier.com/      

Page 2: Talk at OHSU, September 25, 2013

Research  data  is  the  ‘new  hotness’…    §  Share  research  outputs  §  Demonstrate  impact  to  public  §  Data  availability  drives  growth  

§  Demonstrate  impact    §  Guarantee  permanence,  discoverability    §  Avoid  fraud    

§  Generate,  track  outputs  §  Comply  with  mandates  §  Ensure  availability  

§  Archive,  track,  curate  §  Support  researcher/ins7tu7on  

§  Archive    §  Add  cura7on  §  Allow  reuse      

Todd  Vision,  DataDryad,  OAI8,  6/23/13:    “We  need  to  find  a  way  to  keep  Dryad  funded,  and  would  love  to  hear  your  ideas  about  doing  that.”  

Phil  Bourne,  Associate  Vice  Chancellor,  UCSD,  4/13:    “We  are  thinking  about  the  university  as  a  digital  enterprise.”  

Mike  Huerta,  Ass.  Director  NLM  O  of  Health  Info  at  NIH,  6/13:    “Today,  the  major  public  product  of  science  are  concepts,  wri=en  down  in  papers.  But  tomorrow,  data  will  be  the  main  product  of  science….  We  will  require  scien7sts  to  track  and  share  their  data  as  least  as  well,  if  not  be=er,  than  they  are  sharing  their  ideas  today.”    

Mara  Saule,  Dean  University  Libraries/CIO,  UVM,  5/13:    “We  need  to  do  something  about  data.”  

§  Derive  credit  §  Comply  with  mandates  §  Discover  and  use    §  Cite/acknowledge  

Gov  

Funding  bodies  

University  management    

Researchers  

Librarians  

Data    Repositories  

Nathan  Urban,  PI  Urban  Lab,  CMU,  3/13:    “If  we  can  share  our  data,  we  can  write  a  paper  that  will  knock  everybody’s  socks  off!”  

Roles  and  needs  wrt  Research  Data:  

Barbara  Ransom,  NSF  Program  Director  Earth  Sciences,  2/13:    “We’re  not  going  to  spend  any  more  money  for  you  to  go  out  and  get  more  data!  We  want  you  first  to  show  us  how  you’re  going  to  use  all  the  data  we  paid  y’all  to  collect  in  the  past!”  

Page 3: Talk at OHSU, September 25, 2013

Research  data  management  today:  

Using  an7bodies  and  squishy  bits      Grad  Students  experiment  and  enter  details  into  their  lab  notebook.    The  PI  then  tries  to  make    sense  of  their  slides,  and  writes  a  paper.      End  of  story.    

Page 4: Talk at OHSU, September 25, 2013

Prepare  

Observe  

Analyze  

Ponder  

Communicate  

Prepare  

Observe  

Analyze  

Ponder  

Communicate  

Research  today  (in  biology)  is  o^en  quite  insular:    

Page 5: Talk at OHSU, September 25, 2013

But  life  is  VERY  complicated:  

h=p://en.wikipedia.org/wiki/File:Duck_of_Vaucanson.jpg  

•  Interspecies  variability:  A  specimen  is  not  a  species  •  Gene  expression  variability:  Knowing  genes  is  not    

knowing  how  they  are  expressed  •  Microbiome:  An  animal  is  an  ecosystem  •  Systems  biology:  A  whole  is  more  than  the  sum  of  its  

parts      Reduc7onist  science    does  not  work  for  living  systems!  

Page 6: Talk at OHSU, September 25, 2013

What  if  the  data  were  connected?  

Prepare  

Analyze   Communicate  

Prepare  

Analyze   Communicate  

Observa7ons  

Observa7ons  

Observa7ons  

Across  labs,  experiments:  track  reagents  and  how  they  are  used  

Page 7: Talk at OHSU, September 25, 2013

Prepare  

Analyze   Communicate  

Prepare  

Analyze   Communicate  

Observa7ons  

Observa7ons  

Observa7ons  

Compare  outcome  of  interac7ons  with  these  en77es  

What  if  the  data  were  connected?  

Page 8: Talk at OHSU, September 25, 2013

Prepare  

Analyze   Communicate  

Prepare  

Analyze  Communicate  

Observa7ons  

Observa7ons  

Observa7ons  

Build  a  ‘virtual  reagent  spectrogram’  by  comparing    how  different  en77es    interacted  in  different  experiments   Think  

What  if  the  data  were  connected?  

Page 9: Talk at OHSU, September 25, 2013

Where  research  data  goes  now:  

>  50  My  Papers  2  M  scien7sts  

2  My  papers/year  

Majority  of  data  (90%?)    is  stored    

on  local  hard  drives  

Dryad:  7,631  files  

 Dataverse:  0.6  My  

   

Ins7tu7onal  Repositories  

 

Some  data    (8%?)  stored  in  large,    

generic  data    repositories  

MiRB:      25k  

PetDB:    1,5  k  

TAIR:      72,1  k  

PDB:      88,3  k    

SedDB:    0.6  k  

A  small  por7on  of  data    (1-­‐2%?)  stored  in  small,    

topic-­‐focused  data  repositories  

1.  How  do  we  get  researchers  to  curate,  store  

and  share  their  data?    

2.  How  do  we  ensure  long-­‐term  

sustainability  for  high-­‐end  repositories?  

3.  What  role  do  libraries/

ins7tu7ons  play?    

Page 10: Talk at OHSU, September 25, 2013

de  Waard,  A.,  Burton,  S.  et  al.,  2013  

1.1.  An  a=empt  to  get  researchers  to  curate  (but  only  parZally  share!)  their  data:    

Page 11: Talk at OHSU, September 25, 2013

•  In  220  publica7ons  only  40%  of  an7bodies,  40%  of  cell  lines  and  25%  of  constructs  can  be  manually  iden7fied  (Vasilevsky  et  al,  submi=ed)  

 •  Proposal  (with  NIH/NIF  and  Force11  Group):    

–  Adding  minimal  data  standards  –  Tool  extracts  likely  reagents  /  resources  –  User  interface  asks  author  to  confirm  or  select  

1.2.  What  to  do  in  the  mean7me?    

49  publica7ons  193  publica7ons   76  publica7ons   214  publica7ons   210  publica7ons  

Page 12: Talk at OHSU, September 25, 2013

Pilot  project  with  IEDA:    – Build  a  database  for  lunar  geochemistry  – Write  joint  report  on  building    repository,  cura7on,  costs  and    challenges  

2.2  How  can  research  databases  become  long-­‐term  sustainable?    

Page 13: Talk at OHSU, September 25, 2013

With  WDS/RDA  WG:    •  Planning  survey  of  cost  recovery  models  for  research  databases  

•  Input/inspira7on:  ICPSR  Sloane-­‐funded  project  Sustaining  Domain  Repositories  for  Digital  Data’  

•  Developing  overarching  funding  model:  

2.2  Cost  recovery  ques7onnaire:  

Page 14: Talk at OHSU, September 25, 2013

Private store

Data producer or sponsor

Access Closed

Flow of funds

Data publication

Public

Service Collaboration

Conclave

Limited

Subscription content

Commercial overlay

Limited Academic Use/Limited

Data user

Flow of funds

Examples ICSPR, CERN-LHC

KEGG GeoFacets Reaxys

DRAFT - CC-BY-NC 2013, Todd Vision & Anita de Waard

Many small operations, e.g. try-db.org, plhdb.org

Dryad, arXiv, PDB

Commercial and institutional storage

&

or

2.3.  A  first  stab  at  a  model:  

Page 15: Talk at OHSU, September 25, 2013

3.1.  Where  do  ins7tu7onal  repositories  fit  in?    Repository   Advantages     Disadvantages  

Local  data  repository  

Easy!  No  one  steals  your  data.    

No  one  sees  it.    Not  compliant  with  requirements  

Generic  data  repository  

Not  very  hard  to  do.  Have  complied!  

Data  can’t  be  easily  reused.  Credit?  

Ins7tu7onal  Repository    

Can  use  exis7ng  IR?  Tracking  and  compliance  checks.      

Data  can’t  easily  be  reused.  Credit?  

Domain-­‐specific  data  repository  

Data  can  be  reused.  Credit!    

Lot  of  work  for  curators.  Long-­‐term  sustainable?     Eff

ort,  Re

use,  Credit,  Co

mpliance  

Habit,  Ease,  Priv

acy,  Con

trol    

 Highe

r  quality  metadata  

Page 16: Talk at OHSU, September 25, 2013

Funding  Agency:   University:  

Collaborators:  Domain  of  study:  Domain-­‐Specific    Data  Repository  

Local    Data  Repository  

Ins7tu7onal    Data  Repository  

Generic    Data  Repository  

AND  

THEY  ALL  

WANT  

DIFFERENT  

METADATA!!!!  

3.2.  The  poor  researcher:    

Page 17: Talk at OHSU, September 25, 2013

Domain  repository  

3.3.  Possible  pilot  project:  

Domain  repository  

IR  Data  Metadata:  What  data  was  stored/viewed  

Metadata  

Metadata:  What  data  was  stored/viewed   •  Interview  ins7tu7ons  

•  Normalize  repor7ng  data  •  Talking  to    

•  IQSS,  Harvard  •  ICPSR,  U  Mich  •  DataDryad,  UNC  •  Pangaea,  Germany  

Page 18: Talk at OHSU, September 25, 2013

3.4.  Ins7tu7onal  Pilot  study:    •  Planning  series  of  interviews  at  key  ins7tu7ons:    – What  role  do  libraries/ins7tu7ons  play  wrt  research  data  management?    

– What  tools/metadata  standards  are  used?  – What  aspects  of  data  deposi7on  is  the  Research  Office/IR/Ins7tu7on  interested  in?    

– How  does  this  compare  with  what  scien7sts  want  and  do  in  their  labs?      

•  Outcomes:    –  Share  knowledge  (within  ins7tu7on);    – Write  joint  report  (anonymised)    –  Establish  joint  plan  of  ac7on  

Page 19: Talk at OHSU, September 25, 2013

Elsevier  Research  Data  Services:    •  2013/2013:  Series  of  pilots,  reviews,  and  reports:  -  With  CMU:  Data/metadata  entry  and  sharing  -  With  IEDA:  Repository  crea7on:  feasibility  study  &  report  -  With  RDA:  Cost  of  Data  Repositories  ques7onnaire  -  With  series  of  ins7tutes:  Interviews  re.  role  of  ins7tu7on  

•  Main  ques7ons:    - What  are  key  needs?    -  Can  we  play  a  role:  skillsets,  partnerships?    -  Is  there  a  (transparent)  business  model  for  this?  

•  Principles:    –  Collabora7on  is  tailored  to  partner’s  needs,  using  local  resources;    –  Collabora7on  plan  is  MoU/Service-­‐Level  Agreement;  –  At  all  7mes,  all  data,  reports  and  so^ware  are  open  and  shared.    

Page 20: Talk at OHSU, September 25, 2013

In  summary:    1.  If  researchers  start  to  curate  and  share  their  

data…  2.  And  research  databases  become  long-­‐term  

sustainable…  3.  And  libraries,  data  repositories  and  grid  

infrastructures  start  to  work  together…    We  might  enable  a  knowledge  infrastructure  that  allows  us  to  jointly  tackle  the  quesZons  of  life!      

Page 21: Talk at OHSU, September 25, 2013

Many  ques7ons  remain:  

?  What  carrots    and  s7cks  will  make  researchers  share  their  data?    

?  How  do  we  create  interoperable  metadata  layers?    

?  What  role  would  the  ins7tu7on/library  play?    ?  What  are  sustainable  models,  moving  forward?    

?  Is  there  a  place  for  publishers,  in  all  this?    

Page 22: Talk at OHSU, September 25, 2013

Thank  you!  Collabora7ons  and  discussions  gratefully  acknowledged:    •  CMU:  Nathan  Urban,  Shreejoy  Tripathy,  Shawn  Burton,  Ed  Hovy  •  UCSD:  Phil  Bourne,  Brian  Shoe=lander,  David  Minor,  Declan  Fleming,  

Ilya  Zaslavsky  •  NIF:  Maryann  Martone,  Anita  Bandrowski  •  MSU:  Brian  Bothner  •  OHSU:  Melissa  Haendel,  Nicole  Vasilevsky  •  California  Digital  Library:  Carly  Strasser,  John  Kunze,  Stephen  Abrams  •  Columbia/IEDA:  Kers7n  Lehnert,  Leslie  Hsu  •  ICPSR:  George  Altman,  Mary  Vardigan  •  CNI:  Clifford  Lynch  •  Harvard:  Michael  Kurtz,  Chris  Erdmann  •  MIT:  Micah  Altman  •  UVM:  Mara  Saurle  •  RDA:  Simon  Hodson,  Michael  Diepenbroek  

 

Page 23: Talk at OHSU, September 25, 2013

Your  ques7ons?    

Anita  de  Waard  VP  Research  Data  Collabora7ons,    

Elsevier  Research  Data  Services  (VT)    [email protected]  

h=p://researchdata.elsevier.com/