75
Maryann E. Martone, Ph. D. University of California, San Diego INCF Neuroinforma>cs Short Course, Stockholm, August 2013

Databases and Ontologies: Where do we go from here?

Embed Size (px)

DESCRIPTION

International Neuroinformatics Short Course on Neuroinformatics 2013; introduction to databases, ontologies and the Neuroscience Information Framework

Citation preview

Page 1: Databases and Ontologies:  Where do we go from here?

Maryann  E.    Martone,  Ph.  D.  University  of  California,  San  Diego  

INCF  Neuroinforma>cs  Short  Course,  Stockholm,  August  2013  

Page 2: Databases and Ontologies:  Where do we go from here?

•  Introduc>on  •  Introduc>on  to  the  Neuroscience  Informa>on  Framework  

•  Structured  informa>on:    data,  databases  •  Federa>ng  neuroscience-­‐relevant  databases  •  Informa>on  frameworks  •  Ontologies  •  What  can  we  do  with  informa>on  in  the  NIF?  •  Conclusions  

Page 3: Databases and Ontologies:  Where do we go from here?

Scholar  

Library  

Scholar  

Publisher  

FORCE11.org:    Future  of  research  communica>ons  and  e-­‐scholarship  

Page 4: Databases and Ontologies:  Where do we go from here?

Scholar  

Consumer  

Libraries  

Data  Repositories  

Code  Repositories  Community  databases/plaRorms  

OA  

Curators  

Social  Networks  

Social  Networks  Social  

Networks  

Peer  Reviewers  

Narra>ve  

Workflows  

Data  

Models  

Mul>media  

Nanopublica>ons  

Code  

Page 5: Databases and Ontologies:  Where do we go from here?

hTp://neuinfo.org  

Page 6: Databases and Ontologies:  Where do we go from here?

•  NIF’s  mission  is  to  maximize  the  awareness  of,  access  to  and  u>lity  of  research  resources  produced  worldwide  to  enable  beTer  science  and  promote  efficient  use  –  NIF  unites  neuroscience  informa>on  without  respect  to  domain,  

funding  agency,  ins>tute  or  community  

–  NIF  is  like  a  “Pub  Med”  for  all  biomedical  resources  and  a  “Pub  Med  Central”  for  databases  

– Makes  them  searchable  from  a  single  interface  –  Prac>cal  and  cost-­‐effec>ve;    tries  to  be  sensible  –  Learned  a  lot  about  the  effec0ve  data  sharing    

The  Neuroscience  Informa>on  Framework  is  an  ini>a>ve  of  the  NIH  Blueprint  consor>um  of  ins>tutes        hTp://neuinfo.org  

Page 7: Databases and Ontologies:  Where do we go from here?

We’d  like  to  be  able  to  find:  •  What  is  known****:  

–  What  are  the  projec>ons  of  hippocampus?  –  Is  GRM1  expressed  In  cerebral  cortex?  –  What  genes  have  been  found  to  be  upregulated  in  

chronic  drug  abuse  in  adults  –  What  animal  models  have  similar  phenotypes  to  

Parkinson’s  disease?  –  What  studies  used  my  polyclonal  an>body  against  

GABA  in  humans?  

•  What  is  not  known:  –  Connec>ons  among  data  –  Gaps  in  knowledge  

A  framework  makes  it  easier  to  address  these  ques>ons  

Page 8: Databases and Ontologies:  Where do we go from here?
Page 9: Databases and Ontologies:  Where do we go from here?

Neuroscience  is  unlikely  to  be  served  by  a  few  large  databases  like  the  genomics  and  proteomics  community  

Whole  brain  data  (20  um  

microscopic  MRI)  

Mosiac  LM  images  (1  GB+)  

Conven>onal  LM  images  

Individual  cell  morphologies  

EM  volumes  &  reconstruc>ons  

Solved  molecular  structures  

No  single  technology  serves  these  all  equally  well.   Mul0ple  data  types;    mul0ple  scales;    mul0ple  

databases  

Page 10: Databases and Ontologies:  Where do we go from here?

•  Data  warehouse:    May  contain  data  from  diverse  sources;    schemas  are  integrated.    Data  are  “cleaned”  to  fit  unified  data  model.    One  database  to  rule  them  all...  

•  Data  federa>on:    a  virtual  database  that  stores  data  defini>ons  and  not  the  data  itself.  The  virtual  database  will  have  informa>on  about  the  loca>on  of  the  data.    When  a  single  call  is  made  to  a  virtual  database,  the  technology  ensures  mul>ple  calls  to  underlying  databases  and  is  also  responsible  for  meaningfully  aggrega>ng  the  returned  result  sets.  

From  wikipedia  and  hTp://www.infosysblogs.com/oracle/2010/01/data_federa>on_a_potent_subst_1.html  

Page 11: Databases and Ontologies:  Where do we go from here?

Subject    473  

•  Species:    mouse  (string)  •  Age:    50  days  (integer)  

•  Age  category:    adult  

•  Protocol:    2  

Rela0onal  Database  

“Mice  (aged  50  days)  were  perfused  with  4%  paraformaldehyde  and  brains  were  sec>oned  at  a  thickness  of  50  um.    Sec>ons  were  labeled  using  an>bodies  against  calbindin  and  imaged  on  a  Zeiss    confocal  microscope.”      

Data  model;    data  types,  formal  query  language  

Free  text  

En>ty  recogni>on;  Natural  language  processing  

Page 12: Databases and Ontologies:  Where do we go from here?

∞  

What  is  easily  machine  processable  and  accessible  

What  is  poten>ally  knowable  

What  is  known:  Literature,  images,  human  

knowledge  

Unstructured;    Natural  language  processing,  en>ty  recogni>on,  image  processing  and  

analysis;  paywalls  communica>on  

Abstracts  vs  full  text  vs  tables  etc  

Page 13: Databases and Ontologies:  Where do we go from here?

hGp://neuinfo.org  June10,  2013   dkCOIN  Inves>gator's  Retreat   13  

•  A  portal  for  finding  and  using  neuroscience  resources  

  A  consistent  framework  for  describing  resources  

  Provides  simultaneous  search  of  mul>ple  types  of  informa>on,  organized  by  category  

  Supported  by  an  expansive  ontology  for  neuroscience  

  U>lizes  advanced  technologies  to  search  the  “hidden  web”  

UCSD,  Yale,  Cal  Tech,  George  Mason,  Washington  Univ  

Literature  

Database  Federa>on  

Registry  

Page 14: Databases and Ontologies:  Where do we go from here?
Page 15: Databases and Ontologies:  Where do we go from here?

With  the  thousands  of  databases  and  other  informa>on  sources  available,  simple  descrip>ve  metadata  will  not  suffice  

Page 16: Databases and Ontologies:  Where do we go from here?

• NIF  curators  • Nomina>on  by  the  community  • Semi-­‐automated  text  mining  pipelines  

 NIF  Registry   Requires  no  special  skills   Site  map  available  for  local  hos>ng  

• NIF  Data  Federa>on  • DISCO  interop  • Requires  some  programming  skill  • Open  Source  Brain  <  2  hr  

Low  barrier  to  entry;    incremental  refinement  

Page 17: Databases and Ontologies:  Where do we go from here?

NIF  was  designed  to  be  populated  rapidly  with  progressive  refinement  

Page 18: Databases and Ontologies:  Where do we go from here?

Databases  come  in  many  shapes  and  sizes  

•  Primary  data:  –  Data  available  for  reanalysis,  e.g.,  

microarray  data  sets  from  GEO;    brain  images  from  XNAT;    microscopic  images  (CCDB/CIL)  

•  Secondary  data  –  Data  features  extracted  through  

data  processing  and  some>mes  normaliza>on,  e.g,  brain  structure  volumes  (IBVD),  gene  expression  levels  (Allen  Brain  Atlas);    brain  connec>vity  statements  (BAMS)  

•  Ter>ary  data  –  Claims  and  asser>ons  about  the  

meaning  of  data  •  E.g.,  gene  upregula>on/

downregula>on,  brain  ac>va>on  as  a  func>on  of  task  

•  Registries:  –  Metadata  –  Pointers  to  data  sets  or  

materials  stored  elsewhere  •  Data  aggregators  

–  Aggregate  data  of  the  same  type  from  mul>ple  sources,  e.g.,  Cell  Image  Library  ,SUMSdb,  Brede  

•  Single  source  –  Data  acquired  within  a  single  

context  ,  e.g.,  Allen  Brain  Atlas  

Researchers  are  producing  a  variety  of  informa>on  ar>facts  using  a  mul>tude  of  technologies  

Page 19: Databases and Ontologies:  Where do we go from here?

• Data:  values  of  qualita>ve  or  quan>ta>ve  variables,  belonging  to  a  set  of  items...  oten  the  results  of  measurements  (Wikipedia)  

• Metadata:    “Data  about  data”  • Structural  metadata:  

• the  design  and  specifica>on  of  data  structures  and  is  more  properly  called  "data  about  the  containers  of  data”  (Wikipedia)  

• e.g.,  image  size,  bit  depth,  integer  vs  string  

• Descrip>ve  metadata:      

• individual  instances  of  applica>on  data,  the  data  content  “data  about  data  content”  

• e.g.,  creator,  subject,    

• Data  type:    the  form  of  the  data  for  the  purposes  of  data  opera>ons  

• Data  Integra>on:  combining  data  residing  in  different  sources  and  providing  users  with  a  unified  view  of  these  data  

“Metadata  are  data”  -­‐Wikipedia  

Page 20: Databases and Ontologies:  Where do we go from here?

0  

50  

100  

150  

200  

250  

0.01  

0.1  

1  

10  

100  

1000  

6-­‐12   12-­‐12   7-­‐13   1-­‐14   8-­‐14   2-­‐15   9-­‐15   4-­‐16   10-­‐16   5-­‐17  

Num

ber  of  Fed

erated

 Datab

ases  

Num

ber  of  Fed

erated

 Records  (M

illions)  

NIF  searches  the  largest  colla>on  of  neuroscience-­‐relevant  data  on  the  web  

DISCO  

June10,  2013   dkCOIN  Inves>gator's  Retreat   20  

Page 21: Databases and Ontologies:  Where do we go from here?

•  Long  tail  data:    large  numbers  of  small  data  sets  

hTp://en.wikipedia.org/wiki/Long_tail  

Page 22: Databases and Ontologies:  Where do we go from here?

Hippocampus  OR  “Cornu  Ammonis”  OR  “Ammon’s  horn”   Query  expansion:    Synonyms  

and  related  concepts  Boolean  queries  

Data  sources  categorized  by  “data  type”  and  level  of  nervous  

system  

Common  views  across  mul>ple  

sources  

Tutorials  for  using  full  resource  when  ge{ng  there  from  

NIF  

Link  back  to  record  in  

original  source  

Page 23: Databases and Ontologies:  Where do we go from here?

Connects  to  

Synapsed  with  

Synapsed  by  

Input  region  

innervates  

Axon  innervates  Projects  to  Cellular  contact  

Subcellular  contact  

Source  site  

Target    site  

Each  resource  implements  a  different,  though  related  model;    systems  are  complex  and  difficult  to  learn,  in  many  cases  

Page 24: Databases and Ontologies:  Where do we go from here?
Page 25: Databases and Ontologies:  Where do we go from here?

•  Current  web  is  designed  to  share  documents  –  Documents  are  unstructured  data  

•  Much  of  the  content  of  digital  resources  is  part  of  the  “hidden  web”  

•  Wikipedia:    The  Deep  Web  (also  called  Deepnet,  the  invisible  Web,  DarkNet,  Undernet  or  the  hidden  Web)  refers  to  World  Wide  Web  content  that  is  not  part  of  the  Surface  Web,  which  is  indexed  by  standard  search  engines.  

Page 26: Databases and Ontologies:  Where do we go from here?

Even  Google  needs  a  knowledge  framework  

Page 27: Databases and Ontologies:  Where do we go from here?

Knowledge  in  space  and  spa>al  rela>onships  (the  “where”)  

Knowledge  in  words,  terminologies  and  logical  rela>onships  (the  “what”)  

Page 28: Databases and Ontologies:  Where do we go from here?

Purkinje  Cell  

Axon  Terminal  

Axon  Dendri>c  Tree  

Dendri>c  Spine  

Dendrite  

Cell  body  

Cerebellar  cortex  

There  is  liTle  obvious  connec>on  between  data  sets  taken  at  different  scales  using  different  microscopies  without  an  explicit  representa>on  of  the  biological  objects  that  the  data  represent  

Page 29: Databases and Ontologies:  Where do we go from here?

•  NIF  covers  mul>ple  structural  scales  and  domains  of  relevance  to  neuroscience  •  Aggregate  of  community  ontologies  with  some  extensions  for  neuroscience,  e.g.,  Gene  

Ontology,  Chebi,  Protein  Ontology  

NIFSTD  

Organism  

NS  Func>on  Molecule   Inves>ga>on  Subcellular  structure  

Macromolecule   Gene  

Molecule  Descriptors  

Techniques  

Reagent   Protocols  

Cell  

Resource   Instrument  

Dysfunc>on   Quality  Anatomical  Structure  

Page 30: Databases and Ontologies:  Where do we go from here?

Brain  

Cerebellum  

Purkinje  Cell  Layer  

Purkinje  cell  

neuron  

has  a  

has  a  

has  a  

is  a  

•  Ontology:  an  explicit,  formal  representa>on  of  concepts    rela>onships  among  them  within  a  par>cular  domain  that  expresses  human  knowledge  in  a  machine  readable  form  

•  Branch  of  philosophy:    a  theory  of  what  is  

•  e.g.,  Gene  ontologies  

Page 31: Databases and Ontologies:  Where do we go from here?

•  Express  neuroscience  concepts  in  a  way  that  is  machine  readable    –  Synonyms,  lexical  variants  –  Defini>ons  

•  Provide  means  of  disambigua>on  of  strings  –  Nucleus  part  of  cell;    nucleus  part  of  brain;    nucleus  part  of  atom  

•  Rules  by  which  a  class  is  defined,  e.g.,  a  GABAergic  neuron  is  neuron  that  releases  GABA  as  a  neurotransmiTer  

•  Proper>es  –  Support  reasoning  

•  Provide  universals  for  naviga>ng  across  different  data  sources  –  Seman>c  “index”  –  Link  data  through  rela>onships  not  just  one-­‐to-­‐one  mappings  

•  Provide  the  basis  for  concept-­‐based  queries  to  probe  and  mine  data  •  Establish  a  seman>c  framework  for  landscape  analysis  

Mathema>cs,  Computer  code  or  Esperanto  

Page 32: Databases and Ontologies:  Where do we go from here?

June10,  2013   32  

Aligns  sources  to  the  NIF  seman>c  framework  

Page 33: Databases and Ontologies:  Where do we go from here?
Page 34: Databases and Ontologies:  Where do we go from here?

birnlex_1741   Brodmann.10  

Explicit  mapping  of  database  content  helps  disambiguate  non-­‐unique  and  custom  terminology  

Page 35: Databases and Ontologies:  Where do we go from here?

birnlex_1204   CA3  

Page 36: Databases and Ontologies:  Where do we go from here?

•  Search  Google:    GABAergic  neuron  

•  Search  NIF:    GABAergic  neuron  

–  NIF  automa>cally  searches  for  types  of  GABAergic  neurons  

Types  of  GABAergic  neurons  

Neuroscience Information Framework – http://neuinfo.org

Page 37: Databases and Ontologies:  Where do we go from here?

Equivalence  classes;    restric>ons  

Arbitrary  but  defensible  

• Neurons  classified  by  • Circuit  role:    principal  neuron  vs  interneuron  • Molecular  cons>tuent:    Parvalbumin-­‐neurons,  calbindin-­‐neurons  • Brain  region:    Cerebellar  neuron  • Morphology:    Spiny  neuron  

•   Molecule  Roles:    Drug  of  abuse,  anterograde  tracer,  retrograde  tracer  • Brain  parts:    Circumventricular  organ  • Organisms:    Non-­‐human  primate,  non-­‐human  vertebrate  • Quali>es:    Expression  level  • Techniques:    Neuroimaging  

Page 38: Databases and Ontologies:  Where do we go from here?

What  genes  are  upregulated  by  drugs  of  abuse  in  the  adult  mouse?  (show  me  the  data!)  

Morphine  Increased  expression  

Adult  Mouse  

Page 39: Databases and Ontologies:  Where do we go from here?

• NIF  Connec>vity:    7  databases  containing  connec>vity  primary  data  or  claims  from  literature  on  connec>vity  between  brain  regions  

• Brain  Architecture  Management  System  (rodent)  • Temporal  lobe.com  (rodent)  • Connectome  Wiki  (human)  • Brain  Maps  (various)  • CoCoMac  (primate  cortex)  • UCLA  Mul>modal  database  (Human  fMRI)  • Avian  Brain  Connec>vity  Database  (Bird)  

• Total:    1800  unique  brain  terms  (excluding  Avian)  

• Number  of  exact  terms  used  in  >  1  database:    42  • Number  of  synonym  matches:    99  • Number  of  1st  order  partonomy  matches:    385  

Page 40: Databases and Ontologies:  Where do we go from here?

•  Realism  vs  conceptualism  

•  Controlled  vocabularies  vs  taxonomies  vs  ontology?  •  How  do  I  name  classes?  •  Shared  vs  custom  ontologies  

•  Single  vs  mul>ple  inheritance  •  RDF  vs  OWL?  •  Top  down  vs  boTom  up:    heavy  weight  vs  light  weight  ontologies  

•  Should  I  encode  everything  in  my  ontology?  

Many  schools  of  thought  about  ontologies-­‐their  construc>on  and  use  

Page 41: Databases and Ontologies:  Where do we go from here?

•  Controlled  vocabularies:  prescribed  list  of  terms  or  headings  each  one  having  

an  assigned  meaning  

•  Lexicon/Thesaurus:  Vocabularies  +  their  lexical  proper>es,  e.g.,  synonyms,  

lexical  variants  

•  Taxonomy:    monohierarchical  

classifica>on  of  concepts,  as  used,  for  

example,  in  the  classifica>on  of  biological  

organisms,  built  on  the  “is  a  “  rela>onship  

•   Ontology:    specifica>on  of  the  concepts  of  a  domain  and  their  rela>onships,  

structured  to  allow  computer  processing  

and  reasoning    

hTp://www.willpowerinfo.co.uk/glossary.htm   Mike  Bergman  

Page 42: Databases and Ontologies:  Where do we go from here?

•  Iden>ty:  –  En>>es  are  uniquely  iden>fiable  –  Name  is  a  meaningless  numerical  iden>fier  (URI:    Uniform  resource  iden>fier)  –  Any  number  of  human  readable  labels  can  be  assigned  to  it  

•  Defini>on:      –  Genera:    is  a  type  of  (cell,  anatomical  structure,  cell  part)  

–  Differen>a:    “has  a”  A  set  of  proper>es  that  dis>nguish  among  members  of  that  class  

–  Can  include  necessary  and  sufficient  condi>ons  

•  Implementa>on:    How  is  this  defini>on  expressed  –  Depending  on  the  nature  of  the  concept  or  en>ty  and  the  needs  of  the  

informa>on  system,  we  can  say  more  or  fewer  things  –  Different  languages;    can  express  different  things  about  the  concept  that  can  be  

computed  upon  

•  OWL  W3C  standard,  RDF  

birnlex_1362   CA2  

CHEBI_29108   CA2  

NIF  follows  OBO  Foundry  best  prac>ces  for  naming  and  defining  classes  

Page 43: Databases and Ontologies:  Where do we go from here?

•  XML:    Extensible  Mark  Up  language:      Mark  up  language  for  data.    XML  itself  is  not  very  much  concerned  with  meaning.  XML  nodes  don't  need  to  be  associated  with  par>cular  concepts,  and  the  XML  standard  doesn't  indicate  how  to  derive  a  fact  from  a  document.  

•  RDF:    Resource  Descrip>on  Framework:    a  general  method  to  decompose  knowledge  into  small  pieces,  with  some  rules  about  the  seman>cs,  or  meaning,  of  those  pieces.  What  sets  RDF  apart  from  XML  is  that  RDF  is  designed  to  represent  knowledge  in  a  distributed  world.  That  RDF  is  designed  for  knowledge,  and  not  data,  means  RDF  is  par>cularly  concerned  with  meaning.  

–  Small  pieces  are  called  “triples”:    Subject  predicate  object  

–  Purkinje  neuron  (S)  has  neurotransmiDer  (P)  GABA  (O)  

•  RDFS  -­‐  a  method  of  specifying  metadata  about  proper>es/characteris>cs  of  things  and  classes  of  things  such  that  inference  an  be  carried  out  (conceptualized  in  RDF)  

•  OWL  (Web  Ontology  Language)  -­‐  a  more  complex(/powerful)  extension  of  RDFS  

•  SPARQL  -­‐  Is  a  query  language  designed  for  RDF  (similar  to  how  SQL  was  designed  for  rela>onal  databases)  

hTp://answers.seman>cweb.com/ques>ons/15215/whats-­‐the-­‐difference-­‐between-­‐using-­‐rdfsowl-­‐versus-­‐xml   hTp://www.rdfabout.com/intro/#Introducing%20RDF  

Page 44: Databases and Ontologies:  Where do we go from here?

Rela>onal  model  • Mouse  has  age  50  days  

• Protocol  uses  instrument  confocal  microscope  

• A  confocal  imaging  protocol  is  a  protocol  that  uses  instrument  confocal  microscope  

RDF:    The  computer  doesn't  need  to  know  what  has  actually  means  in  English  for  this  to  be  useful.  It  is  let  up  to  the  applica>on  writer  to  choose  appropriate  names  for  things  (confocal  microscope)  and  to  use  the  right  predicates  (uses,  has).  RDF  tools  are  ignorant  of  what  these  names  mean,  but  they  can  s>ll  usefully  process  the  informa>on.-­‐hTp://www.rdfabout.com/intro/#Introducing%20RDF  

May  link  to  other  informa>on,  e.g.,  mouse  is  a  rodent  

Page 45: Databases and Ontologies:  Where do we go from here?

The  thalamus  projects  to  the  cortex  in  mammals  •  Universal:  allValuesFrom:    If  a  mammal  has  a  cortex  and  a  

thalamus,  then  the  thalamus  must  project  to  the  cortex  •  Existen>al:    SomeValuesFrom:    The  thalamus  projects  to  

the  cortex  in  at  least  one  member  of  the  class  mammal  •  Disjointness:    owl:disjointWith:  a  member  of  one  class  

cannot  simultaneously  be  an  instance  of  a  specified  other  class:    Rep>les  are  disjoint  from  mammals  

W3C  OWL  guide:    www.w3.org/TR/2004/REC-­‐owl-­‐guide-­‐20040210/  

Restric>ons  places  on  classes  allow  us  to  reason  over  the  ontology  and  check  for  consistency  

Page 46: Databases and Ontologies:  Where do we go from here?

46  

Page 47: Databases and Ontologies:  Where do we go from here?

1.  Look  brain  region  up  in  NeuroLex  2.  Look  up  cells  contained  in  the  brain  

region  3.  Find  those  cells  that  are  known  to  project  

out  of  that  brain  region  4.  Look  up  the  neurotransmiTers  for  those  

cells  5.  Determine  whether  those  

neurotransmiTers  are  known  to  be  excitatory  or  inhibitory  

6.  Report  the  projec>on  as  excitatory  or  inhibitory,  and  report  the  en>re  chain  of  logic  with  links  back  to  the  wiki  pages  where  they  were  made  

7.  Make  sure  user  can  get  back  to  each  statement  in  the  logic  chain  to  edit  it  if  they  think  it  is  wrong  

Stephen  Larson  CHEBI:18243  

Page 48: Databases and Ontologies:  Where do we go from here?

Brain  

Cerebellum  

Cortex  

Cerebellar  Purkinje  cell  

Purkinje  neuron  

Purkinje  cell  soma  

Purkinje  cell  layer    

Cerebellar  cortex  

IP3  

Cerebellum  

• To  create  the  linkages  requires  mapping  • Mapping  is  usually  incomplete  and  not  always  possible  • Can’t  take  advantage  of  others’  work  

Gross  anatomy  ontology   Cell  centered  anatomy  ontology  

Reuse  iden>fiers  rather  than  recreate  them  

Page 49: Databases and Ontologies:  Where do we go from here?

•  “The  trouble  is  that  if  I  make  up  all  of  my  own  URIs,  my  RDF  document  has  no  meaning  to  anyone  else  unless  I  explain  what  each  URI  is  intended  to  denote  or  mean.  Two  RDF  documents  with  no  URIs  in  common  have  no  informa>on  that  can  be  interrelated.”  

•  NIF  favors  reuse  of  iden>fiers  rather  than  mapping  

•  Crea>ng  ontologies  to  be  used  as  common  building  blocks:  modularity,  low  seman>c  overhead,  is  important  

hTp://www.rdfabout.com/intro/#Introducing%20RDF  

Page 50: Databases and Ontologies:  Where do we go from here?

Cerebellum  Purkinje  cell  soma  

Cerebellum  Purkinje  cell  dendrite  

Cerebellum  Purkinje  cell  axon  

(Cell  part  ontology)  

Cerebellum  granule  cell  layer    (Anatomy  ontology)  

Cerebellum  Purkinje  cell  layer  

Cerebellum  molecular  layer  

Has  part  

Has  part  

Has  part  

Is  part  of  

Is  part  of  

Is  part  of  

Calbindin   IP3  (CHEBI:16595)  

Cerebellum  Purkinje  neuron  (Cell  Ontology)  

Cerebellar  cortex  

Has  part  Has  part  

Has  part  

Page 51: Databases and Ontologies:  Where do we go from here?

•  Neuroscience  Informa>on  Framework  –  NIFSTD  available  for  download  –  Ontoquest  web  services  

–  NIF  annota>on  services  and  mapping    tools  available  

–  Neurolex  available  via  SPARQL  endpoint  

•  Bioportal:    Collec>on  of  >  300  ontologies  covering  many  domains  –  automated  mapping  between  ontologies  –  Annota>on  services  –  Web  services  for  access  

•  OBO  Foundry:    hTp://www.obofoundry.org/  –  Collec>on  of  community  ontologies  designed  

according  to  OBO  Foundry  principles  

•  Protégé  Ontology  editor:    Edi>ng  tool  for  construc>ng  ontologies.    Excellent  short  course  available  for  Protégé/OWL.  

•  Program  on  Ontologies  of  Neural  Structures  (INCF):    CUMBO,  Neurolex  Wiki,  Scalable  Brain  Atlas  

You  can  enhance  your  tools  and  annota>on  with  community  ontologies  

Page 52: Databases and Ontologies:  Where do we go from here?

hTp://neurolex.org   Larson  et  al,  Fron>ers  in  Neuroinforma>cs,  in  press  

• Seman>c  MediWiki  

• Provide  a  simple  interface  for  defining  the  concepts  required  

• Light  weight  seman>cs  

• Good  teaching  tool  for  learning  about  seman>c  integra>on  and  the  benefits  of  a  consistent  seman>c  framework  

• Community  based:  • Anyone  can  contribute  their  terms,  concepts,  things  

• Anyone  can  edit  • Anyone  can  link  

• Accessible:    searched  by  Google  • Growing  into  a  significant  knowledge  base  for  neuroscience  

Demo    D03  

 200,000  edits   150  contributors  

Page 53: Databases and Ontologies:  Where do we go from here?

Red  Links:    Informa>on  is  missing  (or  misspelled)  

Page 54: Databases and Ontologies:  Where do we go from here?

•  Neurolex  provides  an  on-­‐line  computable  index  for  expressing  models  in  seman>c  terms,  and  linking  to  other  knowledge  and  data  

•  INCF  task  forces  are  contribu>ng  knowledge  

•  Neuroscience  knowledge  in  the  web  

Builds  a  knowledge  base  by  cross-­‐modular  rela>ons  and  links  to  data  

Page 55: Databases and Ontologies:  Where do we go from here?

Once  terms  have  been  proposed  and  veTed  by  neuroscience  community,  NIF  feeds  them  back  to  general  ontologies  to  enrich  coverage  of  neuroscience  

Page 56: Databases and Ontologies:  Where do we go from here?

Because  they  are  sta>c  URL’s,  Wikis  are  searchable  by  Google  

Page 57: Databases and Ontologies:  Where do we go from here?

•  INCF  Project  –  Neuron  Registry  –  >  30  experts  worldwide  

–  Fill  out  neuron  pages  in  Neurolex  Wiki  

–  Led  by  Dr.  Gordon  Shepherd  

Soma  loca>on  Dendrite  loca>on  

Axon  loca>on  0  

50  

100  

150  

200  

250  

300  

Number   Total  redlinks  

easy  fixes  

hard  fixes  

Soma  loca>on  

Dendrite  loca>on  

Axon  loca>on  

Social  networks  and  community  sites  let  us  learn  things  from  the  collec>ve  behavior  of  contributors    INCF  Knowledge  Space  

Page 58: Databases and Ontologies:  Where do we go from here?

•  Of  the  ~  4000  columns  that  NIF  queries,  ~1300  map  to  one  of  our  core  categories:  –  Organism  

–  Anatomical  structure  

–  Cell  – Molecule  

–  Func>on  –  Dysfunc>on  –  Technique  

•  30-­‐50%  of  NIF’s  queries  autocomplete  

•  When  NIF  combines  mul>ple  sources,  a  set  of  common  fields  emerges  –  >Basic  informa>on  models/seman>c  models  exist  for  certain  types  of  en>>es  

Biomedical  science  does  have  a  conceptual  framework;    but  we  don’t  place  undo  importance  on  it    must  >e  to  data  

Page 59: Databases and Ontologies:  Where do we go from here?
Page 60: Databases and Ontologies:  Where do we go from here?

•  NIF  can  be  used  to  survey  the  data  landscape  

•  Analysis  of  NIF  shows  mul>ple  databases  with  similar  scope  and  content  

•  Many  contain  par>ally  overlapping  data  

•  Data  “flows”  from  one  resource  to  the  next  –  Data  is  reinterpreted,  reanalyzed  or  

added  to  

•  Is  duplica>on  good  or  bad?  NIF  is  trying  to  make  it  easier  to  work  with  diverse  data  

Page 61: Databases and Ontologies:  Where do we go from here?

NIF  is  in  a  unique  posi>on  to  answer  ques>ons  about  the  neuroscience  landscape  

Where  are  the  data?  

Striatum  Hypothalamus  Olfactory  bulb  

Cerebral  cortex  

Brain  

Brain  region

 

Data  source  

Page 62: Databases and Ontologies:  Where do we go from here?

∞  

What  is  easily  machine  processable  and  accessible  

What  is  poten>ally  knowable  

What  is  known:  Literature,  images,  human  

knowledge  

Unstructured;    Natural  language  processing,  en>ty  recogni>on,  image  processing  and  

analysis;    communica>on  

“Known  unknowns  vs  unknown  unknowns”  

Open  world  meets  closed  world  

Page 63: Databases and Ontologies:  Where do we go from here?

Comprehensive  and  unbiased?  

We  know  a  lot  about  some  things  and  less  about  others;    some  of  NIF’s  sources  are  comprehensive;    others  are  highly  biased  

But...NIF  has  >  2M  an>bodies,  338,000  model  organisms,  and  3  million  microarray  records  

Page 64: Databases and Ontologies:  Where do we go from here?

Neocortex  

Olfactory  bulb  

Neostriatum  

Cochlear  nucleus  

All  neurons  with  cell  bodies  in  the  same  brain  region  are  grouped  together  

Proper>es  in  Neurolex  

Page 65: Databases and Ontologies:  Where do we go from here?

NIF  is  in  a  unique  posi>on  to  answer  ques>ons  about  the  neuroscience  landscape  

Where  are  the  data?  

Striatum  Hypothalamus  Olfactory  bulb  

Cerebral  cortex  

Brain  

Brain  region

 

Data  source   Funding  

Page 66: Databases and Ontologies:  Where do we go from here?

• Requires  account  in  MyNIF  • S>ll  a  work  in  progress,  i.e.,  it  breaks  a  lot  • If  you  are  interested,  contact  us!  

Vadim  Astakhov,  Kepler  Workflow  Engine  

Page 67: Databases and Ontologies:  Where do we go from here?

•  Gemma:    Gene  ID    +  Gene  Symbol  •  DRG:    Gene  name  +  Probe  ID  

•  Gemma  presented  results  rela>ve  to  baseline  chronic  morphine;    DRG  with  respect  to  saline,  so  direc>on  of  change  is  opposite  in  the  2  databases  

•           Analysis:  • 1370  statements  from  Gemma  regarding  gene  expression  as  a  func>on  of  chronic  morphine  • 617  were  consistent  with  DRG;      over  half    of  the  claims  of  the  paper  were  not  confirmed  in  this  analysis  • Results  for  1  gene  were  opposite  in  DRG  and  Gemma  • 45  did  not  have  enough  informa>on  provided  in  the  paper  to  make  a  judgment  

Rela>vely  simple  standards  would  make  life  easier  

Page 68: Databases and Ontologies:  Where do we go from here?
Page 69: Databases and Ontologies:  Where do we go from here?

47/50  major  preclinical  published  cancer  studies  could  not  be  replicated  

•  “The  scien>fic  community  assumes  that  the  claims  in  a  preclinical  study  can  be  taken  at  face  value-­‐that  although  there  might  be  some  errors  in  detail,  the  main  message  of  the  paper  can  be  relied  on  and  the  data  will,  for  the  most  part,  stand  the  test  of  >me.    Unfortunately,  this  is  not  always  the  case.”    

•  Ge{ng  data  out  sooner  in  a  form  where  they  can  be  exposed  to  many  eyes  and  many  analyses  may  allow  us  to  expose  errors  and  develop  beTer  metrics  to  evaluate  the  validity  of  data  

Begley  and  Ellis,  29  MARCH  2012  |  VOL  483  |  NATURE  |  531  

Page 70: Databases and Ontologies:  Where do we go from here?

NIF  favors  a  hybrid,  >ered,  federated  system  

•  Domain  knowledge  –  Ontologies  

•  Claims,  models  and  observa>ons  –  Virtuoso  RDF  triples    –  Model  repositories  

•  Data  –  Data  federa>on  –  Spa>al  data  –  Workflows  

•  Narra>ve  –  Full  text  access  

Neuron   Brain  part   Disease  Organism   Gene  

Caudate  projects  to  Snpc   Grm1  is  upregulated  in  

chronic  cocaine  Betz  cells  

degenerate  in  ALS  

NIF  provides  the  tentacles  that  connect  the  pieces:    a  new  type  of  en>ty  for  21st  century  science  

Technique  People  

Page 71: Databases and Ontologies:  Where do we go from here?

•  Several  powerful  trends  should  change  the  way  we  think  about  our  data:    One    Many  – Many  data  

•  Genera>on  of  data  is  ge{ng  easier    shared  data  •  Data  space  is  ge{ng  richer:    more  –omes  everyday  •  But...compared  to  the  biological  space,  s>ll  sparse  

–  Many  eyes  •  Wisdom  of  crowds  •  More  than  one  way  to  interpret  data  

–  Many  algorithms  •  Not  a  single  way  to  analyze  data  

–  Many  analy>cs  •  “Signatures”  in  data  may  not  be  directly  related  to  the  ques>on  for  which  they  were  acquired  but  tell  us  something  really  interes>ng  

Are  you  exposing  or  burying  your  work?  

Page 72: Databases and Ontologies:  Where do we go from here?

•  You  (and  the  machine)  have  to  be  able  to  find  it  –  Accessible  through  the  web  –  Structured  or  semi-­‐structured  –  Annota>ons  

•  You  (and  the  machine)    have  to  be  able  to  use  it  –  Data  type  specified  and  in  an  ac>onable  form  

•  You  (and  the  machine)  have  to  know  what  the  data  mean  

•  Seman>cs  •  Context:    Experimental  metadata  •  Provenance:    where  did  they  come  from  

Repor>ng  neuroscience  data  within  a  consistent  framework  helps  enormously,  but  the  frameworks  need  not  be  onerous  

Page 73: Databases and Ontologies:  Where do we go from here?

A  data  sharing  snafu  in  3  acts  

Page 74: Databases and Ontologies:  Where do we go from here?

hTp://force11.org  

Page 75: Databases and Ontologies:  Where do we go from here?

Jeff  Grethe,  UCSD,  Co  Inves>gator,  Interim  PI  

Amarnath  Gupta,  UCSD,  Co  Inves>gator  

Anita  Bandrowski,  NIF  Project  Leader  

Gordon  Shepherd,  Yale  University  

Perry  Miller  

Luis  Marenco  

Rixin  Wang  

David  Van  Essen,  Washington  University  

Erin  Reid  

Paul  Sternberg,  Cal  Tech  

Arun  Rangarajan  

Hans  Michael  Muller  

Yuling  Li  

Giorgio  Ascoli,  George  Mason  University  

Sridevi  Polavarum  

Fahim  Imam  

Larry  Lui  

Andrea  Arnaud  Stagg  

Jonathan  Cachat  

Jennifer  Lawrence  

Svetlana  Sulima  

Davis  Banks  

Vadim  Astakhov  

Xufei  Qian  

Chris  Condit  

Mark  Ellisman  

Stephen  Larson  

Willie  Wong  

Tim  Clark,  Harvard  University  

Paolo  Ciccarese  

Karen  Skinner,  NIH,  Program  Officer  (re>red)  

Jonathan  Pollock,  NIH,  Program  Officer  

And  my  colleagues  in  Monarch,  dkNet,  3DVC,  Force  11