View
50
Download
0
Category
Tags:
Preview:
DESCRIPTION
The Ontrez project at NCBO. Nigam Shah nigam@stanford.edu. Public data repositories. Around 1100 databases in the NAR’s 2008 database issue. High throughput gene expression data in repositories such as GEO, SMD, Array Express - PowerPoint PPT Presentation
Citation preview
The Ontrez project at NCBOThe Ontrez project at NCBO
Nigam Shahnigam@stanford.edu
Public data repositoriesPublic data repositories
• Around 1100 databases in the NAR’s 2008 database issue.
• High throughput gene expression data in repositories such as GEO, SMD, Array Express
• Clinical Trial repositories such as caBIG, TrialBank, clinicaltrials.gov
• Guideline repositories such as www.guideline.gov• Image repositories such as BIRN• Observational studies such as Framingham,
NHANES, AMCIS.
2
Database annotationDatabase annotation
• Ontology based annotation is not as wide-spread as desired• Most annotation is still free-text
• Possible reasons:1.Lack of a one stop shop for bio-ontologies2.Lack of tools to annotate experimental data
• Manual phenote• Automatic ?
3.Lack of a sustainable mechanism to create ontology based annotations
3
Different kinds of annotationsDifferent kinds of annotations
ELMO1 expression is altered by mechanical stimuli
::
Other experiments::
ELMO1 associated_with actin cytoskeleton organization and biogenesis
Expression profiling of cultured bladder smooth muscle cells subjected to repetitive mechanical stimulation for 4 hours. Chronic overdistension results in bladder wall thickening, associated with loss of muscle contractility. Results identify genes whose expression is altered by mechanical stimuli.
Chronic Bladder Overdistension
Low level result
summary result
annotation
metadata
4
Annotations as assertionsAnnotations as assertions
• Annotation = An assertion declaring a relationship b/w a biomedical entity and a type in an ontology.
• e.g. p53 <associated_with> cell death
• Annotations tell us what the biologists believe to be true (in particular or in general)• Most annotations are based on particular observations
and are generalized during interpretation by a biologist/curator.
• Semantics of annotations are not always declared apriori (e.g. associated_with, involves)
5
Annotations as ‘Meta-data’Annotations as ‘Meta-data’
• Metadata: The text description accompanying a dataset in a database.
• Metadata-annotations should be machine processed (and indexed using ontologies) because• The volume is orders of magnitude more than the
summary results• These annotations are not stating any biological fact
• Hence don’t need a curator to create them• These annotations are to be used to LOCATE datasets
accurately as soon as they are available in a public repository
• we can not afford to have a curation bottleneck
6
High level goalHigh level goal
• Process the metadata annotations to automatically tag the ‘elements’ in public repositories with as many ontology terms as possible.
• For example in case of the GEO dataset 906: • Expression profiling of cultured bladder smooth muscle cells subjected to
repetitive mechanical stimulation for 4 hours. Chronic overdistension results in bladder wall thickening, associated with loss of muscle contractility. Results identify genes whose expression is altered by mechanical stimuli.
• Gets tagged with:• Expression, Expression of bladder, bladder, smooth, bladder muscle, muscle,
smooth muscle, cells, mechanical, mechanical stimulation, stimulation, Chronic, results, bladder overdistension, associated, associated with, with, loss, genes, altered
7
Tagging [annotating] with ontology Tagging [annotating] with ontology termsterms
8
9
Querying the annotation indexQuerying the annotation index
10
11
12
13
14
WHAT NEW SCIENCE DO WE WHAT NEW SCIENCE DO WE ENABLE?ENABLE?
15
New Science enabledNew Science enabled
• Nature study on image features and gene expression
• Correlation b/w protein and gene expression for cancer classification
• Correlating gene expression and drug effect information for predicting drug efficacy
• Training and testing image processing algorithms
16
Decoding global gene expression programs in liver cancer by noninvasive imagingEran Segal, Claude B Sirlin, Clara Ooi, Adam S Adler, Jeremy Gollub, Xin Chen, Bryan K Chan, George R Matcuk, Christopher T Barry, Howard Y Chang & Michael D Kuo
Nature Biotechnology 25, 675 - 680 (2007) Published online: 21 May 2007
17
Correlation of protein and gene expression for the stratification of breast cancer patients
18
There are 20 other diseases for There are 20 other diseases for which this is possible!which this is possible!
Disease GEO samples TMADsamples
Acute myeloid leukemia 366 3Malignant melanoma 47 43B-cell lymphoma 133 27Prostate cancer 47 15Renal carcinoma 34 185Carcinoma squamous 105 175Multiple myeloma 225 169Clear cell carcinoma 34 63Renal cell carcinoma 34 9Breast carcinoma 3 1277Hepatocellular carcinoma 80 163Carcinoma lung 91 66Cutaneous malignant melanoma
38 41
T-cell lymphoma 29 31Lymphoblastic lymphoma 29 30Uterine fibroid 10 19Medulloblastoma 46 9Clear cell sarcoma 35 8Leiomyosarcoma 24 5Mesothelioma 54 5Kaposi's sarcoma 4 3Cardiomyopathy 14 2Dilated cardiomyopathy 14 2
19
20
TMAD incorporates the NCI Thesaurus ontology for searching tissues in the cancer domain. Image processing researchers can extract images and scores for training and testing classification algorithms.
21
Current status of the prototypeCurrent status of the prototype
Resource Number of elements
Resource file size (Kb)
Number of direct annotations
Number of closure annotations
Total number of 'useful' annotations
PubMed 10164 13461 187686 681973 857459
ArrayExpress 2751 2880 143134 484758 619133
ClinicalTrials.gov 43918 8379 1206939 6792430 5217115
Gene Expression Omnibus
546 163 16494 100984 116234
ARRS GoldMiner 1155 494 53082 290935 340915
TOTAL 58534 25377 1607335 8351080 7150856
22
Ontrez: Target resourcesOntrez: Target resources
Papers Datasets Guidelines
Clinical Trials Treatments
Drugs Phenotype Animal models
Alleles and Genotype
mRNA expression
Protein expression
GWAS RCT reports
Trial description
text images Genes Variations
Metastatic Melanoma
3330 7 76
Invasive Melanoma
237 1 1
Melanoma in situ 314 1 2
Spindle Cell Melanoma
47 0 0
23
Where can we go?Where can we go?
• Become a service for ‘annotating’ biomedical text.– People send us text, we send back recognized concepts
(may be even relationships)– Given a set of concepts we provide a similarity metric
between them– Both these services can be plugged into a variety of
community and collaborative annotations tools• Become ‘the one stop shop’ for finding items across
a wide variety of resources …– Integrate on the ‘disease’ dimension. Gene cards exist,
disease cards don’t– Focus on approx. 15 resources in the next year.
– PDB and PLoS are interested
24
Research questions - 1Research questions - 1
Genes/Proteins Diseases Drugs body parts developmental stages
Pathways processes genetic markers
SNOMEDCT .. X .. .. .. .. .. ..
RxNORM .. .. X .. .. .. .. ..
INOH .. .. .. .. .. X .. .. NCIT .. X .. .. .. .. .. .. Gene Ontology (BP)
.. .. .. .. .. .. X ..
FMA .. .. .. X .. .. .. .. Cell type Ontology
.. .. .. .. .. .. .. ..
Mammalian Phenotype
.. .. X .. .. .. .. ..
Mouse anatomy and development
.. .. .. X X .. .. ..
Zebrafish anatomy and development
.. .. .. X X .. .. ..
25
Research questions - 2Research questions - 2
Genes/Proteins Diseases Drugs body parts developmental stages
Pathways processes genetic markers
GATE .. .. .. .. .. .. .. ..
UMLS-Query .. .. .. .. .. .. .. ..
mgrep .. .. .. .. .. .. .. ..
MetaMAP .. .. .. .. .. .. .. ..
UPenn (conditional random fields)
.. .. .. .. .. .. .. ..
Language Modeling methods
.. .. .. .. .. .. .. ..
26
Credits and collaborationsCredits and collaborations
• Clement Jonquet• Nipun Bhatia• Manhong Dai
• Fan Meng• Brian Athey• Mark Musen
27
Recommended