Upload
gavin-rose
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies
Joyce A. Mitchell, Ph.D.
National Library of Medicine
University of Missouri
3
Research Goals
Investigating methods of connecting the disease and genomic information.
Overall goals are to:– Overcome difficulties traversing multiple information
resources– Examine coverage of Unified Medical Language System®
(UMLS®), Gene OntologyTM (GO), LocusLink-OMIM– Develop methods to use ontologies more effectively– Present data in understandable manner
4
Background – UMLS
NLM developed, maintains Purpose: facilitate retrieval & integration of
information from multiple biomedical sources Interrelates 60 biomedical terminologies
– MeSH, SNOMED, Read Codes, ICD, etc– No vocabulary focused on molecular biology
1.5 million English terms; 800,000 concepts; 134 Semantic Types; 54 Semantic Relationships
5
Background – Gene Ontology
GO Consortium developed, maintains Purpose:
– promoting cross-species methodologies for functional comparisions– Allows annotation of molecular information on genes, gene products– “an essential start to creating a shared language of biology” **
Focused on – molecular function (5626 terms)– biological processes (4677 terms)– cellular components (1077 terms)
Two semantic relations (is-a and part-of)
**Genome Research 2001; 11:1425-33.
6
Background - LocusLink
Curated, gene-centered resource of National Center for Biotechnology Information (NLM)
Gene names, gene product names, gene product functions, and reference sequences (DNA, RNA, protein)
Associates phenotype (diseases) to the genotype via Online Mendelian Inheritance in Man (OMIM)
Online links to major bioinformatics knowledge bases and the literature
7
Specific Questions
This study looked at coverage in UMLS of1. 1244 genes associated with human diseases
2. 1702 diseases associated with the genes
3. 11,380 Gene Ontology terms
4. 38,832 genes/gene products in GO database (141,071 names)
5. Associations of genes and their functions in UMLS
6. Representation of gene function in GO compared to the UMLS
8
Methods
LocusLink query: – human genes whose sequence is known and associated
with disease (1244 loci) LocusLink data:
– Genes/gene products (official names, synonyms, symbols)– Phenotypes (diseases) (1702 diseases)
GO data: – all concepts (ontology terms), excluding obsolete terms
(11,380 terms)– Gene products from all species (134,646 unique names,
38,832 genes)
9
Methods
LocusLink and GO terms mapped to UMLS concepts – normalization used– mappings constrained by semantic type
LocusLink loci studied for relationships in UMLS– Gene/GP – phenotype – Gene/GP – molecular function– Gene/GP – biological process– Gene/GP – cellular component
For specific genes compared annotations in GO to representation in UMLS
10
Results - 1
For 1244 genes from LocusLink– 18% found in the UMLS
Official gene name 20% 244/1244
Official gene symbol 16% 200/1244
Alias symbol 15% 394/2669
Gene product 18% 266/1460
Preferred product 18% 266/1460
Alias protein 24% 339/1425
11
Results - 2
For 1702 phenotypes (diseases) corresponding to 1244 genes– 34% found in the UMLS (575/1244)
Most frequent single gene diseases covered– Huntington Disease– Cystic Fibrosis– Marfan Syndrome– Phenylketonuria– Achondroplasia
12
Results - 3
GO terms found in MeSH 2764 terms GO terms found in SNOMED 1366 terms
GO terms found overall: 27% 3062/11,380
Molecular function 44% 2435/5626
Biological process 5% 256/4677
Cellular component 35% 370/1077
13
Results - 4
For 134,646 unique gene names in GO database
Full name 11% 4392/38,832
Symbol 2% 1167/60,381
Synonym 6% 1964/35,433
14
Results - 5
LocusLink – UMLS Relationship Categories found overall: 72%
Genes
&
gene products
Phenotype 64% 754/1182
M. Function 85% 1192/1409
B. Process 61% 762/1240
C. component 76% 841/1107
15
Results - 5
Type of Relationship Associative 613 Co-occurrence 3353 Hierarchical 1168G/GP and Assoc Co-oc Hier
Phenotype 275 724 5
M. Function 206 1069 933
B. Process 57 737 147
C. Component 75 823 83
18
GeneOntology
CellularComponent
Biologicalprocess
MolecularFunction
Cell
Membrane IntracellularCell growth and/or
maintenance
CytoplasmPlasma
MembraneCell
ProliferationObsolete
Negative control ofcell proliferation
StructuralProtein
TumorSuppressor
Cytoskeleton
MERL_HUMAN
19
Proteins
Neoplasm Proteins Cell Cycle Proteins Proteins by Body Part
Tumor Suppressor Proteins Membrane Proteins
Neurofibromin 2
Growth SuppresorProteins
Merlin, Drosophila
21
Best & Worst Mappings
Best mapping categories Molecular function (GO) 44% Cellular component (GO) 35% Phenotype (LL) 34%
Worst mapping categories Gene synonym (GO) 6% Biological process (GO) 5% Gene symbol(GO) 2%
22
Only 34% of diseases?
In OMIM-LL, diseases are subdivided by genetic causes but not in UMLS
E.g. Limb Girdle Muscular DystrophyLGMD is represented in UMLS A SNOMED term in MeSH it is an entry term for muscular dystrophies MeSH notes for MD: A general term for a group of
inherited disorders which are characterized by progressive degeneration of skeletal muscles (ed, 2000)
23
Limb Girdle Muscular Dystrophy – genetic types
LGMD type Gene Name LGMD type Gene Name
1A Myotilin 2C Sarcoglycan-gamma
1B Lamin A/C 2D Sarcoglycan-alpha
1C Caveolin-3 2E Sarcoglycan-beta
1D Unknown 2F Sarcoglycan-delta
2A Calpain-3 2G Telethonin
2B Dysferlin 2H TRIM32
2I Fukutin-related protein
24
Only 5% of Biological Processes?
Only 256 of the biological processes mapped to terms in UMLS. In GO, processes are elaborated & organism specific Example: UMLS - Mitotic spindle GO
– Mitotic spindle assembly– Mitotic spindle assembly (sensu Saccharomyces)– Mitotic spindle assembly (sensu Fungi)– Mitotic spindle checkpoint– Mitotic spindle elongation– Mitotic spindle orientation– Mitotic spindle positioning– Mitotic spindle positioning and orientation
25
Why so few gene names and synonyms mapped?
Official gene names have metadata and comments. – dystrophin (muscular dystrophy, Duchenne and Becker types),
includes DXS143, DXS164, DXS206, DXS230, DXS239, DXS 268, DXS269, DXS270 DXS272
No single source has all names and synonyms GO synonym field contains IPI number for well
known genes, does not match UMLS (useful cross reference but not a synonym)
Symbols are short acronyms and match poorly
26
Summary 1
UMLS needs improvement in molecular biology domain but has considerable content:– 27% of GO concepts map – 34% of single gene diseases– Existing UMLS terms come primarily from MeSH
and SNOMED
Overall, positive mapping for 13,000 terms