57
Rheinische Friedrich-Wilhelms-Universität Bonn Institute of Computer Science III Disease Information and Semantic Web Master’s Thesis Supervisor: Prof. Sören Auer, Heiner OberKampf Turan Gojayev München, December 13, 2014

Disease Information and Semantic Web

Embed Size (px)

Citation preview

  • Rheinische Friedrich-Wilhelms-Universitt BonnInstitute of Computer Science III

    Disease Information and Semantic Web

    Masters ThesisSupervisor: Prof. Sren Auer,

    Heiner OberKampf

    Turan Gojayev

    Mnchen, December 13, 2014

  • Declaration of Authorship

    I hereby certify that this thesis has been composed by me and is based on my ownwork, unless stated otherwise. No other persons work has been used without due ac-knowledgement in this thesis. All references and verbatim extracts have been quoted,and all sources of information, including graphs and data sets, have been specificallyacknowledged.

    Mnchen, December 13, 2014 Turan Gojayev

  • Contents

    1 Introduction 11.1 Motivation and Background . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2 Basics 72.1 Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.1.1 Triples, Ontologies, Reasoners . . . . . . . . . . . . . . . . . . . . 72.1.2 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3 BioPortal 113.1 The structure of dataset on BioPortal . . . . . . . . . . . . . . . . . . . 12

    3.1.1 Ontology Repository . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.2 Ontology Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.3 Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    4 Disease Ontology, Symptom Ontology and UMLS as a starting point. 154.1 Human-Disease Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Symptom Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Unified Medical Language System (UMLS) . . . . . . . . . . . . . . . . 16

    5 Diseases and Symptoms 195.1 Disease Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.2 Symptom Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.3 Data Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    6 Disease-Symptom Relationships 276.1 UMLS group ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 286.2 Non-UMLS ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.3 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    7 Disease and Symptom Graphs 35

    i

  • Contents

    7.1 Disease Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357.1.1 Default approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 357.1.2 Adapted approach . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    7.2 Symptom Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.2.1 Default approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.2.2 Adapted approach . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    8 Summary 41

    Bibliography 43

    List of Figures 47

    List of Tables 49

    Listings 51

    ii

  • 1Introduction

    1.1 Motivation and Background

    From ancient times people tried to observe the changes in the health and storedthis information (or simply recorded it on surfaces of different materials). Over time,using these recorded pieces of knowledge they gradually learned to understand theseconversions and different causes that lead to them, as well as how to cure them. Recordsshowing the changes that bring to a certain state of health play vital role in spottingthe problems and their sources. With the evolution of science the means of storingthis knowledge have been replaced by new technology, its accuracy and volume haveincreased considerably. Todays clinical data contain knowledge about thousands ofdifferent types of diseases, symptoms, information about body parts, etc. There are alsomany attempts to organize this information in a useful manner. However, scientists donot always agree on which terms to use for various reasons and it results in the existenceof many vocabularies in the same domain with huge overlap. Thus, it is very importantto have a relationship between these dictionaries. With the representation of scientificvocabularies in Semantic Web, making these connections is very straightforward.

    BioPortal [1], being worlds largest ontology repository for Biomedicine, containsmore than 400 ontologies and more than 6 million classes that define the terms in them.It also stores millions of mappings between terms of different dictionaries. Nevertheless,the ontologies cover different fields of biomedical domain and therefore it is not possible,for instance, to do a search only in a data about diseases. In other words, the data is notmainly arranged around specific concepts. Having all data sorted by the concepts canhelp users or applications to find the required data in a much easier way. Especially, weare interested in identification of the disease and symptom data within the BioPortalontologies.

    Furthermore, relationships between data of different type of concepts can be veryuseful. For example, one might be interested in all the symptoms that indicate thedisease pneumonia or all the body parts where disease cancer might occur. Absence

    1

  • 1 Introduction 1.2 Problem definition

    of categorization of the data is the main reason why it is difficult to look for thiskind of knowledge. If we had a data organized around the concepts of disease andsymptom, we could look for the connections between them and declare the connectingproperties as subproperties of a general one, which could be called has_symptom andwould point to the symptoms for diseases. With the help of links between predicatesthat carry semantic information, we can query symptoms for diseases without knowingthe exact property that is used in a specific ontology. Despite presence of subpropertyrelations between different predicates used in triples of BioPortal ontologies, theserelations are for the properties that carry either a lexical information or definition. Forinstance, different predicates used for storing labels for classes are linked to a commonskos:prefLabel predicate using rdfs:subPropertyOf. Another example is the usage ofskos:definition predicate as a representative for the properties that give a definition ofthe classes. These relations are very useful for querying the ontologies with a commonquery form.

    1.2 Problem definition

    As it was already mentioned in the previous section, in spite of presence of manydifferent field vocabularies on BioPortal, data is not categorized. The second problem,which is also partially dependent of the first problem, is the relationships betweendifferent types of data. Difficulties in identification of the type of data is that, forinstance, in the case of diseases and symptoms data sets partially overlap. One can tryto identify diseases, symptoms and relationships between them without solving the firstproblem, just by analysing the predicates used in ontologies. However, more than 2600distinct properties were used in BioPortal ontologies and going through this list andguessing if the subjects and objects corresponding to the predicate are about diseases orsymptoms will take a lot of time. Moreover, semantics of predicates are imprecise. Forinstance, related_to in MEDLINEPLUS is used for disease-disease, disease-symptomand symptom-symptom kind of triples. Therefore, the solution of the first problemreduces the amount of work required to solve the second problem.

    1.3 Related Work

    Unified Medical Language System[2], which is a system encorporating main vo-cabularies for biomedical domain, defines semantic types and relationships betweentypes. The ontologies that are part of UMLS, such as Systematized Nomenclature ofMedicineClinical Terms(SNOMED CT)[3], MedDRA[4], etc. contain a large numberof classes having semantic type disease or symptom. There also has already been awork in arranging the data in biomedical ontologies around the concept of disease [5].Human-Disease Ontology(DO) is an ongoing project that intends to create a singlestructure for the classification of disease which unifies the representation of diseaseamong the many and varied terminologies and vocabularies, into a relational ontol-ogy that permits inference and reasoning of the relationships between disease terms

    2

  • 1 Introduction 1.4 Approach

    and concepts. DO contains more than 8600 disease classes and the terms have ex-tensive references to Medical Subject Headings(MeSH)[6], International Classificationof Diseases(ICD)[7], SNOMED CT and other very prominent medical ontologies. Inaddition to that, DO contains relations to Foundational Model of Anatomy (FMA), Hu-man Phenotype Ontology(HP) [8], Symptom Ontology(SYMP) [9] and other ontologiesthat contain knowledge about disease attributes. This knowledge is not in a structuredformat, but rather in textual definitions. For instance, there are 388 distinct diseasescontaining information about symptoms in the definition (in total 777 such definitions).Moreover, SYMP was designed around the symptom concept by same author. GenericHuman Disease Ontology (GHDO) [10] is proposed as a model with four dimensionsof data: disease types, symptoms, causes and treatments. This ontology is designed insuch a way that disease types may be divided into sub-types, causes for diseases canhave two main branches(genetic and environmental). Also for each disease there aredifferent treatments and symptoms indicating it. Nonetheless, there was no such ontol-ogy published from the proposed model. Yet in another work [11] one more ontologymodel for storing disease and symptom relationships is proposed, but the actual workand results are left for future. [12] tries to relate DO and SYMP by finding the linksbetween diseases and symptoms. Authors propose an algorithm for linking classes, butit assumes that one can already get symptoms for a selected disease from a healthwebsite or server, or a database and as a result they have symptoms for 11 diseases.Also [13] proposes a Disease-Symptom Ontology model, but it contains a few manuallyentered relationships between diseases and symptoms. Our main goal is not to haverelated classes mapped to each other, but rather to understand the semantics of thedata on BioPortal, where ontologies contain vocabularies from various fields and in thisway, this task is different from general ontology alignment [14].

    The ontologies that are part of UMLS and contain classes with semantic types diseaseand symptom, in many cases also have relations between them. For having differentdisease-symptom specific predicates mapped to the same common superproperty, wehave to understand the semantics of those relations. As we will see later in the furtherchapters, there are many properties connecting diseases to symptoms, however, just afew of these predicates can be used for linking diseases to their specific symptoms.

    1.4 Approach

    We can see from related work that there are already ontologies on BioPortal wheredata is centered around disease (DO) or symptom (SYMP) concepts, or contain ofboth types of knowledge(UMLS ontologies). UMLS group ontologies also contain re-lationships between these two types of data. Furthermore, there is a large number ofmappings between classes on BioPortal. However, all this knowledge is not analysedas a whole and we try to address this issue in the current work.

    In this thesis we try to integrate disease and symptom related data, as well astheir relationships by analysing BioPortal ontologies. We select disease and symptomdatasets as a starting point and then, using them and existing BioPortal mappings weretrieve more of disease and symptom information. Some parts of resulted datasets

    3

  • 1 Introduction 1.5 Structure of the thesis

    overlap, showing that notions of disease and symptom are not precise. We try toseparate these knowledge bases as much as possible. With this data in hand, we lookfor the connections between these two datasets and try to find predicates that linkdiseases to their specific symptoms. In addition, we link these predicates to a commonproperty using rdfs:subPropertyOf in order to make querying on the resulted ontologyeasier.

    Since we retrieve a large number of classes using mappings, we create a graph con-sisting of classes as nodes and mappings as edges between them. We do this procedureboth for disease and symptom data. Also we assume that these mappings are correctand linked classes represent the same disease on disease graph or the same symptom onsymptom graph. Thus, we find connected components of those graphs and treat themas a same disease or symptom class. This also increases the number of disease classeslinked to symptom classes.

    Using this approach we try to arrange data around specific concepts on a repositorywith many ontologies. Although we consider only disease and symptom data, we believethat one could repeat the same process for other concepts, for instance, body parts aswell. One of the key moments is the selection of the correct starting data.

    1.5 Structure of the thesis

    In next chapter we briefly describe Semantic Web technologies.

    In Chapter 3 we explain what is BioPortal, what functionalities and what kind ofdata it contains, and how the data is structured on it.

    In Chapter 4 we describe what knowledge we have at the beginning of our work,which information do we select as a starting point to retrieve more of relevant data. Inparticular, we talk about Unified Medical Language System which plays an importantrole in fetching required data and also in grouping them.

    Chapter 5 starts with the description of how we combine BioPortal mappings withthe data we select as starting point in oreder to retrieve more data. We show anoverview of all knowledge that we acquire by this method and how we define core partand potential part of the data. Later we face the problem of data overlap betweendisease related and symptom related resources, and we also describe how we separatethem.

    In Chapter 6 we try to find the connection between disease and symptom data. Wedo this for both core and potential parts of the disease and symptom related knowledge.We analyse the triples and select the predicates that connect these data, decide whichpart of those triples to keep in our data model. Also from those triples we choosedisease-symptom specific knowledge that represents the symptoms occurring with thegiven disease. We find which properties indicate this sort of data and define a predicatehasSymptom and use it to store this information in our data model.

    4

  • 1 Introduction 1.5 Structure of the thesis

    In Chapter 7 we show how we build a graph out of disease and symptom data. Later,we group pieces of data into clusters in order to have similar data packed together. Wedo this separately for disease and symptom graphs in two different ways and we talkabout the differences of those approaches.

    5

  • 2Basics

    2.1 Semantic Web

    The World Wide Web (WWW) is a web of data. At the time of creation it was mainlyintended for the human consumption. Development of the technologies has lead to thepoint where the WWW has become not only the web of data for human, but also for theapplications. However, the way the data is represented was meant for the human usersand thus is not very appropriate for the applications. For example, when there is a linkto another resource on a web page, context surrounding the Uniform Resource Locator(URL) gives a user idea about the meaning of the link that it represents. Yet not allthe applications might have text analysis facilities that will help them to understandthe semantics of this connection.

    Semantic Web, in its turn, adds meaning to the content. It is a web of data describedand linked in ways to establish context or semantics that adhere to defined grammarand language constructs [15]. Nonetheless, it is not a substitute for the WWW, ratheran extension to it through standardized semantics.

    2.1.1 Triples, Ontologies, Reasoners

    Triples, or statements, can be considered foundational units of the Semantic Web.Triple gets its name from the number of components it contains. Each triple statesa fact and consists of subject, predicate and object. The subject of the triple is theresource statement describes. The object of the triple is the resource, blank node ora literal value, such as a string, number, date, etc. statement relates to the subject.The predicate provides a relationship between subject and object. Triples can definethe structure of the information, limits on that structure, instance data and etc. A setof such triples is called a Resource Description Framework [16](RDF) graph. RDF is

    7

  • 2 Basics 2.1 Semantic Web

    a general framework, and can be considered a grammar that defines how to representany information in the Web.

    Resources are identified by a Uniform Resource Identifier (URI) and provide a mech-anism to identify resources on the web uniquely. Difference between URIs and URLsis, URI does not always refer to a physical resource on a Web, whilst URLs always canbe dereferenced.

    Figure 2.1 visualises piece of data from Human-Disease Ontology. Blue rectanglesrepresent the classes and light red rounded rectangles show string values that carrytype of information specified by the labels on elbows. In total there are 4 triplesrepresented on the picture. They all have the same subject which is the URI http://purl.obolibrary.org/obo/DOID_10652. Four different predicates relate this URI to alabel for the class represented this URI, subclass information, synonym for the label ofclass and cross-reference to another ontology. Two objects, "Alzheimers disease" and"Dementia of the Alzheimers type" carry string value and thus are literals. The othertwo objects are resources.

    All these data are represented and stored using OWL [17](Web Ontology Language).It is intended for use by applications that process the Web documents, rather thanpresenting them to human users. The current version of this language is OWL2 [18].

    An ontology is simply a collection of triples, that define different concepts, theirrelationships and constraints. It can be compared to a database in case of relationaldatabases. Ontologies can have data from one specific domain, or can be a hybrid ofseveral different fields. There are many rich ontologies that can be used for applicationsdirectly or in an adapted manner. Of course, an application can also create an ontology

    Figure 2.1: Example data from DO

    8

    http://purl.obolibrary.org/obo/DOID_10652http://purl.obolibrary.org/obo/DOID_10652

  • 2 Basics 2.1 Semantic Web

    from scratch, but usually they make use of existing ontologies to link data to a well-known, commonly used data sources. Ontologies are stored using one of the serializationformats (Turtle [19], N-Triples [20], RDF/XML [21] or others).

    The data in Semantic Web can be split into two parts: stated and inferred. Wehave shortly described what can be stated in the ontologies. But what adds semanticto Semantic Web is actually the possibility of inferring additional information from thestated data. Predicates used in triples can be considered functions that have specificdomain and range constraints, as well as properties that let us infer not explicitly statedknowledge. This inference is done by a software piece, which is called a reasoner.

    2.1.2 SPARQL

    SPARQL for the RDF plays the role of SQL for the relational databases. It is aquery language designed for querying RDF databases. SPARQL queries can includeone or more triples where the subject, predicate and/or object can be variables. Thesequeries are being sent to the SPARQL endpoints [22]. On the endpoint the triples inthe query are being compared to the stored ones in specified RDF graphs.

    Listing 2.1 shows an example of SPARQL query. This query results in all the triplesin graph http://bioportal.bioontology.org/ontologies/DOID that contain as a subject theURI http://purl.obolibrary.org/obo/DOID_10652. Four of these triples are shown inFigure 2.1.

    Listing 2.1: SPARQL query exampleSELECT *from WHERE {

    ?p ?o}

    9

    http://bioportal.bioontology.org/ontologies/DOIDhttp://purl.obolibrary.org/obo/DOID_10652

  • 3BioPortal

    BioPortal is a Web-based application that gives its users an easy access to thecontained ontologies. At the moment of writing, with 402 ontologies and 6,062,730classes in those ontologies, it tends to be the worlds most comprehensive repository ofontologies in biomedical domain. With the functions that enable users browse, find andfilter ontologies, search for specific terms within those ontologies, submit new ones andexplore the mappings among them, BioPortal is one of the many tools that NCBO[23]offers.

    Ontologies in BioPortal cover various fields of biomedicine. For instance, Human-Disease Ontology contains information about human related diseases, Human Pheno-type Ontology about phenotypic features encountered in human hereditary and otherdisease, Symptom Ontology about symptoms, Protein Ontology [24] provides an onto-logical representation of protein-related entities by explicitly defining and showing therelationships between them, etc. Besides the diversity of the domain of ontologies inBioPortal, they also differ in size, expressivity and quality.

    Three main formats are used for storing the ontologies:

    1. OBO, the text file format used by OBO-Edit[25], the open source, platform-independent application for viewing and editing ontologies.

    2. OWL, which is a W3C recommendation for representing ontologies on the Seman-tic Web.

    3. RRF, the format mainly used by US National Library of Medicine to distributethe vocabularies that constitute the UMLS.

    BioPortal allows its users to publish, review ontologies, browse through them, throughthe classes or through the mappings between ontologies via the interface. Once an ontol-ogy is selected, one can view the metrics, e.g. number of classes, individuals, properties,classes without definition, etc. calculated on that ontology. Moreover, users can alwaysmake applications that use ontologies from BioPortal and enter this information. This

    11

  • 3 BioPortal 3.1 The structure of dataset on BioPortal

    lets other people see what are the current projects that use those ontologies, and if thoseontologies are really important for the projects they might be considering to create.

    Through the RESTful API available at http://data.bioontology.org/, BioPortal letsusers make queries for given query terms, use any of the ontologies to annotate atext with the classes from those ontologies, get different resources that are stored inBioPortal. Moreover, SPARQL endpoint [26] allows users make more complicatedqueries that are more adjustable to specific projects.

    3.1 The structure of dataset on BioPortal

    The data on BioPortal consists of three essential parts as specified in [27]:

    Ontologies

    Metadata

    Mappings

    3.1.1 Ontology Repository

    The essential part of data in BioPortal is contained in the actual ontologies thatare uploaded by the users. Several versions of ontology can be kept in the repositories.There are many ontologies with thousands or even with ten thousands of classes. Thepredicates used in ontologies by the authors also have a very broad range. At themoment of writing, we have found 2657 distinct predicates on BioPortal. Some ofthose predicates are mapped to common properties (by means of rdfs:subPropertyOfpredicate), what makes query process easier. For instance, the predicates that stand forpreferred labels of the terms, are mapped mapped to skos:prefLabel, or properties thatstand for the synonyms of the terms, are mapped to skos:altLabel. These subpropertydeclarations are saved in a "globals" graph and one can make use of them by queryingfrom that graph.

    3.1.2 Ontology Metadata

    BioPortal uses a specifically designed ontology for storing metadata information.It imports a number of other ontologies and includes classes to describe an ontol-ogy itself, its versions, metadata properties about the ontology, creators of an on-tology, user-contributed content, such as notes, reviews, mappings, and views [28].The two main entities in the metadata are meta:VirtualOntology and omv:Ontology.meta:VirtualOntology represents a container for all versions of an ontology and anomv:Ontology represents a particular ontology version [27]. Figure 3.1 from [27] de-scribes the connections between these two elements.

    12

    http://data.bioontology.org/

  • 3 BioPortal 3.1 The structure of dataset on BioPortal

    Figure 3.1: Metadata: Virtual Ontologies and Version Ontologies. [27]

    3.1.3 Mappings

    The mappings on a BioPortal are stored on a different graph. These mappings canbe uploaded by users separately from the ontologies and this lets all the users addmappings between existing ontologies. Each mapping is created between two classesand contains such information as target class, target ontology, source class, sourceontology, relation type, etc. In the further sections we will show how we use theseinformation for our purposes.

    There are several sources for the mappings on BioPortal:

    1. Lexical Mappings (LOOM[29]) - these are created by a software, based on thesimilarity notion between preferred labels or preferred and alternative labels. Anylabels with no more than 3 characters are excluded.

    2. CUI Mappings from UMLS - contains mappings based on the Concept UniqueIdentifier (CUI) from UMLS network.

    3. User submitted Mappings (REST) - mappings that are created manually by users.

    4. URI-based Mappings - these are the mappings between classes with the sameURI in different ontologies.

    5. Xref OBO Mappings - mappings that are created based on the OBO xref property.

    13

  • 3 BioPortal 3.1 The structure of dataset on BioPortal

    6. CUI Mappings from no UMLS - mappings based on CUI from ontologies that arenot part of UMLS.

    14

  • 4Disease Ontology, Symptom Ontology

    and UMLS as a starting point.

    As it was already mentioned in Chapter 3, BioPortal contains around 400 ontolo-gies and the domains of those ontologies cover such subjects as anatomy, phenotypedescription, experimental conditions, health, etc. Since disease and symptom informa-tion within the BioPortal repository is the main focus of this thesis, we are interestedin those, that can be related to one or to both of them.

    There are a number of ontologies that store the knowledge about diseases. However,users should be familiar with them beforehand, in order to be able to look up for theterms, definitions or any other kind of data about diseases they might be interested in.Therefore, its important to have an overview over this information. In this chapterwe describe some of these ontologies and in the next chapters we will show how wemake use of them and BioPortal mappings to combine disease and symptom relatedknowledge, and retrieve it in a simple way.

    4.1 Human-Disease Ontology

    Human-Disease Ontology represents a comprehensive knowledge base of inherited,developmental and acquired diseases. It integrates disease and medical vocabulariesthrough the usage of cross-mappings and integration of MeSH, ICD, NCIs thesaurus,SNOMED CT and OMIM disease specific terms and identifiers. The DO is utilized fordisease annotation by major biomedical databases (e.g., Array Express, NIF, IEDB),as a standard representation of human disease in biomedical ontologies (e.g., IDO, Cellline ontology, NIFSTD ontology, Experimental Factor Ontology, Influenza Ontology),and as an ontological cross-mappings resource between DO, MeSH and OMIM(e.g.,GeneWiki). DO has been incorporated into open source tools (e.g., Gene Answers,

    15

  • 4 Disease Ontology, Symptom Ontology and UMLS as a starting point. 4.2 SymptomOntology

    FunDO) to connect gene and disease biomedical data through the lens of human dis-ease.

    At the moment of writing it contains 8681 disease classes, 2260 of which have textualdefinitions annotated with disease attributes, such as symptom, phenotype, anatomi-cal location and etc. has_symptom property used in triples to annotate the textualdefinitions with symptom information and only definitions of 388 distinct classes areannotated with this predicate.

    4.2 Symptom Ontology

    The Symptom Ontology was developed as part of the Gemina project[9]. It is createdaround the concept of a symptom being: "A perceived change in function, sensation orappearance reported by a patient indicative of a disease". SYMP is organized primarilyby body regions with a branch for general symptoms. The Symptom Ontology in July2008 was submitted for inclusion and review to the OBO Foundry and was adopted.It also continues to undergo active development to incorporate Basic Formal Ontologystructure.

    4.3 Unified Medical Language System (UMLS)

    UMLS, started in 1986 by US National Library of Medicine is a system for integratingmajor vocabularies and standards from biomedical domain, such as SNOMED CT,MeSH, ICD, LOINC, RxNorm and several others. UMLS consists of sources calledMetathesaurus, Semantic Network and SPECIALIST lexicon.

    Metathesaurus is a huge vocabulary that contains 1 million unique concepts aboutbiomedicine with 5 million concept names from more than 100 terminologies, classifica-tions and thesauri, and more that 17 million relationships between concepts. Metathe-saurus is organized by concept(meaning). Each concept is given a unique id (CUI) andcan have several names, since these concepts might come from different vocabularies,and ids are designed for linking all these names to the same thing. CUIs are given per-manently and might change only if it is discovered that several CUIs actually representthe same concept.

    Semantic Network provides a categorization of the concepts that appear in Metathe-saurus and also the relationship between them. It consists of semantic types andsemantic relations. Each concept is assigned at least one of the semantic types. Seman-tic types are the nodes in the semantic network and the relations are the links betweenthem. A portion of this network is depicted in Figure 4.1.

    There are semantic types for biologic functions, for organisms, for anatomical struc-ture, clinical findings and etc. In total there are 133 semantic types and 54 semanticrelationships defined.

    16

  • 4 Disease Ontology, Symptom Ontology and UMLS as a starting point. 4.3 UnifiedMedical Language System (UMLS)

    Figure 4.1: http://www.ncbi.nlm.nih.gov/books/NBK9679/figure/ch05.F3/?report=objectonly,A Portion of the UMLS Semantic Network: Relations

    17

  • 5Diseases and Symptoms

    In Chapter 4 we have discussed Human-Disease Ontology, Symptom Ontology andUMLS Metathesaurus and Semantic Network. Now we use these sources of informationand BioPortal mappings to get an overview of the disease and symptom informationavailable at BioPortal. The method we apply in this chapter does not guarantee togather all the disease and symptom related knowledge, that can be found on BioPortal,but rather focuses on the data that is located at the neighborhood of the selectedontologies. Neighborhood of the ontologies should be understood as a data that can bereached via the mappings from the classes of this ontology.

    First, we select those ontologies in BioPortal, that we are sure about the existenceof the required knowledge in them. In biomedical domain, classes and expressions playimportant roles in ontologies, in contrast to the instance data within other domains.Therefore, we retrieve the relevant classes to us in the selected ontologies and definethem as core classes. Then, we request the mappings that contain a selected class asa mapping source for each of the core classes. We store the targets of these mappingsas potential classes, together with the sources of mappings and ontologies they comefrom. Later on, we decide which of them to keep, based on other facts that will appearin the course of analysis.

    5.1 Disease Information

    BioPortal contains ontologies from diverse fields of biomedical domain. We aremainly interested in those that contain data about diseases and/or symptoms. For thedisease information we select Human-Disease Ontology and those ontologies that arefiltered as UMLS group on BioPortal. Since DO contains knowledge only about thediseases, we simply consider each of the OWL classes in DO a disease. Besides, thereare 73334 classes in 138 ontologies that can be reached from the classes of DO viamappings.

    19

  • 5 Diseases and Symptoms 5.1 Disease Information

    For retrieval of diseases from UMLS group ontologies, we use semantic type T047which stands for "Disease or Syndrome" and predicate hasSTY (Listing 5.1).

    Listing 5.1: Retrieval of disease classes from UMLS ontologiesprefix owl: select distinct ?sfrom where{

    ?s a owl:Class;

    .

    }

    There are 31 ontologies filtered as UMLS group on BioPortal, but only the onesthat are listed in Table 5.1 contain classes with semantic type T047. Table 5.1 alsoshows the number of distinct classes in those ontologies that have semantic type T047,number of classes that can be reached from these classes via mappings, and the numberof ontologies those classes appear in.

    We define the classes with semantic type T047 and the ones in DO as core diseaseclasses, as it was discussed before. The classes that we get via the mappings we callpotential classes. Core disease classes in each UMLS group ontology are distinct, butbetween potential disease classes, there are many overlaps.

    Acronym #core_classes #connected_classes #connected_ontologiesSNOMEDCT 45511 102281 162MDR 24557 82203 139RCD 22914 92695 134ICD10CM 16047 62918 121OMIM 7766 35641 133ICD9CM 7632 58630 100MeSH 5808 70895 153ICD10 4913 40434 98NDFRT 2824 48279 120ICPC2P 2101 26501 127CRISP 1146 32276 162COSTART 946 33075 150WHO-ART 880 25961 103LOINC 678 3673 108MEDLINEPLUS 416 19988 125ICPC 220 3604 68AIR 189 2430 55

    Table 5.1: UMLS ontologies that contain classes with semantic type T047("Disease orSyndrome") ("core classes"),number of core classes, number of classes mapped from coreclasses, number of ontologies the mapped classes are located in.

    20

  • 5 Diseases and Symptoms 5.2 Symptom Information

    By applying this method to all ontologies in Table 5.1 we get 219 ontologies and247683 classes involved. 123736 of these classes have at least one mapping to anotherclass. These mappings have different mapping sources as mentioned in Section 3.1.3. Inmany cases the same mapping might have origin in several of the mapping sources.

    Once we have the disease information, we might create an overview of the ontologiesthat shows the connections(mappings) between them. We should remind that we do notclaim to have all the disease related knowledge on BioPortal, but rather the data thatresides in the neighborhood of the selected ontologies. An overview of the ontologiescontaining disease information and connections between them is depicted in Figure 5.1using Gephi [30] visualization tool for graphs.

    Each node in the graph represents a different ontology at BioPortal. The sizes ofthe nodes are proportional to the number of disease classes (core and potential) foundin that ontology. The colors, changing from red to blue, represent the degree of thenode(number of mappings that include the classes of ontology) in the graph.

    5.2 Symptom Information

    The other kind of data we are interested in is the symptom information on BioPortal.We use the same method that was applied in Section 5.1 to fetch the required knowledge.Instead of DO, however, this time we use Symptom Ontology as an ontology whichcontains only classes about symptoms. The number of distinct classes in SYMP is936 and the 7105 classes from 119 ontologies are used as a target for mappings fromthese classes. Also we use UMLS semantic type T184("Sign or Symptom") for furtherobtaining the symptoms from UMLS group ontologies. The number of symptoms foundin those ontologies is shown in Table 5.2.

    As in the case of diseases, here we also define the classes that we get by usingsemantic type and the ones in SYMP as core symptoms. The rest of them we considerpotential symptom classes. In total we find 34088 symptom classes in 161 ontologiesand 11882 of them are mapped at least to one another class. Figure visualizes theproportion of core symptom classes in each of the selected ontologies as a startingpoint for symptoms.

    The picture for the ontologies with symptoms classes and mappings to those, isdepicted in Figure 5.2.

    5.3 Data Overlap

    In previous sections of this chapter we showed how we get disease and symptominformation in BioPortal and visualized relative portions of the classes for each UMLSgroup ontology. Core sets of the classes are unique for each ontology in the context ofdisease or symptom information. Nonetheless, some part of the disease and symptomclasses overlap between themselves. Disease set contains 247683 distinct classes, out of

    21

  • 5 Diseases and Symptoms 5.3 Data Overlap

    Figure 5.1: Disease Ontologies Graph

    which 153223 appear in the core. For symptom information, we have 14971 classes inthe core of set with 34088 classes.

    The overlap between classes means, that the notions of "Disease" and "Symptom"are not well separated. There are four distinguishable cases, as labeled in Figure 5.3with A, B, C and D.

    The occurrence of a class in the core of disease set means that it is either fromDO or has semantic type T047. If a class is found in the core of symptom set, thisindicates that the class is either from SYMP or has semantic type T184. 471 classesthat reside in the intersection of cores of sets (A in Figure 5.3), come from the UMLSgroup ontologies and have both semantic type T047 and T184, meaning that theycan be considered both diseases and symptoms. At this point we asked our expert inmedical domain for the help in denoting those classes either by disease, or by symptom.189 out of 471 were labeled as disease, 234 were labeled as symptom, and on 48 ofthose classes our expert could not make decision and therefore, we labeled those asboth disease and symptom. Since these 471 classes appear in core of both sets, weremove the ones labeled as disease by expert from the core of symptom set, and theones labeled as symptom from the core of disease set. We decided to keep 48 classeson which we hesitate in both sets and they will be considered a symptom and a diseaseat the same time.

    22

  • 5 Diseases and Symptoms 5.3 Data Overlap

    Figure 5.2: Symptom Ontologies Graph

    Figure 5.3: Data overlap between disease and symptom data

    23

  • 5 Diseases and Symptoms 5.3 Data Overlap

    Acronym #core_classes #connected_classes #connected_ontologiesSNOMEDCT 4482 11645 132MDR 3453 8449 92RCD 2142 9800 108ICPC2P 1080 5385 81ICD10CM 533 6200 71OMIM 358 4144 73ICD9CM 324 3742 72LOINC 288 1262 83WHO-ART 228 10147 90MeSH 218 7377 93ICD10 213 2759 61ICPC 196 1203 55COSTART 192 12027 103NDFRT 138 3592 76CRISP 79 3511 98AIR 64 426 47MEDLINEPLUS 47 3093 89

    Table 5.2: UMLS ontologies that contain classes with semantic type T184("Sign orSymptom") ("core classes"),number of core classes, number of classes mapped from coreclasses, number of ontologies the mapped classes are located in.

    We stored the core classes together with the mappings, where they are playing therole of source. This means, that the targets of those mappings were included amongthe potential classes of that set. Thus, when we remove classes from the core, we haveto remove potential classes that were used as target for the mappings that use the coreclasses as a source. Here we have to take care that these potential classes were usedonly in the mappings with the classes that we are going to remove. If they also appearin the mappings with other core classes, we keep them. Removing 234 classes fromdisease set leaves us with 244474 classes and removal of 189 classes from symptom setleaves us with 31202 classes.

    The classes appearing in a core of one set and among potential classes of the other,can be removed from the set, which contains it among potential classes, since it appearsto be there due to the mappings. We assume, that predicate hasSTY is a strongerindication about the origin of a class than mappings. Removal of the some core classesmentioned above, also changes the number of potential classes. After that process, wehave 4713 classes in core of disease set that are also potential symptom classes (C inFigure 5.3), and 1918 classes other way around (B in Figure 5.3). Thus, we can remove4713 classes from symptom set and 1918 classes from disease set.

    For the last case, where potential classes from each set (D in Figure 5.3) overlap, weare left with 5139 classes. To disambiguate the type for these classes, we retrieve theclasses from BioPortal that they are mapped to. We regard them as disease if they aremostly mapped to the classes in the core of disease set, as symptom otherwise. Thisway we consider 2847 classes a disease and 2292 classes a symptom class. After deletion

    24

  • 5 Diseases and Symptoms 5.3 Data Overlap

    of these classes, we have 240264 classes in disease set and 23642 classes in symptomset. Since the separation process does not change the sets drastically, we dont presentnew overviews for the connections between ontologies for the selected data.

    25

  • 6Disease-Symptom Relationships

    One of the important questions and motivational points for us was the identificationof diease-symptom relationships and their retrieval from the ontologies in BioPortal.In Chapter 5 we showed how we get and identify classes as a "Disease" or as a "Symp-tom". Now, we can use this acquired information to find the connections between them.Without this knowledge, we would have to analyse all the data on BioPortal, whichcould have taken too much time.

    One could suggest to find the predicates used in ontologies to connect diseases tosymptoms, by separate analysis of each predicate. However, more than 2600 distinctproperties are used in BioPortal ontologies. Moreover, some of the predicate namesconsist of just a URI, which also makes it difficult to answer the question, whether aproperty is used to connect diseases and symptoms, or not.

    Having the disease and symptom classes at hand, we look for the direct connectionsbetween them in the ontologies. Here we have two separate cases for UMLS groupontologies and the rest of them, that contain classes from both disease and symptomsets. We are searching disease-symptom relationships only between classes that occurin the same ontology. Although we have two different situations, we use the samemethod to retrieve the relations.

    We make an assumption, that if there is a structured disease-symptom relation,such that indicates that the certain disease class has the certain symptom class as asymptom, it should be stored in a one triple as a direct connection. Therefore, wecheck the triples of the form < entity1 > < predicate > < entity2 >, where either< entity1 > is a disease class and < entity2 > is a symptom class, or the other wayaround.

    27

  • 6 Disease-Symptom Relationships 6.1 UMLS group ontologies

    6.1 UMLS group ontologies

    To find the disease-symptom connections in UMLS group ontologies, we iterate overthose ontologies and for each we make the SPARQL query shown in Listing 6.1.

    Listing 6.1: Retrieval of disease-symptom connections from UMLS ontologiesSELECT distinct ?subject ?predicate ?objectFROM WHERE {?subject

    .?object

    .?subject ?predicate ?object.

    }

    As we have already described in Chapter 5, we remove part of the classes from the setsof disease and symptoms. When we get results for the query shown in Listing 6.1, theyalso include the classes with both semantic types. We filter out the results by removingthe data that contain deleted classes. Table 6.1 shows the resulting predicates for thedirect disease-symptom connections in the triples of UMLS group ontologies and thenumber of diseases that were used as subject in those triples.

    To find symptom-disease connections, we simply change the positions of ?subject and?object in the third constraint of the SPARQL query in Listing 6.1 and again filter outthe results of the queries. The predicates for the direct symptom-disease connectionsin the triples of UMLS group ontologies and the number of times the symptoms wereused as a subject of the triple are shown in Table 6.2.

    As we can see from the Tables 6.1 and 6.2, the predicates ending with SIB are onesthat are used more than others in total. The abbreviation SIB means that the classesused as subject and object in a triple, have sibling relationship in a Metathesaurussource vocabulary. RN, RB and RO are used to show narrower, broader and otherthan synonymous, narrower, or broader relationships, in a given order. Since we arelooking for relations, that would indicate the co-occurrence of the symptom and disease,these predicates can not be regarded as those we need. They rather show structuralrelationships between different classes. Although we have tried to separate the diseaseset from the symptom set, these connections imply, that the notions of "disease" and"symptom" are not perfect and they overlap.

    Another frequently used relationship is the rdfs:subClassOf predicate, which showshierarchical relationship between different classes in the ontology. As in the case of thepredicates mentioned above, this is also the case that shows us the imperfect notionsof "disease" and "symptom" in those ontologies. Being more specific, there are manyclasses in the UMLS group ontologies, that have semantic type "Disease or Syndrome",however, they appear to be the subclasses of a class with the semantic type "Sign orSymptom", and vice versa. As it was shown in Chapter 5, there are also classes withboth semantic types.

    28

  • 6 Disease-Symptom Relationships 6.1 UMLS group ontologies

    Count Predicate2474 http://purl.bioontology.org/ontology/MDR/SIB1463 http://www.w3.org/2000/01/rdf-schema#subClassOf1346 http://purl.bioontology.org/ontology/RCD/SIB1114 http://purl.bioontology.org/ontology/OMIM/has_manifestation737 http://purl.bioontology.org/ontology/WHO/SIB398 http://purl.bioontology.org/ontology/MSH/SIB369 http://purl.bioontology.org/ontology/MEDLINEPLUS/SIB330 http://purl.bioontology.org/ontology/ICD9CM/SIB324 http://purl.bioontology.org/ontology/ICD10CM/SIB268 http://purl.bioontology.org/ontology/MDR/classified_as149 http://purl.bioontology.org/ontology/CSP/SIB146 http://purl.bioontology.org/ontology/MSH/mapped_to127 http://purl.bioontology.org/ontology/MDR/classifies79 http://purl.bioontology.org/ontology/SNOMEDCT/may_be_a66 http://purl.bioontology.org/ontology/WHO/RN48 http://purl.bioontology.org/ontology/CSP/RN47 http://purl.bioontology.org/ontology/MEDLINEPLUS/related_to31 http://purl.bioontology.org/ontology/WHO/RB28 http://purl.bioontology.org/ontology/MSH/RO21 http://purl.bioontology.org/ontology/CSP/RB18 http://purl.bioontology.org/ontology/CSP/RO15 http://purl.bioontology.org/ontology/SNOMEDCT/associated_morphology_of9 http://purl.bioontology.org/ontology/SNOMEDCT/same_as8 http://purl.bioontology.org/ontology/SNOMEDCT/is_alternative_use4 http://purl.bioontology.org/ontology/SNOMEDCT/replaces3 http://purl.bioontology.org/ontology/SNOMEDCT/cause_of3 http://purl.bioontology.org/ontology/SNOMEDCT/interprets2 http://purl.bioontology.org/ontology/SNOMEDCT/replaced_by2 http://purl.bioontology.org/ontology/ICPC2P/replaced_by2 http://purl.bioontology.org/ontology/ICPC2P/replaces1 http://purl.bioontology.org/ontology/SNOMEDCT/occurs_after1 http://purl.bioontology.org/ontology/SNOMEDCT/associated_finding_of1 http://purl.bioontology.org/ontology/SNOMEDCT/occurs_before

    Table 6.1: Predicates in triples of the form < disease > < predicate > < symptom >and number of diseases used as subject of the triple in UMLS group ontologies

    29

  • 6 Disease-Symptom Relationships 6.1 UMLS group ontologies

    1439 http://www.w3.org/2000/01/rdf-schema#subClassOf662 http://purl.bioontology.org/ontology/MDR/SIB615 http://purl.bioontology.org/ontology/RCD/SIB345 http://purl.bioontology.org/ontology/OMIM/manifestation_of277 http://purl.bioontology.org/ontology/MDR/classified_as192 http://purl.bioontology.org/ontology/WHO/SIB177 http://purl.bioontology.org/ontology/ICD10CM/SIB172 http://purl.bioontology.org/ontology/ICD9CM/SIB107 http://purl.bioontology.org/ontology/MSH/SIB94 http://purl.bioontology.org/ontology/MDR/classifies83 http://purl.bioontology.org/ontology/SNOMEDCT/may_be_a53 http://purl.bioontology.org/ontology/SNOMEDCT/inverse_may_be_a37 http://purl.bioontology.org/ontology/CSP/SIB35 http://purl.bioontology.org/ontology/MEDLINEPLUS/SIB33 http://purl.bioontology.org/ontology/MSH/mapped_from30 http://purl.bioontology.org/ontology/WHO/RN28 http://purl.bioontology.org/ontology/SNOMEDCT/has_associated_morphology22 http://purl.bioontology.org/ontology/CSP/RN19 http://purl.bioontology.org/ontology/SNOMEDCT/definitional_manifestation_of18 http://purl.bioontology.org/ontology/MSH/RO16 http://purl.bioontology.org/ontology/SNOMEDCT/associated_with15 http://purl.bioontology.org/ontology/SNOMEDCT/was_a15 http://purl.bioontology.org/ontology/SNOMEDCT/associated_morphology_of10 http://purl.bioontology.org/ontology/MEDLINEPLUS/related_to9 http://purl.bioontology.org/ontology/SNOMEDCT/same_as9 http://purl.bioontology.org/ontology/CSP/RB9 http://purl.bioontology.org/ontology/SNOMEDCT/inverse_was_a8 http://purl.bioontology.org/ontology/CSP/RO8 http://purl.bioontology.org/ontology/SNOMEDCT/alternatively_used_for7 http://purl.bioontology.org/ontology/WHO/RB4 http://purl.bioontology.org/ontology/SNOMEDCT/replaced_by4 http://purl.bioontology.org/ontology/SNOMEDCT/due_to2 http://purl.bioontology.org/ontology/SNOMEDCT/replaces2 http://purl.bioontology.org/ontology/SNOMEDCT/is_interpreted_by2 http://purl.bioontology.org/ontology/ICPC2P/replaced_by2 http://purl.bioontology.org/ontology/ICPC2P/replaces1 http://purl.bioontology.org/ontology/SNOMEDCT/cause_of1 http://purl.bioontology.org/ontology/SNOMEDCT/refers_to1 http://purl.bioontology.org/ontology/SNOMEDCT/occurs_after1 http://purl.bioontology.org/ontology/SNOMEDCT/has_associated_finding1 http://purl.bioontology.org/ontology/SNOMEDCT/has_definitional_manifestation1 http://purl.bioontology.org/ontology/SNOMEDCT/associated_finding_of

    Table 6.2: Predicates in triples of the form < symptom > < predicate > < disease >and number of usages of symptoms as a subject of such triple in UMLS group ontologies

    30

  • 6 Disease-Symptom Relationships 6.2 Non-UMLS ontologies

    In some ontologies, the subclass relationships are even more complex. Figure 6.1shows these connections between classes having semantic type T047, T184 and both inMedical Subject Headings (MeSH) ontology. Green circles represent the classes withsemantic type "Sign or Symptom", red dots the classes with semantic type "Disease orSyndrome" and blue circles represents a class that has both semantic types.

    Moreover, there are also such predicates, as classified_as, mapped_to, classifies, and etc.that are not specific to the disease-symptom relations, rather to structural knowledge.

    Among the predicates that are listed in Tables 6.1 and 6.2, we consider onlyhas_manifestation relevant for disease-symptom relationships, and manifestation_offor the symptom-disease relationships. These two properties are inverse predicates andthus connect the same classes. We include these relationships as hasSymptom connec-tions in our data model.

    6.2 Non-UMLS ontologies

    We repeat the same process for the non-UMLS ontologies. In this case, since we donthave semantic types, we simply iterate over the ontologies and for each ontology weare looking at the triples, where symptom classes are used as an object. Table 6.3shows the predicates that were used in triples, where potential disease class is a subjectand potential symptom class is an object, together with the number of distinct diseaseclasses used with these predicates.

    (a) before (b) after

    Figure 6.1: Subclass relationships between disease-symptom data in Medical SubjectHeadings, before and after separating disease set from symptom set.

    31

  • 6 Disease-Symptom Relationships 6.3 Data Model

    994 http://www.w3.org/2000/01/rdf-schema#subClassOf96 http://bioontology.org/projects/ontologies/radlex/radlexOwlDlComponent#Is_A36 http://www.w3.org/2002/07/owl#disjointWith21 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#A1115 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#A109 http://www.medicine.mcgill.ca/epidemiology/buckeridge/syndromes.owl#isRelatedTo1 http://bioontology.org/projects/ontologies/radlex/radlexOwlDlComponent#Has_Subtype1 http://www.medicine.mcgill.ca/epidemiology/buckeridge/syndromes.owl#hasRelatedConcepts1 http://www.medicine.mcgill.ca/epidemiology/buckeridge/syndromes.owl#isSynonymousTo1 http://bioontology.org/projects/ontologies/radlex/radlexOwlDlComponent#Has_Part

    Table 6.3: Predicates in triples of the form < disease > < predicate > < symptom >and number of usages of diseases as a subject of such triple in non-UMLS ontologies

    Table 6.4 contains information about the predicates used in triples, where potentialsymptom class is a subject and potential disease class is an object, together with thenumber of distinct symptom classes used with these predicates.

    As we can see from Tables 6.3 and 6.4 most frequently used predicate to connect thepotential disease and symptom classes is rdfs:subClassOf, and as we discussed before,we dont consider it a disease-symptom specific relationship. Also the other propertieslike is_a and disjointWith are used for structural information between classes and thus,we disregard all these properties.

    6.3 Data Model

    With all the data we have retrieved, we can try to create a model that subsumesdisease and symptom information, as well as the information about their co-occurrence.Although we could not find too much of the latter type of knowledge on BioPortalwith our method, we will try to group diseases and symptoms together, to get moreinformation connected. In Chapter 7 we describe this in more detail.

    49 http://www.w3.org/2000/01/rdf-schema#subClassOf12 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#A114 http://www.w3.org/2002/07/owl#disjointWith3 http://bioontology.org/projects/ontologies/radlex/radlexOwlDlComponent#Is_A2 http://www.medicine.mcgill.ca/epidemiology/buckeridge/syndromes.owl#isRelatedTo1 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#A101 http://www.medicine.mcgill.ca/epidemiology/buckeridge/syndromes.owl#hasRelatedConcepts1 http://www.medicine.mcgill.ca/epidemiology/buckeridge/syndromes.owl#hasSynonymousConcepts

    Table 6.4: Predicates in triples of the form < symptom > < predicate > < disease >and number of usages of symptoms as a subject of such triple in non-UMLS ontologies

    32

  • 6 Disease-Symptom Relationships 6.3 Data Model

    First of all, we store all disease and symptom URIs as classes and indicate whichtype the class has with predicate hasType. We use predicate from for each class to showin which ontologies it occurs. One class URI might occur in one, as well as in manydifferent ontologies. To represent the mappings between classes, we use the mappingsources as a predicate. Also we define these predicates as a subproperty of a predicateisMappedTo to make it possible to query the mappings without discriminating theirsources. For each class we also store the preferred labels as strings using the predicateskos:prefLabel. One class might have one or more preferred labels.

    Together with the properties listed above, we also put the subclass information inour data model. We store the subclass relationships among disease classes, amongsymptom classes and between disease and symptom classes. The last piece of datathat we also represent in our data model, are the symptoms for diseases. We usepredicate hasSymptom for this purpose. Listing 6.2 shows triples for one class fromOMIM ontology.

    Listing 6.2: View from Disease-Symptom Ontology

    a owl:Class ;mapping:umls_cui

    , ;

    disy:hasSymptom ;disy:hasType disy:disease ;disy:isLocatedIn bioontology:OMIM ;skos:prefLabel "ichthyosiform erythroderma, corneal involvement, and

    deafness"^^xsd:string .}

    33

  • 7Disease and Symptom Graphs

    In previous chapters we have shown how we search disease and symptom related dataon BioPortal, also how we search the symptoms for diseases. As it was mentioned inChapter 5, we had some overlap between disease and symptom data in the first step.After separating these data, we are left with 240264 disease and 23642 symptom classes.However, this does not imply that all of the 240624 classes are for distinct diseases or23642 classes are about completely different symptoms. Many of the classes mightactually be about the same disease or symptom, but have synonymous labels. In fact,many classes even have exactly the same label, but just different URIs. Thus, in thischapter we will try to group the classes for disease and symptom data. This will alsoresult in connection of more disease and symptom classes.

    7.1 Disease Graph

    In Chapter 5 we have described how we use BioPortal mappings to get more data.Although there are 6 mapping sources, only 2 of them: Loom and UMLS_CUI aremainly used with disease classes. If we consider these classes vertices and the mappingsbetween them edges, we can create a disease graph.

    7.1.1 Default approach

    Considering the disease data as a graph, one very natural way of grouping the classeswould be to cluster them in the form of connected components of the graph. Connectedcomponent, or simply component in undirected graph is a subgraph in which any twovertices are connected to each other by paths. By using different mapping sources, orcombination of all of them, we can create different components and group the classURIs in the components as classes that represent the same disease.

    35

  • 7 Disease and Symptom Graphs 7.1 Disease Graph

    (a) All (b) Loom

    (c) UMLS_CUI

    Figure 7.1: Histogram of connected component sizes for different mapping sources indisease graph.

    Figure 7.1 shows histograms of sizes for different clusterings of the disease graph.For a better picture, we have depicted log of the sizes and counts. In Figure 7.1(a) allmappings were used as edges to create the graph. For this case we get 55685 clustersand the biggest cluster contains 70062 disease classes. In Figure 7.1(b) only loommappings were used to create the graph and we get 77427 clusters with biggest clustercontaining 33877 disease classes. Figure 7.1(c) shows the results for the case whereonly umls_cui mappings are used as edges in the graph. Here we have 135626 clusterswith biggest cluster of size 33708 disease classes.

    Obviously, this is not the best way of clustering disease classes. Since we want togroup different disease classes about the same disease or syndrome in the same cluster,its highly improbable that there are 70062 or 30000 distinct disease classes for thesame disease. In many cases classes of specific diseases might have mappings to moregeneral diseases and vice versa. Especially many classes from the hierarchy of the sameontology might have mappings among themselves.

    36

  • 7 Disease and Symptom Graphs 7.2 Symptom Graph

    7.1.2 Adapted approach

    In the adapted approach to clustering, we try to avoid the subclass relationshipswithin a cluster. Therefore, while finding the connected components of the graph, wedont add a class to the component, if the component already contains a subclass or asuperclass of this class. Moreover, we try not to put two classes from the same ontologyin the same cluster, even if they do not have subclass-superclass relationships to eachother. Here we assume that each disease or syndrome is not represented by more thanone class in any ontology.

    Figure 7.2 shows the results for the adapted approach. Here we again used the logsizes and log counts for a better picture. As we can see from the histograms, the resultsdiffer drastically from the approach described in Section 7.1.1. If we use all mappings tocreate the graph (Figure 7.2(a)) we get 102990 clusters with the biggest cluster of size64. In Figure 7.2(b) only Loom mappings were used to create the graph. For this casewe get 113165 clusters with the biggest cluster of size 53. In the case of UMLS_CUImappings (Figure 7.2(c)) number of clusters is 167970 with the biggest cluster size of20.

    The difference is not only in the biggest clusters, but also in the counts of clusterswith the size between 60-700. In Figure 7.1 there are many clusters with the sizes fromthat range. Since the histograms depict the log sizes and log counts, those numbersare much larger than we see in the picture and one unit more/less in figures actuallymeans twice more/less sizes or counts.

    Although the second approach reduces the sizes of clusters, it creates many clustersof very small sizes. In case where we use all mappings to create the graph, number ofclusters with only one class almost doubles with second way of clustering. With Loommappings, we again get more clusters of size 1, but in case of UMLS_CUI mappingsnumber of clusters with one class doesnt change as much as in previous two cases.

    As we can see, default approach creates clusters with bigger sizes, when adaptedapproach tries to diminish the clusters, getting more clusters of single classes.

    7.2 Symptom Graph

    We can repeat the procedure in Section 7.1 for the symptom classes as well.

    7.2.1 Default approach

    As in the case of disease classes, we can consider symptom classes vertices and BioPor-tal mappings edges, create a graph from them and group the classes in the connectedcomponents of the graph as one cluster. Figure 7.3 depicts the histograms of clustersizes for three different cases: with all mappings (Figure 7.3(a)), only Loom mappings(Figure 7.3(b)) and only UMLS_CUI mappings (Figure 7.3(c)) used as the edges of thegraph. Again histogram shows the log sizes and log frequencies for a better picture.

    37

  • 7 Disease and Symptom Graphs 7.2 Symptom Graph

    (a) All (b) Loom

    (c) UMLS_CUI

    Figure 7.2: Histogram of connected component sizes for different mapping sources indisease graph, without including classes from the same ontology and from the same hier-archy.

    In the case of all mappings we get 5028 clusters with largest cluster of size 5959symptom classes. If we use only Loom mappings, we get 8685 clusters and the biggestclusters size decreases to 2755. With only UMLS_CUI mappings being used as edgeswe have 10744 connected components in symptom graph and the larges componentcontains 3891 classes.

    7.2.2 Adapted approach

    With the default approach to grouping the classes, we again get quite big clusters inthe symptom graph. It is not only one cluster that contains too many classes, but alsothe middle sized (60-400 classes in cluster) connected components in symptom graphthat contain around 5000 classes make us reconsider the way of grouping them. Werepeat the adapted approach that we have applied to the disease graph. In other words,when we find connected components of the graph, we dont add the class to the cluster

    38

  • 7 Disease and Symptom Graphs 7.2 Symptom Graph

    (a) All (b) Loom

    (c) UMLS_CUI

    Figure 7.3: Histogram of connected component sizes for different mapping sources insymptom graph.

    if it already contains subclasses or superclasses of it. Also if theres another class fromthe same ontology in the component, we avoid adding it to the same component.

    The results are depicted in Figure 7.4(a), (b) and (c) corresponding to the graphscontaining all mappings, only Loom mappings and only UMLS_CUI mappings as edges.In first case graph has 11530 connected components and the biggest one contains 57classes. In case of lexical mappings, 13000 clusters contain the symptom classes withthe biggest one of size 53. The symptom graph made of only UMLS_CUI mappingsas edges has 16416 clusters, and the biggest cluster contains just 18 classes.

    39

  • 7 Disease and Symptom Graphs 7.2 Symptom Graph

    (a) All (b) Loom

    (c) UMLS_CUI

    Figure 7.4: Histogram of connected component sizes for different mapping sources insymptom graph, without including classes from the same ontology and from the samehierarchy.

    40

  • 8Summary

    41

  • Bibliography

    [1] Whetzel PL, Noy NF, Shah NH, Alexander PR, Nyulas C, Tudorache T, MusenMA. BioPortal: enhanced functionality via new Web services from the Na-tional Center for Biomedical Ontology to access and use ontologies in softwareapplications. Nucleic Acids Res., 39(Web Server Issue):W541W545, Jul 2011.(cited on page 1)

    [2] B. Humphreys D. Lindberg and A. McCray. The unified medical language system.Methods of Information in Medicine, 32(4):281291, 1993. (cited on page 2)

    [3] International Health Terminology Standards Development Organisation. Snomedct, 2014. URL http://www.ihtsdo.org/snomed-ct/. (cited on page 2)

    [4] Reinhard Fescharek, Jrgen Kbler, Ulrich Elsasser, Monika Frank, and Pe-tra Gthlein. Medical dictionary for regulatory activities (meddra). Interna-tional Journal of Pharmaceutical Medicine, 18(5):259269, 2004. ISSN 1364-9027. doi: 10.2165/00124363-200418050-00001. URL http://dx.doi.org/10.2165/00124363-200418050-00001. (cited on page 2)

    [5] Schriml L, Arze C, Nadendla S, Chang Y, Mazaitis M, Felix V, Feng G, KibbeW. Disease ontology: a backbone for disease semantic integration. Nucleic AcidsResearch, 40(D1):D940D946, 2012. (cited on page 2)

    [6] Carolyn E. Lipscomb. Medical subject headings (MeSH). Bull Med Libr As-soc., 2000. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=35238.88(3): 265-266. (cited on page 3)

    [7] Sgolne Aym, Ana Rath, and Bertrand Bellet. Who international classification ofdiseases (icd) revision process: incorporating rare diseases into the classificationscheme: state of art. Orphanet Journal of Rare Diseases, 5(Suppl 1):P1, 2010.(cited on page 3)

    [8] Sebastian Khler, Sandra C. Doelken, Christopher J. Mungall, Sebastian Bauer,Helen V. Firth, Isabelle Bailleul-Forestier, Graeme C. M. Black, Danielle L. Brown,Michael Brudno, Jennifer Campbell, David R. FitzPatrick, Janan T. Eppig, An-drew P. Jackson, Kathleen Freson, Marta Grdea, Ingo Helbig, Jane A. Hurst,Johanna Jhn, Laird G. Jackson, Anne M. Kelly, David H. Ledbetter, Sahar Man-sour, Christa L. Martin, Celia Moss, Andrew Mumford, Willem Ouwehand, Soo-Mi

    43

    http://www.ihtsdo.org/snomed-ct/http://dx.doi.org/10.2165/00124363-200418050-00001http://dx.doi.org/10.2165/00124363-200418050-00001http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=35238

  • Bibliography Bibliography

    Park, Erin Rooney Riggs, Richard H. Scott, Sanjay Sisodiya, Steven Van Vooren,Ronald J. Wapner, Andrew O. M.Wilkie, Caroline F. Wright, Anneke T. Vulto-vanSilfhout, Nicole de Leeuw, Bert B. A. de Vries, Nicole L. Washington, Cynthia L.Smith, Monte Westerfield, Paul N. Schofield, Barbara J. Ruef, Georgios V. Gk-outos, Melissa Haendel, Damian Smedley, Suzanna E. Lewis, and Peter N. Robin-son. The human phenotype ontology project: linking molecular biology and dis-ease through phenotype data. Nucleic Acids Research, 42(Database-Issue):966974, 2014. doi: 10.1093/nar/gkt1026. URL http://dx.doi.org/10.1093/nar/gkt1026.(cited on page 3)

    [9] Lynn M. Schriml, Cesar Arze, Suvarna Nadendla, Anurhada Ganapathy, Vic-tor Felix, Anup Mahurkar, Katherine H. Phillippy, Aaron Gussman, Samuel V.Angiuoli, Elodie Ghedin, Owen White, and Neil Hall. Gemina, genomic meta-data for infectious agents, a geospatial surveillance pathogen database. NucleicAcids Research, 38(Database-Issue):754764, 2010. doi: 10.1093/nar/gkp832. URLhttp://dx.doi.org/10.1093/nar/gkp832. (cited on pages 3 and 16)

    [10] Maja Hadzic and Elizabeth Chang. Ontology-based multi-agent systems supporthuman disease study and control. In Hans Czap, Rainer Unland, Cherif Branki,and Huaglory Tianfield, editors, SOAS, volume 135 of Frontiers in Artificial Intelli-gence and Applications, pages 129141. IOS Press, 2005. ISBN 1-58603-577-0. URLhttp://dblp.uni-trier.de/db/conf/soas/soas2005.html#HadzicC05. (cited on page 3)

    [11] Mythili Thirugnanam, Mangayarkarasi Ramaiah, V. Pattabiraman, and R. Sivaku-mar. Ontology based disease information system. Procedia Engineering, 38(0):3235 3241, 2012. ISSN 1877-7058. doi: http://dx.doi.org/10.1016/j.proeng.2012.06.375. URL http://www.sciencedirect.com/science/article/pii/S1877705812022886.(cited on page 3)

    [12] O. Mohammed, R. Benlamri, and S. Fong. Building a diseases symptoms ontologyfor medical diagnosis: An integrative approach. In Future Generation Commu-nication Technology (FGCT), 2012 International Conference on, pages 104108,Dec 2012. doi: 10.1109/FGCT.2012.6476567. (cited on page 3)

    [13] Heiner Oberkampf, Sonja Zillner, and Bernhard Bauer. Interpreting patient datausing medical background knowledge. In Ronald Cornet and Robert Stevens,editors, ICBO, volume 897 of CEUR Workshop Proceedings. CEUR-WS.org,2012. URL http://dblp.uni-trier.de/db/conf/icbo/icbo2012.html#OberkampfZB12.(cited on page 3)

    [14] Fabian M. Suchanek, Serge Abiteboul, and Pierre Senellart. Ontology alignmentat the instance and schema level. CoRR, abs/1105.5516, 2011. URL http://arxiv.org/abs/1105.5516. (cited on page 3)

    [15] John Hebeler, Matthew Fisher, Ryan Blace, Andrew Perez-Lopez, and Mike Dean.Semantic web programming. 2009. (cited on page 7)

    [16] Resource Description Framework, . URL http://www.w3.org/TR/rdf11-concepts/.(cited on page 7)

    44

    http://dx.doi.org/10.1093/nar/gkt1026http://dx.doi.org/10.1093/nar/gkp832http://dblp.uni-trier.de/db/conf/soas/soas2005.html##HadzicC05http://www.sciencedirect.com/science/article/pii/S1877705812022886http://dblp.uni-trier.de/db/conf/icbo/icbo2012.html##OberkampfZB12http://arxiv.org/abs/1105.5516http://arxiv.org/abs/1105.5516http://www.w3.org/TR/rdf11-concepts/

  • Bibliography Bibliography

    [17] Web Ontology Language, . URL http://www.w3.org/TR/owl-guide/.(cited on page 8)

    [18] Web Ontology Language 2, . URL http://www.w3.org/TR/owl2-overview/.(cited on page 8)

    [19] Turtle, . URL http://www.w3.org/TR/turtle/. (cited on page 9)

    [20] N-triples, . URL http://www.w3.org/TR/n-triples/. (cited on page 9)

    [21] RDF-XML, . URL http://www.w3.org/TR/rdf-syntax-grammar/. (cited on page 9)

    [22] SPARQL endpoint, . URL http://semanticweb.org/wiki/SPARQL_endpoint.(cited on page 9)

    [23] NCBO team Musen MA, Noy NF, Shah NH, Whetzel PL, Chute CG, Story MA,Smith B. The National Center for Biomedical Ontology. J Am Med Inform Assoc.,19(2):190195, 2012 Mar-Apr. (cited on page 11)

    [24] Protein Ontology, . URL http://bioportal.bioontology.org/ontologies/{PR}.(cited on page 11)

    [25] The OBO Ontology Editor, . URL http://www.oboedit.org. (cited on page 11)

    [26] Manuel Salvadores, Matthew Horridge, PaulR Alexander, RayW Fergerson,MarkA Musen, and NatalyaF Noy. Using SPARQL to Query BioPortal Ontolo-gies and Metadata. In Philippe Cudr-Mauroux, Jeff Heflin, Evren Sirin, TaniaTudorache, Jrme Euzenat, Manfred Hauswirth, JosianeXavier Parreira, JimHendler, Guus Schreiber, Abraham Bernstein, and Eva Blomqvist, editors, TheSemantic Web ISWC 2012, Lecture Notes in Computer Science, pages 180195.Springer Berlin Heidelberg, 2012. doi: 10.1007/978-3-642-35173-0\_12. URLhttp://dx.doi.org/10.1007/978-3-642-35173-0_12. (cited on page 12)

    [27] Salvadores M, Alexander PR, Musen MA, and Noy NF. Bioportal as adataset of linked biomedical ontologies and terminologies in rdf. SWJ., 2012.(cited on pages 12, 13, and 47)

    [28] Natalya F. Noy, Michael Dorf, Nicholas Griffith, Csongor Nyulas, and Mark A.Musen. Harnessing the power of the community in a library of biomedicalontologies. In Proceedings of the Workshop on Semantic Web Applicationsin Scientific Discourse. Eighth International Semantic Web Conference, 2009.(cited on page 12)

    [29] Amir Ghazvinian. Creating mappings for ontologies in biomedicine: sim-ple methods work. AMIA Annu Symp Proc 2009, pages 198202, Nov 2009.(cited on page 13)

    [30] Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. Gephi: An opensource software for exploring and manipulating networks, 2009. URL http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154. (cited on page 21)

    [31] Human-Disease Ontology, . URL http://bioportal.bioontology.org/ontologies/DOID.

    45

    http://www.w3.org/TR/owl-guide/http://www.w3.org/TR/owl2-overview/http://www.w3.org/TR/turtle/http://www.w3.org/TR/n-triples/http://www.w3.org/TR/rdf-syntax-grammar/http://semanticweb.org/wiki/SPARQL_endpointhttp://bioportal.bioontology.org/ontologies/{PR}http://www.oboedit.orghttp://dx.doi.org/10.1007/978-3-642-35173-0_12http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154http://bioportal.bioontology.org/ontologies/DOIDhttp://bioportal.bioontology.org/ontologies/DOID

  • Bibliography Bibliography

    [32] Symptom Ontology, . URL http://bioportal.bioontology.org/ontologies/{SYMP}.

    [33] Nicholas Sioutos, Sherri de Coronado, Margaret W. Haber, Frank W. Hartel, Wen-Ling Shaiu, and LawrenceW.Wright. Nci thesaurus: A semantic model integratingcancer-related clinical and molecular information. J. of Biomedical Informatics,40(1):3043, Feb 2007. ISSN 1532-0464. doi: 10.1016/j.jbi.2006.02.013. URLhttp://dx.doi.org/10.1016/j.jbi.2006.02.013.

    46

    http://bioportal.bioontology.org/ontologies/{SYMP}http://dx.doi.org/10.1016/j.jbi.2006.02.013

  • List of Figures

    2.1 Example data from DO . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3.1 Metadata: Virtual Ontologies and Version Ontologies. [27] . . . . . . . 13

    4.1 http://www.ncbi.nlm.nih.gov/books/NBK9679/figure/ch05.F3/?report=objectonly,A Portion of the UMLS Semantic Network: Relations . . . . . . . . . . 17

    5.1 Disease Ontologies Graph . . . . . . . . . . . . . . . . . . . . . . . . . 225.2 Symptom Ontologies Graph . . . . . . . . . . . . . . . . . . . . . . . . 235.3 Data overlap between disease and symptom data . . . . . . . . . . . . 23

    6.1 Subclass relationships between disease-symptom data in Medical Sub-ject Headings, before and after separating disease set from symptomset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    7.1 Histogram of connected component sizes for different mapping sourcesin disease graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    7.2 Histogram of connected component sizes for different mapping sourcesin disease graph, without including classes from the same ontology andfrom the same hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    7.3 Histogram of connected component sizes for different mapping sourcesin symptom graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    7.4 Histogram of connected component sizes for different mapping sourcesin symptom graph, without including classes from the same ontologyand from the same hierarchy. . . . . . . . . . . . . . . . . . . . . . . . 40

    47

  • List of Tables

    5.1 UMLS ontologies that contain classes with semantic type T047("Diseaseor Syndrome") ("core classes"),number of core classes, number of classesmapped from core classes, number of ontologies the mapped classes arelocated in. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    5.2 UMLS ontologies that contain classes with semantic type T184("Signor Symptom") ("core classes"),number of core classes, number of classesmapped from core classes, number of ontologies the mapped classes arelocated in. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    6.1 Predicates in triples of the form < disease > < predicate > and number of diseases used as subject of the triple inUMLS group ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    6.2 Predicates in triples of the form < symptom > < predicate > and number of usages of symptoms as a subject of suchtriple in UMLS group ontologies . . . . . . . . . . . . . . . . . . . . . . 30

    6.3 Predicates in triples of the form < disease > < predicate > and number of usages of diseases as a subject of suchtriple in non-UMLS ontologies . . . . . . . . . . . . . . . . . . . . . . . 32

    6.4 Predicates in triples of the form < symptom > < predicate > and number of usages of symptoms as a subject of suchtriple in non-UMLS ontologies . . . . . . . . . . . . . . . . . . . . . . . 32

    49

  • Listings

    2.1 SPARQL query example . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    5.1 Retrieval of disease classes from UMLS ontologies . . . . . . . . . . . . . 20

    6.1 Retrieval of disease-symptom connections from UMLS ontologies . . . . 286.2 View from Disease-Symptom Ontology . . . . . . . . . . . . . . . . . . . 33

    51

    1 Introduction1.1 Motivation and Background1.2 Problem definition1.3 Related Work1.4 Approach1.5 Structure of the thesis

    2 Basics2.1 Semantic Web2.1.1 Triples, Ontologies, Reasoners2.1.2 SPARQL

    3 BioPortal3.1 The structure of dataset on BioPortal3.1.1 Ontology Repository3.1.2 Ontology Metadata3.1.3 Mappings

    4 Disease Ontology, Symptom Ontology and UMLS as a starting point.4.1 Human-Disease Ontology4.2 Symptom Ontology4.3 Unified Medical Language System (UMLS)

    5 Diseases and Symptoms5.1 Disease Information5.2 Symptom Information5.3 Data Overlap

    6 Disease-Symptom Relationships6.1 UMLS group ontologies6.2 Non-UMLS ontologies6.3 Data Model

    7 Disease and Symptom Graphs7.1 Disease Graph7.1.1 Default approach7.1.2 Adapted approach

    7.2 Symptom Graph7.2.1 Default approach7.2.2 Adapted approach

    8 SummaryBibliographyList of FiguresList of TablesListings