A query language for discovering semantic associations, Part I: Approach and formal definition of query primitives

In contemporary query languages, the user is respon-sible for navigation among semantically related data.Because of the huge amount of data and the complexstructural relationships among data in modern applica-tions, it is unrealistic to suppose that the user couldknow completely the content and structure of the avail-able information. There are several query languageswhose purpose is to facilitate navigation in unknownstructures of databases. However, the background as-sumption of these languages is that the user knows howdata are related to each other semantically in the struc-ture at hand. So far only little attention has been paid tohow unknown semantic associations among availabledata can be discovered. We address this problem in thisarticle. A semantic association between two entities canbe constructed if a sequence of relationships expressedexplicitly in a database can be found that connects these entities to each other. This sequence may containseveral other entities through which the original entitiesare connected to each other indirectly. We introduce anexpressive and declarative query language for discover-ing semantic associations. Our query language is able,for example, to discover semantic associations betweenentities for which only some of the characteristics areknown. Further, it integrates the manipulation of seman-tic associations with the manipulation of documentsthat may contain information on entities in semanticassociations.

Introduction

Users of several current information systems need, inaddition to the information internal to an organization, ex-ternal information related to other organizations (e.g., B2Bor business-to-business applications; Jensen, Møller, &Pedersen, 2003), or public information sources reachable

via the Web. Such applications are based on a specificcontext whose underlying information has to be extractedfrom several distributed information sources. Typicallyin the context of interest only a part of the data available inan information source is needed. The context of interestdetermines what data is relevant in a specific informationsource.

The Extensible Markup Language (XML; W3C, 2004a)has totally changed the way data is shared between applica-tions and organizations because it offers a popular standardfor a data-exchange format between heterogeneous informa-tion sources on the Web. In fact, XML has removed oneof the main obstacles to integrating data: the heterogeneity ofdata formats. Many commercial database management sys-tems (DBMSs) have also incorporated some support forXML publishing. Likewise many authors (e.g., Benedikt,Chan, Fan, Freire, & Rastogi, 2003; Fernández, Morishima, &Suciu, 2001; Fong, Wong, & Cheng, 2003; Shanmugasundaramet al., 2001) have considered how XML representation isproduced from relational data. There are also several efforts(e.g., DeHaan, Toman, Consens, & Özsu, 2003; Fernández,Kadiyska, Suciu, Morishima, & Tan, 2002) to map XML-based queries to Structured Query Language (SQL) queries.These works can be exploited in extracting relevant infor-mation from heterogeneous information sources. It is obvi-ous that the data related to an application involving several(possibly autonomous) information sources should be repre-sented in one uniform way that is easily constructable fromthese information sources. Therefore, the starting point ofour work is that all information in the application is repre-sented as one XML-based information source.

Our XML-based application developed for some contextof interest can be viewed as consisting of entities (objects) andthe relationships among them, inherited from the real world.The is-a relationship, the association, and the part-of rela-tionship are considered basic among entities in the real world(e.g., Rumbaugh, Blaha, Premerlani, Eddy, & Lorensen,1991; Rumbaugh, Jacobson, & Booch, 1999; Wand, Storey, &

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 58(11):1559–1568, 2007

A Query Language for Discovering SemanticAssociations, Part I: Approach and Formal Definition of Query Primitives

Timo Niemi and Janne JämsenDepartment of Computer Sciences, University of Tampere, Kanslerinrinne 1, 33014, Tampere, Finland. E-mail: {tn, janne.jamsen}@cs.uta.fi

Received June 23, 2005; revised February 19, 2007; accepted February 19,2007

© 2007 Wiley Periodicals, Inc. • Published online 16 July 2007 in WileyInterScience (www.interscience.wiley.com). DOI: 10.1002/asi.20478

Weber, 1999; Motschnig-Pitrik & Kaasbøll, 1999; Renguo,Dillon, Rahayu, Chang, & Gorla, 2000).

The is-a relationship (often called a specialization/gener-alization hierarchy or class hierarchy in object-orientation)groups together semantically similar entities (at the exten-sional level) or entity types (at the intensional level). In otherwords, it organizes entities or entity types whose degree ofspecialization is different. For example, the entity type bicy-cle is more specific than the entity type cycle, which, in turn,is more specific than vehicle. The part-of relationship isneeded for modeling whole/part relationships among entitiesand entity types. Entities modeled by the part-of relationshipconsist of parts (i.e., of some other entities), which, in turn,are composed of other parts and so on. Thus all parts are noton the same level but they rather have different complexity.Unlike the part-of relationship, the association expresses therelationship among independent entity types/entities. Typi-cally it models an event, a phenomenon, or a fact of the realworld. Characteristically, entity types/entities participatingin an association play some role. By establishing an associa-tion between two entities of the type person and an entity ofthe type vehicle, we could, for example, express who (therole seller) has sold whom (the role buyer) a vehicle. In anassociation, the participating entity types are assumed to beconceptually on the same level (Renguo et al., 2000),whereas in the part-of relationship they depend hierarchi-cally on each other.

Next-generation information systems (NGISs) (Li &Lochovsky, 1998; Niemi, Junkkari, & Järvelin, 2002) needto manipulate data-oriented (structural), behavioral, and de-ductive aspects of all the above relationships. Due to greatamounts of available data, which have possibly been ex-tracted from several heterogeneous autonomous informationsources, it is impossible to assume that the user could knowthe available data and the relationships among them in detail.In the context of the is-a relationship this mainly means thecapability of finding out the unknown super- or subentitytypes of a given entity type (Niemi, Christensen, & Järvelin,2000). Instead, in the part-of relationship there may be sev-eral components unknown to the user both on the intensional(schema) and extensional (instance) level (Junkkari, 2005).In Niemi, Junkkari, Järvelin, and Viita (2004) we have intro-duced our expressive query language for manipulating part-of relationships when the user does not know their structuresor contents. In this article our focus is to support the user infinding unknown semantic connections among entities in thecontext of associations.

According to Sheth and colleagues (Sheth et al., 2005;Anyanwu & Sheth, 2002, 2003; Aleman-Meza, Halaschek,Arpinar, & Sheth, 2003), a semantic association exists betweentwo entities if they relate to each other via relationships andentities in a given application domain. The semantic associ-ation between two entities can be direct or indirect. In adirect semantic association two entities are related to eachother via an association in which both entities participate(e.g., Professors Smith and Jones have written a joint article).In an indirect semantic association they relate transitively to

each other via a sequence of other entities and associations(e.g., Professor Jones uses in his logic programming coursea text book written by a scientist belonging to the same de-partment as Professor Smith).

The need to support the discovering of semantic associa-tions has recently been recognized both in the Semantic Web(Anyanwu & Sheth, 2002, 2003) and relational database(Hristidis & Papakonstantinou, 2002) communities. So farresearch has been concentrated on developing algorithms fordiscovering semantic associations. To our knowledge, thequery language we introduce in this article is the first devel-oped for this purpose.

The rest of the article is organized as follows. In the fol-lowing section we consider research on discovering semanticassociations among data. Special attention is paid to factorsthat are still missing in order to develop a query language fordiscovering semantic associations. Further, we consider whythe approach based on conventional query languages cannotbe used for discovering semantic associations. The basic mod-eling constructs in our approach are introduced in the subse-quent section, followed by a presentation of our query languagefor discovering semantic associations. Finally, we present ourconclusions. In Part II of this article (Niemi & Jämsen, inpress), we give a more detailed discussion of the merits andusage of our query language and its prototype implementa-tion, as well as potential problems and challenges for its futuredevelopment.

Related Work

Contemporary database query languages, regardless of theunderlying database paradigm, permit the formulation ofqueries in which the user knows the structure and contentof the database of interest. In relational databases related dataare usually stored in several relations. The semantic connec-tions among relational data are represented by storing thesame values of attributes in several relations. Therefore, incollecting semantically related data from several relations,the user has to specify many relational joins between data,for example, in the WHERE clause of SQL (Date, 2000).Unlike in relational databases, in object-oriented databasesa unique identifier is assigned to each object of the databaseand is used to refer to the object; related data are usuallyexpressed so that the object identifier of one object is storedas an attribute of another. In a typical object-oriented querylanguage, such as Object Query Language (OQL; Cluet,1998), the user has to specify through available iterators howdifferent variables depend on each other, that is, how ob-jects, which are instantiations of these variables, are relatedsemantically to each other. In deductive databases, or logicaldatabases, the existing basic semantic relationships amongdata are represented in Extensional Data Base (EDB) andmore complex semantic relationships among data (calledIntensional Data Base or IDB) are inferred from EDB byspecifying rules (Ullman, 1988). Both EDB and IDB arespecified on the basis of logic. In specifying IDB rules theuser has to know in detail how the EDB has been organized.

1560 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—September 2007DOI: 10.1002/asi

(2000a, 2000b) have presented this kind of framework,which is able to find tuples in a relational database that con-tains all keywords. Thus the user need not know the namesof relations and their attributes, as in SQL query formulation.However, this framework is not able to discover relationshipsamong keywords if they appear in different relations. Withrespect to this work, the DISCOVER system (Hristidis &Papakonstantinou, 2002) takes one step forward by discov-ering relationships among given keywords even when theyare located in different relations. In DISCOVER, the so-lution to a query consists of those join sequences amongrelations that contain all the keywords. In other words,each join sequence expresses a semantic association amongkeywords.

When applying the Rho operation or the above discoverymechanisms developed for relational databases, the usermust know entities or keywords explicitly, although it is notnecessary to know how they are semantically related. Herewe go yet one step (we believe an important one) further bysupporting the discovery of semantic associations amongunknown entities. In other words, we may know some char-acteristics related to the entities of interest but we do notknow the entities themselves. This means that our problem ismore complex because we have to find unknown semanticassociations among unknown entities. In the next section,we summarize the essential features of our approach and itsdifferences with regard to existing approaches.

One uniform information source for the user. All availableinformation, which may have been extracted from many het-erogeneous (possibly autonomous) information sources, isrepresented as one uniform information source for the user.The user need not know how this information has been orga-nized. It is sufficient that the user knows that it consists ofentities and associations with their types and properties (i.e.,data-oriented information) as well as documents containingpossibly semistructured textual information on entities (i.e.,document-oriented information). We have developed, fordata-oriented modeling constructs, both a formal (XML-independent) representation and an XML-based prototype rep-resentation (see Part II). In our prototype we chose theXML-based representation to promote open data exchangeand interoperability as well as to provide one consistent wayto represent structured as well as semistuctured information.

Query language. Unlike the Rho operation or the relationaldiscovery methods, our idea is to offer a real query languagefor discovering semantic associations among data. Further, itis essential from the viewpoint of expressive power that ourlanguage also finds unknown semantic associations among un-known entities. Our language has been designed so that it canexploit all the explicit information that the user is able to give.For example, if the user can express that unknown entities be-long to a specific type, this can be exploited in two ways. First,the user gets in the answer only instantiations of entities thatbelong to the given type, instead of all possible instantiations.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—September 2007 1561DOI: 10.1002/asi

From the above we can draw the following conclusions.In contemporary database query languages the user is re-sponsible for specifying semantic relationships among dataand, therefore, has to understand how data are related seman-tically to each other. Unfortunately, these query languages donot support querying unknown semantic relationships amongdata.

XML represents semistructured data, that is, data withpossibly irregular and incomplete structure. The starting pointfor many XML-based semistructured query languages (e.g.,W3C, 1999, 2005; Abiteboul, Quass, McHugh, Widom, &Wiener, 1997; Buneman, Fernández, & Suciu, 2000) is that aninformation need can mainly be satisfied by extracting andselecting data from XML documents. The user navigatesamong XML data by specifying path expressions. An ele-ment in an XML document matches to the path expression ifit is possible to find a path starting from the root element andending at the target element so that the elements belongingto the path satisfy all conditions specified by the user. Due tothe semistructured data, an XML document may containseveral different structures leading to the target element withthe desired label. For example, data in some structure maybe absent or may not conform to a regular structure. There-fore it is appropriate that all parts of the path need not bespecified exactly. Compared to the database query languagesdiscussed above, which require knowledge of the exactstructure of the data, path-oriented query languages are astep forward in the manipulation of unknown information.Nevertheless, we share the opinion of Erdmann and Studer(2001) that these languages are strong only when the loca-tion of wanted data is known in advance or when the infor-mation seeker is at least aware of the document structure tosome extent. Clearly, XML query languages do not supportin any way the discovery of unknown semantic relationshipsamong entities in XML documents.

Sheth and colleagues (Sheth et al., 2005; Anyanwu &Sheth, 2002, 2003) have developed an advanced operation,called Rho, for discovering semantic associations amongdata. This operation uses the Semantic Web based on the Re-source Description Framework (RDF) representation (W3C,2004b). RDF-based path-oriented query languages (e.g.,Karvounarakis, Alexaki, Christophides, Plexousakis, &Scholl, 2002; Seaborne, 2004) are similar to XML-basedpath-oriented query languages and thus they do not supportthe discovery of semantic associations among data. There-fore the novel Rho operation was needed. It is able to returnthe answer to the question “How are entity a and entity brelated?” The answer is a set of paths, called semantic pathsby Sheth et al., consisting of relationships and intermediateentities that connect the entities a and b.

In conventional relational query languages the user has toknow the underlying database schema exactly. There is,however, some work that supports information discoveryamong relational data on the basis of keywords. The userneed not have any knowledge of the relational databaseschema at hand. Keywords can be any attribute values (e.g.,the names of persons) in the database. Masermann and Vossen

Therefore the answer is more compact, containing more rele-vant instantiations for the unknown entities. Second, explicitinformation speeds up processing because only a specificscope of available information must be processed.

The degree of declarativity of our query language must beessentially greater than in XML-based path-oriented querylanguages. Although our query language deals with XML-based data, XML-based path-oriented query languages (e.g.,Abitebou et al., 1997; Buneman et al., 2000; W3C, 1999,2005) are easy to use only if the information need can be sat-isfied on the basis of extraction and selection of availabledata. Further, queries in these query languages produce asthe result an XML element (which can be a part of an originaldocument). Because we are interested here in discoveringsemantic associations between entities, we need a languagethat is able to produce precisely this information withoutintermingling content and its description like in XML. Inother words, we need an approach to query formulation thatdiffers considerably from that used in XML-based path-oriented query languages. Our query language resemblesthe RDF-based query language TRIPLE (Sintek & Decker,2001) in the respect that the result of a query consists ofdifferent possible bindings of free variables in the query. Inour approach the query result contains the instantiations ofthose variables that refer to unknown entities and semanticassociations.

Source Information for Discovering SemanticAssociations

The starting point of our approach is that all available in-formation collected for some purpose (context of interest) hasbeen constructed from several (possibly heterogeneous andautonomous) information sources. Roughly speaking, in ourapproach any context of interest consists of entities, theirproperties, the is-a hierarchy among entities, associationsamong entities, and documents containing information relatedto entities. The chosen modeling constructs resemble closelythose of our Relational Deductive Object-Oriented Model(RDOOM; Niemi et al., 2002) introduced earlier. However,the deductive and behavioral aspects of this model are notconsidered in this article, nor are part-of relationships.

Below only the modeling constructs crucial for manipula-tion of semantic associations are defined formally. Based onthese definitions, we also give the formalism to represent se-mantic associations. The formal definitions of other model-ing constructs can be found in the article by Niemi et al.(2002). First we introduce the basic notational conventionsused in addition to the usual set theoretic notation.

Notational Basic Conventions

(1) A tuple is denoted between angle brackets, for example�a, b, c�.

(2) The length of a tuple t is the number of the elements inthe tuple, and it is denoted by len(t), for examplelen(�a, b, c�) � 3.

(3) The ith element (i � {1, … , len(t)}) of a tuple t is denotedby t[i]. For example, if t � �a, b, c� then t[2] � b andt[len(t)] � c.

Modeling Constructs for the Context of Interest

Entities and their properties (attributes). Entities have at-tributes that describe their characteristics. We assume thatvalues of some attributes of entities are used to identify enti-ties uniquely in the context at hand, that is, these attributesact as keys of entities. In other words, our work is based onthe value-oriented approach. Entities and the values of theirattributes belong to the extensional level. At the intensionallevel, entities possessing the same attributes form an entitytype with a unique name. We use the notation ext(Y) to referto the set of entities that belong to the entity type Y.

Is-a hierarchy. In an is-a hierarchy entities/entity typeswith different degrees of specialization are organized intoseparate hierarchy levels. Typically, the specialization of anentity type is defined by giving additional attributes alongwith superentity types. In other words, in addition to its spe-cific attributes, an entity type inherits the attributes of itssuperentity types. At the extensional level we have thefollowing interpretation: An entity belonging to a specificentity type also belongs to each of its superentity types (i.e.,more general entity types) in the is-a hierarchy at hand. Ourformal set theoretic representation for is-a hierarchies can befound in the article by Niemi et al. (2002).

Associations. An association type contains three kinds ofinformation: association name, entity attribute names, andproperty attribute names. The entity attributes express thoseentities that participate in the association in some role,whereas the property attributes characterize the phenome-non behind the association. If EN is an entity attribute namein some association type, then we use the notation type(EN )to express its underlying entity type. For example, if seller isan entity attribute name in some association type, the nota-tion type(seller) could mean the entity type person.

An individual association expresses a named connectionbetween specific entities and related property attribute val-ues. Similar associations are described as an association typeand they are represented as a relation, that is, as a set con-sisting of tuples with the same structure. An association typehas the schema A(E1, … , En, P1, … , Pm) where A is thename of the association type and E1, … , En are its entity at-tribute names and P1, … , Pm are its property attributenames. The sets of entity and attribute names in the associa-tion type A are denoted by ENA and PNA, respectively. Asexplained above, at the instance level an association typeA(E1, … , En, P1, … , Pm) is represented as a set {�e11, … ,e1n, p11, … , p1m�, … , �ek1, … , ekn, pk1, … , pkm�}where eij (i � {1, … , k} and j � {1, … , n}) expressesan entity belonging to the entity attribute Ej, whereas pij ( j �{1, … , m}) expresses a value of the property attribute Pj. In


other words, k expresses the cardinality of A, whereas n � mexpresses its degree. In constructing semantic associations itis necessary to find out whether two entities appear in thesame association. For this we need the projection of an asso-ciation type. Let ENr and ENt (r, t � {1, … , n}) be twoentity attribute names in ENA. Then the projection over ENr and ENt is denoted by A[ENr, ENt] and it means theset{�e1r, e1t�, … , �ekr, ekt�}. In our model there are sev-eral association types among which semantic connectionsbetween entities are constructed. The collection of all avail-able association types is denoted by AC. Thus, if we want tofind all association types in which there is an immediateconnection between entities es and et, we can express this bythe following set specification:

Documents. The idea of integrating information in docu-ments with information in databases is well known. Forexample, the integration of structured documents has beenproposed both with relational (e.g., ISO, 1986) and object-oriented (e.g., Christophides, Abiteboul, Cluet, & Scholl,1994; Yan & Annevelink, 1994) databases. In our ap-proach the context of interest often contains regular orirregular textual information related to some informationsources in the form of XML documents. Documents maycontain references to entities. The documents created forthe same purpose are grouped into the same documentcollection.

Semantic associations among entities are constructedduring query processing based on the above modeling con-structs. A semantic association expresses a semantic rela-tionship between two entities. It consists of one or moreassociations joined together by shared entities. A semanticassociation is valid if it is acyclic, that is, it does not containthe same association or the same pair of shared entitiestwice. For the sake of simplicity, we represent it formallyas a sequence consisting of entity pairs labeled by associa-tion names, that is, as a tuple with the form �A1(e1, e2),A2(e2, e3), … , An(en, en+1)�. Each labeled entity pair Ai(ei, ei+1)(i � {1, … , n}) is interpreted as an association of the typeAi holding between the entities ei and ei+1.

The Query Language

One of the main goals of our query language design is itsability to adjust itself to the user’s knowledge or ignoranceabout the context at hand. Therefore, it is important that theuser can refer to unknown factors in their queries intuitivelyand uniformly.

From the user’s viewpoint, query formulation consists ofcombining query primitives containing variables. Variablesare used to refer to unknown information, and they can beassociated with a construct at the intensional or extensionallevel. The notion and notation of variable we use are similarto those of deductive databases (accordingly, a variable isdenoted by a string, which may consist of letters and numbers,

ENA:�es, et�� A[ENs, ENt]6.5A 0A � AC ¿ EENs, ENt �

beginning with an uppercase letter). However, unlike in de-ductive databases, the user need not master how variablesare instantiated during query processing. It is sufficient thatthe user knows the referents of the variables he specifies.The query processor is responsible for finding the values ofvariables satisfying the criteria given in a query. This meansthat the user can combine query primitives safely, that is,without fearing nonterminating processing.

Next we consider the semantics related to the primitivesof the query language. Because the focus of our query lan-guage is to offer a novel expression power for manipulationof semantic associations, we specify exactly the queryprimitives tailored to this purpose. Other query primitives,such as condition expressions, are conventional and we donot repeat their formal definitions. For example, in the arti-cle by Niemi et al. (2002) the interested reader can find theformal definitions based on the notion of variable used inthis article. In our formal specification we use set theorybecause it is a precise, well-known and established formal-ism. The exact syntax of our query language is given in theAppendix.

General Form of a Query

A query has the following form:result where conditions.The string where is a reserved word used to separate the

two parts of a query from each other. The result consists ofvariables specified in conditions, or attribute expressionsof the form “variable:attrname,” which are separated fromeach other by commas. Thus the expression “X:address”could be used to denote the value of the attribute address inan entity X, and the expression “Y:parent” to denote the en-tity that has the role parent in an association Y. The solu-tions in a result include those instantiations of variables thatsatisfy the criteria given in conditions. The conditions, inturn, consist of individual query primitives separated bycommas or semicolons. The former stand for logical con-junction, and the latter for logical disjunction. Parenthesescan be used to group conjunction or disjunction sequencesif necessary.

Comparison and Negation Expressions

By using conventional comparison operations (“�”,“�”, “�”, “�”, “�” and “��”) in queries, it is possible tocompare values. Comparison can be specified among literalsand expressions containing variables. The former includestrings, integers, decimal numbers, and the names of modellingconstructs (e.g., the name of an entity type or the filename ofa document).

The expression “neg condition” means the negation of acondition. The scope of negation depends on the scope of thenegated condition. For example, the expression “neg entityperson P” (see the following section) means any entity thatdoes not belong to the entity type person or any of its suben-tity types.


Primitives for Entities

In our query language the user need not declare variables.However, primitives restricting possible instantiations of avariable are needed. For example, references to entities inour language are specified by the reserved word entity. Thus,the variable E in the expressions “entity E” and “entity per-son E” would refer to any entity in the context at hand, or toany entity belonging to the entity type person, respectively.The type of an entity can be retrieved by replacing it with avariable, thus enabling formulation of intensional queries.

Primitives for Semantic Associations

The core of our query language is its ability to discoversemantic associations among entities. An association avail-able in the context represents the simplest form of a seman-tic association. The reserved word assoc is used to refer to anassociation in our query language. If the type of an associa-tion, say parenthood, is known, it can be expressed as fol-lows: “assoc parenthood X.” Now each instantiation of thevariable X would express a parent-child relationship be-tween two individual entities.

In complex cases several associations have to be com-bined with each other in order to join two entities semanti-cally with each other. Typically, the user does not know theexact way to connect them but, instead, wants to retrieve thisinformation, that is, the semantic associations between them.The basic primitive for expressing a semantic association is

sem_assoc x between e1 and e2

where x denotes any sequence consisting of associations, in-cluding possibly other entities that are needed to connect theentities e1 and e2 with each other.

In order to illuminate our formal definition, let us con-sider the sample graph in Figure 1. In it we describe differ-ent paths to connect the source node es semantically to thetarget node et through associations belonging to types Ais. Inother words, in this case the result for x in the expression“sem_assoc x between es and et” would be the followingset of semantic associations: {�A1(es, et)�, �A2(es, et)�,�A3(es, e5), A4(e5, et)�, �A3(es, e1), A1(e1, e4), A4(e4, et)�,�A3(es, e1), A1(e1, e4), A6(e4, et)�, �A3(es, e1), A1(e1, e2),A2(e2, e3), A5(e3, et)�}. The path �A3(es, e1), A1(e1, e4),A4(e4, et)� describes one semantic association in the aboveresult. Based on this semantic association, we illustrate the

idea behind our formal definition. The construction of thispath happens through different tuples in the following order:�es�, �A3(es, e1), e1�, �A3(es, e1), A1(e1, e4), e4� and�A3(es, e1), A1(e1, e4), A4(e4, et)�. In other words, the lastelement in the tuple expresses the entity for which we haveto find a labeled entity pair through which the target entityet can be reached. The semantics related to the primitive“sem_assoc x between es and et” is given as x � con-nect(�es�, et) where the function connect has been definedrecursively as follows:

/* Exit condition */

� otherwise

The exit condition of the above definition specifies thatthere are one or more association types in terms of which thelast element in the path constructed to that point can bereplaced by a labeled entity pair that ends with the desiredtarget entity. In the recursive definition, a path can beextended by an entity that is not the target entity, and thatdoes not yet belong to the path (this prevents nonterminatingprocessing). As our example in Figure 1 shows, there can beseveral association types in terms of which the path at handcan be extended in this way. If it is not possible to extend apath in this way, the empty set is yielded.

As explained above, our query language supports queriesthat discover semantic associations between unknown entities.Let us assume that we have to find semantic associations be-tween entities of type Type1 and Type2, that is, we do not knowthe individual entities, only their types. We can find thesesemantic associations as follows:

connect(�e1�, e2) where e1 � ext(Type1) and e2 � ext(Type2).

As Aleman-Meza et al. (2003) have recognized, thenumber of semantic associations that connect two entitiescan be very large. Therefore, we need filtering expressionsthat effectively reduce their number in the result of a query.

d

path[ j] � Aj(ej, e) e et

e et ¿ ¬E j � 51, p , len(path) 16:�path[len(path)], e�� A[ENs, ENp] ¿if EA(� AC), ENs(� ENA), ENp(� ENA), e:

A(path[len(path)], e) ¿ path'[len(path) � 1] � e6path'[i] � path[i] ¿ path'[len(path)] �

where path-set � 5path' 05i �51, p , len(path) 16:d

X�path-setconnect(x, et )

�path[len(path)], et�� A[ENs, ENp]

if E A(� AC), ENs(� ENA), ENp(� ENA):

path'[len(path)] � A(path[len(path)], et )65path' 05i � 51, p , len(path) 16: path'[i]� path[i] ¿

connect(path, et) �


et

es

et

e5

et

e1

e4

et

et

e2 e3 et

A1

A3

A1

A2 A5

A4A1

A6

A3

A4A2

FIG. 1. A graph for connecting entity es to entity et.

•

Each filtering primitive has well-defined semantics in termsof which the user can characterize semantic associations ofinterest. On the other hand, the use of these primitives pre-supposes that the user knows what entities, entity types, orassociation types are included in the context at hand.

Next, we list the filtering primitives of our query lan-guage. These primitives can be added to the above basic ex-pression for semantic associations. Each of them affects onlythe intermediate entities or associations required to join twoentities. We also give their functional definitions based onthe function connect.

Filtering primitives for entities.The filtering primitive(1) via {e1, e2, … , en}expresses that in the result, only semantic associa-

tions that contain the entities e1, e2, … , en are considered.In addition, the resulting semantic associations may alsocontain some other entities. Formally, x in the expression“sem_assoc x between es and et via L” is defined asthe set

Above, it isworth noting that connect(�es�, et) means a path set pro-duced by this function.

The filtering primitive(2) not_via {e1, e2, … , en}specifies that we are not interested in those semantic as-

sociations that contain the entities e1, e2, … , en. The defini-tion of x is analogous to that used above; in the expression“sem_assoc x between es and et not_via L” x is evaluated ast h e s e t

Filtering primitives for entity and association types.The filtering primitives:(3) within_entity_types {et1, et2, … , etn}(4) without_entity_types {et1, et2, … , etn}express that semantic associations of interest have to con-

tain, or must not contain, entities belonging to the entitytypes et1, et2, … , etn, respectively. If the filtering primitives(3) and (4) are omitted, semantic associations are consideredamong all entity types belonging to the context. In the ex-pression “sem_assoc x between es and et within_entity_typesL”, x stands for the set

Accordingly, in the expression“sem_assoc x between es and et without_entity_types L”,x can be defined as

The filtering primitives:(5) within_assoc_types {at1, at2, … , atn}(6) without_assoc_types {at1, at2, … , atn}specify that semantic associations of interest must, or

must not be, constructed by using association types at1,at2, … , atn. Formally, x included in the expression “sem_assoc

� L ¿ type(ei�1) � L6.5i � 52, p , len(path) 16: path[i] � Ai(ei, ei�1): type(ei )5path 0path � connect(�es�, et ) ¿

� L ¿ type(ei�1) � L6.5i � 52, p , len(path) 16: path[i] � Ai(ei, ei�1): type(ei )5path 0path � connect(�es�, et ) ¿

� 51, p , len(path) 16: path[i] � Ai(ei, e)6.5path 0path � connect(�es�, et ) ¿ 5e � L: ¬E i

51, p , len(path) 16: path[i] � Ai(ei, e)6.5path 0path � connect(�es�, et ) ¿ 5e � L: E i �

x between es and et within_assoc_types L” meansthe set

Analogously, xin the expression “sem_assoc x between es and et without_assoc_types L” is evaluated as the set

The filtering primitive for restricting the length of a semanticassociation.

The filtering primitive:(7) with_max_length istates that the number of associations in a semantic asso-

ciation is at most i (a positive integer). Thus, the filteringexpression “sem_assoc x between es and et with_max_lengthL” implies x to be the set

Primitives for Manipulating Documents

From the viewpoint of the user it is highly desirable to beable to retrieve information from documents without know-ing their exact structure or content. By the reserved worddoc the user refers to available documents. It can be usedsimilarly to the previously introduced expressions for enti-ties and associations. Thus, the expressions “doc X” and“doc collision_report Z” would refer to any document X orto any collision report Z, respectively.

In addition, the user needs a mechanism to refer to a com-ponent in the XML documents at hand. The component maybe an attribute or element. If the user wants to refer to a com-ponent, say title, this can be expressed by the reserved wordcomp as follows: “comp title X”. If the type of the compo-nent is known, the reserved word comp can be replaced bythe reserved words attr or elem.

In our language the primitive “x in d” is used to refer toany component x in any document d belonging to the con-text. Sometimes the user has to refer to substructures ofan element in an XML document. For this our query lan-guage contains the expression “c within e in d”, where crefers to any component (an attribute or element) con-tained directly or transitively in an element e in a docu-ment d. For example, in the following primitive sequence:“comp title T, elem news_item N, T within N in D” thevariable D would refer to those documents that contain anelement with the name “news_item,” which, in turn, con-tains an element or an attribute with the name title directlyor transitively.

In addition, the query language provides the user withprimitives for searching referred entities and strings indocuments. The primitive “e refers_to en” expresses thatthe element e has to contain an instance of the specialelement (see above) that refers to the entity en. The prim-itive “c contains s” is true if the component c contains thestring s.

len(path) � L6. 5path 0path � connect(�es�, et ) ¿

Ai(ei, ei�1) ¿ Ai � L6.connect(�es�, et ) ¿ 5i � 51, p , len(path)6: path[i] �

5path 0path �

len(path)6: path[i] � Ai(ei, ei�1) ¿ Ai � L6.5path 0path � connect(�es�, et ) ¿ 5i � 51, p ,


Conclusions

The number of applications, tailored to some context ofinterest, that consist of data collected from several heteroge-neous autonomous information sources has increased con-siderably during the last few years. It is typical in theseapplications that the user does not master exactly their infor-mation content or structural relationships among data. Infact, the user is often interested in finding out how data aresemantically related. So far there have not been actual querylanguages for this purpose. In this article we introduce aquery language that is able to find the semantic associationsamong data in a context that may consist of entities, theirproperties, is-a hierarchies among entities, associationsamong entities, and documents containing textual informa-tion related to entities. A semantic association for connectingtwo entities is a sequence of associations expressed explic-itly in the context of interest, and it may contain severalother entities through which these two entities are connectedto each other indirectly.

Unlike contemporary approaches developed for discover-ing semantic associations, our query language is able to findsemantic associations among unknown entities or entitiesthat are only known by some of their characteristics. Weshall discuss this issue more in Part II. Our query languagealso integrates the manipulation of semantic associationswith the manipulation of documents that may contain infor-mation on entities in semantic associations.

Acknowledgments

This research was supported by the Academy of Finlandunder Grant Number 1209960. We want also to thank Pro-fessor Kalervo Järvelin for his valuable comments in differ-ent phases of writing this article.

References

Abiteboul, S., Quass, D., McHugh, J., Widom, J., & Wiener, J.L. (1997).The Lorel query language for semistructured data. International Journalon Digital Libraries, 1(1), 68–88.

Aleman-Meza, B., Halaschek, C., Arpinar, I.B., & Sheth, A. (2003). Context-aware semantic association ranking. In I.F. Cruz, V. Kashyap, S. Decker, &R. Eckstein (Eds.), Proceedings of SWDB ’03, The First InternationalWorkshop on Semantic Web and Databases, Berlin, Germany (pp. 33–50).Retrieved from http://www.cs.uic.edu/~ifc/SWDB/ proceedings.pdf

Anyanwu, K., & Sheth, A. (2002). The r Operator: Discovering and rank-ing associations on the Semantic Web. SIGMOD Record, 31(4), 42–47.

Anyanwu, K., & Sheth, A. (2003). r Queries: Enabling querying for seman-tic associations on the Semantic Web. Proceedings of the 12th Interna-tional World Wide Web Conference, Budapest, Hungary (pp. 690–699). Retrieved from http://www2003.org/cdrom/papers/refereed/p823/p823-anyanwu.htm

Benedikt, M., Chan, C.-Y., Fan, W., Freire, J., & Rastogi, R. (2003). Cap-turing both types and constraints in data integration. In Proceedings ofACM SIGMOD International Conference on Management of Data(pp. 277–288). New York: ACM.

Buneman, P., Fernández, M., & Suciu, D. (2000). UnQL: A query languageand algebra for semistructured data based on structural recursion. VLDBJournal, 9(1), 1–20.

Christophides, V., Abiteboul, S., Cluet, S., & Scholl, M. (1994). From struc-tured documents to novel query facilities. In R.T. Snodgrass & M. Winslett(Eds.), Proceedings of the 1994 ACM SIDMOD International Confer-ence on Management of Data (pp. 313–324). New York: ACM.

Cluet, S. (1998). Designing OQL: Allowing objects to be queried. Informa-tion Systems, 23(5), 279–305.

Date, C.J. (2000). An introduction to database systems (7th ed.). Reading,MA: Addison-Wesley.

DeHaan, D., Toman, D., Consens, M.P., & özsu, M.T. (2003). A compre-hensive XQuery to SQL translation using dynamic interval encoding. InA.Y. Halevy, Z.G. Ives, & A. Doan (Eds.), Proceedings of the 2003 ACMSIGMOD International Conference on Management of Data (pp. 623–634).New York: ACM.

Erdmann, M., & Studer, R. (2001). How to structure and access XML doc-uments with ontologies. Data & Knowledge Engineering, 36(3),317–335.

Fernández, M., Kadiyska, Y., Suciu, D., Morishima, A., & Tan, W.-C.(2002). SilkRoute: A framework for publishing relational data in XML.ACM Transactions on Database Systems, 27(4), 438–493.

Fernández, M., Morishima, A., & Suciu, D. (2001). Efficient evaluation ofXML middle-ware queries. In W.G. Aref (Ed.), Proceedings of the 2001ACM SIGMOD International Conference on Management of Data(pp. 103–114). New York: ACM.

Fong, J., Wong, H.K., & Cheng, Z. (2003). Converting relational databaseinto XML documents with DOM. Information and Software Technology,45(6), 335–355.

Hristidis, V., & Papakonstantinou, Y. (2002). DISCOVER: Keyword searchin relational databases. In Proceedings of the 28th International Confer-ence on Very Large Data Bases (VLDB) (pp. 670–681). St. Louis, MO:Morgan Kaufmann.

International Organization for Standardization (ISO). (1986). Informationprocessing—text and office systems: Standard generalized markuplanguage (SGML) (ISO 8879). Geneva, Switzerland: International Orga-nization for Standardization.

Jensen, M.R., Møller, T.H., & Pedersen, T.B. (2003). Converting XMLDTDs to UML diagrams for conceptual data integration. Data & Knowl-edge Engineering, 44(3), 323–346.

Junkkari, M. (2005). PSE: An object-oriented representation for modelingand managing part-of relationships. Journal of Intelligent InformationSystems, 25(2), 131–157.

Karvounarakis, G., Alexaki, S., Christophides, V., Plexousakis, D., &Scholl, M. (2002). RQL: A declarative query language for RDF. InProceedings of the 11th International World Wide Web Conference,Honolulu, HI, (pp. 592–603). Retrieved from http://www2002.org/CDROM/refereed/329/index.html

Li, Q., & Lochovsky, F.H. (1998). ADOME: An advanced object modelingenvironment. IEEE Transactions on Knowledge and Data Engineering,10(2), 255–275.

Masermann, U., & Vossen, G. (2000a). Design and implementation of anovel approach to key-word searching in relational databases. In J. Stuller,J. Pokorny, B. Thalheim, & Y. Masunaga (Eds.), Lecture Notes In Com-puter Science: Vol. 1884, Proceedings of ADBIS-DASFAA Symposium(pp. 171–184). London: Springer.

Masermann, U., & Vossen, G. (2000b). SISQL: Schema independent data-base querying (on and off the Web). In Proceedings of the 2000 Interna-tional Symposium on Database Engineering and Applications (IDEAS2000) (pp. 55–64). Washington, DC: IEEE Computer Society.

Motschnig-Pitrik, R., & Kaasbøll, J. (1999). Part-whole relationship cate-gories and their application in object-oriented analysis. IEEE Transactionson Knowledge and Data Engineering, 11(5), 779–797.

Niemi, T., Christensen, M., & Järvelin, K. (2000). Query language approachbased on the deductive object-oriented database paradigm. Informationand Software Technology, 42(11), 777–792.

Niemi, T., & Jämsen, J. (in press). A query language for discovering semanticassociations, Part II: Sample Queries and Query Evaluation. Journal of theAmerican Society for Information Science and Technology.

Niemi, T., Junkkari, M., & Järvelin, K. (2002). Relational deductive object-oriented modeling (RDOOM) approach for finding, representing and


integrating application-specific concepts. International Journal of SoftwareEngineering and Knowledge Engineering, 12(4), 415–451.

Niemi, T., Junkkari, M., Järvelin, K., & Viita, S. (2004). Advanced querylanguage for manipulating complex entities. Information Processing &Management, 40(6), 2004, 869–889.

Renguo, X., Dillon, T.S., Rahayu, J.W., Chang, E., & Gorla, N. (2000). Anindexing structure for aggregation relationship in OODB. In M.T. Ibrahim,J. Küng, & N. Revell (Eds.), Proceedings of the 11th InternationalConference on Database and Expert Systems Applications (pp. 21–30).London: Springer.

Rumbaugh, J., Blaha, M., Premerlani, W., Eddy, F., & Lorensen, W. (1991).Object-oriented modeling and design. Englewood Cliffs, NJ: Prentice Hall.

Rumbaugh, J., Jacobson, I., & Booch, G. (1999). The Unified ModelingLanguage reference manual. Reading, MA: Addison-Wesley.

Seaborne, A. (2004). RDQL-a query language for RDF (W3C Member Sub-mission). Retrieved April 4, 2005, from http://www.w3.org/Submission/2004/SUBM-RDQL-20040109/

Shanmugasundaram J., Shekita, E., Barr, R., Carey, M., Lindsay, B.,Pirahesh, H., et al. (2001). Efficiently publishing relational data as XMLdocuments. The VLDB Journal, 10(2–3), 133–154.

Sheth, A., Aleman-Meza, B., Arpinar, I.B., Bertram, C., Warke, Y.,Ramakrishnan, C., et al. (2005). Semantic association identification andknowledge discovery for national security applications. Journal of Data-base Management, 16(1), 33–53.

Sintek, M., & Decker, S. (2001). TRIPLE: An RDF query, inference, andtransformation language. In Proceedings of the 14th International Confer-ence on Applications of Prolog; Workshop on Deductive Databases andKnowledge Management (pp. 47–56). Tokyo: Prolog Association of Japan.

Ullman, J.D. (1988). Principles of database and knowledge base systems(Vol. 1). Rockville, MD: Computer Science Press.

Wand, Y., Storey, V.C., & Weber, R. (1999). An ontological analysis of therelationship construct in conceptual modeling. ACM Transactions onDatabase Systems, 24(4), 494–528.

World Wide Web Consortium (W3C). (1999). XML Path Language (XPath)recommendation (Version 1.0). Retrieved April 4, 2005, from http://www.w3.org/TR/xpath

World Wide Web Consortium (W3C). (2004a). Extensible Markup Lan-guage (XML) 1.1 recommendation. Retrieved April 4, 2005, from http://www.w3.org/TR/2004/REC-xml11-20040204/

World Wide Web Consortium (W3C). (2004b). RDF Primer. RetrievedApril 4, 2005, from http://www.w3.org/TR/rdf-primer/

World Wide Web Consortium (W3C). (2005). XQuery 1.0: An XML querylanguage. Retrieved April 4, 2005, from http://www.w3.org/TR/xquery/

Yan, T. W., & Annevelink, J. (1994). Integrating a structured-text retrievalsystem with an object-oriented database system. In J.B. Bocca, M. Jarke, &C. Zaniolo (Eds.), Proceedings of the 20th International Conference onVery Large Data Bases (VLDB) (pp. 740–749). St. Louis, MO: MorganKaufmann.


Appendix

A Grammar Defining the Specific Syntax of the Query Language

The notation conforms to Extended BNF syntax as speci-fied by ISO/IEC 14977:1996(E). Accordingly, the followingsymbols are used:� defining-symbol; terminator-symbol, concatenation alternative sequences“” or ‘’ (enclosed) terminal symbol{} (enclosed) optional repetitive sequence[] (enclosed) optional sequence(**) (enclosed) comment() parentheses (used in their mathematical sense)query � form of result, “where”, condition sequence, “.”;form of result � value reference, {“,”, value reference};condition sequence � conjunction sequence disjunction

sequence;condition subsequence � condition (“(”, condition se-

quence, “)”);conjunction sequence � condition subsequence, {“,”, condi-

tion subsequence};disjunction sequence � condition subsequence, {“;”, condi-

tion subsequence};value reference � semantic association association asso-

ciation type entity entity type document documentcollection component component name literal;

semantic association � variable;association � variable;association type � name variable;entity � variable;entity type � name variable;

document � string variable;document collection � name variable;component � variable;component name � name variable;literal � string integer decimal number date;name � lowercase letter, {alphanumeric character “_”};variable � (uppercase letter “_”), {alphanumeric character

“_”};condition � negation comparison type declaration query

primitive for semantic associations;negation � “neg”, condition;comparison � value reference, “ ”, (“�” “�” “�”

“��”), “ ”, value reference; type declaration � (“assoc”, [association type, “ ”], associa-

tion) (“entity”, [entity type, “ ”], entity) (“doc”, [docu-ment collection, “ ”], document) (“comp”, [componentname, “ ”], component) (“attr”, [component name, “ ”],component) (“elem”, [component name, “ ”], compo-nent);

query primitive for semantic associations � “sem_assoc”, se-mantic association, “between”, entity, “and”, entity, [filteringprimitive 1], [filtering primitive 2], [filtering primitive 3],[filtering primitive 4], [filtering primitive 5], [filteringprimitive 6], [filtering primitive 7] (* filtering primitivesmay occur in any order *);

filtering primitive 1 � “via {”, entity, {“,”, entity}, “}”; filtering primitive 2 � “not_via {”, entity, {“,”, entity}, “}”;filtering primitive 3 � “within_entity_types {”, entity type,

{“,”, entity type}, “}”; filtering primitive 4 � “without_entity_types {”, entity type,

{“,”, entity type}, “}”;filtering primitive 5 � “within_assoc_types {”, association

type, {“,”, association type}, “}”;

filtering primitive 6 � “without_assoc_types {”, associationtype, {“,”, association type}, “}”;

filtering primitive 7 � “with_max_length”, unsigned integer;string � “’”, character, {character}, “’”;integer � [“�” “”], unsigned integer;unsigned integer � number, {number};decimal number � integer, “.”, unsigned integer;date � year “-” month “-” day;year � unsigned integer;month � unsigned integer;day � unsigned integer;character � lowercase letter uppercase letter number

special symbol;

alphanumeric character � lowercase letter uppercase letternumber;

lowercase letter � “a” “b” “c” “d” “e” “f” “g” “h”“i” “j” “k” “l” “m” “n” “o” “p” “q” “r” “s”“t” “u” “v” “w” “x” “y” “z”;

uppercase letter � “A” “B” “C” “D” “E” “F” “G”“H” “I” “J” “K” “L” “M” “N” “O” “P” “Q”“R” “S” “T” “U” “V” “W” “X” “Y” “Z”;

number � “0” “1” “2” “3” “4” “5” “6” “7” “8” “9”;special symbol � “ ” “!” '“' “#” “$” “%” “&” “(” “)”

“*” “�” “,” “–” “.” “/” “:” “;” “�” “�” “�”“?” “@” ”[” “\” “]” “^” “{” “ “ “}”;