A formal basis for an abbreviated concept-based query language

  • Published on
    02-Jul-2016

  • View
    214

  • Download
    2

Transcript

  • A formal basis for an abbreviated concept-based querylanguage

    Vesper Owei a,*, Shamkant Navathe b

    a Information and Decision Sciences Department (M/C 294), University of Illinois at Chicago, Chicago, IL 60607, USAb College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA

    Received 22 September 1998; received in revised form 16 December 1999; accepted 3 March 2000

    Abstract

    Concept-based query languages allow users to specify queries directly against conceptual schemas. The primary goal

    of their development is ease-of-use and user-friendliness. However, existing concept-based query languages require the

    end-user to explicitly specify query paths in totality, thereby rendering such systems not as easy to use and user-friendly

    as they could be. The conceptual query language (CQL) discussed in this paper also allows end-users to specify queries

    directly against the conceptual schemas of database applications, using concepts and constructs that are native to and

    exist on the schemas. Unlike other existing concept-based query languages, however, CQL queries are abbreviated, i.e.,

    the entire path of a query does not have to be specified. CQL is, therefore, an abbreviated concept-based query lan-

    guage. CQL is developed with the aim of combining the ease-of-use and user-friendliness of concept-based languages

    with the power of formal languages. It does not require end-users to be familiar with the structure and organization of

    the application database, but only with the content. Therefore, it makes minimal demands on end-users cognitive

    knowledge of database technology without sacrificing expressive power. In this paper, the formal semantics and the

    theoretical basis of CQL are presented. It is shown that, while CQL is easy to use and user-friendly, it is nonetheless

    more than first-order complete. A contribution of this study is the use of the semantic roles played by entities in their

    associations with other entities to support abbreviated conceptual queries. Although only mentioned here in passing, a

    prototype of CQL has been implemented as a front-end to a relational database manager. 2001 Published byElsevier Science B.V. All rights reserved.

    Keywords: Abbreviated query formulation; Computer-supported query formulation; Concept-based query languages; Conceptual

    query language; Query language expressive power

    1. Introduction

    Query tools that depend on programming skill for their eective and ecient use impose acognitive burden that may diminish users productivity with the tools. This underscores the needfor database query languages (DBQLs) that are matched to the skills and ability of end-users,

    Data & Knowledge Engineering 36 (2001) 109151www.elsevier.com/locate/datak

    * Corresponding author. Present address: Division of Management Information Systems, University of Oklahoma, 307 West Brooks,

    Room 306, Norman, OK 73019-4007, USA. Tel.: +1-405-325-0768; fax: +1-405-325-7482.

    E-mail addresses: vesper@uic.edu (V. Owei), sham@cc.gatech.edu (S. Navathe).

    0169-023X/01/$ - see front matter 2001 Published by Elsevier Science B.V. All rights reserved.PII: S 0 1 6 9 - 0 2 3 X ( 0 0 ) 0 0 0 4 2 - 2

  • necessitating a rethinking of the DBQL design. Concept-based approaches to DB queryingsupport the direct use of conceptual schemas and constructs that are either the same or similar tothose in users mental model. Therefore, concept-based DB querying naturally tends to fit theskills and ability of typical end-users. Conceptual DB querying will be needed with ever increasingdemand as we place more and more complex databases on the Web. This need for concept basedinformation retrieval has led to research into concept-based DBQLs.

    However, because the primary motivation for the development of concept-based querylanguages is ease-of-use and user-friendliness, they tend to be weak in formalism. For ex-ample, visual query languages, which are only a sub-class of concept-based query languages,are usually very weak in expressive power [53]. This paper discusses the conceptual querylanguage (CQL) [41] which is developed with the aim of combining the ease-of-use and user-friendliness of concept-based languages with the power of formal languages. CQL allows usersto formulate queries in a very intuitive way without the need for them to learn about theschema (structure) of the database or to grapple with the syntactic complexity of command-based languages. It, therefore, makes minimal demands on end-users cognitive knowledge ofDB technology without sacrificing expressive power. Experiments in [43] show that end-usersperform better with CQL than with alternative languages such as SQL; they also have a betterperception of CQL. Our focus in this paper is on the theoretical basis and formal semanticsof CQL. We show that, while CQL is easy to use, it is nonetheless more than first-ordercomplete.

    The rest of the paper is organized as follows: In Section 2, we give an example to illustrate themotivation for this study. In Section 3, we discuss some related studies in conceptual queryformulation and semantics based querying. We formally define CQL in Section 4. Section 5 isdevoted to discussing the functionality of the dierent modules of CQL. We examine the claimsconcerning the expressive power of CQL in Section 6. The paper concludes in Section 7 with adiscussion of much earlier work in the development of conceptual interfaces and an examinationof other issues, e.g., intelligent interfaces, that are important in interface design. A summary of thepaper and an examination of its main contributions and limitations, as well as an indication ofrelated studies planned for the future are also given in the concluding section.

    2. Motivation

    Query specification in linear keyword languages (LKLs) like SQL and in other visual systemspatterned after or similar to query-by-example (QBE) make use of joins defined either during datadefinition or during query formulation. ACCESSe and PARADOXe are examples of QBEsystems. Recent QBE implementations, for example in ACCESS, are able to perform joins oncethe tables to be joined have been specified by the user. This requires the joins to have been definedas relationships during table creation. Where needed joins are not defined, possible joins can besuggested to the user. The domain types of attributes can be used for this task. The existingcommercial systems are unable to select joins automatically for the user. The ability to selectdefinite joins is tantamount to specifying a particular query path; this requires the use of meta-knowledge about the schema in the form of the meaning of a query path to ensure the semanticcorrectness of the selected path. Such meta-knowledge is lacking in existing LKL and QBEsystems.

    110 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • We therefore ask the following thematic question: Given the rich semantics of data models likethe ER model, is it possible to exploit the meta-knowledge about these models to reduce the cog-nitive load faced by end-users to facilitate query formulation? This question deals with the issue offurther enhancements to the query formulation methods in commercially popular LKL, drag-and-drop and point-and-click query tools.

    Since the mid-1980s a number of approaches using meta-knowledge about DB schemas toenhance the facility of end-users in query formulation have been proposed (for example,[1,14,15,35,44,47]). The recent prototypical approaches in [14,15,44,47] elevate query formulationfrom the logical level to the conceptual schema level by supporting the direct use of concepts andabstractions on conceptual schema in query statements. Query formulation can be further facil-itated in these systems by reducing the cognitive workload entailed by their use. One way this canbe achieved is through minimizing what is required to be specified by the end-user. The system canthen use schema meta-knowledge to determine and select a semantically correct query. CQL isbased on this approach.

    2.1. Structure of the conceptual query language

    Current commercially popular LKL and QBE systems require users to explicitly mention all thetables needed by the system to solve the problem. Furthermore, in LKL and QBE systems the usermust also specify query paths. This explicit navigation is a major source of diculty for a typicalend-user. In our proposed language called the CQL, this cognitive burden in formulating DBqueries is reduced by migrating much of this task to the underlying DBMS. Unlike LKL and QBEsystems, query formulation in CQL does not require the user to specify all the tables needed tosolve a query. Also, the user does not have to specify query paths. CQL is, therefore, particularlysuitable for business and administrative end-users who, generally speaking, are not programmers.

    In CQL only the entities and conditions explicitly mentioned in query statements are requiredto be specified in their formulations. CQL has a simple and straightforward query syntax. Thebasic (canonical) form of a CQL query, Q, can be expressed as

    Query : QtE; SE; fCsel;Csemg

    where tE is the set of targets (entities and attributes about which information is sought), SE the setof sources (entities and attributes about which information is given or known), Csel the selectioncriteria/conditions, Csem the semantic relationships between implicit sources and implicit targets,and the entities semantically adjacent to them on the application conceptual schema. An implicitsource is either a source or a target entity of the query. An implicit target may be the target of thequery or an intermediate entity that is neither the source nor the target of the specified query, butlies on the query path. As discussed latter, the specification of intermediate entities in CQL isoptional and not necessary.

    In formulating a query with CQL, therefore, the end-user only needs to state tE; SE;Csel andCsem. The formulated query is then automatically passed to the underlying DBMS to determineand select the query path. The CQL system uses semantic information about the schema toperform these tasks. This information is in the form of the semantic roles played by schemaentities in their relationships with other entities.

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 111

  • 2.2. Query abbreviation in the conceptual query language

    Concept-based or conceptual query interfaces reduce the cognitive load in querying DBs byallowing users to directly use constructs form conceptual schemas [24,13,41,47]. As exemplifiedin [14], instead of specifying the relational condition Where s.sno sp.sno and sp.pno p.pno, concept-based interfaces would allow for a more natural specification like WhereSupplier supplies Parts. The CQL approach provides additional enhancement to this. Whereintermediate entities exist on the query path between Supplier and Parts, CQL uses built-in meta-knowledge about the application schema to determine and select the correct intermediate entities.Therefore, in comparison to LKL and QBE queries, conceptual queries in CQL tend to be highlyabbreviated, since the user is not required to specify the entire query path. The main problem withabbreviated queries is to derive the corresponding semantically correct full queries [35]. Thisconcern naturally carries over to CQL queries. In this section we use an illustration to explainwhat CQL is, what its structure is and what it is trying to achieve. The illustration is based onFig. 1, which is a semantically constrained entity-relationship diagram (SCERD) 1 of a universitydepartment.

    Fig. 1. Semantically enhanced ER diagram of a university department schema.

    1 SCERD contains other constructs that are used for updates. These have been left out in Fig. 1, since they are not pertinent to the

    discussion here.

    112 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • In SCERD, entity types in the schema bear explicitly named relationships, or association,among themselves. Each relationship has a semantic meaning. Double-headed arrows are used inan SCERD to indicate that the entities at both heads of the arrows have a direct semantic re-lationship, and the arrow-heads are labeled with the roles, e.g., works-for, can-teach, advises, etc.,played by the entities in specific relationships. The association semantics of the relationships in-volving entities are constrained by the roles the entities play in the particular relationship. InSCERD, the meaning of the links between entities, therefore, lies in the form of roles. CQLsupports the direct use of SCERD constructs in query formulation.

    Example. Suppose the following query is posed and specified on Fig. 1:Query 1: What course(s) is Marshall taking from associate professor Jones?

    An abbreviated CQL formulation of this query requires the user to specify only the statedentities Student, Teacher and Course along with a set of selection predicates on these entities. Thesystem is then required to chart one or more paths through the conceptual schema from Studentand Teacher to Course. We refer to such paths as derived paths. In addition to path derivation,the system must also be capable of performing any needed operations, e.g., conjunction or dis-junction, on the derived paths. In this case, the meaning of the desired query demands that thesub-paths Student ! ! Course and Teacher ! ! Course be derived and conjunctivelycombined, where indicates segments of the sub-paths that must be determined by the system.Furthermore, these segments must be such that the meaning of the resulting path is the same asthat of the desired query. Clearly, the sub-path STDjenrolledin ! CRjhPi ! C is semanticallycorrect. In this notation P fp1; p2; . . . ; png is a set of paths, and jenrolledin denotes the role playedby the Student entity on that path. The path derives its meaning from the totality of the semantics ofthe roles played by all the entities on the path.

    An examination of Fig. 1 shows that multiple paths exist between Student and Course and alsobetween Teacher and Course. What complicates the problem here is that all the paths do not havethe same meaning. For example, the semantics of STDjadvicedby ! T jcanteach ! C, i.e., the sub-path leading from Student to Course via Teacher deals with adviseradvisee relationship, and notwith students taking classes. It would be semantically incorrect for the system to include this sub-path in constructing the query path.

    The task of the system, then, is twofold: (1) To determine PI P and PII P such that for eachpi 2 PI and pk 2 PII; STDjhPIi ! C and T jhPIIi ! C are semantically correct. In CQL, meta-knowledge (in the form of the semantics of roles) about the relationships that the entities par-ticipate in are used to resolve this path ambiguity problem. (2) To select a pi and a pk from all thecandidate paths in (1). A modification of the path selection algorithm in [45] is used for this. In therest of the paper, the formal basis of CQL is presented. But first, we discuss some related studies inthe next section.

    3. Previous work on conceptual query formulation and semantics based querying

    As already mentioned, the main goal in developing concept-based query languages is to provideend-users with high-level, easy to use, and user-friendly interfaces for data manipulation. As far as

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 113

  • we are aware, the universal relation (UR) interface [32,33], RIDL [20,38] and NIAM [26,40,55,58]were among the earliest eorts in that direction. An examination of existing DBQLs seems toindicate a continuing trend in this direction. We examine only a sample set of existing relatedstudies in this section.

    3.1. Enabling data manipulation through semantic paths on conceptual schemes

    Chang and Sciore [15] propose the UR with semantic abstraction (URSA) model, which is anextension of the UR interface. Instead of demanding a universally unique role for each attribute,as in the UR approach, the URSA model requires this uniqueness of role only within a limited set,called closure, of entities. Querying in URSA is based on the UR query paradigm. Its referencingscheme therefore forces a QUEL-type and an SQL-type syntax. This may render it not suitable forthe generality of end-users. Peckham et al. [44] propose a DB design paradigm that abstracts therelationship semantics of application conceptual data models and uses this as a predictor of queryand update paths.

    Peckham et al. show that association roles in semantic schemas define connection paths be-tween objects, and these connections can be used to enable data manipulation. The URSA studyshows that the semantics of the association among schema entities can be used to ensure thesemantic correctness of queries. CQL extends these ideas by showing that the connection pathshave meanings that are derived from the semantic meaning of the association roles, and that thepath-meanings can be used to determine and select the correct paths of abbreviated conceptualqueries.

    The intuitive system [46] defines a very intuitive architecture for information retrieval thatcomprises four main modules. The multimodal interaction manager supports request specificationusing speech and pointing-and-clicking; the end-user component provides users with a visualinterface and functionality for using large heterogeneous DBs. The third module, the intelligentdialog manager, interprets users requests according to the task being performed by the user, whilethe fourth module, the data access layer links the dierent components of the system together.

    Although the intuitive system is aimed at end-user interaction with heterogeneous DBs, it isgeneric enough for non-heterogeneous and single DB scenarios. Each functional aspects of CQLcan be associated with a component of Intuitive. For example, the multimodal interactionmanager provides the functionality of computer-supported query formulation in CQL. CQLs set-handler can be identified with the dialog manager.

    The point-and-click mode of request formulation in Intuitive presents an ER schema to theuser, who can then specify a query by selecting the subschema defining the desired query-path.Interesting similarities and dierences between intuitive and CQL exist here: With intuitive, toformulate the query on persons who appear in an interview, the user highlights the entitiesPerson and Interview and the relationship appears_in linking the two entities. In CQLthis is specified as Person appears_in Interview. For more complex queries involving longerpaths, the Intuitive user still highlights the entire path on the schema, the CQL user does not.Intuitive supports multimedia data, text retrieval from documents, and exploratory search ofhypertexts. Clearly, Intuitive is a much more comprehensive system than CQL, which iscurrently narrowly focused on the manipulation of data in a single DB via a semantic datamodel.

    114 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • 3.2. Query-path specification, construction and selection

    Query formulation imposes a certain level of cognitive burden on users and therefore enhancesor degrades the ease of use of a query language [10,14,43,47,57]. As exemplified by Conquer-II [5],the graphical interface proposed by Mannino and Shapiro [34], and Graph Model [11], thecommon approach to query formulation requires the user to specify each query-path in its en-tirety.

    ConQuer-II [5] is a commercial concept-based query language based on the object-role mod-eling (ORM) paradigm [2428]. Like SCERD, ORM models applications in terms of the semanticroles played by objects and entities in relationships. While SCERD is an enhancement of the ERmodel, ORM is based primarily on a binary data model. ConQuer-II allows queries to be for-mulated via paths through the conceptual schema. The query paths are constructed from thesemantic roles of objects and entities. Data manipulation in the system proposed in [34] involvesfinding a path from a set of starting nodes through possible intermediate nodes and edges to a setof terminating nodes. In Graph Model, entities are conceptually conceived of as nodes and linksemantic types as edges in a graph. Query formulation involves graphically selecting a set ofsource and target nodes, then drawing a set of edges between the selected sets of nodes, and finallyspecifying for each node a set of data retrieval criteria. Users select each node and edge on thegraph-path between the source and the target. Graphs are manually manipulated until the desiredquery is obtained.

    Although CQL adopts these basic ideas, it, however, extends them by requiring users to specifyonly the endpoint, i.e., starting and terminating entities and relationship roles. The CQL systemautomatically deduces the correct intermediate nodes to use on a given query-path. Therefore,although CQL also allows queries to be formulated via paths through the conceptual schema,users are not required to specify paths in their entirety.

    Vizla [6] is a visual query language interface for the information control prototyping languageSF [7]. In Vizla, a database is abstracted as a collection of sets (entities) and functions that mapfrom this collection of sets to auxiliary sets (attributes). Queries are formulated in Vizla bypointing to representations of functions, their domains and codomains, or subsets of the domainsand codomains, and to various operators in a conceptual model of a database. The items selectedin this way are displayed and assembled graphically in a workspace, or window.

    The workspace concept is used in Vizla to reduce the cognitive burden query formulationimposes on end-users. It achieves this by allowing users to separate querying into sequences ofsmall steps, save intermediate results of such sequences, and combine the intermediate results intofinal results. Ad-hoc queries can therefore be formulated and processed in this manner. This is anapproach that we feel can be adopted, with certain modifications, by CQL to facilitate queryformulation. The query-formulation-by-pointing approach in Vizla could be tedious and unap-pealing for complex queries with long query-paths through the schema. This is because users mustpoint to the entire paths of queries and all the functions and operators needed for computation ofthe paths. The abbreviated querying approach in CQL, wherein only the terminal nodes and linksof interest are selected, cuts down on the number of operations that users must perform andthereby improves on the Vizla approach.

    Vizla is a full-fledged, self-standing query language. On the other hand, CQL in its currentprototype stage is a front-end to an underlying full-fledged query language like SQL. In addition

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 115

  • to its use as a query language, Vizla is also designed to function as a programming language. Forthis reason, it is aimed at being at least as expressive as a general programming language. Asexpected, it is perhaps more expressive than CQL. CQL can, therefore, additionally benefit fromthe work on Vizla as we develop it (CQL) further in its interface design and expressive power.

    Usually multiple paths exist for a query specified against a database schema. An approach todealing with multiple paths is through abbreviated queries. In abbreviated queries users mentiononly a subset of the objects/entities of interest. The system then interprets the query formulation,i.e., finds the necessary connections between the objects. In the path-finding approach in [59], thequery-induced map of the database schema is first pruned to make the graph less bushy. Then theshortest-path in the pruned sub-graph is selected. This idea is borrowed by CQL in its pathconstruction and selection algorithm. In [16], all possible paths are displayed; the user thenselects a particular path. CQL does not display all possible paths. Instead, from the set of allcandidate paths, it selects and displays the natural language transcription of only the minimumcost path.

    3.3. Arguing for concept-based query languages

    A number of studies have been conducted either to motivate the development of concept-based query languages or to demonstrate their superiority over other query paradigms. Thestudy by Welty and Stemple [57] attempted to find how well users could learn relational querylanguages. It was concluded that users were having considerable diculty with relationalqueries, and that the problem was not limited to any particular relational language. A dis-cussion on comparative studies arguing for concept-based query languages can be found in[14,47]. Both studies compare SQL and the concept-based DBQL called the knowledge querylanguage (KQL) [13]. While the study in [14] was atemporal, that in [47] was temporal in that itstudied the eect of time on learning. Both studies showed that users of concept-based querylanguages outperform SQL users: Irrespective of time, the KQL users performed better thantheir SQL counterparts, with respect to query accuracy, query formulation time, and userconfidence. Additional empirical studies suggesting the superiority of concept-based data re-trieval approaches over other query approaches can be found in [24,31,43]. All of these studiespoint to the need for alternative query paradigms. Concept-based approaches, as in CQL,clearly oer one such alternative.

    In [43], a statistical experiment was conducted to probe end-users reaction to using CQL, vis avis SQL, as a database query language. The comparison focused on the eect of the two dierentdatabase query language interfaces on user performance (as measured by query formulation time,query correctness, and users perception) in a query writing task with varying diculty levels.Statistically significant dierences between the two query languages were found.

    The results indicate that end-users perform better with CQL and have a better perception of itthan of SQL. There were significantly more accurate formulations with CQL than with SQL.Also, the groups with CQL took significantly less time than the groups with SQL. The CQLsubjects perceived their query language to be easier to use than their SQL counterparts felt aboutSQL; they also felt more satisfied with CQL than the SQL subjects were with SQL. These dif-ferences were more pronounced when query-diculty level was considered. The statistical sig-nificance of the dierences increased with the complexity of the query. The scores indicate that

    116 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • users are more likely to perform better with CQL than with SQL, and that they are more likely toharbor a more favorable perception of it than of SQL.

    As discussed in [9,53], existing approaches aimed at enhancing end-user facility with query toolssacrifice expressive power for ease-of-use. This limits their use and applicability. This restriction isremoved in CQL, since it is designed to be both easy to use and expressively powerful. As we showin this paper, CQL is founded on a strong formal basis. CQL is formally defined in the nextsection.

    4. Formal definition of CQL query context 2

    CQL queries are specified and submitted in an application context consisting of a conceptualschema based on a semantic data model, an implied logical schema, and an underlying DB, whichphysically exists. In this study we assume a SCERD. A context is therefore defined as consisting ofa conceptual schema and a database:

    context :: {conceptual_schema, database}conceptual_schema :: {conceptual_schema_name j entity_type, semantic_role, relationship,cardinality}conceptual_schema_name :: {name}

    The university department SCERD in Fig. 1 shows our application conceptual schema. This isseen to consist of entity-types, and semantic roles played by related entities. The relationshipcardinalities are also indicated. An entity is seen to have a name and a set of defining attributes:

    entity_type :: {entity_type_name, attribute}entity_type_name :: {name}attribute :: {attribute_name,value}value :: {data_type j expression}data_type :: {booleanjnumberjintegerjrealjcharacterjmemo}

    For Fig. 1 the entity-types are Student (STD), Course-Registration (C-R), Section (SEC), Course(C), Teacher (T) and Secretary (SEK). The attributes for each entity-type are also shown in thefigure.

    The semantic role of an entity in its direct association with another entitity defines the roleplayed by the entity in the relationship. The cardinality of the association defines the multiplicitiesof the relationship:

    semantic_role :: {semantic_role_namej semantic_role, entity_type_name}semantic_role_name:: {name}relationship :: relationship_namejrelationship, entity_type_name}cardinality :: {integer}

    2 The symbols used here are defined in Appendix A.

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 117

  • As an example, Fig. 1 shows that the Student entity plays the semantic role adviced-by in itsdirect association with the Teacher entity. Conversely, the Teacher entity plays an advises role inthe relationship.

    Our logical schema is assumed to be relational, since for this study CQL is implemented on topof a relational DB. Since the user specifies CQL queries directly to the conceptual schema, there isno actual relational schema in the context. It is instead deduced from the conceptual schema bythe translation mechanism that we discuss later. For this reason we view the logical schema asbeing implied by the conceptual schema. Given an application conceptual schema, therefore, weassume that a database is uniquely identified:

    database :: {database_namejdatabase, relation}database_name :: {name}relation :: {relation_namejrelation, column}relation_name :: {name}column :: {column_namejcolumn, value}column_name :: {name}

    All the names in the context are strings of character:

    name :: {wordjnamejjword}word :: {string}string :: {characterjstring, character}

    4.1. Definition of CQL queries

    A CQL query is composed of lexical terms that are semantically meaningful in the contextof an application. A query comprises five constructs: target entities, source entities, interme-diate entities, semantic relationships, and selection conditions. A query term derives itsmeaning from the construct to which it belongs. The specification of intermediate entities isnot necessary, but optional. Consequently, the canonical form of a CQL query consists offour blocks:

    query :: ftarget block; source block; intermediate block; semantic relationship block;selection condition blockjqueryg

    (The intermediate block is enclosed within [ ] to indicate that it is optional.)

    Example. According to the definition of a query, our example Query 1, can be stated as

    Query QfChCnamei; STDhSnamemarshalli; ThTnamejonesi;hTtitleassociate professori;; STD enrolled inCR And T teaches SECg:

    This query statement transcribes into English as: Find the course(s) that student Marshall istaking from teacher Jones whose title is associate professor.

    118 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • A target is an entity whose attribute value is sought:

    target_block :: {target, attribute}target :: {target_entity_name}target_entity_name :: {name}

    Example. Query 1 consists of the single target Course, with a target attribute of C name.

    A source is an entity having a known or given attribute value:

    source_block :: {source, attribute}source :: {source_entity_name}source_entity_name :: {name}

    Example. Query 1 has the two sources Student and Teacher, student_name and teacher_name asthe respective source attributes.

    A semantic relationship represents a direct association between entities. It is defined by thesemantic roles played by the associated entities. In a query formulation only the semantic rela-tionships involving the target entities and the source entities are stated by the user. Intermediatesemantic relationships lying between targets and sources are deduced by the system. As mentionedabove, the specification of intermediate entities and their semantic relationships is therefore notnecessary, but optional in CQL:

    semantic_relationship_block :: {semantic_relationship, logical_operator}semantic_relationship :: {target_entity semantic_rolejsource_entity semantic_rolej[interme-diate_entity semantic_role]j null_semantic_relationship}

    Example. The example query has the following two semantic relationships: STD enrolled-in CRAnd T teaches SEC.

    The selection condition of a CQL query specifies a constraint on the values of the target at-tributes:

    selection_condition_block :: {selection_condition expressions}expression :: {algebra_expressionjnull_expression}null_semantic_relationship :: {empty_relationship}null_expression :: {empty_expression}

    Example. Our example query does not contain any selection criteria. Thus the selection conditionis a null expression.

    4.2. Conceptual mapping of abbreviated queries

    The mapping of a CQL query to the underlying DB is a two-phase process. First there is thequery-to-conceptual schema mapping which derives the conceptual path of the query in the form

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 119

  • a sub-schema of the application schema. We refer to the resulting sub-schema as the conceptualanswer, Sa, of the query. It follows, from the definition of a conceptual schema, that

    Sa :: fconceptual schemagTo be able to formally define this mapping process, we first define what we mean by a path on a

    conceptual schema. Our definition assumes a linear path.The mix-fix notation in [51] can be used to formally describe a linear path on a conceptual

    schema. If R fR1;R2; . . . ;Rmg and E fE1;E2;E3; . . . ;Eng are sets of names, then the mix-fixnotations on these sets can be used to describe dierent paths involving the elements of the sets.Let r r1; r2; . . . ; rm and e e1; e2; e3; . . . ; en represent ordered instances of R and E. The mix-fix expression on r and e is given by:

    pi hfeg; frg;mfixi;where feg and frg are respectively sets of schema entities and semantic roles on path pi.

    pi MFixfrjg; feig MFixr1; r2; . . . ; rm; e1; e2; e3; . . . ; en e1r1e2r2e3 . . . en1rmen:

    Suppose, for example, that R is a set of semantic roles (also referred to as semroles) played byschema entities and E a set of schema entities, such that r1 takes, r2 taught-by,e1 Student, e2 Course, and e3 Teacher. The mix-fix of [take(s), taught-by] and [Student,Course, Teacher] is:

    MFixtakes; taught-by; Student; Course; Teacher Student takes Course taught-by Teacher

    Note that this describes a schema path linking the entities Student and Teacher via the inter-mediate entity Course. Note that the inverse (or reverse) of this path can be verbalized as Teacherteaches Course taken-by Student. Both paths are semantically equivalent.

    If fPg is a set of connected paths, i.e., fPg fpi; for i 1; 2; . . . ; ng, then the conceptual an-swer of a CQL query is given by:

    Sa fLfpig : pi 2 fPgjfPg Sc;

    where L is a path operation on the set of paths pi and Sc is the database conceptual schema.According to this expression, the sub-schema defining the conceptual answer of a CQL query is

    a set of connected paths that are a subset of the DB conceptual schema.

    Example. For Query 1 the answer sub-schema is shown in Fig. 2. The linear paths satisfying thequery on this figure are expressed in mix-fix notation as:

    P1 student enrolled-in course-registration is-enrollment-for coursep2 student enrolled-in course-registration belong-to section is-section-of coursep3 teacher teaches section is-section-of coursep4 teacher teaches section has course-registration is-enrollment-for course

    120 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • The conceptual answer therefore consists of the logical combination of pi; p2; p3 and p4. That is:

    fPg Lfpi; p2; p3; p4g:In the next section, we expand on this to show the exact logical combination of the paths.

    The second phase in the mapping of CQL queries to the application DB consists ofmapping the conceptual answer Sa to the DB, which is assumed to be relational for thisstudy. This conceptual answer-to-logical database mapping process employs standard ER(EER)-to-relation transformation rules that are found in any good database book, e.g.,[21,37]. We, therefore, do not present the rules here, but instead discuss the mechanism of themapping.

    Path pi 2 fPg if it is the conceptual answer or a subset of the conceptual answer. This requirespi to be mapped onto existing DB relations. From the mix-fix notation, pi has the formpi e1r1e2r2e3 . . . en1rmen. The mapping mechanism is as follows:1. Each ei is assumed to be simple and can therefore be mapped directly onto a relation.2. Each e1r1e2 defines a binary relationship between e1r1e2 and a join, at the logical level, between

    the associated relations.

    Fig. 2. Search space, or candidate solution paths, for Query 1.

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 121

  • 3. Higher-order relationships are recast as a set of inter-related linear paths interlocking the par-ticipating entities.

    From the above-steps, the relations pertinent to the specified query are identified and extracted.

    Example. For our Query 1, the pertinent tables are COURSE, SECTION, COURSE-REGIS-TRATION, STUDENT, TEACHER. In a later section on SQL statement generation, we showhow these tables are input into the SQL statement generating process.

    In the next section, we discuss the main functional modules of the CQL system. These modulestake as input the query formulated by the user and process it to output (1) a recast of it in naturallanguage, and (2) an SQL statement equivalent of the query.

    5. Functional modules

    CQL is founded on set- and graph-theories. As Fig. 3 shows, the CQL system comprises thefollowing three functional components: (1) input functions, (2) processing functions, and (3)output functions. These components are discussed in this section. Examples are used to illustratethe dierent functional concepts.

    5.1. Input functions

    Query formulation in CQL assumes the existence of a schema (like an ER diagram) based on asemantic data model, against which queries are specified. In formulating a query, the user is aidedby a computer-supported query formulation system, which is discussed next.

    5.1.1. Query formulation surrogate system (QFSS)CQL uses a QFSS to help users tailor their queries to the target DB. QFSS provides infor-

    mation that enables users to reduce what Belkin [5] describes as users anomalous state ofknowledge. This is achieved by reducing the semantic gap between user query formulations andthe logical and conceptual states of the DB. In using CQL, users can familiarize themselves withthe conceptual aspects of the target DB through an interaction with QFSS. QFSS provides userswith helpful information on the schema concepts and constructs that they must be familiar with inorder to be able to formulate queries. To use QFSS, the user clicks on the item about whichinformation is needed. A window is then opened showing information on the clicked item. In-formation on the following CQL constructs and concepts is provided:

    Selection operators. These are the operations and functions supported by CQL. They are usedin the selection conditions of queries. Arithmetic and logical operators (;;

  • Semantic role list. The semantic roles of the schema entities are contained on this list. If a roleon this list is clicked, a window opens to show the pairs of entities related through the clicked role,and the semantic roles involving the pair of entities.

    In providing the query formulation surrogate system, the intention is to provide inexperiencedor novice users with help to ease the task of formulating queries. The use of this help system is notmandatory. Expert or knowledgeable users can bypass it and directly formulate their queries.

    5.1.2. Browsing facility of the QFSSIn specifying queries in CQL users can directly write the query anew on a CQL input interface

    in a query formulation window (QFW), or workspace. For this mode of interaction, the user

    Fig. 3. Functional modules of CQL.

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 123

  • directly writes the query to an input interface. In the current prototype of CQL, the interfaceshown in Fig. 4 is used. Alternatively, they can compose the query with the help of the QFSS, byclicking on the desired items. The chosen items are written to the input interface in the QFW. Inthe QFW, the query can then be used as is, if the user believes it is semantically equivalent to thedesired query, or modified as necessary. To facilitate the latter mode of query specification via theQFSS, the QFSS windows are hyperlinked to support navigation between windows. In the fol-lowing, we illustrate the use of the QFSS with our example query.

    Fig. 4. CQL query formulation interface.

    124 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • Fig. 5 shows an example of the CQL query screen. To use QFSS help, the user clicks the upperbutton of this figure. This action takes the user to the top screen of Fig. 6. The user can requesthelp on schema information by clicking one or more of the buttons on this screen. Clicking on theEntities button, for example, pops up the listing of entities in the bottom screen of Fig. 6. Forour example query one of the entities to be selected is Student. On clicking this entity, the de-scriptive attributes of the entity are displayed (see Fig. 6). The user then picks the desired attri-butes of Student. All the selected items are written to the input interface shown in Fig. 4. The restof the form can be filled similarly.

    5.2. Processing functions

    CQL uses the meta-schema diagram in Fig. 7. This figure shows the dierent types of meta-schema information that support the internal processing of CQL queries. The structures on Fig. 7are schema-dependent, but query-invariant. Hence they are created during design time, are per-sistent, and are defined as follows:1. Entity-attribute. Comprehensive list of attributes and the entities with which they are associated

    in the application DB.2. Entity-table. Comprehensive list of entities in the application DB.3. Link adjacency matrix (LAM). Contains information about pairs of entities that are logically

    adjacent, i.e., that bear direct semantic relationship on the DB schema.

    Fig. 5. CQL query screen.

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 125

  • 4. Connectivity. Lists pairs of entities that can be connected directly (as in the LAM) or indirectlythrough a set of intervening entities.

    5. Link semantic dictionary (LSD). Specifies the semantic relationships that exist between two di-rectly connected entities.

    6. Join-matrix. Basically an extension of the LAM. It details the relationships of LAM entity-pairsby indicating the key columns used for joining entities (or tables at the virtual logical level).

    Fig. 6. QFSS help window on schema entities.

    126 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • Furthermore, the figure shows the relationships between the meta-data items. For example, itshows that an attribute is part of an entity, etc. One use of this kind of information is the designof meta-data integrity checks and error messages. As an example, since every attribute belongs toan entity, it follows that any attribute specified in a query must exist on the schema as an attributeof an entity on the schema. This and similar checks are performed prior to the processing ofqueries.

    Input validation. The first step in processing a CQL query is input validation. For this purpose,the schema is internally represented by the entity table and the entity-attribute table, which aredefined as follows:

    Entity_Table :: {entity-type,data-type}Entity_Attribute_Table :: {entity_type_name, attribute, data_type}

    The validation of a specified entity is done by checking the entity against the entity-table, toensure that it actually exists on the schema. Each specified attribute is similarly validated againstthe entity-attribute table. Where a specified entity or an attribute does not exist on the DBschema, an error message is returned to the user. Validation is automatically done by the CQLsystem.

    Parse input. CQL transcribes specified queries into a set-specification form. The set of sources,targets and conditions is written to a Set Handler, which is a template of semantically typed slots,or place-holders, for sources, targets, and conditions (selection and semantic). The Set Handler isan abstraction for the CQL input form; its internal representation is that of an intermediate setform. The set representation makes it amenable to set theoretical treatment. The purpose of theSet Handler is to decompose specified queries into sub-queries, which are then used to derive thecorresponding paths on the DB schema. The Set Handler performs two functions: (1) set-packingand (2) set-unpacking.

    Set-packing. Set-packing extracts the specified sources, targets and conditions, and writes themto the query template. This task is performed by the set-packer.

    Fig. 7. CQL meta-schema diagram.

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 127

  • Example. For our example query, the Set Handler is packed as:

    S QfChCnamei;STDhSnamemarshalli;ThTnamejonesi;hTtitleassociate professori; ;STD enrolled-in CR And T teaches SECg:

    According to this, the set-packer identifies and extracts Course as the target, Student andTeacher as the set of sources, and Std enrolled-in CR and T teaches SEC as the semantic rela-tionships.

    Set-unpacking. Set-unpacking uses set-theoretic operations to fragment the query written to theset-packer into its subqueries:

    Q1 : S1fChCnamei; STDhSnamemarshalli;; Std enrolled-in CRg S1C; STD;; Stdsem andQ2 : S2fChCnamei; T ;; T teaches SECg S2fC; T ;; Tsemg

    (The _ indicates an unspecified, a possibly null, value.)Connectivity validation. In processing a CQL query, the system traverses a path between schema

    entities. The highly abbreviated nature of CQL queries requires the system1. to first unpack a query into subqueries,2. to possess the capability of determining the set of semantically correct adjacent pairs of entities

    on a query path,3. to ensure that at least one connected path exists between sources and targets.

    These functions are discussed here.Each initial subquery resulting from the unpacking of the query is further decomposed into

    irreducible subqueries. An irreducible subquery is equivalent to a linear path between a sourceand a target. The linear paths (or subqueries) are then logically combined into candidate solutionpaths for the query. The process is illustrated in Fig. 8. According to this figure, for each initialsubquery, a set of shortest paths is constructed between the source and the target. The process canbe algorithmically described as follows:1. For each subquery, pick a specified relationship semantic role (for brevity, we simply refer to

    this as semantic condition). Let the chosen semantic condition be tagged as semantic k.2. Determine the linear paths linking semantic k to the specified target j in the subquery (done by

    Procedure Tree in Fig. 8):2.1. Read the entities in positions 1 and 2 of Sempos_table[k]. (Sempos_Tableholds the spec-

    ified semantic conditions and is defined below.)2.2. If the entity in position 1 (i.e., entity 1) of Sempos_table[k] is in the first position of the

    LAM, generate a directed acyclic graph having the source as its root entity (performedby Procedure Select_LAM).

    2.3. Otherwise, if entity 1 of Sempos_table[k] is not in the first position of the LAM, rearrangethe LAM so that entity 1 is in position 1 of the LAM (performed by Procedure Se-lect_DLAM).

    3. From the LAM or DLAM, generate a direct acyclic graph from semantic k to the target:3.1. Tag entity 1 in the LAM or DLAM, as the case may be, as read.3.2. Update Procedure Tree with the entity in position 2 (i.e., entity 2) of Sempos_table[k] set

    equal to entity 1.

    128 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • 4. Repeat step 3 until entity 2 equals the target.5. Write the linear path generated (the tagged entities) into Validation_Pathmatrix (VP matrix),

    defined below.The query-invariant LAM is defined as:

    LAM :: {obj1, obj2, cost}obj1 :: {entity_name}obj2 :: {entity_name}cost :: {cost of traversing obj1 to obj2}Sempos_Table and the VP matrix are volatile, since they are query-variant, and are auto-

    matically generated anew for each new query. They are defined as follows:

    Semantic Position Table :: Sempos Tablek; ek;1; ek;2

    where k is the kth row of the table, ek;1 the entity occupying position 1 of the kth row of the table,and ek;2 is the entity occupying position 2 of the kth row of the table.

    Fig. 8. Query path generation in CQL.

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 129

  • Entries in this table are of the form k; ek;1; ek;2. This has the mix-fix semantics: ek;1rkek;2, where,as defined before, rk is the semantic role associating ek;1 and ek;2. For example, in the semanticrelationship T teaches SEC of subquery Q2 above, T e2;1, SEC e2;2 and teaches r2. Similarly,for Q1 Std e1;1, CR e1;2 and r1 enrolled-in. Therefore, from the initial unpacking of ourexample query, Sempos_Table is instantiated as {(1, STUDENT, COURSE-REGISTRATION),(2, TEACHER, SECTION)}.

    Validation_PathMatrix (VP Matrix) :: setfp1; p2; . . . ; pkg 3

    pi n-tuplefei;1; ei;2; ei;3; . . . ; ei;ng hfeg; frg;mfixi mix-fixr1; r2; . . . ; rm; e1; e2; e3; . . . ; en

    It is seen that pi is the ordered sequence of entities on the ith path from a source to a target.Therefore, the VP matrix holds the set of linear paths (corresponding to an irreducible subquery)from a source to a target. For our running example, the VP matrix holds the values {(C, CR,STD), (C, SEC, CR, STD), (C, SEC, T), (C, CR, SEC, T)}.

    Semantics of an irreducible linear path. As expressed above, linear path pi isei;1 r1;2 ei;2 r2;3 ei;3 r3;4::rn1;n ei;n in mix-fix notation, where rj;k is the semantic role associating eijand eik on the path. This can be expressed alternatively as:

    piei;1 r1;2 ei;2\ei;2 r2;3 ei;3\ei;3 r3;4 ei;4::\ei;n1 rn1;n ei;n\eij rj;k eikfj1;2;...;n1 and kj1g:Since the terminal entities ei;1 and ei;n are source and target entities, respectively, the user specifiesei;1 r1;2 ei;2 and ei;n1 rn1;n ei;n as semantic role relationships in the query formulation. It fol-lows that pi can be expressed as:

    pi ei;1 r1;2 ei;2 \ feij rj;k eikfj2;...;n2 and kj1gg:The path segment eij rj;k eikfj2;...;n2 and kj1g consists of intermediate entities and semantic

    role relationships and is automatically determined by Algorithm Prune-First, CQLs path findingand selection algorithm given in the appendix. In automatically deducing the intermediate pathsegments, the value for eij rj;k eik is read from the non-volatile link semantic dictionary (LSD)defined as:

    LSD :: fobj1; obj2; semantic role; Inverse semantic roleg

    Example. For our running example Query 1, the derived semantic paths are:

    P1 hSTD enrolled-in CR is-enrollment-for Ci (or the inverse hC has CR enrolls STDi)P2 hSTD enrolled-in CR belongs_to SEC is_section_of Ci (or the inverse hC consists_of SEC

    has CR enrolls STDi)P3 hT teaches SEC is_section_of Ci (or the inverse hC consists_of SEC is_taught_by T i)

    3 CQL gives users the option of specifying a value for k, thereby limiting the number of paths generated for each source-target pair.This is especially important for very large DBs (VLDBs).

    130 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • P4 hT teaches SEC has CR is_enrollment_for Ci (or C has CR belongs_to SEC is_taught_by T)Candidate path selection. For the purpose of query-validation, we return only one candidate

    solution path to the user. The underlying cost criterion we use to select a path is the total numberof edges (or arcs) on the path, i.e, the path length. (We opt for this simple cost criterion fordemonstration purposes only, since the issue of cost criteria is orthogonal to the ecacy of theCQL system. The cost criterion can be readily changed.) The objective is to choose the minimum-cost candidate solution path.

    Our path selection approach forms clusters of the irreducible linear paths, with each clusterconsisting of the minimum cost linear paths between a source and a target. A candidate path for aquery is constructed by selecting one linear path from each cluster and then logically combiningthem. The minimum cost linear path clusters for our example are:

    P1 hSTD enrolled-in CR is-enrollment-for Ci (or the inverse hC has CR enrolls STDi)P3 hT teaches SEC is_section_of Ci (or the inverse hC consists_of SEC is_taught_by T iIt can be easily shown that for a multiple-source-single-target (MSST) query, the upper bound

    on the cost of a candidate path is

    CN ;1 i1RN cLi fk2RNN !=N k!k!k 1ci1\kLig;

    where cLi is the cost of linear path Li; i1 \k Li is the intersection of k linear paths and N is thenumber of sources. For a multiple-source-multiple-target (MSMT) query, it can be shown that theupper bound on the cost of a candidate paths is

    CN ;M M CN ;1;

    where N is the number of sources and M the number of targets.For our example MSST query, N 2 and M 1. If Cj;k denotes the cost of the conjunctive

    solution path pj:pk, then by our cost criterion, it is seen from the cost function and Fig. 9 thatC1;3 4. While we do not pursue it further, it can be shown that other minimum length costcandidate path combinations may exist from the set of clusters. For example, in our case, p2:p3and p1:p4, have costs C2;3 4 and C1;4 4, respectively, and are, therefore minimum cost can-didate paths. Furthermore, it can be shown that the candidate path chosen by our approach isguaranteed to be a member of the set of minimum cost candidate paths. 4

    Natural language validation of CQL queries. To ensure the semantic correctness of constructedqueries, user validation is essential in abbreviated queries. In CQL, the semantic roles played byentities in relationships are also used by the system to construct pseudo-natural language ex-planations of queries. The system-constructed explanations are returned to the user for validation.To facilitate legibility, the system does not generate unnecessarily lengthy sentences. 5 According

    4 In an alternative path selection approach for CQL, the user specifies the maximum number of candidate paths to be generated.

    Therefore, if the value k is specified, the system then constructs and returns up to, but not more than, k minimum cost candidate paths.For this alternative, if the user had specified 3, or even 4, candidate path combinations P1. P3; P2:P3, and P1. P4 will be generated andreturned.

    5 For long query paths, CQL provides the user with the option of eliding the masked portions of the path (as in Fig. 8) from the

    natural language (NL) explanation, to facilitate understanding.

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 131

  • to [18,23,56], this is the recommended and an eective way of obtaining a good compromisebetween natural language and readability, focus and relevance. The NL aspect of the CQL systemis discussed in this section.

    The NL explanation module consists of two components. The first is a sub-path transcriber,which transcribes each irreducible linear path selected for candidate path construction intopseudo-English. The second is a logical path synthesizer which combines the transcribed linearpaths into a single system-explanation of the specified query. The context-free grammar of CQLsnatural language explanation of queries is described in BackusNaur form (BNF).

    5.3. BNF for CQL natural language explainer

    The basic, or canonical, form of the pseudo-NL explanation of a CQL query is formally ex-pressed as:

    CQL Pseudo-NL Canonical Structure :: {search_verb, search_clause, search_predi-cate_clause, h known_attribute_value, semantic_relationship_clause, join_conditioni}search_verb :: {FIND}search_clause :: {target-attribute-comma-list }search_predicate_clause :: {SUCH THAT}known_attribute_value :: {source-attribute-comma-list}semantic_relationship_clause :: {semantics-predicate}join_condition :: {Join-attribute-expression}Based on this definition the pattern of the NL explanation of a CQL query is:

    FIND h[target-attribute -comma - list and} target-attribute i SUCH THAT hsemantics-predi-cate - semi-colon - listij[, and hsource-attribute comma list and] source-attributei, and h[[join-attribute-expression - semi-colon - list and] join-attribute-expression]ijThe symbols [ ] and j, respectively represent multiple terms and optional expressions that may

    be missing from some queries. The algorithm that generates the pseudo-English explanations ofqueries can be found in [41] and is available upon request from the first author.

    Example. For the selected linear paths p1 and p3 of our example query, the pseudo-natural lan-guage transcription generated is shown in Fig. 10. This statement is returned to the user forvalidation.

    Query validation. Validation of CQL query formulations involves checking for semanticequivalence, or consistency, between the original query statement and the system-generated ex-planations. Where the user believes that the system-generated query explanation is semantically

    Fig. 9. Logical combination of linear paths into a solution path for Query 1.

    132 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • consistent with the intended query, the user validates it. It is thereafter executed. Otherwise, it ismodified as necessary before execution.

    5.4. Automatic generation of SQL

    To execute a query whose NL explanation has been validated by the user, the conceptualanswer is first translated into an SQL query (here again we are limiting the discussion to theprototype implementation of CQL on top of a relational DBMS). In the remainder of the section,we discuss and illustrate this translation process.

    From the earlier discussion on the mapping of query paths to the underlying DB relations, it isseen that the edge connecting two entities on a path is equivalent to a join. Since the conceptualanswer defines the solution path of the query, the derived relations and joins are automaticallyconverted into an SQL query, which is then specified on the underlying physical DB. AlgorithmMapping gives the main conversion logic. The conversion pseudo-code can be found in [41]. 6 Inessence the approach is as follows:

    6 We have not included the pseudo-code here because of its length. However, it is available to interested readers upon request from

    the first author.

    Fig. 10. CQLs natural language explanation of Query 1.

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 133

  • 1. The target attributes are written to the SQL SELECT clause.2. The following are written to the SQL FROM clause: (1) target tables, (2) source tables, and(3) each entity on the chosen candidate path.3.1. If the underlying relational DBMS performs automatic joins, then the tables in the FROMclause are automatically joined by the system.3.2. If the underlying DBMS does not perform automatic joins, the join conditions are readfrom the Join_Matrixand written to the SQL WHERE clause.4. The selection criteria are read from the Selection-(Theta)-Operation matrix and also writtento the SQL WHERE clause.

    5.5. CQL-to-SQL conversion algorithm

    Algorithm Mapping is the main translation program for mapping non-recursive and un-nestedqueries from CQL to SQL. Translations for special cases, such as nested queries, recursive queries,aggregation functions, clauses, other more complex multi-block queries, etc., are handled by otheralgorithms in [41] which call Algorithm Mapping as a special procedure.

    Algorithm MappingFunction: Map CQL query formulation to SQLInput Validated CQL query formulation. Entities on validated query-path.Output SQL query formulation

    Algorithm.Step 1. Pick all the target attributes from the target segment of the CQL formulation. Write

    them in the SELECT clause of SQL, separating them with commas (,).Step 2. Pick all the target and source entities specified in the CQL formulation.Step 3. Pick any entity specified in the semantics (relationship) conditions block of the CQL

    formulation, but not already picked in Step 2.Step 4. Pick any entity not already picked in Steps 2 and 3, but contained in the query-pathStep 5. Write all the entities picked in Steps 24 into the SQL FROM clause, separating them

    with commas.Step 6. In the SQL WHERE clause, write each source attribute and its value. Enclose the at-

    tribute value in apostrophes ( . . . ). Separate the source-attribute attribute-valuepairs with an And, i.e., form a conjunction.

    Step 7. Check the selection conditions block of CQL. If any selection criteria are specified, writethem into the SQL WHERE clause, separating this set from the set in Step 6 with an And.

    Step 8. If underlying DBMS supports automatic joining of tables in the FROM clause, then do:Step 8.1. Write a semi-colon (;) after the last source-attribute attribute-value pair in Step 7.Stop.Otherwise do:Step 8.2. Write an And after the last source-attribute attribute-value pair in Step 7.

    Step 8.2.1. Extract all the join conditions from the query-path and write them after Step8.2, separating the joins with an And.

    Step 8.2.2. Terminate the last join condition with a semi-colon (;). Stop.

    Example. The generated SQL statement corresponding to the semantics of the conceptual answerof our example query 1 is shown in Fig. 11.

    134 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • 6. Expressive power of CQL

    A major concern in query languages is their completeness, which is an aspect of expressivepower, or simply expressiveness, which in turn is taken as the ability of the system to extractmeaningful information from DBs [11]. In this section, the functional expressiveness of CQL isdiscussed. First, we discuss the aggregation functions, clauses, quantifiers and logical operatorssupported by CQL. Thereafter, we provide formal proofs of the power of CQL. Lastly, a dis-cussion on the safety of CQL expressions is given.

    6.1. Aggregations, clauses and quantifiers in CQL

    CQL is mappable to dierent target database management systems (DBMSs). Therefore, CQLfunctions, clauses, quantifiers and logical operators are independent of any specific DBMS querylanguage. This means that they must be tailorable to those of the underlying target DBMS querysub-language (as mentioned earlier, in the implementation discussed in this paper, CQL is tailoredto an underlying relational system). Aggregation functions, clauses, quantifiers and logical op-erators are collectively referred to here as operators.

    As illustrated in Fig. 12, two types of tailorable functions are provided to extend the power ofCQL: Directly mappable operator functions (DMOFs). These are operators that are directly supported

    by the underlying DBMS sub-language. They can be used directly in CQL query formulation,and are directly mapped to the same operators in the underlying DBMS sub-language. There-fore, there is a one-to-one correspondence between DMOFs in CQL and the set of operatorssupported by the underlying DBMS.

    Fig. 11. CQL generated SQL statement for Query 1.

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 135

  • Indirectly mappable operator functions (IMOFs). Some of the operators supported by CQL andrequired by an application may not be directly supported by the target DBMS. A body of con-version codes is written to translate this class of operators to programs that are executable bythe DBMS. IMOFs, therefore, provide enhanced functionality to the DBMS sub-language.Table 1 shows the operators that are currently supported in CQL and the equivalent SQL

    DMOFs that they are mapped to. A CQL operator for which SQL lacks an equivalent operator isan IMOF with respect to SQL. SQL IMOFs are indicated by ) in Table 1. As an example, theREPEAT(n) operator, which is an SQL IMOF, is used for recursive queries in CQL. The Match-n-In, Shunt and Mask operators are also IMOFs in SQL. These three operators are briefly ex-plained in Appendix A.

    The set operators currently supported in CQL are UNION, INTERSECTION, DIVIDE andCARTESIAN PRODUCT. Currently, the logical Operators ELSE, AND, OR, NOT and EXOR are supported in CQL. The interested reader is referred to [41] for a discussion on each of theCQL operators.

    6.2. Proof of the expressive power of CQL

    The expressiveness of query languages is usually gauged in terms of their relational com-pleteness. A data manipulation language is said to be relationally complete if it is as expressive asrelational algebra (or equivalently, relational calculus) [17,21].

    Fig. 12. Tailoring CQL operators.

    136 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • In discussing the expressive power of CQL, it is noted that the class of queries computable byCQL is a superset of first-order queries. More formally, if Q(CQL) denotes the set of queriescomputable by CQL and Q(fo) is the set of first-order queries, then Q(fo) Q(CQL). To provethis, it is shown that QCQL [Qfo;Qu;QIMOFs. Qu is the set of queries involvingthe use of the universal quantifier and Q(IMOFs) is the class of CQL queries requiring the use ofIMOFs with respect to SQL.

    6.2.1. Claim on CQL completenessCQL is more than first-order complete. To show this, it is first shown that CQL is relationally

    complete. Next, it is shown that CQL supports universal quantification, thus making it morefunctionally expressive than first-order completeness.

    6.2.2. Relational completeness of CQLThe approach taken here is fashioned after Ullman [52] in proving the relational completeness

    of QUEL and QBE, and also by Date [19]. In general, language L1 is L2 complete if we canexpress in L1 any query that can be expressed in L2. Where L1 is CQL and L2 is SQL, the proofreduces to showing that CQL is relationally complete. To prove the relational completeness (andhence the expressive power) of a query language, it suces to show how to apply each of the fivebasic relational algebra operations and store the result in a new relation [52].

    A first-order complete system is one whose class of computable queries contains the class ofqueries computable through the relational algebraic operators: Dierence (di), Union ([),Cartesian product X , Selection (r) and Projection (P) [52].

    Claim. CQL is first-order complete, i.e., Qfo Q;[ ;X; r;P QCQL.

    Table 1

    Mappable CQL user-defined operator functions

    CQL operators SQL operators

    SUM SUM

    COUNT COUNT

    AVERAGE AVERAGE

    SORT-BY ORDER-BY

    GROUP-BY GROUP-BY

    BETWEEN BETWEEN

    BELONG-IN EXISTS

    FORALL )REPEAT(n) )IS-IN IN

    IS-HAVING HAVING

    IS-LIKE LIKE

    IS-n-OF )MATCH-n-IN )SHUNT )MASK )

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 137

  • Proof. To prove this claim, it is required to show that given any relational database (RDB) andany query in Q;[ ;X ;r;P specified on RDB, with result Arb on RDB, there exists an equivalentquery Q expressed in terms of CQL, such that if ACQL is the answer of Q, then ACQL Arb: Thisessentially says that the set of queries computable by CQL is a superset of the set of first-orderqueries.

    Let R : Ra1; a2; . . . ; an and S : Sb1; b2; . . . ; bn; and let An and Bn denote the attributes ofR and S, respectively. R tiRji 1; 2; . . . ;m, i.e., the set of tuples of R and S tjSji 1; 2; . . . ; k,i.e., the set of tuples of S. tiR vi1; vi2; . . . ; vin and tjS vj1; vj2; . . . ; vjn, where vix and vjx are thevalues of ax and bx, respectively.

    6.2.2.1. Union operation [. Assume Arb [TR R [ S: (This presupposes that R and S areunion compatible.)

    )[TR ftiRg [ ftjSg:That is, [TR is a relation that includes all the tuples that are either in R or in S, or in both R and S.

    In CQL:

    ACQL QtargetT ; sourceR; S;; CselT R [c S;where [c is the union operator supported in CQL.

    Using the set-theoretic decomposition of Q, ACQL Q1targetT 1; sourceR;;CselT 1 R [c Q2targetT 2; source(S); -; Csel(T2 S)], where Q1 and Q2 are subqueriesof Q.

    )ACQL T 1 R [c T 2 S:But [c [, the relational union operator.

    )ACQL R [ S [ TR:

    6.2.2.2. Dierence operation () or MINUS). Assume Arb TR R S R MINUS S:) TR ftiRg ftjS: TR

    is a relation that includes all the tuples that are in R but not in S.In CQL, ACQL QtargetT ; sourceR; S;; CselT Rdiff S, where di is the dierence op-erator in CQL.

    Set decomposition give ACQL Q1targetT 1; source(R); -; Csel(T 1 R)] di Q2[target(T2);source(S); -; Csel(T 2 S)].

    )ACQL T 1 R diff T 2 S:But di) (or MINUS), the relational dierence operator.

    )ACQL R S TR:

    6.2.2.3. Selection operation (r). As argued by Ullman [46], . . . all selections can be broken intosimple selections of the form rXqY.

    138 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • Assume Arb r TR rAIHV R: rTR is a relation consisting of only the tuples of R where thecondition AIHV evaluates to true. V is an attribute-value vector, Al an attribute vector and Hthe theta operation vector.

    If t0iR tuples of R such that AIHV is false, and R0 ftiRg ft0iRg:Then rTR R0:In CQL, ACQL QtargetT ; sourceR;; CselT AIHV If t00iR tuples of R such that AIHV is false and R00 ftiRg ft00iRg, then T R00.But ft00iRg ft0iRg)R00 R0 and T r TR:

    6.2.2.4. Projection operation (P). Assume Arb P TR PAIR, such that if a0I 2 AI, thena0I 2 An (recall that An denotes the attributes of R). PTR is a relation whose intention consistsonly of attributes A0I a01; a02; . . . ; a0k, such that A0I AI.In CQL, ACQL QtargetT A0I; source RAI;; QtargetT ; sourceRAI;;CselallA0I:allA0I picks all the sets of distinct values of A0I and assigns them to T.

    6.2.2.5. Cartesian product operation X . Assume Arb X TRS RXS:X TRS is a relation whoseintention consists of the concatenation of An and Bm, where m may or may not be equal to n.Thus, if AX denotes the attributes of X TRS, then AX is of the form AnBm. That is,AX a1; a2; . . . ; an; b1; b2; . . . ; bm. The relational product operation is equivalent to a join op-eration with no join restrictions. Thus, the extension of X TRS is the set of all possible combina-tions of tuples from the two relations being operated on.

    In CQL, ACQL QtargetT A; sourceR; S;; CselAnX :Bm. (Instead of (AnX :Bm, CQLallows the expression R X. S to be used.) CQL sets the intention of T to AnX :Bm. But X. inCQL maps to the relational X operator. It follows that the extension of T is also the set of allpossible combinations of tuples from R and S.

    )T X TRS:

    6.2.3. Universal quantification and safe expressionsThe second part of the proof of the functional expressiveness of CQL deals with showing that

    CQL supports universal quantification and that its expressions are safe.

    6.2.3.1. Universal quantification in CQL. The support for universal quantification in CQL is dis-cussed and demonstrated here. The FORALL operator is used for universal quantification inCQL. We show that this operator is mappable to SQL.The input structure for the CQL FORALL operator is:

    QtargetT aT; sourceSiai; Suau;; CselaI v and au ? and aT FORALL faujof aijg.The term j of a i j is optional.

    The equivalent SQL statement generated by CQL is:

    SELECT a(T)FROM TWHERE NOT EXISTS (SELECT

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 139

  • FROM S S1WHERE a(I) vAnd NOT EXISTS (SELECT

    FROM S S2WHERE S2.a(u) S1.a(u) And S2.a(T) T.a(T)));

    The translation of the CQL FORALL operator to SQL code for universal quantification isachieved through Appendix C.

    Example. Get supplier numbers for suppliers who supply at least all those parts supplied bysupplier S2. (From [19]). [The assumed relations are:

    S(S#,Sname, Status, City)P(P#,Pname, Color, Weight, City)SP(S#,P#, QTY)]

    CQL Formulation:Target attribute: hS#iTarget: hSPi

    Source attribute: hS#iSource attribute value: hS# S2iSource: hSPi

    Source attribute: hP#iSource: hSPi

    Selection Conditions: hS# FORALL P# of S# S2iSQL Formulation (resulting from Algorithm FORALL):

    SELECT S#FROM SPWHERE NOT EXISTS (SELECT

    FROM SP SP1WHERE S# S2And NOT EXISTS (SELECT

    FROM SP SP2WHERE SP.S# SP2.S#And SP1.P# SP2.P#));

    From the foregoing, if Qu is the class of universally quantified queries and Q(CQL) the classof queries supported by CQL, then Qu QCQL.6.2.3.2. Safety of CQL expressions. In the remainder of the section, the safety of CQL expressionsis argued. According to [52], the main properties of safe formulas are:

    (a) Every safe formula must be domain independent. This ensures that data is not materializedfrom outside the domain.

    140 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • (b) It should be easy to tell, just by inspecting a formula, whether or not it is safe.(c) The formulas that are expressible in real query languages based on relational calculus are

    safe.

    These three properties provide a mechanism for ensuring the safety of CQL queries.In CQL, the following properties hold:

    1. A formula or expression does not materialize or reference an infinite entity (or infinite table atthe logical level).

    2. A non-recursive mechanism references only a finite set of finite entities. This ensures a finiteresult. In this regard, it is noted that CQL recursions (using the REPEAT n operator) aretransformed into non-recursive procedures through query-graph construction. This ensuresthat:

    a CQL query-graph is a finite graph, each entity on a query-graph or path is finite, each referenced entity on a query-graph is finite, CQL query results are materialized from a finite number of target entities, which are explicitly

    specified in the query formulation.3. Every CQL formula is expressible in SQL. SQL is a real query language based on relational

    calculus (and algebra), and its expressions are safe.Therefore, based on properties (a)(c) above for safety, it can be concluded from properties (1)

    (3) that CQL expressions are safe.Additionally, let

    F(sql) formula expressions for SQLF(cql) formula expressions for CQLA! B mean A is expressible in BThe claim that every CQL expression is expressible in SQL can be formally stated as:

    fFiCQL ! F SQLg8i, which in conjunction with property (c) for safety implies the safety ofCQL expressions.

    Corollary. [A9B means A is not expressible in B].If 9i : FiCQL9 F SQL; then CQL expressions cannot be guaranteed to be safe, i.e., it is

    impossible to make a definite assertion as to the safety of CQL expressions. However, the dis-cussion on mapping (that each segment in the target block, the source block, and the selectionconditions block of CQL is mappable to an SQL term) and on the expressive power of CQLshows that: :9i : FiCQL9 F SQL. This also implies the safety of F(CQL).

    7. Discussion and conclusion

    Certainly, the concept-based approach to query formulation is not new. Indeed, the ORMcommunity has explored this field extensively and proposed concept-based query languages fortheir modeling approaches. As already mentioned, ORM itself is a generic term for a concept-based approach to data modeling in which data is modeled only in terms of entities (or object) andthe semantic roles they play in relationships with other entities. No use is made of the concept of

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 141

  • attributes in ORM. Because of its generic nature, there is not just a single ORM model, but a setof closely related versions, all of which adhere to the binary data modeling principle stipulated byORM. Examples include NIAM [26,40,55,58] and the predicator set model (PSM) [48].

    NIAM is a version of ORM that supports only binary relationship types. As a modeling ap-proach, it is particularly useful as an analysis method that describes an information system innatural language. Starting from examples, which are partial descriptions of the information do-main, the approach results in an information structure, or database schema. A formalization ofNIAM was attempted in the predicator model (PM) [54] by extending it to allow for n-ary re-lationships.

    A further extension of NIAM was achieved in PSM by extending PM to support advancedmodeling constructs like sequences, sets, polymorphism, power types, schema types, generaliza-tion and specialization relationships. A motivation for this extension was to support complexobjects, hypermedia and oce automation applications. PSM is built around the concept ofpredicator, which is the connection between an object and a role. Relationships are then defined interms of the association roles played by objects, i.e., a relationship is an association betweenpredicators. In PSM, a relationship is viewed as a set of predicators.

    ORM lends itself to dierent dimensions of database querying, one of which is the use ofschema transformation in schema, and hence query, optimization. Dierent conceptual schemasof the same DB application can be mapped to dierent internal and logical schemas. This allowsfor the performance at the operational/internal level to be optimized by optimizing the conceptualschema. This requires a transformation of one conceptual schema onto another. The study in [28]proposes a formal approach to optimizing conceptual schemas by transforming a given concep-tual schema onto a dierent but equivalent conceptual schema that exhibits a better operationaleciency at both the logical and internal schema levels. The study proposes an approach and alanguage based on the mix-fix notation. In essence the approach takes as input a conceptualschema of a DB application and outputs another conceptual schema of the same application. Theoutput conceptual schema is an optimized version of the input schema in the sense that it leadsto more ecient logical and internal schemas, which in turn result in better operational charac-teristics.

    Both initial and optimized schemas are, however, ORM schemas. This means that they aresemantic schemas. They can therefore be used as the underlying conceptual schemas for CQL. Wenote that any conceptual schema that is expressible in mix-fix notations of entity-types (or object-types) and semantic association roles played by the entities (or objects) can form the basis for and,therefore, support CQL queries. To illustrate this argument, we note that an ORM schema can bedefined as the tuple hE;Ri, where

    E :: set of entity-types,R :: set of semantic roles played by members of E.An ORM schema can therefore be expressed in the mix-fix notation:

    mFixR; E mFixr1; r2; . . . ; rn; e1; e2; . . . ; em:This is precisely the notation for the path expressions of CQL queries.

    Fundamentally, the motivation behind the development of concept-based query languages isthe same as for natural language query languages, namely, to provide users with query languages

    142 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • that are naturally close to users. This necessitates, on the one hand, that the languages aremathematically sound and unambiguous, and on the other hand that they are as natural aspossible and, hence, easy to use. To the best of our knowledge, the reference and idea language(RIDL) [20,38] was the first concept-based language to aim at these goals. RIDL was a semi-natural language query language that was developed for NIAM. The language, however, sueredfrom certain drawbacks, which included a lack of formal definition and sound syntactic and se-mantic basis. Additionally, it was based on the initial but restricted binary version of NIAM. Forthese reasons, RIDL did not meet with widespread acceptance [49,51].

    The general approach to querying in the newer family of ORM-based query languages is il-lustrated by LISA-D [49,51], Conquer [8] and Conquer-II [5]. LISA-D is essentially a redesign andextension of RIDL to make it more sound and strong formally. For this reason, instead of basingit on NIAM, it is based on PSM, which, as mentioned above, is itself an extension of NIAM.LISA-D queries are formulated using information descriptors. This is because querying in LISA-D is founded on the information descriptor syntactical category. Information descriptors char-acterize and facilitate the disclosure of information objects in an information-base [50], which inthe context of database queries would constitute the database population. An information de-scriptor is specified as D: information descriptor X ENV ! PE, where ENV is the environmentof the database, as determined by the database population. PE is a path expression. According tothis notation, an information descriptor in a given environment maps to a specific path expres-sion. A query path is therefore expressed by information descriptors. A query path in LISA-D istherefore a concatenation of information descriptors. Indeed, path expressions in LISA-D can beverbalized via the verbalization function D, such that if D is an information descriptor, thenD[[D]] is equivalent to a path expression. If P is a path expression and denotes the concatenationoperator, then P DD1D2D3 can be expressed as DD1 DD2 DD3. For example, ifDD1;DD2 and DD3 define the atomic information descriptor President, born-in and Staterespectively, then P DD1D2D3 DD1 DD2 DD3 President born-in State. Thisexpression corresponds to the path connecting schema entities President and State via the se-mantic role born-in.

    Specified queries are matched against the characterization of information objects, i.e., againstinformation descriptors. A LISA-D query has the general format LIST p1; p2; . . . ; pn, P, wherep1; p2; . . . ; pn are predicators whose values are to be evaluated on path P. For our example, thisquery specifies the evaluation of p1; p2; . . . ; pn on the expressed path P given by President born-inState.

    In terms of the CQL notation used in [42], this query can be expressed as LISTp1; p2; . . . ; pn # P , where # is the submersion operator used to suppress the predicates to beevaluated. Once p1; p2; . . . ; pn are suppressed, P remains. CQLs mix-fix expression for P thenbecomes explicitly clear: Pmix-fix([born-in], [President, State]) President born-in State. ThisCQL path expression precisely coincides with the LISA-D query path. LISA-D is expressivelyvery powerful, but technically not suited for end-users [8].

    Conquer [8] and Conquer-II [5] are also concept-based query languages based on ORM.Queries are formulated as paths through an information space that is represented as schemasmodeled in ORM. Query predicates are represented as semantic role sequences that can be ex-pressed in mix-fix form. Queries can be expressed as outline queries, schema trees, or text. Thecommercially implemented versions of these languages require queries to be entered in outline

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 143

  • form through the drag-and-drop approach discussed earlier in connection with Conquer-II.Textual verbalizations of expressed queries can be generated automatically. Queries consist ofentities and predicates. (When necessary, attributes are introduced only as derived concepts.) Bothlinear and non-linear queries, i.e., tree-shaped queries, are expressed as sequences of conceptualjoins and conceptual operations forming a series of conceptual paths through ORM schemas.Therefore, ORM queries can be readily verbalized as mix-fix statements, as illustrated by thefollowing two Conquer/Conquer-II queries: (1) Employee lives incity and city is location ofBranch. (2) Employee has salary >90000 and Either speaks Language x Or drives Car y. Theimplied mix-fix notation can be clearly seen from these query expressions.

    From the foregoing discussion, it can be seen that CQL can be used with an ORM schemaand, therefore, act as another ORM query language. Where the conceptual schema is an E-Rschema, or a variant of it, CQL is being used as an ER model query language. However, whatdistinguishes CQL from other concept-based query languages is that it is an abbreviated con-cept-based query language. As discussed earlier in the paper, this means that, unlike currentlyexisting concept-based query languages, the entire query path does not have to be specified bythe user.

    An area in which the concept-based approach can benefit is in the incorporation of intelligenttools and techniques into query systems (a good coverage of the topic as it relates to multimediacommunication interfaces can be found in [30]). In intelligent query answering the intent of aquery is analyzed to provide generalized, neighborhood, or associated information relevant to thequery [12,29,36]. An approach adopted in the more recent studies is to exploit the rich semanticinformation of knowledge-rich DBs to determine the intent of queries. Query intent analysis canbe performed on query statements that are not well formulated or dicult to interpret, in order toclarify the intent of the user. Once the intent is determined, the query can be restated eitherautomatically or cooperatively, with the help of the user, in a form that is easily interpreted.Advances in this field can be applied to facilitate the formulation of abbreviated concept-basedqueries. For example, we are currently investigating how to apply this to resolving ambiguousqueries and queries with missing information in CQL.

    Intelligent query answering systems can also be used to provide sensible explanations of posedqueries. The problem has been extensively studied in the context of designing intelligent multi-media explanations for paraphrasing and communication systems [22,36]. CQL provides thisadditional support to allow users to validate the system-explanations of their queries. The di-culty here is in avoiding too many or superfluous explanations. We deal with this problem in thisstudy by returning the explanation of only the shortest query path to the user.

    Intelligent approaches can also be used to provide computer-aided query formulation systemsto facilitate user formulation of abbreviated concept-based queries. This is the more commonapplication of intelligent query answering tools in natural language query systems. In thosesystems where it is provided, e.g., [50], the approach is usually assistive, with the user interactingwith the system to incrementally formulate the query. This usually takes the form of the userresponding to prompts and cues from the system. In the natural language extension of CQLreported in [42], the user is presented with the information content of the DB. Further help isprovided in the form of sample queries that can be used as is or modified and used. To the best ofour knowledge, no other concept-based query language provides this extended level of assistancefor query formulation.

    144 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • While our overriding motivation for the study was to reduce the cognitive load imposed onusers in formulating queries, we do not expect that users will be completely devoid of allknowledge about databases. We, therefore, presume some, but not in-depth, familiarity with theconcepts of entities, attributes, and relationships. These terms can easily be replaced with lessarcane terms during actual production use. For example, entity can be replaced by terms suchas real-world object, data type, etc. Attribute can be substituted with data item, datafield, data column, etc. Furthermore, in production use, CQL can be augmented with a facilitythat provides on-line explanations and examples of these terms. CQL does not require users to befamiliar with the structure and organization of the application database, but only with thecontent. Even on this latter demand, we provide help through the query formulation surrogatesystem.

    In summary, the formal basis of the CQL was presented in this paper. Like other concept-basedquery languages, CQL allows users to specify queries directly against conceptual schemas ofdatabase applications, using concepts and constructs that are native to and exist on the schemas.However, unlike other existing concept-based query languages, CQL queries are abbreviated.Hence CQL is an abbreviated concept-based query language.

    CQL is designed for ease-of-use and, thereby, aimed at reducing the cognitive burden faced bydatabase end-users. To aid end-users in formulating queries, CQL is provided with a computer-assisted query formulation system. CQL is founded on strong set- and graph-theoretic principles.We demonstrated that it is more than first-order complete. In combining ease-of-use with ex-pressive power, it overcomes the common weakness in concept-based query languages, i.e., that ofbeing less than relationally complete. A prototype of CQL has been implemented as a front-end toa relational database manager.

    A contribution of this study is the use of the semantic roles played by entities in their asso-ciations with other entities to support abbreviated conceptual queries. An advantage that accruesfrom this main contribution is the use of relationship semantics of data models to alleviate or freethe user from dealing with the syntactic complexity of query formulation. Additional advantagesinclude the use of the roles played by entities in relationships in developing semantic graphs ofconceptual queries, the use of the roles played by entities in relationships in developing pseudo-natural language explanations of queries, the use of system-constructed semantic graphs to aid theautomatic generation of SQL. The study was limited to querying a single database; databaseupdating was not addressed.

    In future, we would like to extend this study to deal with query ambiguity and incompleteness.Missing information in a database can occur where the DB is based on the open world assumption(OWA) [39]. OWA allows a DB system to have incomplete knowledge. This implies that theremay be some true propositions about the universe of discuss which are neither stored nor de-rivable by the system. In CQL, missing information is of two types: incompleteness and ambi-guity. Both are defined with respect to the ability of the system to extract data.

    We also plan to extend CQL to support multi-dimensional queries. On-line analytical pro-cessing is based on the multi-dimensional modeling of business. In our view, it should be possibleto extend CQL in a straightforward manner for querying multi-dimensional (or decision support)databases. A study of this is already in progress.

    An extension of CQL to heterogeneous and distributed databases is also slated for the future.Database querying in heterogeneous and distributed environments, such as the World Wide Web,

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 145

  • requires knowledge of the exact location of data in the system, the organization of data, and theknowledge of the access protocols for each unit in the distributed system. Added to this, the usermust know the various query languages used by each of the units. This makes DB querying insuch environments time consuming, dicult and inecient. A more user-friendly and easy-to-useapproach is called for. Preliminary investigation suggests that CQL can be extended to distributeddatabase environments to ease and facilitate query formulation and processing.

    Extension of CQL to very large database systems is also planned. In very large databases(VLDBs), non-contiguous fragments of the DB schema may reside contemporaneously in thesystem. This introduces additional complexities to query formulation and processing. The CQLapproach can be used to define a higher level virtual schema on the VLDB system. Queries canthen be specified against the virtual schema using CQL or an extended version of CQL.

    Appendix A. Definition of formal symbols

    j logical OR, logical AND{}a set of components:: comprises or Consists ofjj concatenationThe general syntax is: hleft_side_of_formulai comprises {hright_side_of_formula}.

    Appendix B. Algorithm prune-first

    Function: Generates the candidate paths between sources and targets.Input CQL query, Q(S; T), of sources (S) and targets (T), connectivity matrix (C-matrix),

    logical adjacency matrix (LAM)Output All candidate paths between sources and targetsThe following definitions are used:

    E: a set of schema entitiesE(x): entity type x in E

    Algorithm.Step 0: Let S Es and T Et.Step 1: Read C-matrix for vs; t, the Es=Et cell value of the LAM. If vs; t 0, the schema

    is disconnected. STOP. Else (If vs; t 6 0) continue.Step 2: For each successor of a source, determine if the schema semantics of its link to the

    source is consistent with the query semantics. {* This is achieved by comparingthe relationship semantics between the pair of entities on the schema with the specifiedsemantics in the semantics relationship condition in the CQL query formulation. Therelationship semantics between schema entities are defined in the link semantic dictio-nary (defined earlier.*}

    146 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • Step 2.1: If it is, retain that link.Step 2.2: If it is not, delete that link and that successor from the successor list of the source.

    (These links are deleted from the paths leading away from the source.)Step 3: For each predecessor of a target, determine if the schema semantics of its link to the

    target is consistent with the query semantics.Step 3.1: If it is, retain that link.Step 3.2: If it is not, delete that link and that predecessor from the predecessor list of the tar-

    get. (These links are deleted from the paths leading to the target.)Step 4: Scan resulting successor set of Es from the LAM.

    Step 4.1: If EtfEs-succ.set}, then pick Et and connect Es directly to Et. [Query Path(QP) fEs;Etg Es ! Etf X-succ.set is defined as the set of set of sche-ma entities succeeding, i.e., adjacent to X. }

    Step 4.2: Pick each Ej in turn, where Ej fEs-succ:setg. Connect Es to Ej.QP fhEs;Etg Es ! Eti; hEs;Ejg Es ! Eji

    Step 4.2.1: Set j s and go to Step 1, and skip Step 2.Step 4.2.2: Repeat Step 4.2.1 until Ej T .

    Appendix C. Algorithm FORALL

    Function: Maps the CQL FORALL operator to SQL code for universal quantificationInput CQL Query FormulationOutput SQL statement for universal quantification operation

    Algorithm.Step 1: If an uninstantiated source attribute is specified, mark it in all its occurrences for selec-

    tion in a nested SQL block.Step 2: Map the target segment of the CQL query to SQL as in the un-nested case:

    Step 2.1: If a FORALL quantifier is specified in the CQL selection condition do:Step 2.1.1: In the WHERE clause write NOT EXISTSStep 2.1.2: Select an asterisk (*) in a second SQL SELECT-WHERE-FROM block

    Step 2.1.2.1: In the FROM clause write the specified source and a first alias of thesource. Separate both with a space.

    Step 2.1.2.2: In the WHERE clause: (1) Write all the instantiated source attributes, as inthe un-nested case (i.e., Algorithm Mapping), if any is specified. (2) Form aconjunction of (1) with NOT EXISTS

    Step 2.1.2.2.1: Select an asterisk (*) in a third SELECT-FROM-WHERE blockStep 2.1.2.2.1.1: In the FROM clause write the specified source and a second alias

    of this source. Separate both with a space.Step 2.1.2.2.1.2: In the WHERE clause: (1) Join the first and second aliases of the

    specified source in Step 2.1.2.1 and Step 2.1.2.2.1.1 on the markeduniversally quantified source attribute (i.e., S2.a(u) S1.a(u)). (2)Join the second alias of the specified source in Step 2.1.2.2.1.1 onthe set of specified target attributes (i.e., S2.a(T) T.a(T)). (3)

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 147

  • Form a conjunction of (1) and (2).Step 2.1.2.2.2: Enclose the SQL code generated in Step 2.1.2.2.1 in parentheses.

    Step 2.1.3: Enclose the SQL code generated in Step 2.1.2 in parentheses.Step 3: Terminate the entire code generated in Step 2 with a semi-colon (;)Step 4: Stop.

    Appendix D. Special problems with peculiar cases

    Match-n-in operator. For the first n or less values in a list or set that are matched by values in adatabase, the matching values are returned. This operator is expressed as: ha(t) Match-n-IN Listi,where n is a positive integer and List is a set or list derived from a sub-query or directly specifiedin the query formulation.

    Shunt operator. In some applications, the schema may be disconnected into semantically dis-joint sets. This type of disconnection may result from a lack of meaningful semantics associatingany entity in one set with any entity in the other set. Although the schema is semantically dis-connected, an implicit connection may exist through foreign key or common domain key rela-tionships. It should, therefore, still be possible to query the database with source in one set andtarget in the other.

    ShuntEi;Ej is an operation instructing the system to forcibly join Ei and Ej using thespecified or identified join keys. For any arbitrary Ei and Ej, ShuntEi;Ej establishes adirect connection between Ei and Ej, while shunting any intermediate entities between the two.This approach can be used where the two disjoint sets represent two dierent schemas.

    Mask operator. For very long query-paths, the path-meaning derived through semantic linksmay be either lost or meaningless. In this case, a path exists between source and target, but due tothe long path-length the user considers the NL transcription of the entire path may either beawkward or be of no value. Here the intermediate entities between source and target or a subset ofthe intermediates can be masked, i.e., excluded, from the NL transcription using theMaskEi;Et operation, where the segment of the path between Ei and Et are to bemasked.

    To mask the segment of a path, the end-user only specifies the two end points of the segment.The net eect of this operation is that the system excludes the segment from the validation NLstatement returned to the end-user, but retains the corresponding tables and joins in the generatedSQL statement.

    References

    [1] M. Anderson, S. Doug-Guk, Integrating an intelligent interface with a relational database for two way man-machine

    communication, in: Proceedings of the IEEE/ACM Intl Conference on Developing and Managing Expert System Programs,

    Washington, DC, 30 September2 October 1991, pp. 411.

    [2] D. Batra, A framework for studying human error behavior in conceptual database modeling, Information and Management 25

    (1993) 121131.

    [3] D. Batra, J.A. Hoer, R.P. Bostrom, Comparing representations with relational and EER Models, CACM 33 (2) (1990) 126139.

    [4] D. Batra, M.K. Sein, Improving conceptual database design through feedback, International Journal of HumanComputer

    Studies 40 (1994) 653676.

    [5] N.J. Belkin, Anomalous states of knowledge as a basis for information retrieval, Canadian Journal of Information Science 5

    (1980) 133143.

    148 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • [6] A.T. Berztiss, Data abstraction in the specification of information systems, in: Proceedings of the IFIP World Congress 86 (1986)

    8390.

    [7] A.T. Berztiss, The query language Vizla, IEEE TKDE 5 (5) (1993) 813825.

    [8] A.C. Bloesch, T.A. Halpin, ConQuer: a conceptual query language, Conceptual Modeling -ER96, LNCS vol. 1157, Springer,

    Berlin, 1996, pp. 121133.

    [9] T. Catarci, M.F. Costabile, S. Levialdi, C. Batini, Visual query systems for databases: a survey, Journal of Visual Languages and

    Computing 8 (2) (1997) 215260.

    [10] T. Catarci, G. Santucci, Query by diagram: a graphic query system, in: Proceedings of the Seventh International Conference on the

    E-R Approach, Rome, Italy, 1618 November 1988, pp. 157174.

    [11] T. Catarci, G. Santucci, M. Angelaccio, Fundamental graphical primitives for visual query languages, in: Proceedings of the

    Second European-Japanese Seminar on Information Modeling and Knowledge Bases, Finland, 1992.

    [12] F. Cuppens, R. Demolombe, Cooperative answering: a methodology to provide intelligent access to databases, in: Proceedings of

    the Second International Conference on Expert Database Systems, Fairfax, VA, April 1988, pp. 621643.

    [13] H.C. Chan, A knowledge level user interface using the ER model, Ph.D. Dissertation, University of British Columbia, 1989.

    [14] H.C. Chan, K.K. Wei, K.L. Siau, User-database interface: the eect of abstraction levels on query performance, MIS Quarterly 17

    (4) (1993) 441464.

    [15] T. Chang, E. Sciore, A universal relation data model with semantic abstractions, IEEE TKDE 4 (1) (1992) 2333.

    [16] B. Czejdo, R. Elmasri, D.W. Embley, M.A. Rusinkiewicz, Graphical data manipulation language for an extended entity-

    relationship model, IEEE Computer 23 (3) (1990) 2636.

    [17] E.F. Codd, Relational completeness of database sub-languages, in: R. Rustin (Ed.), Data Base Systems, Prentice-Hall, Englewood

    Clis, 1972.

    [18] H. Dalianis, Explaining conceptual models an architecture and design principles, in: Proceedings of the ER-97, Los Angeles, CA,

    USA, 1997, 214228.

    [19] C.J. Date, An Introduction to Database Systems, seventh ed., vol. 1, Addison-Wesley, Reading, 2000.

    [20] O. DeTroyer, R. Meersman, F. Ponsaert, RIDL User Guide, Research Report, International Centre for Information Analysis

    Services, Control Data Belgium, Inc., Brussels, Belgium, 1984.

    [21] R. Elmasri, S.B. Navathe, Fundamentals of Database Systems, third ed., Addison-Wesley, Reading, 2000.

    [22] S.K. Feiner, K.R. McKeown, Automating the Generation of Coordinated Multimedia Explanations, in: M.T. Maybury (Ed.),

    Intelligent Multimedia Interfaces, 1993, 117138.

    [23] J.A. Gulla, A general explanation component for conceptual modelling in CASE environments, ACM TOIS 14 (2) (1996)

    297329.

    [24] T.A. Halpin, Conceptual Schema and Relational Database Design, second ed., Prentice-Hall, Sydney, Australia, 1995.

    [25] T.A. Halpin, Business rules and object-role modeling, Database Programming and Design 9 (10) (1996) 6672.

    [26] T.A. Halpin, M.E. Orlowska, Fact-oriented modelling for data analysis, Journal of Information Systems 2 (2) (1992) 97119.

    [27] T.A. Halpin, H.A. Proper, Database Schema Transformation and Optimization OOER95: Object-Oriented and Entity-

    Relationship Modeling, Springer LNCS vol. 1021, 1995, pp. 191203.

    [28] T.A. Halpin, H.A. Proper, Subtyping and polymorphism in object-role modeling, Data and Knowledge Engineering 15 (1995)

    251281.

    [29] J. Han, Y. Huang, N. Cercone, Y. Fu, Intelligent query answering by knowledge discovery techniques, IEEE TKDE 8 (3) (1996)

    373390.

    [30] M.T. Maybury, Intelligent Multimedia Interfaces, AAAI Press, Cambridge, MA, 1993.

    [31] S.L. Jarvenpaa, J.J. Machesky, Data analysis and learning: an experimental study of data modeling tools, International Journal

    ManMachine Studies 31 (1989) 367391.

    [32] D. Maier, D. Rozenshtein, D.S. Warren, Window functions, in: P. Kanellakis (Ed.), Advances in Computing Research, JAI Press,

    1986, pp. 213246.

    [33] D. Maier, J.D. Ullman, Maximal object and the semantics of universal relation databases, ACM TODS 8 (1) (1983) 114.

    [34] M.V. Mannino, L.D. Shapiro, Extensions to query languages for graph traversal problems, IEEE TKDE 2 (3) (1990) 353363.

    [35] Markowitz and Shoshani, Abbreviated query interpretation in EER oriented databases, in: Proceedings of the Eighth

    International Conference on E-R Approach, Toronto, Canada, 1820 October 1989, pp. 325344.

    [36] M.T. Maybury, Planning multimedia explanations using communicative acts, in: M.T. Maybury (Ed.), Intelligent Multimedia

    Interfaces, 1993, pp. 6074.

    [37] F. McFadden, J. Hoer, M. Prescott, Modern Database Design, fifth ed., Addison-Wesley, Reading, 1999.

    [38] R. Meersman, The RIDL Conceptual Language, Research Report, International Centre for Information Analysis Services,

    Control Data Belgium Inc, Brussels, Belgium, 1982.

    [39] G. Nijssen, T. Halpin, Conceptual Schema and Relational Database Design, Prentice-Hall, Englewood Clis, 1989.

    [40] G.M. Nijssen, T.A. Halpin, Conceptual Schema and Relational Database Design: A Fact Oriented Approach, Prentice-Hall,

    Sydney, Australia, 1989.

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 149

  • [41] V. Owei, Framework for a conceptual query language for capturing relationship semantics in databases, Ph.D. Dissertation,

    Georgia Inst. of Technology, 1994.

    [42] V. Owei, Natural language querying of databases: an information extraction approach in the conceptual query language,

    International Journal of HumanComputer Studies (to appear).

    [43] V. Owei, H. Rhee, S.B. Navathe, An abbreviated concept-based query language and its exploratory evaluation, Journal of Systems

    and Software (to appear).

    [44] J. Peckham, F. Maryanski, S. Demurjian, Towards the correctness and consistency of update semantics in semantic databases,

    IEEE TKDE 8 (3) (1996) 503507.

    [45] S. Puranik, A data definition language for the object-oriented semantic association model and algorithms for intelligent query

    processing, Master of Science Thesis, The Graduate School, University of Florida, 1988.

    [46] P. Rosengren, P. Kool, S. Paulsson, U. Wingstedt, Intuitive System, http://www.sisu.se/oldprojects/intuitive/intuitive.html.

    [47] K.L. Siau, H.C. Chan, K.K. Wei, The eects of conceptual and logical interfaces on visual query performance of end users, in:

    Proceedings of the International Conference on Information Systems, Amsterdam, The Netherlands, 1013 December 1995,

    pp. 225235.

    [48] T. Hofstede, T. Van Der Weide, Expressiveness in conceptual data modelling, Data & Knowledge Engineering 10 (1) (1993)

    65100.

    [49] A. Ter Hofstede, H. Proper, T. Van Der Weide, Formal definition of a conceptual language for the description and manipulation

    of information models, Information Systems 18 (7) (1993) 489523.

    [50] A.H.M. Ter Hofstede, H.A. Proper, T. Van Der Weide, Query formulation as an information retrieval problem, The Computer

    Journal 39 (4) (1996) 256274.

    [51] A. Ter Hofstede, H. Proper, T. Van Der Weide, Exploiting fact verbalisation in conceptual information modelling, Information

    Systems 22 (6/7) (1997) 349385.

    [52] J.D. Ullman, Principles of database and knowledge-base systems, vol. I, Computer Science Press, Rockville, 1988.

    [53] K. Vadarparty, Y.A. Aslandogan, G. Ozsoyoglu, Towards a unified visual database access, in: International Conference on

    Management of Data, 2628 May 1993, ACM SIGMOD RECORD 22(2) (1993) 357366.

    [54] P. Van Bommel, T. Hofstede, T. Van Der Weide, Semantics and verification of object-role models, Information Systems 16 (5)

    (1991) 471495.

    [55] G. Verheijen, J. Van Bekkum, NIAM: an information analysis method, in: T.W. Olle, H.G. Sol, A.A. Verrijn-Stuart (Eds.),

    Information Systems Design Methodologies: A Comparative Review, North-Holland, Amsterdam, The Netherlands, 1982,

    pp. 537590.

    [56] J. Wald, P. Sorenson, Explaining ambiguity in a formal query language, ACM TODS 15 (2) (1990) 125161.

    [57] C. Welty, D.W. Stemple, Human factors comparison of a procedural and nonprocedural query language, ACM TODS 6 (4)

    (1981) 626649.

    [58] J. Wintraecken, The NIAM Information Analysis Method: Theory and Practice, Kluwer, Deventer, The Netherlands, 1990.

    [59] X. Wu, T. Ichikawa, KDA: a knowledge-based database assistant with a query guiding facility, IEEE TKDE 4 (5) (1992).

    150 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151

  • Vesper Owei holds a masters degree inelectrical and electronic engineering andin operations research from the GeorgiaInstitute of Technology (Georgia Tech),Atlanta, Georgia, USA. Owei also holdsa Ph.D. from Georgia Tech. He haspractised as a project, design and con-sulting engineer. His current researchinterests include data management, datamodeling, concept-based query lan-guages, conceptual interfaces, knowl-edge systems, data warehousing, olap,data mining, web-based database ap-

    plication development, end-user interfaces for e-commerce ande-business, information systems architectures and framework forthe disabled and information technology for healthcare delivery.

    Shamkant B. Navathe is a professor andthe head of the database research groupat the College of Computing, GeorgiaInstitute of Technology, Atlantia. Hehas been active in a variety of databaseallocation, and database including dat-abase modeling, database conversion,database design, distributed databaseallocation, and database integration. Hehas worked with IBM and Siemens intheir research divisions and has been aconsultant to various companies in-cluding Digital, CCA, HP and Equifax.

    He was the General Co-chairman of the 1996 InternationalVLDB (Very Large Data Base) conference in Bombay, India. Hewas also program co-chair of ACM SIGMOD 1985 InternationalConference and General Co-chair of the IFIP WG 2.6 DataSemantics Workshop in 1995. He has been an associate editor ofACM Computing Surveys, and IEEE Transactions on Knowl-edge and Data Engineering. He is also on the editorial boards ofInformation Systems (Pergamon Press) and Distributed andParallel Databases (Kluwer Academic Publishers). He is an au-thor of the book, Fundamentals of Database Systems, with R.Elmasri (Addison-Wesley, Edition 3) currently the leadingdatabase text-book worldwide. He also co-authored the bookConceptual Design: An Entity Relationship Approach (Add-ison-Wesley, 1992) with Carlo Batini and Stefano Ceri. Hiscurrent research interests include human genome data manage-ment, engineering data management, intelligent information re-trieval, data mining algorithms, e-commerce applications andmobile database synchronization. Navathe holds a Ph.D. fromthe University of Michigan and has over 100 refereed publica-tions.

    V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109151 151

Recommended

View more >