A formal basis for an abbreviated concept-based query language

A formal basis for an abbreviated concept-based querylanguage

Vesper Owei a,*, Shamkant Navathe b

a Information and Decision Sciences Department (M/C 294), University of Illinois at Chicago, Chicago, IL 60607, USAb College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA

Received 22 September 1998; received in revised form 16 December 1999; accepted 3 March 2000

Abstract

Concept-based query languages allow users to specify queries directly against conceptual schemas. The primary goal

of their development is ease-of-use and user-friendliness. However, existing concept-based query languages require the

end-user to explicitly specify query paths in totality, thereby rendering such systems not as easy to use and user-friendly

as they could be. The conceptual query language (CQL) discussed in this paper also allows end-users to specify queries

directly against the conceptual schemas of database applications, using concepts and constructs that are native to and

exist on the schemas. Unlike other existing concept-based query languages, however, CQL queries are abbreviated, i.e.,

the entire path of a query does not have to be speci®ed. CQL is, therefore, an abbreviated concept-based query lan-

guage. CQL is developed with the aim of combining the ease-of-use and user-friendliness of concept-based languages

with the power of formal languages. It does not require end-users to be familiar with the structure and organization of

the application database, but only with the content. Therefore, it makes minimal demands on end-users' cognitive

knowledge of database technology without sacri®cing expressive power. In this paper, the formal semantics and the

theoretical basis of CQL are presented. It is shown that, while CQL is easy to use and user-friendly, it is nonetheless

more than ®rst-order complete. A contribution of this study is the use of the semantic roles played by entities in their

associations with other entities to support abbreviated conceptual queries. Although only mentioned here in passing, a

prototype of CQL has been implemented as a front-end to a relational database manager. Ó 2001 Published by

Elsevier Science B.V. All rights reserved.

Keywords: Abbreviated query formulation; Computer-supported query formulation; Concept-based query languages; Conceptual

query language; Query language expressive power

1. Introduction

Query tools that depend on programming skill for their e�ective and e�cient use impose acognitive burden that may diminish users' productivity with the tools. This underscores the needfor database query languages (DBQLs) that are matched to the skills and ability of end-users,

Data & Knowledge Engineering 36 (2001) 109±151www.elsevier.com/locate/datak

* Corresponding author. Present address: Division of Management Information Systems, University of Oklahoma, 307 West Brooks,

Room 306, Norman, OK 73019-4007, USA. Tel.: +1-405-325-0768; fax: +1-405-325-7482.

E-mail addresses: [email protected] (V. Owei), [email protected] (S. Navathe).

0169-023X/01/$ - see front matter Ó 2001 Published by Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 9 - 0 2 3 X ( 0 0 ) 0 0 0 4 2 - 2

necessitating a rethinking of the DBQL design. Concept-based approaches to DB queryingsupport the direct use of conceptual schemas and constructs that are either the same or similar tothose in users' mental model. Therefore, concept-based DB querying naturally tends to ®t theskills and ability of typical end-users. Conceptual DB querying will be needed with ever increasingdemand as we place more and more complex databases on the Web. This need for concept basedinformation retrieval has led to research into concept-based DBQLs.

However, because the primary motivation for the development of concept-based querylanguages is ease-of-use and user-friendliness, they tend to be weak in formalism. For ex-ample, visual query languages, which are only a sub-class of concept-based query languages,are usually very weak in expressive power [53]. This paper discusses the conceptual querylanguage (CQL) [41] which is developed with the aim of combining the ease-of-use and user-friendliness of concept-based languages with the power of formal languages. CQL allows usersto formulate queries in a very intuitive way without the need for them to learn about theschema (structure) of the database or to grapple with the syntactic complexity of command-based languages. It, therefore, makes minimal demands on end-users' cognitive knowledge ofDB technology without sacri®cing expressive power. Experiments in [43] show that end-usersperform better with CQL than with alternative languages such as SQL; they also have a betterperception of CQL. Our focus in this paper is on the theoretical basis and formal semanticsof CQL. We show that, while CQL is easy to use, it is nonetheless more than ®rst-ordercomplete.

The rest of the paper is organized as follows: In Section 2, we give an example to illustrate themotivation for this study. In Section 3, we discuss some related studies in conceptual queryformulation and semantics based querying. We formally de®ne CQL in Section 4. Section 5 isdevoted to discussing the functionality of the di�erent modules of CQL. We examine the claimsconcerning the expressive power of CQL in Section 6. The paper concludes in Section 7 with adiscussion of much earlier work in the development of conceptual interfaces and an examinationof other issues, e.g., intelligent interfaces, that are important in interface design. A summary of thepaper and an examination of its main contributions and limitations, as well as an indication ofrelated studies planned for the future are also given in the concluding section.

2. Motivation

Query speci®cation in linear keyword languages (LKLs) like SQL and in other visual systemspatterned after or similar to query-by-example (QBE) make use of joins de®ned either during datade®nition or during query formulation. ACCESSe and PARADOXe are examples of QBEsystems. Recent QBE implementations, for example in ACCESS, are able to perform joins oncethe tables to be joined have been speci®ed by the user. This requires the joins to have been de®nedas ``relationships'' during table creation. Where needed joins are not de®ned, possible joins can be``suggested'' to the user. The domain types of attributes can be used for this task. The existingcommercial systems are unable to select joins automatically for the user. The ability to selectde®nite joins is tantamount to specifying a particular query path; this requires the use of meta-knowledge about the schema in the form of the meaning of a query path to ensure the semanticcorrectness of the selected path. Such meta-knowledge is lacking in existing LKL and QBEsystems.

110 V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109±151

We therefore ask the following thematic question: Given the rich semantics of data models likethe E±R model, is it possible to exploit the meta-knowledge about these models to reduce the cog-nitive load faced by end-users to facilitate query formulation? This question deals with the issue offurther enhancements to the query formulation methods in commercially popular LKL, drag-and-drop and point-and-click query tools.

Since the mid-1980s a number of approaches using meta-knowledge about DB schemas toenhance the facility of end-users in query formulation have been proposed (for example,[1,14,15,35,44,47]). The recent prototypical approaches in [14,15,44,47] elevate query formulationfrom the logical level to the conceptual schema level by supporting the direct use of concepts andabstractions on conceptual schema in query statements. Query formulation can be further facil-itated in these systems by reducing the cognitive workload entailed by their use. One way this canbe achieved is through minimizing what is required to be speci®ed by the end-user. The system canthen use schema meta-knowledge to determine and select a semantically correct query. CQL isbased on this approach.

2.1. Structure of the conceptual query language

Current commercially popular LKL and QBE systems require users to explicitly mention all thetables needed by the system to solve the problem. Furthermore, in LKL and QBE systems the usermust also specify query paths. This explicit navigation is a major source of di�culty for a typicalend-user. In our proposed language called the CQL, this cognitive burden in formulating DBqueries is reduced by migrating much of this task to the underlying DBMS. Unlike LKL and QBEsystems, query formulation in CQL does not require the user to specify all the tables needed tosolve a query. Also, the user does not have to specify query paths. CQL is, therefore, particularlysuitable for business and administrative end-users who, generally speaking, are not programmers.

In CQL only the entities and conditions explicitly mentioned in query statements are requiredto be speci®ed in their formulations. CQL has a simple and straightforward query syntax. Thebasic (canonical) form of a CQL query, Q, can be expressed as

Query :� Q�tE; SE; fCsel;Csemg�

where tE is the set of targets (entities and attributes about which information is sought), SE the setof sources (entities and attributes about which information is given or known), Csel the selectioncriteria/conditions, Csem the semantic relationships between implicit sources and implicit targets,and the entities semantically adjacent to them on the application conceptual schema. An implicitsource is either a source or a target entity of the query. An implicit target may be the target of thequery or an intermediate entity that is neither the source nor the target of the speci®ed query, butlies on the query path. As discussed latter, the speci®cation of intermediate entities in CQL isoptional and not necessary.

In formulating a query with CQL, therefore, the end-user only needs to state tE; SE;Csel andCsem. The formulated query is then automatically passed to the underlying DBMS to determineand select the query path. The CQL system uses semantic information about the schema toperform these tasks. This information is in the form of the semantic roles played by schemaentities in their relationships with other entities.

V. Owei, S. Navathe / Data & Knowledge Engineering 36 (2001) 109±151 111

2.2. Query abbreviation in the conceptual query language

Concept-based or conceptual query interfaces reduce the cognitive load in querying DBs byallowing users to directly use constructs form conceptual schemas [2±4,13,41,47]. As exempli®edin [14], instead of specifying the relational condition ``Where s.sno � sp.sno and sp.pno �p.pno'', concept-based interfaces would allow for a more natural speci®cation like ``WhereSupplier supplies Parts''. The CQL approach provides additional enhancement to this. Whereintermediate entities exist on the query path between Supplier and Parts, CQL uses built-in meta-knowledge about the application schema to determine and select the correct intermediate entities.Therefore, in comparison to LKL and QBE queries, conceptual queries in CQL tend to be highlyabbreviated, since the user is not required to specify the entire query path. The main problem withabbreviated queries is to derive the corresponding semantically correct full queries [35]. Thisconcern naturally carries over to CQL queries. In this section we use an illustration to explainwhat CQL is, what its structure is and what it is trying to achieve. The illustration is based onFig. 1, which is a semantically constrained entity-relationship diagram (SCERD) 1 of a universitydepartment.

Fig. 1. Semantically enhanced E±R diagram of a university department schema.

1 SCERD contains other constructs that are used for updates. These have been left out in Fig. 1, since they are not pertinent to the

discussion here.


In SCERD, entity types in the schema bear explicitly named relationships, or association,among themselves. Each relationship has a semantic meaning. Double-headed arrows are used inan SCERD to indicate that the entities at both heads of the arrows have a direct semantic re-lationship, and the arrow-heads are labeled with the roles, e.g., works-for, can-teach, advises, etc.,played by the entities in speci®c relationships. The association semantics of the relationships in-volving entities are constrained by the roles the entities play in the particular relationship. InSCERD, the meaning of the links between entities, therefore, lies in the form of roles. CQLsupports the direct use of SCERD constructs in query formulation.

Example. Suppose the following query is posed and speci®ed on Fig. 1:Query 1: What course(s) is Marshall taking from associate professor `Jones'?

An abbreviated CQL formulation of this query requires the user to specify only the statedentities Student, Teacher and Course along with a set of selection predicates on these entities. Thesystem is then required to chart one or more paths through the conceptual schema from Studentand Teacher to Course. We refer to such paths as derived paths. In addition to path derivation,the system must also be capable of performing any needed operations, e.g., conjunction or dis-junction, on the derived paths. In this case, the meaning of the desired query demands that thesub-paths Student ! � � � ! Course and Teacher ! � � � ! Course be derived and conjunctivelycombined, where ` � � � ' indicates segments of the sub-paths that must be determined by the system.Furthermore, these segments must be such that the meaning of the resulting path is the same asthat of the desired query. Clearly, the sub-path STDjenrolledÿin ! CRjhPi ! C is semanticallycorrect. In this notation P � fp1; p2; . . . ; png is a set of paths, and jenrolledÿin denotes the role playedby the Student entity on that path. The path derives its meaning from the totality of the semantics ofthe roles played by all the entities on the path.

An examination of Fig. 1 shows that multiple paths exist between Student and Course and alsobetween Teacher and Course. What complicates the problem here is that all the paths do not havethe same meaning. For example, the semantics of STDjadvicedÿby ! T jcanÿteach ! C, i.e., the sub-path leading from Student to Course via Teacher deals with adviser±advisee relationship, and notwith students taking classes. It would be semantically incorrect for the system to include this sub-path in constructing the query path.

The task of the system, then, is twofold: (1) To determine PI � P and PII � P such that for eachpi 2 PI and pk 2 PII; STDjhPIi ! C and T jhPIIi ! C are semantically correct. In CQL, meta-knowledge (in the form of the semantics of roles) about the relationships that the entities par-ticipate in are used to resolve this path ambiguity problem. (2) To select a pi and a pk from all thecandidate paths in (1). A modi®cation of the path selection algorithm in [45] is used for this. In therest of the paper, the formal basis of CQL is presented. But ®rst, we discuss some related studies inthe next section.

3. Previous work on conceptual query formulation and semantics based querying

As already mentioned, the main goal in developing concept-based query languages is to provideend-users with high-level, easy to use, and user-friendly interfaces for data manipulation. As far as


we are aware, the universal relation (UR) interface [32,33], RIDL [20,38] and NIAM [26,40,55,58]were among the earliest e�orts in that direction. An examination of existing DBQLs seems toindicate a continuing trend in this direction. We examine only a sample set of existing relatedstudies in this section.

3.1. Enabling data manipulation through semantic paths on conceptual schemes

Chang and Sciore [15] propose the UR with semantic abstraction (URSA) model, which is anextension of the UR interface. Instead of demanding a universally unique role for each attribute,as in the UR approach, the URSA model requires this uniqueness of role only within a limited set,called closure, of entities. Querying in URSA is based on the UR query paradigm. Its referencingscheme therefore forces a QUEL-type and an SQL-type syntax. This may render it not suitable forthe generality of end-users. Peckham et al. [44] propose a DB design paradigm that abstracts therelationship semantics of application conceptual data models and uses this as a predictor of queryand update paths.

Peckham et al. show that association roles in semantic schemas de®ne connection paths be-tween objects, and these connections can be used to enable data manipulation. The URSA studyshows that the semantics of the association among schema entities can be used to ensure thesemantic correctness of queries. CQL extends these ideas by showing that the connection pathshave meanings that are derived from the semantic meaning of the association roles, and that thepath-meanings can be used to determine and select the correct paths of abbreviated conceptualqueries.

The intuitive system [46] de®nes a very intuitive architecture for information retrieval thatcomprises four main modules. The multimodal interaction manager supports request speci®cationusing speech and pointing-and-clicking; the end-user component provides users with a visualinterface and functionality for using large heterogeneous DBs. The third module, the intelligentdialog manager, interprets users' requests according to the task being performed by the user, whilethe fourth module, the data access layer links the di�erent components of the system together.

Although the intuitive system is aimed at end-user interaction with heterogeneous DBs, it isgeneric enough for non-heterogeneous and single DB scenarios. Each functional aspects of CQLcan be associated with a component of Intuitive. For example, the multimodal interactionmanager provides the functionality of computer-supported query formulation in CQL. CQL's set-handler can be identi®ed with the dialog manager.

The point-and-click mode of request formulation in Intuitive presents an ER schema to theuser, who can then specify a query by selecting the subschema de®ning the desired query-path.Interesting similarities and di�erences between intuitive and CQL exist here: With intuitive, toformulate the query on persons who appear in an interview, the user highlights the entities``Person'' and `Ìnterview'' and the relationship `àppears_in'' linking the two entities. In CQLthis is speci®ed as ``Person appears_in Interview''. For more complex queries involving longerpaths, the Intuitive user still highlights the entire path on the schema, the CQL user does not.Intuitive supports multimedia data, text retrieval from documents, and exploratory search ofhypertexts. Clearly, Intuitive is a much more comprehensive system than CQL, which iscurrently narrowly focused on the manipulation of data in a single DB via a semantic datamodel.


3.2. Query-path speci®cation, construction and selection

Query formulation imposes a certain level of cognitive burden on users and therefore enhancesor degrades the ease of use of a query language [10,14,43,47,57]. As exempli®ed by Conquer-II [5],the graphical interface proposed by Mannino and Shapiro [34], and Graph Model [11], thecommon approach to query formulation requires the user to specify each query-path in its en-tirety.

ConQuer-II [5] is a commercial concept-based query language based on the object-role mod-eling (ORM) paradigm [24±28]. Like SCERD, ORM models applications in terms of the semanticroles played by objects and entities in relationships. While SCERD is an enhancement of the ERmodel, ORM is based primarily on a binary data model. ConQuer-II allows queries to be for-mulated via paths through the conceptual schema. The query paths are constructed from thesemantic roles of objects and entities. Data manipulation in the system proposed in [34] involves®nding a path from a set of starting nodes through possible intermediate nodes and edges to a setof terminating nodes. In Graph Model, entities are conceptually conceived of as nodes and linksemantic types as edges in a graph. Query formulation involves graphically selecting a set ofsource and target nodes, then drawing a set of edges between the selected sets of nodes, and ®nallyspecifying for each node a set of data retrieval criteria. Users select each node and edge on thegraph-path between the source and the target. Graphs are manually manipulated until the desiredquery is obtained.

Although CQL adopts these basic ideas, it, however, extends them by requiring users to specifyonly the endpoint, i.e., starting and terminating entities and relationship roles. The CQL systemautomatically deduces the correct intermediate nodes to use on a given query-path. Therefore,although CQL also allows queries to be formulated via paths through the conceptual schema,users are not required to specify paths in their entirety.

Vizla [6] is a visual query language interface for the information control prototyping languageSF [7]. In Vizla, a database is abstracted as a collection of sets (entities) and functions that mapfrom this collection of sets to auxiliary sets (attributes). Queries are formulated in Vizla bypointing to representations of functions, their domains and codomains, or subsets of the domainsand codomains, and to various operators in a conceptual model of a database. The items selectedin this way are displayed and assembled graphically in a workspace, or window.

The workspace concept is used in Vizla to reduce the cognitive burden query formulationimposes on end-users. It achieves this by allowing users to separate querying into sequences ofsmall steps, save intermediate results of such sequences, and combine the intermediate results into®nal results. Ad-hoc queries can therefore be formulated and processed in this manner. This is anapproach that we feel can be adopted, with certain modi®cations, by CQL to facilitate queryformulation. The query-formulation-by-pointing approach in Vizla could be tedious and unap-pealing for complex queries with long query-paths through the schema. This is because users mustpoint to the entire paths of queries and all the functions and operators needed for computation ofthe paths. The abbreviated querying approach in CQL, wherein only the terminal nodes and linksof interest are selected, cuts down on the number of operations that users must perform andthereby improves on the Vizla approach.

Vizla is a full-¯edged, self-standing query language. On the other hand, CQL in its currentprototype stage is a front-end to an underlying full-¯edged query language like SQL. In addition


to its use as a query language, Vizla is also designed to function as a programming language. Forthis reason, it is aimed at being at least as expressive as a general programming language. Asexpected, it is perhaps more expressive than CQL. CQL can, therefore, additionally bene®t fromthe work on Vizla as we develop it (CQL) further in its interface design and expressive power.

Usually multiple paths exist for a query speci®ed against a database schema. An approach todealing with multiple paths is through abbreviated queries. In abbreviated queries users mentiononly a subset of the objects/entities of interest. The system then interprets the query formulation,i.e., ®nds the necessary connections between the objects. In the path-®nding approach in [59], thequery-induced map of the database schema is ®rst pruned to make the graph less bushy. Then theshortest-path in the pruned sub-graph is selected. This idea is borrowed by CQL in its pathconstruction and selection algorithm. In [16], all possible paths are displayed; the user thenselects a particular path. CQL does not display all possible paths. Instead, from the set of allcandidate paths, it selects and displays the natural language transcription of only the minimumcost path.

3.3. Arguing for concept-based query languages

A number of studies have been conducted either to motivate the development of concept-based query languages or to demonstrate their superiority over other query paradigms. Thestudy by Welty and Stemple [57] attempted to ®nd how well users could learn relational querylanguages. It was concluded that users were having considerable di�culty with relationalqueries, and that the problem was not limited to any particular relational language. A dis-cussion on comparative studies arguing for concept-based query languages can be found in[14,47]. Both studies compare SQL and the concept-based DBQL called the knowledge querylanguage (KQL) [13]. While the study in [14] was atemporal, that in [47] was temporal in that itstudied the e�ect of time on learning. Both studies showed that users of concept-based querylanguages outperform SQL users: Irrespective of time, the KQL users performed better thantheir SQL counterparts, with respect to query accuracy, query formulation time, and usercon®dence. Additional empirical studies suggesting the superiority of concept-based data re-trieval approaches over other query approaches can be found in [2±4,31,43]. All of these studiespoint to the need for alternative query paradigms. Concept-based approaches, as in CQL,clearly o�er one such alternative.

In [43], a statistical experiment was conducted to probe end-users' reaction to using CQL, vis �avis SQL, as a database query language. The comparison focused on the e�ect of the two di�erentdatabase query language interfaces on user performance (as measured by query formulation time,query correctness, and users' perception) in a query writing task with varying di�culty levels.Statistically signi®cant di�erences between the two query languages were found.

The results indicate that end-users perform better with CQL and have a better perception of itthan of SQL. There were signi®cantly more accurate formulations with CQL than with SQL.Also, the groups with CQL took signi®cantly less time than the groups with SQL. The CQLsubjects perceived their query language to be easier to use than their SQL counterparts felt aboutSQL; they also felt more satis®ed with CQL than the SQL subjects were with SQL. These dif-ferences were more pronounced when query-di�culty level was considered. The statistical sig-ni®cance of the di�erences increased with the complexity of the query. The scores indicate that


users are more likely to perform better with CQL than with SQL, and that they are more likely toharbor a more favorable perception of it than of SQL.

As discussed in [9,53], existing approaches aimed at enhancing end-user facility with query toolssacri®ce expressive power for ease-of-use. This limits their use and applicability. This restriction isremoved in CQL, since it is designed to be both easy to use and expressively powerful. As we showin this paper, CQL is founded on a strong formal basis. CQL is formally de®ned in the nextsection.

4. Formal de®nition of CQL query context 2

CQL queries are speci®ed and submitted in an application context consisting of a conceptualschema based on a semantic data model, an implied logical schema, and an underlying DB, whichphysically exists. In this study we assume a SCERD. A context is therefore de®ned as consisting ofa conceptual schema and a database:

context ::� {conceptual_schema, database}conceptual_schema ::� {conceptual_schema_name j entity_type, semantic_role, relationship,cardinality}conceptual_schema_name ::� {name}

The university department SCERD in Fig. 1 shows our application conceptual schema. This isseen to consist of entity-types, and semantic roles played by related entities. The relationshipcardinalities are also indicated. An entity is seen to have a name and a set of de®ning attributes:

entity_type ::� {entity_type_name, attribute}entity_type_name ::� {name}attribute ::� {attribute_name,value}value ::� {data_type j expression}data_type ::� {booleanjnumberjintegerjrealjcharacterjmemo}

For Fig. 1 the entity-types are Student (STD), Course-Registration (C-R), Section (SEC), Course(C), Teacher (T) and Secretary (SEK). The attributes for each entity-type are also shown in the®gure.

The semantic role of an entity in its direct association with another entitity de®nes the roleplayed by the entity in the relationship. The cardinality of the association de®nes the multiplicitiesof the relationship:

semantic_role ::� {semantic_role_namej semantic_role, entity_type_name}semantic_role_name:: � {name}relationship ::� relationship_namejrelationship, entity_type_name}cardinality ::� {integer}

2 The symbols used here are de®ned in Appendix A.


As an example, Fig. 1 shows that the Student entity plays the semantic role adviced-by in itsdirect association with the Teacher entity. Conversely, the Teacher entity plays an advises role inthe relationship.

Our logical schema is assumed to be relational, since for this study CQL is implemented on topof a relational DB. Since the user speci®es CQL queries directly to the conceptual schema, there isno actual relational schema in the context. It is instead deduced from the conceptual schema bythe translation mechanism that we discuss later. For this reason we view the logical schema asbeing implied by the conceptual schema. Given an application conceptual schema, therefore, weassume that a database is uniquely identi®ed:

database ::� {database_namejdatabase, relation}database_name ::� {name}relation ::� {relation_namejrelation, column}relation_name ::� {name}column ::� {column_namejcolumn, value}column_name ::� {name}

All the names in the context are strings of character:

name ::� {wordjnamejjword}word ::� {string}string ::� {characterjstring, character}

4.1. De®nition of CQL queries

A CQL query is composed of lexical terms that are semantically meaningful in the contextof an application. A query comprises ®ve constructs: target entities, source entities, interme-diate entities, semantic relationships, and selection conditions. A query term derives itsmeaning from the construct to which it belongs. The speci®cation of intermediate entities isnot necessary, but optional. Consequently, the canonical form of a CQL query consists offour blocks:

query ::� ftarget block; source block; �intermediate block�; semantic relationship block;

selection condition blockjqueryg

(The intermediate block is enclosed within [ ] to indicate that it is optional.)

Example. According to the de®nition of a query, our example Query 1, can be stated as

Query � QfChCÿnamei; �STDhSÿname�`marshall'i; T�hTÿname�`jones'i;hTÿtitle�àssociate professor'i��;�ÿ; STD enrolled ÿ inCR And T teaches SEC�g:

This query statement transcribes into English as: ``Find the course(s) that student Marshall istaking from teacher `Jones' whose title is àssociate professor'.''


A target is an entity whose attribute value is sought:

target_block ::� {target, attribute}target ::� {target_entity_name}target_entity_name ::� {name}

Example. Query 1 consists of the single target Course, with a target attribute of C name.

A source is an entity having a known or given attribute value:

source_block ::� {source, attribute}source ::� {source_entity_name}source_entity_name ::� {name}

Example. Query 1 has the two sources Student and Teacher, student_name and teacher_name asthe respective source attributes.

A semantic relationship represents a direct association between entities. It is de®ned by thesemantic roles played by the associated entities. In a query formulation only the semantic rela-tionships involving the target entities and the source entities are stated by the user. Intermediatesemantic relationships lying between targets and sources are deduced by the system. As mentionedabove, the speci®cation of intermediate entities and their semantic relationships is therefore notnecessary, but optional in CQL:

semantic_relationship_block ::� {semantic_relationship, logical_operator}semantic_relationship ::� {target_entity semantic_rolejsource_entity semantic_rolej[interme-diate_entity semantic_role]j null_semantic_relationship}

Example. The example query has the following two semantic relationships: STD enrolled-in CRAnd T teaches SEC.

The selection condition of a CQL query speci®es a constraint on the values of the target at-tributes:

selection_condition_block ::� {selection_condition expressions}expression ::� {algebra_expressionjnull_expression}null_semantic_relationship ::� {empty_relationship}null_expression ::� {empty_expression}

Example. Our example query does not contain any selection criteria. Thus the selection conditionis a null expression.

4.2. Conceptual mapping of abbreviated queries

The mapping of a CQL query to the underlying DB is a two-phase process. First there is thequery-to-conceptual schema mapping which derives the conceptual path of the query in the form


a sub-schema of the application schema. We refer to the resulting sub-schema as the conceptualanswer, Sa, of the query. It follows, from the de®nition of a conceptual schema, that

Sa ::� fconceptual schemagTo be able to formally de®ne this mapping process, we ®rst de®ne what we mean by a path on a

conceptual schema. Our de®nition assumes a linear path.The mix-®x notation in [51] can be used to formally describe a linear path on a conceptual

schema. If R � fR1;R2; . . . ;Rmg and E � fE1;E2;E3; . . . ;Eng are sets of names, then the mix-®xnotations on these sets can be used to describe di�erent paths involving the elements of the sets.Let r � �r1; r2; . . . ; rm� and e � �e1; e2; e3; . . . ; en� represent ordered instances of R and E. The mix-®x expression on r and e is given by:

pi � hfeg; frg;mfixi;where feg and frg are respectively sets of schema entities and semantic roles on path pi.

pi �MFix��frjg�; �feig�� MFix��r1; r2; . . . ; rm�; �e1; e2; e3; . . . ; en�� e1r1e2r2e3 . . . enÿ1rmen:

Suppose, for example, that R is a set of semantic roles (also referred to as semroles) played byschema entities and E a set of schema entities, such that r1 � take�s�, r2 � taught-by,e1 � Student, e2 � Course, and e3 � Teacher. The mix-®x of [take(s), taught-by] and [Student,Course, Teacher] is:

MFix��take�s�; taught-by�; �Student; Course; Teacher�� Student take�s� Course taught-by Teacher

Note that this describes a schema path linking the entities Student and Teacher via the inter-mediate entity Course. Note that the inverse (or reverse) of this path can be verbalized as Teacherteaches Course taken-by Student. Both paths are semantically equivalent.

If fPg is a set of connected paths, i.e., fPg � fpi; for i � 1; 2; . . . ; ng, then the conceptual an-swer of a CQL query is given by:

Sa � fLfpig : pi 2 fPgjfPg � Sc;

where L is a path operation on the set of paths pi and Sc is the database conceptual schema.According to this expression, the sub-schema de®ning the conceptual answer of a CQL query is

a set of connected paths that are a subset of the DB conceptual schema.

Example. For Query 1 the answer sub-schema is shown in Fig. 2. The linear paths satisfying thequery on this ®gure are expressed in mix-®x notation as:

P1� student enrolled-in course-registration is-enrollment-for coursep2� student enrolled-in course-registration belong-to section is-section-of coursep3� teacher teaches section is-section-of coursep4� teacher teaches section has course-registration is-enrollment-for course


The conceptual answer therefore consists of the logical combination of pi; p2; p3 and p4. That is:

fPg � Lfpi; p2; p3; p4g:In the next section, we expand on this to show the exact logical combination of the paths.

The second phase in the mapping of CQL queries to the application DB consists ofmapping the conceptual answer Sa to the DB, which is assumed to be relational for thisstudy. This conceptual answer-to-logical database mapping process employs standard E±R(EER)-to-relation transformation rules that are found in any good database book, e.g.,[21,37]. We, therefore, do not present the rules here, but instead discuss the mechanism of themapping.

Path pi 2 fPg if it is the conceptual answer or a subset of the conceptual answer. This requirespi to be mapped onto existing DB relations. From the mix-®x notation, pi has the formpi � e1r1e2r2e3 . . . enÿ1rmen. The mapping mechanism is as follows:1. Each ei is assumed to be simple and can therefore be mapped directly onto a relation.2. Each e1r1e2 de®nes a binary relationship between e1r1e2 and a join, at the logical level, between

the associated relations.

Fig. 2. Search space, or candidate solution paths, for Query 1.


3. Higher-order relationships are recast as a set of inter-related linear paths interlocking the par-ticipating entities.

From the above-steps, the relations pertinent to the speci®ed query are identi®ed and extracted.

Example. For our Query 1, the pertinent tables are COURSE, SECTION, COURSE-REGIS-TRATION, STUDENT, TEACHER. In a later section on SQL statement generation, we showhow these tables are input into the SQL statement generating process.

In the next section, we discuss the main functional modules of the CQL system. These modulestake as input the query formulated by the user and process it to output (1) a recast of it in naturallanguage, and (2) an SQL statement equivalent of the query.

5. Functional modules

CQL is founded on set- and graph-theories. As Fig. 3 shows, the CQL system comprises thefollowing three functional components: (1) input functions, (2) processing functions, and (3)output functions. These components are discussed in this section. Examples are used to illustratethe di�erent functional concepts.

5.1. Input functions

Query formulation in CQL assumes the existence of a schema (like an E±R diagram) based on asemantic data model, against which queries are speci®ed. In formulating a query, the user is aidedby a computer-supported query formulation system, which is discussed next.

5.1.1. Query formulation surrogate system (QFSS)CQL uses a QFSS to help users tailor their queries to the target DB. QFSS provides infor-

mation that enables users to reduce what Belkin [5] describes as users' anomalous state ofknowledge. This is achieved by reducing the semantic gap between user query formulations andthe logical and conceptual states of the DB. In using CQL, users can familiarize themselves withthe conceptual aspects of the target DB through an interaction with QFSS. QFSS provides userswith helpful information on the schema concepts and constructs that they must be familiar with inorder to be able to formulate queries. To use QFSS, the user clicks on the item about whichinformation is needed. A window is then opened showing information on the clicked item. In-formation on the following CQL constructs and concepts is provided:

Selection operators. These are the operations and functions supported by CQL. They are usedin the selection conditions of queries. Arithmetic and logical operators (�;ÿ; <;�; etc) and ag-gregate functions (sum, count, average, etc) belong in this group. A more detailed coverage of thisis given in a later section.

Entity item. Clicking on a schema entity opens up a window showing the attributes of theentity, other entities semantically associated with the clicked entity, and the semantic roles theentity plays in its relationships with others entities.

Attribute list. This list shows the attributes of all the schema entities. Clicking on an attributeon the list opens an attribute window showing the entities having the clicked attribute, the pairs ofentities related via the attribute, and the semantic roles involving the entities.


Semantic role list. The semantic roles of the schema entities are contained on this list. If a roleon this list is clicked, a window opens to show the pairs of entities related through the clicked role,and the semantic roles involving the pair of entities.

In providing the query formulation surrogate system, the intention is to provide inexperiencedor novice users with help to ease the task of formulating queries. The use of this help system is notmandatory. Expert or knowledgeable users can bypass it and directly formulate their queries.

5.1.2. Browsing facility of the QFSSIn specifying queries in CQL users can directly write the query anew on a CQL input interface

in a query formulation window (QFW), or workspace. For this mode of interaction, the user

Fig. 3. Functional modules of CQL.


directly writes the query to an input interface. In the current prototype of CQL, the interfaceshown in Fig. 4 is used. Alternatively, they can compose the query with the help of the QFSS, byclicking on the desired items. The chosen items are written to the input interface in the QFW. Inthe QFW, the query can then be used as is, if the user believes it is semantically equivalent to thedesired query, or modi®ed as necessary. To facilitate the latter mode of query speci®cation via theQFSS, the QFSS windows are hyperlinked to support navigation between windows. In the fol-lowing, we illustrate the use of the QFSS with our example query.

Fig. 4. CQL query formulation interface.


Fig. 5 shows an example of the CQL query screen. To use QFSS help, the user clicks the upperbutton of this ®gure. This action takes the user to the top screen of Fig. 6. The user can requesthelp on schema information by clicking one or more of the buttons on this screen. Clicking on the`Èntities'' button, for example, pops up the listing of entities in the bottom screen of Fig. 6. Forour example query one of the entities to be selected is Student. On clicking this entity, the de-scriptive attributes of the entity are displayed (see Fig. 6). The user then picks the desired attri-butes of Student. All the selected items are written to the input interface shown in Fig. 4. The restof the form can be ®lled similarly.

5.2. Processing functions

CQL uses the meta-schema diagram in Fig. 7. This ®gure shows the di�erent types of meta-schema information that support the internal processing of CQL queries. The structures on Fig. 7are schema-dependent, but query-invariant. Hence they are created during design time, are per-sistent, and are de®ned as follows:1. Entity-attribute. Comprehensive list of attributes and the entities with which they are associated

in the application DB.2. Entity-table. Comprehensive list of entities in the application DB.3. Link adjacency matrix (LAM). Contains information about pairs of entities that are logically

adjacent, i.e., that bear direct semantic relationship on the DB schema.

Fig. 5. CQL query screen.


4. Connectivity. Lists pairs of entities that can be connected directly (as in the LAM) or indirectlythrough a set of intervening entities.

5. Link semantic dictionary (LSD). Speci®es the semantic relationships that exist between two di-rectly connected entities.

6. Join-matrix. Basically an extension of the LAM. It details the relationships of LAM entity-pairsby indicating the key columns used for joining entities (or tables at the virtual logical level).

Fig. 6. QFSS help window on schema entities.


Furthermore, the ®gure shows the relationships between the meta-data items. For example, itshows that an attribute is part of an entity, etc. One use of this kind of information is the designof meta-data integrity checks and error messages. As an example, since every attribute belongs toan entity, it follows that any attribute speci®ed in a query must exist on the schema as an attributeof an entity on the schema. This and similar checks are performed prior to the processing ofqueries.

Input validation. The ®rst step in processing a CQL query is input validation. For this purpose,the schema is internally represented by the entity table and the entity-attribute table, which arede®ned as follows:

Entity_Table ::� {entity-type,data-type}Entity_Attribute_Table ::� {entity_type_name, attribute, data_type}

The validation of a speci®ed entity is done by checking the entity against the entity-table, toensure that it actually exists on the schema. Each speci®ed attribute is similarly validated againstthe entity-attribute table. Where a speci®ed entity or an attribute does not exist on the DBschema, an error message is returned to the user. Validation is automatically done by the CQLsystem.

Parse input. CQL transcribes speci®ed queries into a set-speci®cation form. The set of sources,targets and conditions is written to a Set Handler, which is a template of semantically typed slots,or place-holders, for sources, targets, and conditions (selection and semantic). The Set Handler isan abstraction for the CQL input form; its internal representation is that of an intermediate setform. The set representation makes it amenable to set theoretical treatment. The purpose of theSet Handler is to decompose speci®ed queries into sub-queries, which are then used to derive thecorresponding paths on the DB schema. The Set Handler performs two functions: (1) set-packingand (2) set-unpacking.

Set-packing. Set-packing extracts the speci®ed sources, targets and conditions, and writes themto the query template. This task is performed by the set-packer.

Fig. 7. CQL meta-schema diagram.


Example. For our example query, the Set Handler is packed as:

S �QfChCÿnamei;�STDhSÿname�`marshall'i;T�hTÿname�`jones'i;hTÿtitle�àssociate professor'i��; �ÿ;STD enrolled

-in CR And T teaches SEC�g:According to this, the set-packer identi®es and extracts Course as the target, Student andTeacher as the set of sources, and Std enrolled-in CR and T teaches SEC as the semantic rela-tionships.

Set-unpacking. Set-unpacking uses set-theoretic operations to fragment the query written to theset-packer into its subqueries:

Q1 :� S1fChCÿnamei; STDhSÿname�`marshall'i;ÿ; Std enrolled-in CRg � S1C; STD;ÿ; Stdsem andQ2 :� S2fChCÿnamei; T ;ÿ; T teaches SECg � S2fC; T ;ÿ; Tsemg

(The ``_'' indicates an unspeci®ed, a possibly null, value.)Connectivity validation. In processing a CQL query, the system traverses a path between schema

entities. The highly abbreviated nature of CQL queries requires the system1. to ®rst unpack a query into subqueries,2. to possess the capability of determining the set of semantically correct adjacent pairs of entities

on a query path,3. to ensure that at least one connected path exists between sources and targets.

These functions are discussed here.Each initial subquery resulting from the unpacking of the query is further decomposed into

irreducible subqueries. An irreducible subquery is equivalent to a linear path between a sourceand a target. The linear paths (or subqueries) are then logically combined into candidate solutionpaths for the query. The process is illustrated in Fig. 8. According to this ®gure, for each initialsubquery, a set of shortest paths is constructed between the source and the target. The process canbe algorithmically described as follows:1. For each subquery, pick a speci®ed relationship semantic role (for brevity, we simply refer to

this as ``semantic condition''). Let the chosen semantic condition be tagged as semantic k.2. Determine the linear paths linking semantic k to the speci®ed target j in the subquery (done by

Procedure Tree in Fig. 8):2.1. Read the entities in positions 1 and 2 of Sempos_table[k]. (Sempos_Tableholds the spec-

i®ed semantic conditions and is de®ned below.)2.2. If the entity in position 1 (i.e., entity 1) of Sempos_table[k] is in the ®rst position of the

LAM, generate a directed acyclic graph having the source as its root entity (performedby Procedure Select_LAM).

2.3. Otherwise, if entity 1 of Sempos_table[k] is not in the ®rst position of the LAM, rearrangethe LAM so that entity 1 is in position 1 of the LAM (performed by Procedure Se-lect_DLAM).

3. From the LAM or DLAM, generate a direct acyclic graph from semantic k to the target:3.1. Tag entity 1 in the LAM or DLAM, as the case may be, as read.3.2. Update Procedure Tree with the entity in position 2 (i.e., entity 2) of Sempos_table[k] set

equal to entity 1.


4. Repeat step 3 until entity 2 equals the target.5. Write the linear path generated (the tagged entities) into Validation_Pathmatrix (VP matrix),

de®ned below.The query-invariant LAM is de®ned as:

LAM :: � {obj1, obj2, cost}obj1 :: � {entity_name}obj2 :: � {entity_name}cost :: � {cost of traversing obj1 to obj2}

Sempos_Table and the VP matrix are volatile, since they are query-variant, and are auto-matically generated anew for each new query. They are de®ned as follows:

Semantic Position Table ::� Sempos Table�k; ek;1; ek;2�

where k is the kth row of the table, ek;1 the entity occupying position 1 of the kth row of the table,and ek;2 is the entity occupying position 2 of the kth row of the table.

Fig. 8. Query path generation in CQL.


Entries in this table are of the form �k; ek;1; ek;2�. This has the mix-®x semantics: ek;1rkek;2, where,as de®ned before, rk is the semantic role associating ek;1 and ek;2. For example, in the semanticrelationship T teaches SEC of subquery Q2 above, T � e2;1, SEC � e2;2 and teaches� r2. Similarly,for Q1 Std � e1;1, CR � e1;2 and r1 � enrolled-in. Therefore, from the initial unpacking of ourexample query, Sempos_Table is instantiated as {(1, STUDENT, COURSE-REGISTRATION),(2, TEACHER, SECTION)}.

Validation_PathMatrix (VP Matrix) ::� setfp1; p2; . . . ; pkg 3

pi � n-tuplefei;1; ei;2; ei;3; . . . ; ei;ng � hfeg; frg;mfixi� mix-fix��r1; r2; . . . ; rm�; �e1; e2; e3; . . . ; en��

It is seen that pi is the ordered sequence of entities on the ith path from a source to a target.Therefore, the VP matrix holds the set of linear paths (corresponding to an irreducible subquery)from a source to a target. For our running example, the VP matrix holds the values {(C, CR,STD), (C, SEC, CR, STD), (C, SEC, T), (C, CR, SEC, T)}.

Semantics of an irreducible linear path. As expressed above, linear path pi isei;1 r1;2 ei;2 r2;3 ei;3 r3;4::rnÿ1;n ei;n in mix-®x notation, where rj;k is the semantic role associating eij

and eik on the path. This can be expressed alternatively as:

pi��ei;1 r1;2 ei;2�\�ei;2 r2;3 ei;3�\�ei;3 r3;4 ei;4�::\�ei;nÿ1 rnÿ1;n ei;n��\�eij rj;k eik�fj�1;2;...;nÿ1 and k�j�1g:

Since the terminal entities ei;1 and ei;n are source and target entities, respectively, the user speci®es�ei;1 r1;2 ei;2� and �ei;nÿ1 rnÿ1;n ei;n� as semantic role relationships in the query formulation. It fol-lows that pi can be expressed as:

pi � �ei;1 r1;2 ei;2� \ f�eij rj;k eik�fj�2;...;nÿ2 and k�j�1gg:The path segment �eij rj;k eik�fj�2;...;nÿ2 and k�j�1g consists of intermediate entities and semantic

role relationships and is automatically determined by Algorithm Prune-First, CQL's path ®ndingand selection algorithm given in the appendix. In automatically deducing the intermediate pathsegments, the value for eij rj;k eik is read from the non-volatile link semantic dictionary (LSD)de®ned as:

LSD ::� fobj1; obj2; semantic role; Inverse semantic roleg

Example. For our running example Query 1, the derived semantic paths are:

P1 � hSTD enrolled-in CR is-enrollment-for Ci (or the inverse hC has CR enrolls STDi)P2 � hSTD enrolled-in CR belongs_to SEC is_section_of Ci (or the inverse hC consists_of SEC

has CR enrolls STDi)P3 � hT teaches SEC is_section_of Ci (or the inverse hC consists_of SEC is_taught_by T i)

3 CQL gives users the option of specifying a value for k, thereby limiting the number of paths generated for each source-target pair.

This is especially important for very large DBs (VLDBs).


P4 � hT teaches SEC has CR is_enrollment_for Ci (or C has CR belongs_to SEC is_taught_by T)Candidate path selection. For the purpose of query-validation, we return only one candidate

solution path to the user. The underlying cost criterion we use to select a path is the total numberof edges (or arcs) on the path, i.e, the path length. (We opt for this simple cost criterion fordemonstration purposes only, since the issue of cost criteria is orthogonal to the e�cacy of theCQL system. The cost criterion can be readily changed.) The objective is to choose the minimum-cost candidate solution path.

Our path selection approach forms clusters of the irreducible linear paths, with each clusterconsisting of the minimum cost linear paths between a source and a target. A candidate path for aquery is constructed by selecting one linear path from each cluster and then logically combiningthem. The minimum cost linear path clusters for our example are:

P1 � hSTD enrolled-in CR is-enrollment-for Ci (or the inverse hC has CR enrolls STDi)P3 � hT teaches SEC is_section_of Ci (or the inverse hC consists_of SEC is_taught_by T i�It can be easily shown that for a multiple-source-single-target (MSST) query, the upper bound

on the cost of a candidate path is

CN ;1 � i�1RN c�Li� ÿ fk�2R

N�N !=�N ÿ k�!k!��k ÿ 1�c�i�1\kLi�g;

where c�Li� is the cost of linear path Li; �i�1 \k Li� is the intersection of k linear paths and N is thenumber of sources. For a multiple-source-multiple-target (MSMT) query, it can be shown that theupper bound on the cost of a candidate paths is

CN ;M � M � CN ;1;

where N is the number of sources and M the number of targets.For our example MSST query, N � 2 and M � 1. If Cj;k denotes the cost of the conjunctive

solution path pj:pk, then by our cost criterion, it is seen from the cost function and Fig. 9 thatC1;3 � 4. While we do not pursue it further, it can be shown that other minimum length costcandidate path combinations may exist from the set of clusters. For example, in our case, p2:p3

and p1:p4, have costs C2;3 � 4 and C1;4 � 4, respectively, and are, therefore minimum cost can-didate paths. Furthermore, it can be shown that the candidate path chosen by our approach isguaranteed to be a member of the set of minimum cost candidate paths. 4

Natural language validation of CQL queries. To ensure the semantic correctness of constructedqueries, user validation is essential in abbreviated queries. In CQL, the semantic roles played byentities in relationships are also used by the system to construct pseudo-natural language ex-planations of queries. The system-constructed explanations are returned to the user for validation.To facilitate legibility, the system does not generate unnecessarily lengthy sentences. 5 According

4 In an alternative path selection approach for CQL, the user speci®es the maximum number of candidate paths to be generated.

Therefore, if the value k is speci®ed, the system then constructs and returns up to, but not more than, k minimum cost candidate paths.

For this alternative, if the user had speci®ed 3, or even 4, candidate path combinations P1. P3; P2:P3, and P1. P4 will be generated and

returned.5 For long query paths, CQL provides the user with the option of eliding the masked portions of the path (as in Fig. 8) from the

natural language (NL) explanation, to facilitate understanding.


to [18,23,56], this is the recommended and an e�ective way of obtaining a good compromisebetween natural language and readability, focus and relevance. The NL aspect of the CQL systemis discussed in this section.

The NL explanation module consists of two components. The ®rst is a sub-path transcriber,which transcribes each irreducible linear path selected for candidate path construction intopseudo-English. The second is a logical path synthesizer which combines the transcribed linearpaths into a single system-explanation of the speci®ed query. The context-free grammar of CQL'snatural language explanation of queries is described in Backus±Naur form (BNF).

5.3. BNF for CQL natural language explainer

The basic, or canonical, form of the pseudo-NL explanation of a CQL query is formally ex-pressed as:

CQL Pseudo-NL Canonical Structure ::� {search_verb, search_clause, search_predi-cate_clause, h known_attribute_value, semantic_relationship_clause, join_conditioni}search_verb ::� {FIND}search_clause ::� {target-attribute-comma-list }search_predicate_clause ::� {SUCH THAT}known_attribute_value ::� {source-attribute-comma-list}semantic_relationship_clause ::� {semantics-predicate}join_condition ::� {Join-attribute-expression}

Based on this de®nition the pattern of the NL explanation of a CQL query is:

FIND h[target-attribute -comma - list and} target-attribute i SUCH THAT hsemantics-predi-cate - semi-colon - listij[, and hsource-attribute comma list and] source-attributei, and h[[join-attribute-expression - semi-colon - list and] join-attribute-expression]ijThe symbols [ ] and j, respectively represent multiple terms and optional expressions that may

be missing from some queries. The algorithm that generates the pseudo-English explanations ofqueries can be found in [41] and is available upon request from the ®rst author.

Example. For the selected linear paths p1 and p3 of our example query, the pseudo-natural lan-guage transcription generated is shown in Fig. 10. This statement is returned to the user forvalidation.

Query validation. Validation of CQL query formulations involves checking for semanticequivalence, or consistency, between the original query statement and the system-generated ex-planations. Where the user believes that the system-generated query explanation is semantically

Fig. 9. Logical combination of linear paths into a solution path for Query 1.


consistent with the intended query, the user validates it. It is thereafter executed. Otherwise, it ismodi®ed as necessary before execution.

5.4. Automatic generation of SQL

To execute a query whose NL explanation has been validated by the user, the conceptualanswer is ®rst translated into an SQL query (here again we are limiting the discussion to theprototype implementation of CQL on top of a relational DBMS). In the remainder of the section,we discuss and illustrate this translation process.

From the earlier discussion on the mapping of query paths to the underlying DB relations, it isseen that the edge connecting two entities on a path is equivalent to a join. Since the conceptualanswer de®nes the solution path of the query, the derived relations and joins are automaticallyconverted into an SQL query, which is then speci®ed on the underlying physical DB. AlgorithmMapping gives the main conversion logic. The conversion pseudo-code can be found in [41]. 6 Inessence the approach is as follows:

6 We have not included the pseudo-code here because of its length. However, it is available to interested readers upon request from

the ®rst author.

Fig. 10. CQL's natural language explanation of Query 1.


1. The target attributes are written to the SQL SELECT clause.2. The following are written to the SQL FROM clause: (1) target tables, (2) source tables, and(3) each entity on the chosen candidate path.3.1. If the underlying relational DBMS performs automatic joins, then the tables in the FROMclause are automatically joined by the system.3.2. If the underlying DBMS does not perform automatic joins, the join conditions are readfrom the Join_Matrixand written to the SQL WHERE clause.4. The selection criteria are read from the Selection-(Theta)-Operation matrix and also writtento the SQL WHERE clause.

5.5. CQL-to-SQL conversion algorithm

Algorithm Mapping is the main translation program for mapping non-recursive and un-nestedqueries from CQL to SQL. Translations for special cases, such as nested queries, recursive queries,aggregation functions, clauses, other more complex multi-block queries, etc., are handled by otheralgorithms in [41] which call Algorithm Mapping as a special procedure.

Algorithm MappingFunction: Map CQL query formulation to SQLInput Validated CQL query formulation. Entities on validated query-path.Output SQL query formulation

Algorithm.Step 1. Pick all the target attributes from the target segment of the CQL formulation. Write

them in the SELECT clause of SQL, separating them with commas (,).Step 2. Pick all the target and source entities speci®ed in the CQL formulation.Step 3. Pick any entity speci®ed in the semantics (relationship) conditions block of the CQL

formulation, but not already picked in Step 2.Step 4. Pick any entity not already picked in Steps 2 and 3, but contained in the query-pathStep 5. Write all the entities picked in Steps 2±4 into the SQL FROM clause, separating them

with commas.Step 6. In the SQL WHERE clause, write each source attribute and its value. Enclose the at-

tribute value in apostrophes (` . . . '). Separate the source-attribute ± attribute-valuepairs with an `Ànd'', i.e., form a conjunction.

Step 7. Check the selection conditions block of CQL. If any selection criteria are speci®ed, writethem into the SQL WHERE clause, separating this set from the set in Step 6 with an `Ànd''.

Step 8. If underlying DBMS supports automatic joining of tables in the FROM clause, then do:Step 8.1. Write a semi-colon (;) after the last source-attribute ± attribute-value pair in Step 7.Stop.Otherwise do:Step 8.2. Write an `Ànd'' after the last source-attribute ± attribute-value pair in Step 7.

Step 8.2.1. Extract all the join conditions from the query-path and write them after Step8.2, separating the joins with an `Ànd''.

Step 8.2.2. Terminate the last join condition with a semi-colon (;). Stop.

Example. The generated SQL statement corresponding to the semantics of the conceptual answerof our example query 1 is shown in Fig. 11.


6. Expressive power of CQL

A major concern in query languages is their completeness, which is an aspect of expressivepower, or simply expressiveness, which in turn is taken as the ability of the system to extractmeaningful information from DBs [11]. In this section, the functional expressiveness of CQL isdiscussed. First, we discuss the aggregation functions, clauses, quanti®ers and logical operatorssupported by CQL. Thereafter, we provide formal proofs of the power of CQL. Lastly, a dis-cussion on the safety of CQL expressions is given.

6.1. Aggregations, clauses and quanti®ers in CQL

CQL is mappable to di�erent target database management systems (DBMSs). Therefore, CQLfunctions, clauses, quanti®ers and logical operators are independent of any speci®c DBMS querylanguage. This means that they must be tailorable to those of the underlying target DBMS querysub-language (as mentioned earlier, in the implementation discussed in this paper, CQL is tailoredto an underlying relational system). Aggregation functions, clauses, quanti®ers and logical op-erators are collectively referred to here as operators.

As illustrated in Fig. 12, two types of tailorable functions are provided to extend the power ofCQL:· Directly mappable operator functions (DMOFs). These are operators that are directly supported

by the underlying DBMS sub-language. They can be used directly in CQL query formulation,and are directly mapped to the same operators in the underlying DBMS sub-language. There-fore, there is a one-to-one correspondence between DMOFs in CQL and the set of operatorssupported by the underlying DBMS.

Fig. 11. CQL generated SQL statement for Query 1.


· Indirectly mappable operator functions (IMOFs). Some of the operators supported by CQL andrequired by an application may not be directly supported by the target DBMS. A body of con-version codes is written to translate this class of operators to programs that are executable bythe DBMS. IMOFs, therefore, provide enhanced functionality to the DBMS sub-language.Table 1 shows the operators that are currently supported in CQL and the equivalent SQL

DMOFs that they are mapped to. A CQL operator for which SQL lacks an equivalent operator isan IMOF with respect to SQL. SQL IMOFs are indicated by ``)'' in Table 1. As an example, theREPEAT(n) operator, which is an SQL IMOF, is used for recursive queries in CQL. The Match-n-In, Shunt and Mask operators are also IMOFs in SQL. These three operators are brie¯y ex-plained in Appendix A.

The set operators currently supported in CQL are UNION, INTERSECTION, DIVIDE andCARTESIAN PRODUCT. Currently, the logical Operators ELSE, AND, OR, NOT and EX±OR are supported in CQL. The interested reader is referred to [41] for a discussion on each of theCQL operators.

6.2. Proof of the expressive power of CQL

The expressiveness of query languages is usually gauged in terms of their relational com-pleteness. A data manipulation language is said to be relationally complete if it is as expressive asrelational algebra (or equivalently, relational calculus) [17,21].

Fig. 12. Tailoring CQL operators.


In discussing the expressive power of CQL, it is noted that the class of queries computable byCQL is a superset of ®rst-order queries. More formally, if Q(CQL) denotes the set of queriescomputable by CQL and Q(fo) is the set of ®rst-order queries, then Q(fo) � Q(CQL). To provethis, it is shown that Q�CQL� � [�Q�fo�;Q�u�;Q�IMOFs��. Q�u� is the set of queries involvingthe use of the universal quanti®er and Q(IMOFs) is the class of CQL queries requiring the use ofIMOFs with respect to SQL.

6.2.1. Claim on CQL completenessCQL is more than first-order complete. To show this, it is ®rst shown that CQL is relationally

complete. Next, it is shown that CQL supports universal quanti®cation, thus making it morefunctionally expressive than ®rst-order completeness.

6.2.2. Relational completeness of CQLThe approach taken here is fashioned after Ullman [52] in proving the relational completeness

of QUEL and QBE, and also by Date [19]. In general, language L1 is L2 complete if we canexpress in L1 any query that can be expressed in L2. Where L1 is CQL and L2 is SQL, the proofreduces to showing that CQL is relationally complete. To prove the relational completeness (andhence the expressive power) of a query language, it su�ces to show ``how to apply each of the ®vebasic relational algebra operations and store the result in a new relation'' [52].

A ®rst-order complete system is one whose class of computable queries contains the class ofqueries computable through the relational algebraic operators: Di�erence (di�), Union ([),Cartesian product �X �, Selection (r) and Projection (P) [52].

Claim. CQL is ®rst-order complete, i.e., Q�fo� � Q�ÿ;[ ;X; r;P� � Q�CQL�.

Table 1

Mappable CQL user-de®ned operator functions

CQL operators SQL operators

SUM SUM

COUNT COUNT

AVERAGE AVERAGE

SORT-BY ORDER-BY

GROUP-BY GROUP-BY

BETWEEN BETWEEN

BELONG-IN EXISTS

FORALL )REPEAT(n) )IS-IN IN

IS-HAVING HAVING

IS-LIKE LIKE

IS-n-OF )MATCH-n-IN )SHUNT )MASK )


Proof. To prove this claim, it is required to show that given any relational database (RDB) andany query in Q�ÿ;[ ;X ;r;P� speci®ed on RDB, with result Arb on RDB, there exists an equivalentquery Q expressed in terms of CQL, such that if ACQL is the answer of Q, then ACQL � Arb: Thisessentially says that the set of queries computable by CQL is a superset of the set of ®rst-orderqueries.

Let R :� R�a1; a2; . . . ; an� and S :� S�b1; b2; . . . ; bn�; and let A�n� and B�n� denote the attributes ofR and S, respectively. R � tiRji � 1; 2; . . . ;m, i.e., the set of tuples of R and S � tjSji � 1; 2; . . . ; k,i.e., the set of tuples of S. tiR � �vi1; vi2; . . . ; vin� and tjS � �vj1; vj2; . . . ; vjn�, where vix and vjx are thevalues of ax and bx, respectively.

6.2.2.1. Union operation �[�. Assume Arb � �[�TR � R [ S: (This presupposes that R and S areunion compatible.)

)�[�TR � ftiRg [ ftjSg:That is, �[�TR is a relation that includes all the tuples that are either in R or in S, or in both R and S.

In CQL:

ACQL � Q�target�T �; source�R; S�;ÿ; Csel�T � �R [c S�;where [c is the union operator supported in CQL.

Using the set-theoretic decomposition of Q, ACQL � Q1�target�T 1�; sourceR;ÿ;Csel�T 1 � R�� [c Q2�target�T 2�; source(S); -; Csel(T2 � S)], where Q1 and Q2 are subqueriesof Q.

)ACQL � �T 1 � R� [c �T 2 � S�:But [c � [, the relational union operator.

)ACQL � R [ S ��[� TR:

6.2.2.2. Di�erence operation () or MINUS). Assume Arb ��ÿ� TR � Rÿ S�� R MINUS S�:) �ÿ�TR � ftiRg ÿ ftjS�: �ÿ�TR

is a relation that includes all the tuples that are in R but not in S.In CQL, ACQL � Q�target�T �; source�R; S�;ÿ; Csel�T � �Rdiff S��, where di� is the di�erence op-erator in CQL.

Set decomposition give ACQL � Q1�target�T 1�; source(R); -; Csel(T 1 � R)] di� Q2[target(T2);source(S); -; Csel(T 2 � S)].

)ACQL � �T 1 � R� diff �T 2 � S�:But di��) (or MINUS), the relational di�erence operator.

)ACQL � Rÿ S ��ÿ� TR:

6.2.2.3. Selection operation (r). As argued by Ullman [46], ``. . . all selections can be broken intosimple selections of the form rXqY''.


Assume Arb ��r� TR � rA�I�HV �R�: �r�TR is a relation consisting of only the tuples of R where thecondition A�I�HV evaluates to true. V is an attribute-value vector, Al an attribute vector and Hthe theta operation vector.

If t0iR � tuples of R such that A�I�HV is false, and R0 � ftiRg ÿ ft0iRg:Then �r�TR � R0:In CQL, ACQL � Q�target�T �; source�R�;ÿ; Csel�T � �A�I�HV ��If t00iR � tuples of R such that A�I�HV is false and R00 � ftiRg ÿ ft00iRg, then T � R00.But ft00iRg � ft0iRg)R00 � R0 and T ��r� TR:

6.2.2.4. Projection operation (P). Assume Arb ��P� TR � PA�I�R, such that if a0I 2 A�I�, thena0I 2 A�n� (recall that A�n� denotes the attributes of R). �P�TR is a relation whose intention consistsonly of attributes A0�I� � �a01; a02; . . . ; a0k�, such that A0�I� � A�I�.In CQL, ACQL � Q�target�T �A0�I��; source �R�A�I��;ÿ;ÿ� � Q�target�T �; source�R�A�I��;ÿ;Csel�all�A0�I��:àll�A0�I��' picks all the sets of distinct values of A0�I� and assigns them to T.

6.2.2.5. Cartesian product operation �X �. Assume Arb � �X �TRS � RXS:�X �TRS is a relation whoseintention consists of the concatenation of A�n� and B�m�, where m may or may not be equal to n.Thus, if A�X � denotes the attributes of �X �TRS, then A�X � is of the form A�n�B�m�. That is,A�X � � �a1; a2; . . . ; an; b1; b2; . . . ; bm�. The relational product operation is equivalent to a join op-eration with no join restrictions. Thus, the extension of �X �TRS is the set of all possible combina-tions of tuples from the two relations being operated on.

In CQL, ACQL � Q�target�T �A�; source�R; S�;ÿ; Csel�A�n�X :B�m��. (Instead of (A�n�X :B�m��, CQLallows the expression R X. S to be used.) CQL sets the intention of T to �A�n�X :B�m��. But X. inCQL maps to the relational X operator. It follows that the extension of T is also the set of allpossible combinations of tuples from R and S.

)T � �X �TRS:

6.2.3. Universal quanti®cation and safe expressionsThe second part of the proof of the functional expressiveness of CQL deals with showing that

CQL supports universal quanti®cation and that its expressions are safe.

6.2.3.1. Universal quanti®cation in CQL. The support for universal quanti®cation in CQL is dis-cussed and demonstrated here. The FORALL operator is used for universal quanti®cation inCQL. We show that this operator is mappable to SQL.The input structure for the CQL FORALL operator is:

Q�target�T �aT��; source�Si�ai�; Su�au��;ÿ; Csel�aI � v and au � ? and �aT FORALL faujof aijg��.The term ``j of a i j'' is optional.

The equivalent SQL statement generated by CQL is:

SELECT a(T)FROM TWHERE NOT EXISTS (SELECT�


FROM S S1WHERE a(I) � vAnd NOT EXISTS (SELECT�

FROM S S2WHERE S2.a(u) � S1.a(u) And S2.a(T) � T.a(T)));

The translation of the CQL FORALL operator to SQL code for universal quanti®cation isachieved through Appendix C.

Example. Get supplier numbers for suppliers who supply at least all those parts supplied bysupplier S2. (From [19]). [The assumed relations are:

S(S#,Sname, Status, City)P(P#,Pname, Color, Weight, City)SP(S#,P#, QTY)]

CQL Formulation:Target attribute: hS#iTarget: hSPi

Source attribute: hS#iSource attribute value: hS# � S2iSource: hSPi

Source attribute: hP#iSource: hSPi

Selection Conditions: hS# FORALL P# of S# � S2iSQL Formulation (resulting from Algorithm FORALL):

SELECT S#FROM SPWHERE NOT EXISTS (SELECT�

FROM SP SP1WHERE S#� `S2'And NOT EXISTS (SELECT�

FROM SP SP2WHERE SP.S#� SP2.S#And SP1.P#� SP2.P#));

From the foregoing, if Q�u� is the class of universally quanti®ed queries and Q(CQL) the classof queries supported by CQL, then Q�u� � Q�CQL�.6.2.3.2. Safety of CQL expressions. In the remainder of the section, the safety of CQL expressionsis argued. According to [52], the main properties of safe formulas are:

(a) Every safe formula must be domain independent. This ensures that data is not materializedfrom outside the domain.


(b) It should be easy to tell, just by inspecting a formula, whether or not it is safe.(c) The formulas that are expressible in real query languages based on relational calculus are

safe.

These three properties provide a mechanism for ensuring the safety of CQL queries.In CQL, the following properties hold:

1. A formula or expression does not materialize or reference an in®nite entity (or in®nite table atthe logical level).

2. A non-recursive mechanism references only a ®nite set of ®nite entities. This ensures a ®niteresult. In this regard, it is noted that CQL recursions (using the ``REPEAT n'' operator) aretransformed into non-recursive procedures through query-graph construction. This ensuresthat:

· a CQL query-graph is a ®nite graph,· each entity on a query-graph or path is ®nite,· each referenced entity on a query-graph is ®nite,· CQL query results are materialized from a ®nite number of target entities, which are explicitly

speci®ed in the query formulation.3. Every CQL formula is expressible in SQL. SQL is a real query language based on relational

calculus (and algebra), and its expressions are safe.Therefore, based on properties (a)±(c) above for safety, it can be concluded from properties (1)±

(3) that CQL expressions are safe.Additionally, let

F(sql) � formula expressions for SQLF(cql) � formula expressions for CQLA! B mean `À is expressible in B''

The claim that every CQL expression is expressible in SQL can be formally stated as:fFi�CQL� ! F �SQL�g8i, which in conjunction with property (c) for safety implies the safety ofCQL expressions.

Corollary. [A9B means `À is not expressible in B''].If 9i : Fi�CQL�9 F �SQL�; then CQL expressions cannot be guaranteed to be ``safe'', i.e., it is

impossible to make a de®nite assertion as to the safety of CQL expressions. However, the dis-cussion on mapping (that each segment in the target block, the source block, and the selectionconditions block of CQL is mappable to an SQL term) and on the expressive power of CQLshows that: :9i : Fi�CQL�9 F �SQL�. This also implies the safety of F(CQL).

7. Discussion and conclusion

Certainly, the concept-based approach to query formulation is not new. Indeed, the ORMcommunity has explored this ®eld extensively and proposed concept-based query languages fortheir modeling approaches. As already mentioned, ORM itself is a generic term for a concept-based approach to data modeling in which data is modeled only in terms of entities (or object) andthe semantic roles they play in relationships with other entities. No use is made of the concept of


attributes in ORM. Because of its generic nature, there is not just a single ORM model, but a setof closely related versions, all of which adhere to the binary data modeling principle stipulated byORM. Examples include NIAM [26,40,55,58] and the predicator set model (PSM) [48].

NIAM is a version of ORM that supports only binary relationship types. As a modeling ap-proach, it is particularly useful as an analysis method that describes an information system innatural language. Starting from examples, which are partial descriptions of the information do-main, the approach results in an information structure, or database schema. A formalization ofNIAM was attempted in the predicator model (PM) [54] by extending it to allow for n-ary re-lationships.

A further extension of NIAM was achieved in PSM by extending PM to support advancedmodeling constructs like sequences, sets, polymorphism, power types, schema types, generaliza-tion and specialization relationships. A motivation for this extension was to support complexobjects, hypermedia and o�ce automation applications. PSM is built around the concept ofpredicator, which is the connection between an object and a role. Relationships are then de®ned interms of the association roles played by objects, i.e., a relationship is an association betweenpredicators. In PSM, a relationship is viewed as a set of predicators.

ORM lends itself to di�erent dimensions of database querying, one of which is the use ofschema transformation in schema, and hence query, optimization. Di�erent conceptual schemasof the same DB application can be mapped to di�erent internal and logical schemas. This allowsfor the performance at the operational/internal level to be optimized by optimizing the conceptualschema. This requires a transformation of one conceptual schema onto another. The study in [28]proposes a formal approach to optimizing conceptual schemas by transforming a given concep-tual schema onto a di�erent but equivalent conceptual schema that exhibits a better operationale�ciency at both the logical and internal schema levels. The study proposes an approach and alanguage based on the mix-®x notation. In essence the approach takes as input a conceptualschema of a DB application and outputs another conceptual schema of the same application. Theoutput conceptual schema is an `òptimized'' version of the input schema in the sense that it leadsto more e�cient logical and internal schemas, which in turn result in better operational charac-teristics.

Both initial and optimized schemas are, however, ORM schemas. This means that they aresemantic schemas. They can therefore be used as the underlying conceptual schemas for CQL. Wenote that any conceptual schema that is expressible in mix-®x notations of entity-types (or object-types) and semantic association roles played by the entities (or objects) can form the basis for and,therefore, support CQL queries. To illustrate this argument, we note that an ORM schema can bede®ned as the tuple hE;Ri, where

E ::� set of entity-types,R ::� set of semantic roles played by members of E.

An ORM schema can therefore be expressed in the mix-®x notation:

mFix��R�; �E�� mFix��r1; r2; . . . ; rn�; �e1; e2; . . . ; em��:This is precisely the notation for the path expressions of CQL queries.

Fundamentally, the motivation behind the development of concept-based query languages isthe same as for natural language query languages, namely, to provide users with query languages


that are naturally close to users. This necessitates, on the one hand, that the languages aremathematically sound and unambiguous, and on the other hand that they are as natural aspossible and, hence, easy to use. To the best of our knowledge, the reference and idea language(RIDL) [20,38] was the ®rst concept-based language to aim at these goals. RIDL was a semi-natural language query language that was developed for NIAM. The language, however, su�eredfrom certain drawbacks, which included a lack of formal de®nition and sound syntactic and se-mantic basis. Additionally, it was based on the initial but restricted binary version of NIAM. Forthese reasons, RIDL did not meet with widespread acceptance [49,51].

The general approach to querying in the newer family of ORM-based query languages is il-lustrated by LISA-D [49,51], Conquer [8] and Conquer-II [5]. LISA-D is essentially a redesign andextension of RIDL to make it more sound and strong formally. For this reason, instead of basingit on NIAM, it is based on PSM, which, as mentioned above, is itself an extension of NIAM.LISA-D queries are formulated using information descriptors. This is because querying in LISA-D is founded on the information descriptor syntactical category. Information descriptors char-acterize and facilitate the disclosure of information objects in an information-base [50], which inthe context of database queries would constitute the database population. An information de-scriptor is speci®ed as D: information descriptor X ENV ! PE, where ENV is the environmentof the database, as determined by the database population. PE is a path expression. According tothis notation, an information descriptor in a given environment maps to a speci®c path expres-sion. A query path is therefore expressed by information descriptors. A query path in LISA-D istherefore a concatenation of information descriptors. Indeed, path expressions in LISA-D can beverbalized via the verbalization function D, such that if D is an information descriptor, thenD[[D]] is equivalent to a path expression. If P is a path expression and � denotes the concatenationoperator, then P � D��D1D2D3�� can be expressed as D��D1�� D��D2�� D��D3��. For example, ifD��D1��;D��D2�� and D��D3�� de®ne the atomic information descriptor President, born-in and Staterespectively, then P � D��D1D2D3�� D��D1�� D��D2�� D��D3�� President born-in State. Thisexpression corresponds to the path connecting schema entities President and State via the se-mantic role born-in.

Speci®ed queries are matched against the characterization of information objects, i.e., againstinformation descriptors. A LISA-D query has the general format LIST p1; p2; . . . ; pn, P, wherep1; p2; . . . ; pn are predicators whose values are to be evaluated on path P. For our example, thisquery speci®es the evaluation of p1; p2; . . . ; pn on the expressed path P given by ``President born-inState''.

In terms of the CQL notation used in [42], this query can be expressed as LIST�p1; p2; . . . ; pn� # P , where # is the submersion operator used to suppress the predicates to beevaluated. Once p1; p2; . . . ; pn are suppressed, P remains. CQL's mix-®x expression for P thenbecomes explicitly clear: P�mix-®x([born-in], [President, State]) � President born-in State. ThisCQL path expression precisely coincides with the LISA-D query path. LISA-D is expressivelyvery powerful, but technically not suited for end-users [8].

Conquer [8] and Conquer-II [5] are also concept-based query languages based on ORM.Queries are formulated as paths through an information space that is represented as schemasmodeled in ORM. Query predicates are represented as semantic role sequences that can be ex-pressed in mix-®x form. Queries can be expressed as outline queries, schema trees, or text. Thecommercially implemented versions of these languages require queries to be entered in outline


form through the drag-and-drop approach discussed earlier in connection with Conquer-II.Textual verbalizations of expressed queries can be generated automatically. Queries consist ofentities and predicates. (When necessary, attributes are introduced only as derived concepts.) Bothlinear and non-linear queries, i.e., tree-shaped queries, are expressed as sequences of conceptualjoins and conceptual operations forming a series of conceptual paths through ORM schemas.Therefore, ORM queries can be readily verbalized as mix-®x statements, as illustrated by thefollowing two Conquer/Conquer-II queries: (1) Employee lives incity and city is location ofBranch. (2) Employee has salary >90000 and Either speaks Language x Or drives Car y. Theimplied mix-®x notation can be clearly seen from these query expressions.

From the foregoing discussion, it can be seen that CQL can be used with an ORM schemaand, therefore, act as another ORM query language. Where the conceptual schema is an E-Rschema, or a variant of it, CQL is being used as an ER model query language. However, whatdistinguishes CQL from other concept-based query languages is that it is an abbreviated con-cept-based query language. As discussed earlier in the paper, this means that, unlike currentlyexisting concept-based query languages, the entire query path does not have to be speci®ed bythe user.

An area in which the concept-based approach can bene®t is in the incorporation of intelligenttools and techniques into query systems (a good coverage of the topic as it relates to multimediacommunication interfaces can be found in [30]). In intelligent query answering the intent of aquery is analyzed to provide generalized, neighborhood, or associated information relevant to thequery [12,29,36]. An approach adopted in the more recent studies is to exploit the rich semanticinformation of knowledge-rich DBs to determine the intent of queries. Query intent analysis canbe performed on query statements that are not well formulated or di�cult to interpret, in order toclarify the intent of the user. Once the intent is determined, the query can be restated eitherautomatically or cooperatively, with the help of the user, in a form that is easily interpreted.Advances in this ®eld can be applied to facilitate the formulation of abbreviated concept-basedqueries. For example, we are currently investigating how to apply this to resolving ambiguousqueries and queries with missing information in CQL.

Intelligent query answering systems can also be used to provide sensible explanations of posedqueries. The problem has been extensively studied in the context of designing intelligent multi-media explanations for paraphrasing and communication systems [22,36]. CQL provides thisadditional support to allow users to validate the system-explanations of their queries. The di�-culty here is in avoiding too many or super¯uous explanations. We deal with this problem in thisstudy by returning the explanation of only the shortest query path to the user.

Intelligent approaches can also be used to provide computer-aided query formulation systemsto facilitate user formulation of abbreviated concept-based queries. This is the more commonapplication of intelligent query answering tools in natural language query systems. In thosesystems where it is provided, e.g., [50], the approach is usually assistive, with the user interactingwith the system to incrementally formulate the query. This usually takes the form of the userresponding to prompts and cues from the system. In the natural language extension of CQLreported in [42], the user is presented with the information content of the DB. Further help isprovided in the form of sample queries that can be used as is or modi®ed and used. To the best ofour knowledge, no other concept-based query language provides this extended level of assistancefor query formulation.


While our overriding motivation for the study was to reduce the cognitive load imposed onusers in formulating queries, we do not expect that users will be completely devoid of allknowledge about databases. We, therefore, presume some, but not in-depth, familiarity with theconcepts of entities, attributes, and relationships. These terms can easily be replaced with lessarcane terms during actual production use. For example, `èntity'' can be replaced by terms suchas ``real-world object'', ``data type'', etc. `Àttribute'' can be substituted with ``data item'', ``data®eld'', ``data column'', etc. Furthermore, in production use, CQL can be augmented with a facilitythat provides on-line explanations and examples of these terms. CQL does not require users to befamiliar with the structure and organization of the application database, but only with thecontent. Even on this latter demand, we provide help through the query formulation surrogatesystem.

In summary, the formal basis of the CQL was presented in this paper. Like other concept-basedquery languages, CQL allows users to specify queries directly against conceptual schemas ofdatabase applications, using concepts and constructs that are native to and exist on the schemas.However, unlike other existing concept-based query languages, CQL queries are abbreviated.Hence CQL is an abbreviated concept-based query language.

CQL is designed for ease-of-use and, thereby, aimed at reducing the cognitive burden faced bydatabase end-users. To aid end-users in formulating queries, CQL is provided with a computer-assisted query formulation system. CQL is founded on strong set- and graph-theoretic principles.We demonstrated that it is more than ®rst-order complete. In combining ease-of-use with ex-pressive power, it overcomes the common weakness in concept-based query languages, i.e., that ofbeing less than relationally complete. A prototype of CQL has been implemented as a front-end toa relational database manager.

A contribution of this study is the use of the semantic roles played by entities in their asso-ciations with other entities to support abbreviated conceptual queries. An advantage that accruesfrom this main contribution is the use of relationship semantics of data models to alleviate or freethe user from dealing with the syntactic complexity of query formulation. Additional advantagesinclude the use of the roles played by entities in relationships in developing semantic graphs ofconceptual queries, the use of the roles played by entities in relationships in developing pseudo-natural language explanations of queries, the use of system-constructed semantic graphs to aid theautomatic generation of SQL. The study was limited to querying a single database; databaseupdating was not addressed.

In future, we would like to extend this study to deal with query ambiguity and incompleteness.Missing information in a database can occur where the DB is based on the open world assumption(OWA) [39]. OWA allows a DB system to have incomplete knowledge. This implies that theremay be some true propositions about the universe of discuss which are neither stored nor de-rivable by the system. In CQL, missing information is of two types: incompleteness and ambi-guity. Both are de®ned with respect to the ability of the system to extract data.

We also plan to extend CQL to support multi-dimensional queries. On-line analytical pro-cessing is based on the multi-dimensional modeling of business. In our view, it should be possibleto extend CQL in a straightforward manner for querying multi-dimensional (or decision support)databases. A study of this is already in progress.

An extension of CQL to heterogeneous and distributed databases is also slated for the future.Database querying in heterogeneous and distributed environments, such as the World Wide Web,


requires knowledge of the exact location of data in the system, the organization of data, and theknowledge of the access protocols for each unit in the distributed system. Added to this, the usermust know the various query languages used by each of the units. This makes DB querying insuch environments time consuming, di�cult and ine�cient. A more user-friendly and easy-to-useapproach is called for. Preliminary investigation suggests that CQL can be extended to distributeddatabase environments to ease and facilitate query formulation and processing.

Extension of CQL to very large database systems is also planned. In very large databases(VLDBs), non-contiguous fragments of the DB schema may reside contemporaneously in thesystem. This introduces additional complexities to query formulation and processing. The CQLapproach can be used to de®ne a higher level ``virtual'' schema on the VLDB system. Queries canthen be speci®ed against the virtual schema using CQL or an extended version of CQL.

Appendix A. De®nition of formal symbols

j logical OR, logical AND{}a set of components:: � comprises or Consists ofjj concatenationThe general syntax is: hleft_side_of_formulai comprises {hright_side_of_formula}.

Appendix B. Algorithm prune-®rst

Function: Generates the candidate paths between sources and targets.Input CQL query, Q(S; T), of sources (S) and targets (T), connectivity matrix (C-matrix),

logical adjacency matrix (LAM)Output All candidate paths between sources and targets

The following de®nitions are used:

E: a set of schema entitiesE(x): entity type x in E

Algorithm.Step 0: Let S � E�s� and T � E�t�.Step 1: Read C-matrix for v�s; t�, the E�s�=E�t� cell value of the LAM. If v�s; t� � 0, the schema

is disconnected. STOP. Else (If v�s; t� 6� 0) continue.Step 2: For each successor of a source, determine if the schema semantics of its link to the

source `ìs consistent'' with the query semantics. {* This is achieved by comparingthe relationship semantics between the pair of entities on the schema with the speci®edsemantics in the semantics relationship condition in the CQL query formulation. Therelationship semantics between schema entities are de®ned in the link semantic dictio-nary (de®ned earlier.*}


Step 2.1: If it is, retain that link.Step 2.2: If it is not, delete that link and that successor from the successor list of the source.

(These links are deleted from the paths leading away from the source.)Step 3: For each predecessor of a target, determine if the schema semantics of its link to the

target `ìs consistent'' with the query semantics.Step 3.1: If it is, retain that link.Step 3.2: If it is not, delete that link and that predecessor from the predecessor list of the tar-

get. (These links are deleted from the paths leading to the target.)Step 4: Scan resulting successor set of E�s� from the LAM.

Step 4.1: If E�t�fE�s�-succ.set}, then pick E�t� and connect E�s� directly to E�t�. [Query Path(QP) �fE�s�;E�t�g � E�s� ! E�t��f� X-succ.set is de®ned as the set of set of sche-ma entities succeeding, i.e., adjacent to X. �}

Step 4.2: Pick each E�j� in turn, where E�j� fE�s�-succ:setg. Connect E�s� to E�j�.�QP � fhE�s�;E�t�g � E�s� ! E�t�i; hE�s�;E�j�g � E�s� ! E�j�i�

Step 4.2.1: Set j � s and go to Step 1, and skip Step 2.Step 4.2.2: Repeat Step 4.2.1 until E�j� � T .

Appendix C. Algorithm FORALL

Function: Maps the CQL FORALL operator to SQL code for universal quanti®cationInput CQL Query FormulationOutput SQL statement for universal quanti®cation operation

Algorithm.Step 1: If an uninstantiated source attribute is speci®ed, mark it in all its occurrences for selec-

tion in a nested SQL block.Step 2: Map the target segment of the CQL query to SQL as in the un-nested case:

Step 2.1: If a FORALL quanti®er is speci®ed in the CQL selection condition do:Step 2.1.1: In the WHERE clause write NOT EXISTSStep 2.1.2: Select an asterisk (*) in a second SQL SELECT-WHERE-FROM block

Step 2.1.2.1: In the FROM clause write the speci®ed source and a ®rst alias of thesource. Separate both with a space.

Step 2.1.2.2: In the WHERE clause: (1) Write all the instantiated source attributes, as inthe un-nested case (i.e., Algorithm Mapping), if any is speci®ed. (2) Form aconjunction of (1) with ``NOT EXISTS''

Step 2.1.2.2.1: Select an asterisk (*) in a third SELECT-FROM-WHERE blockStep 2.1.2.2.1.1: In the FROM clause write the speci®ed source and a second alias

of this source. Separate both with a space.Step 2.1.2.2.1.2: In the WHERE clause: (1) Join the ®rst and second aliases of the

speci®ed source in Step 2.1.2.1 and Step 2.1.2.2.1.1 on the markeduniversally quanti®ed source attribute (i.e., S2.a(u) � S1.a(u)). (2)Join the second alias of the speci®ed source in Step 2.1.2.2.1.1 onthe set of speci®ed target attributes (i.e., S2.a(T) � T.a(T)). (3)


Form a conjunction of (1) and (2).Step 2.1.2.2.2: Enclose the SQL code generated in Step 2.1.2.2.1 in parentheses.

Step 2.1.3: Enclose the SQL code generated in Step 2.1.2 in parentheses.Step 3: Terminate the entire code generated in Step 2 with a semi-colon (;)Step 4: Stop.

Appendix D. Special problems with peculiar cases

Match-n-in operator. For the ®rst n or less values in a list or set that are matched by values in adatabase, the matching values are returned. This operator is expressed as: ha(t) Match-n-IN Listi,where n is a positive integer and ``List'' is a set or list derived from a sub-query or directly speci®edin the query formulation.

Shunt operator. In some applications, the schema may be disconnected into semantically dis-joint sets. This type of disconnection may result from a lack of meaningful semantics associatingany entity in one set with any entity in the other set. Although the schema is semantically dis-connected, an implicit connection may exist through foreign key or common domain key rela-tionships. It should, therefore, still be possible to query the database with source in one set andtarget in the other.

Shunt�E�i�;E�j�� is an operation instructing the system to forcibly join E�i� and E�j� using thespeci®ed or identi®ed join keys. For any arbitrary E�i� and E�j�, Shunt�E�i�;E�j�� establishes adirect connection between E�i� and E�j�, while shunting any intermediate entities between the two.This approach can be used where the two disjoint sets represent two di�erent schemas.

Mask operator. For very long query-paths, the path-meaning derived through semantic linksmay be either lost or meaningless. In this case, a path exists between source and target, but due tothe long path-length the user considers the NL transcription of the entire path may either beawkward or be of no value. Here the intermediate entities between source and target or a subset ofthe intermediates can be masked, i.e., excluded, from the NL transcription using theMask�E�i�;E�t�� operation, where the segment of the path between E�i� and E�t� are to bemasked.

To mask the segment of a path, the end-user only speci®es the two end points of the segment.The net e�ect of this operation is that the system excludes the segment from the validation NLstatement returned to the end-user, but retains the corresponding tables and joins in the generatedSQL statement.

References

[1] M. Anderson, S. Doug-Guk, Integrating an intelligent interface with a relational database for two way man-machine

communication, in: Proceedings of the IEEE/ACM Int'l Conference on Developing and Managing Expert System Programs,

Washington, DC, 30 September±2 October 1991, pp. 4±11.

[2] D. Batra, A framework for studying human error behavior in conceptual database modeling, Information and Management 25

(1993) 121±131.

[3] D. Batra, J.A. Ho�er, R.P. Bostrom, Comparing representations with relational and EER Models, CACM 33 (2) (1990) 126±139.

[4] D. Batra, M.K. Sein, Improving conceptual database design through feedback, International Journal of Human±Computer

Studies 40 (1994) 653±676.

[5] N.J. Belkin, Anomalous states of knowledge as a basis for information retrieval, Canadian Journal of Information Science 5

(1980) 133±143.


[6] A.T. Berztiss, Data abstraction in the speci®cation of information systems, in: Proceedings of the IFIP World Congress 86 (1986)

83±90.

[7] A.T. Berztiss, The query language Vizla, IEEE TKDE 5 (5) (1993) 813±825.

[8] A.C. Bloesch, T.A. Halpin, ConQuer: a conceptual query language, Conceptual Modeling -ER'96, LNCS vol. 1157, Springer,

Berlin, 1996, pp. 121±133.

[9] T. Catarci, M.F. Costabile, S. Levialdi, C. Batini, Visual query systems for databases: a survey, Journal of Visual Languages and

Computing 8 (2) (1997) 215±260.

[10] T. Catarci, G. Santucci, Query by diagram: a graphic query system, in: Proceedings of the Seventh International Conference on the

E-R Approach, Rome, Italy, 16±18 November 1988, pp. 157±174.

[11] T. Catarci, G. Santucci, M. Angelaccio, Fundamental graphical primitives for visual query languages, in: Proceedings of the

Second European-Japanese Seminar on Information Modeling and Knowledge Bases, Finland, 1992.

[12] F. Cuppens, R. Demolombe, Cooperative answering: a methodology to provide intelligent access to databases, in: Proceedings of

the Second International Conference on Expert Database Systems, Fairfax, VA, April 1988, pp. 621±643.

[13] H.C. Chan, A knowledge level user interface using the ER model, Ph.D. Dissertation, University of British Columbia, 1989.

[14] H.C. Chan, K.K. Wei, K.L. Siau, User-database interface: the e�ect of abstraction levels on query performance, MIS Quarterly 17

(4) (1993) 441±464.

[15] T. Chang, E. Sciore, A universal relation data model with semantic abstractions, IEEE TKDE 4 (1) (1992) 23±33.

[16] B. Czejdo, R. Elmasri, D.W. Embley, M.A. Rusinkiewicz, Graphical data manipulation language for an extended entity-

relationship model, IEEE Computer 23 (3) (1990) 26±36.

[17] E.F. Codd, Relational completeness of database sub-languages, in: R. Rustin (Ed.), Data Base Systems, Prentice-Hall, Englewood

Cli�s, 1972.

[18] H. Dalianis, Explaining conceptual models ± an architecture and design principles, in: Proceedings of the ER-97, Los Angeles, CA,

USA, 1997, 214±228.

[19] C.J. Date, An Introduction to Database Systems, seventh ed., vol. 1, Addison-Wesley, Reading, 2000.

[20] O. DeTroyer, R. Meersman, F. Ponsaert, RIDL User Guide, Research Report, International Centre for Information Analysis

Services, Control Data Belgium, Inc., Brussels, Belgium, 1984.

[21] R. Elmasri, S.B. Navathe, Fundamentals of Database Systems, third ed., Addison-Wesley, Reading, 2000.

[22] S.K. Feiner, K.R. McKeown, Automating the Generation of Coordinated Multimedia Explanations, in: M.T. Maybury (Ed.),

Intelligent Multimedia Interfaces, 1993, 117±138.

[23] J.A. Gulla, A general explanation component for conceptual modelling in CASE environments, ACM TOIS 14 (2) (1996)

297±329.

[24] T.A. Halpin, Conceptual Schema and Relational Database Design, second ed., Prentice-Hall, Sydney, Australia, 1995.

[25] T.A. Halpin, Business rules and object-role modeling, Database Programming and Design 9 (10) (1996) 66±72.

[26] T.A. Halpin, M.E. Orlowska, Fact-oriented modelling for data analysis, Journal of Information Systems 2 (2) (1992) 97±119.

[27] T.A. Halpin, H.A. Proper, Database Schema Transformation and Optimization OOER'95: Object-Oriented and Entity-

Relationship Modeling, Springer LNCS vol. 1021, 1995, pp. 191±203.

[28] T.A. Halpin, H.A. Proper, Subtyping and polymorphism in object-role modeling, Data and Knowledge Engineering 15 (1995)

251±281.

[29] J. Han, Y. Huang, N. Cercone, Y. Fu, Intelligent query answering by knowledge discovery techniques, IEEE TKDE 8 (3) (1996)

373±390.

[30] M.T. Maybury, Intelligent Multimedia Interfaces, AAAI Press, Cambridge, MA, 1993.

[31] S.L. Jarvenpaa, J.J. Machesky, Data analysis and learning: an experimental study of data modeling tools, International Journal

Man±Machine Studies 31 (1989) 367±391.

[32] D. Maier, D. Rozenshtein, D.S. Warren, Window functions, in: P. Kanellakis (Ed.), Advances in Computing Research, JAI Press,

1986, pp. 213±246.

[33] D. Maier, J.D. Ullman, Maximal object and the semantics of universal relation databases, ACM TODS 8 (1) (1983) 1±14.

[34] M.V. Mannino, L.D. Shapiro, Extensions to query languages for graph traversal problems, IEEE TKDE 2 (3) (1990) 353±363.

[35] Markowitz and Shoshani, Abbreviated query interpretation in EER oriented databases, in: Proceedings of the Eighth

International Conference on E-R Approach, Toronto, Canada, 18±20 October 1989, pp. 325±344.

[36] M.T. Maybury, Planning multimedia explanations using communicative acts, in: M.T. Maybury (Ed.), Intelligent Multimedia

Interfaces, 1993, pp. 60±74.

[37] F. McFadden, J. Ho�er, M. Prescott, Modern Database Design, ®fth ed., Addison-Wesley, Reading, 1999.

[38] R. Meersman, The RIDL Conceptual Language, Research Report, International Centre for Information Analysis Services,

Control Data Belgium Inc, Brussels, Belgium, 1982.

[39] G. Nijssen, T. Halpin, Conceptual Schema and Relational Database Design, Prentice-Hall, Englewood Cli�s, 1989.

[40] G.M. Nijssen, T.A. Halpin, Conceptual Schema and Relational Database Design: A Fact Oriented Approach, Prentice-Hall,

Sydney, Australia, 1989.


[41] V. Owei, Framework for a conceptual query language for capturing relationship semantics in databases, Ph.D. Dissertation,

Georgia Inst. of Technology, 1994.

[42] V. Owei, Natural language querying of databases: an information extraction approach in the conceptual query language,

International Journal of Human±Computer Studies (to appear).

[43] V. Owei, H. Rhee, S.B. Navathe, An abbreviated concept-based query language and its exploratory evaluation, Journal of Systems

and Software (to appear).

[44] J. Peckham, F. Maryanski, S. Demurjian, Towards the correctness and consistency of update semantics in semantic databases,

IEEE TKDE 8 (3) (1996) 503±507.

[45] S. Puranik, A data de®nition language for the object-oriented semantic association model and algorithms for intelligent query

processing, Master of Science Thesis, The Graduate School, University of Florida, 1988.

[46] P. Rosengren, P. Kool, S. Paulsson, U. Wingstedt, Intuitive System, http://www.sisu.se/oldprojects/intuitive/intuitive.html.

[47] K.L. Siau, H.C. Chan, K.K. Wei, The e�ects of conceptual and logical interfaces on visual query performance of end users, in:

Proceedings of the International Conference on Information Systems, Amsterdam, The Netherlands, 10±13 December 1995,

pp. 225±235.

[48] T. Hofstede, T. Van Der Weide, Expressiveness in conceptual data modelling, Data & Knowledge Engineering 10 (1) (1993)

65±100.

[49] A. Ter Hofstede, H. Proper, T. Van Der Weide, Formal de®nition of a conceptual language for the description and manipulation

of information models, Information Systems 18 (7) (1993) 489±523.

[50] A.H.M. Ter Hofstede, H.A. Proper, T. Van Der Weide, Query formulation as an information retrieval problem, The Computer

Journal 39 (4) (1996) 256±274.

[51] A. Ter Hofstede, H. Proper, T. Van Der Weide, Exploiting fact verbalisation in conceptual information modelling, Information

Systems 22 (6/7) (1997) 349±385.

[52] J.D. Ullman, Principles of database and knowledge-base systems, vol. I, Computer Science Press, Rockville, 1988.

[53] K. Vadarparty, Y.A. Aslandogan, G. Ozsoyoglu, Towards a uni®ed visual database access, in: International Conference on

Management of Data, 26±28 May 1993, ACM SIGMOD RECORD 22(2) (1993) 357±366.

[54] P. Van Bommel, T. Hofstede, T. Van Der Weide, Semantics and veri®cation of object-role models, Information Systems 16 (5)

(1991) 471±495.

[55] G. Verheijen, J. Van Bekkum, NIAM: an information analysis method, in: T.W. Olle, H.G. Sol, A.A. Verrijn-Stuart (Eds.),

Information Systems Design Methodologies: A Comparative Review, North-Holland, Amsterdam, The Netherlands, 1982,

pp. 537±590.

[56] J. Wald, P. Sorenson, Explaining ambiguity in a formal query language, ACM TODS 15 (2) (1990) 125±161.

[57] C. Welty, D.W. Stemple, Human factors comparison of a procedural and nonprocedural query language, ACM TODS 6 (4)

(1981) 626±649.

[58] J. Wintraecken, The NIAM Information Analysis Method: Theory and Practice, Kluwer, Deventer, The Netherlands, 1990.

[59] X. Wu, T. Ichikawa, KDA: a knowledge-based database assistant with a query guiding facility, IEEE TKDE 4 (5) (1992).


Vesper Owei holds a masters degree inelectrical and electronic engineering andin operations research from the GeorgiaInstitute of Technology (Georgia Tech),Atlanta, Georgia, USA. Owei also holdsa Ph.D. from Georgia Tech. He haspractised as a project, design and con-sulting engineer. His current researchinterests include data management, datamodeling, concept-based query lan-guages, conceptual interfaces, knowl-edge systems, data warehousing, olap,data mining, web-based database ap-

plication development, end-user interfaces for e-commerce ande-business, information systems architectures and framework forthe disabled and information technology for healthcare delivery.

Shamkant B. Navathe is a professor andthe head of the database research groupat the College of Computing, GeorgiaInstitute of Technology, Atlantia. Hehas been active in a variety of databaseallocation, and database including dat-abase modeling, database conversion,database design, distributed databaseallocation, and database integration. Hehas worked with IBM and Siemens intheir research divisions and has been aconsultant to various companies in-cluding Digital, CCA, HP and Equifax.

He was the General Co-chairman of the 1996 InternationalVLDB (Very Large Data Base) conference in Bombay, India. Hewas also program co-chair of ACM SIGMOD 1985 InternationalConference and General Co-chair of the IFIP WG 2.6 DataSemantics Workshop in 1995. He has been an associate editor ofACM Computing Surveys, and IEEE Transactions on Knowl-edge and Data Engineering. He is also on the editorial boards ofInformation Systems (Pergamon Press) and Distributed andParallel Databases (Kluwer Academic Publishers). He is an au-thor of the book, Fundamentals of Database Systems, with R.Elmasri (Addison-Wesley, Edition 3) ± currently the leadingdatabase text-book worldwide. He also co-authored the book``Conceptual Design: An Entity Relationship Approach'' (Add-ison-Wesley, 1992) with Carlo Batini and Stefano Ceri. Hiscurrent research interests include human genome data manage-ment, engineering data management, intelligent information re-trieval, data mining algorithms, e-commerce applications andmobile database synchronization. Navathe holds a Ph.D. fromthe University of Michigan and has over 100 refereed publica-tions.


Documents

A formal basis for an abbreviated concept-based query language