Download pdf - An abbreviated concept-based query language and its exploratory evaluation

An abbreviated concept-based query language and itsexploratory evaluation

Vesper Owei a,*, Shamkant B. Navathe b,1, Hyeun-Suk Rhee c,2

a Division of Management Information Systems, University of Oklahoma, 307 West Brooks, Room 306, Norman, OK 73019-4007, USAb College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA

c School of Management, University of Texas at Dallas, Richardson, TX 75083, USA

Received 15 September 1999; received in revised form 23 March 2000; accepted 8 August 2001

Abstract

Research on the use of conceptual information in database queries has primarily focused on semantic query optimization.

Studies on the important aspects of conceptual query formulation are currently not as extensive. Only a relatively small number of

works exist in this area. The existing concept-based query languages are similar in the sense that they require the user to specify

the entire query path in formulating a query. In this study, we present the Conceptual Query Language (CQL), which does not

require entire query paths to be specified but only their terminal points. CQL is an abbreviated concept-based query language that

allows for the conceptual abstraction of database queries and exploits the rich semantics of semantic data models to ease and

facilitate query formulation. CQL was developed with the aim of providing typical end-users like secretaries and administrators an

easy-to-use database query interface for querying and report generation. A CQL prototype has been implemented and currently

runs as a front-end to an underlying relational DBMS. A statistical experiment conducted to probe end-users’ reaction to using

CQL vis-�aa-vis SQL as a database query language indicates that end-users perform better with CQL and have a better perception

of it than of SQL. This paper discusses the design of CQL, the strategies for CQL query processing, and the comparative study

between CQL and SQL.

� 2001 Published by Elsevier Science Inc.

1. Introduction

According to a recent study, end-user computing isgrowing at the rate of 50–90% per year (Cronan andDouglas, 1990). This rapid increase raises the concernsabout the suitability of database (DB) query languages(DBQLs) for present-day end-users, who are typicallynon-expert DB users. Query tools that depend on users’programming skill for their effective and efficient useimpose a cognitive burden that may diminish users’productivity. This underscores the need for DBQLs thatare matched to the limited ability of end-users. A re-

thinking of DBQL design is therefore called for. Wethink it is essential that DBQLs be adapted to the userand not the user to DBQLs. This requires that DBQLsuse concepts that are as close as possible to those inthe user’s cognitive mental model and adopt interfacetechniques that are suited to users’ abilities. Becauseconceptual schemas represent users’ real world view andthe mental model of their application universe, concept-based approaches to DB querying tend to support thedirect use of concepts and constructs on conceptualschemas which are either the same or similar to thosein users’ mental model. Therefore, concept-based DBquerying naturally tends to fit the skills and ability oftypical end-users. This has led to research into concept-based DBQLs.

Research on the use of conceptual information in DBqueries has, however, mainly focused on semantic queryoptimization, i.e., on the use of semantic information toreformulate a query more efficiently into a different butsemantically equivalent form that yields correct answers

The Journal of Systems and Software 63 (2002) 45–67

www.elsevier.com/locate/jss

* Current address: The George Washington University, Manage-

ment Science Department (Information Systems), Monroe Hall, 2115

G Street, N.W., Washington, DC 20052, USA. Tel.: +1-202-994-4364.

E-mail addresses: [email protected] (V. Owei), [email protected].

edu (S.B. Navathe), [email protected] (H.-S. Rhee).1 Tel.: +404-894-0537.2 Tel.: +972-883-4459.

0164-1212/01/$ - see front matter � 2001 Published by Elsevier Science Inc.

PII: S0164-1212 (01 )00139-X

mail to: [email protected]

(Pittges, 1995a,b). A good discussion on research effortsin semantic query optimization can be found in Pittges(1995a) and Pittges et al. (1995). Studies on the impor-tant aspects of conceptual query formulation are cur-rently not as extensive. A few recent works in this areacan be found in Bloesch and Halpin (1996, 1997), Cat-arci and Santucci (1988), Chan (1989), Chan et al.(1993), Chang and Sciore (1992), Halpin and Proper(1995a), Owei (1994), Owei and Higa (1994), Owei et al.(1997a) and Siau et al. (1995). In one respect, the ex-isting concept-based query languages are similar, i.e.,they require the user to specify the entire query pathin formulating a query. In this study, we present theConceptual Query Language (CQL), which does notrequire entire query paths to be specified but only theirterminal points. CQL was developed with the aim ofproviding an easy-to-use DB query interface for typicalend-users. CQL makes minimal demands on end-users’cognitive knowledge of DB technology. To the best ofour knowledge, there are no other concept-based querylanguages that employ abbreviated querying as in CQL.

Because SQL has come to be taken as the de-factostandard for relational languages, query languages thatrun on an underlying relational database managementsystem (DBMS) tend to be benchmarked against SQL.Experimental studies have therefore been conducted toattempt to establish the superiority of existing concept-based query languages over SQL (Bell and Rowe, 1992;Chan et al., 1998, 1993; Siau et al., 1995, for example).In this paper, a relative evaluation of CQL against SQLis reported. We conducted a statistical experiment toprobe end-users’ reaction to using CQL, vis-�aa-vis SQL,as a database query language. The comparison focusedon the effect of the two different database query lan-guage interfaces on user performance (as measured byquery formulation time, query correctness, and users’perception) in a query writing task with varying diffi-culty levels. Statistically significant differences betweenthe two query languages were found.

The results indicate that end-users perform betterwith CQL and have a better perception of it than ofSQL. There were significantly more accurate formula-tions with CQL than with SQL. Also, the groups withCQL took significantly less time than the groups withSQL. The CQL subjects perceived their query languageto be easier to use than their SQL counterparts feltabout SQL; they also felt more satisfied with CQL thanthe SQL subjects were with SQL. These differences weremore pronounced when query-difficulty level was con-sidered. The statistical significance of the differencesincreased with the complexity of the query. The scoresindicate that users are more likely to perform better withCQL than with SQL, and that they are more likely toharbor a more favorable perception of it than of SQL.The Cronbach alpha values for the user perceptionfactors ranged from 0.80 to 0.93, well above the ac-

ceptable level of 0.70, considered to be adequate forbehavioral research.

1.1. Focus and contribution

In developing CQL, the goal is conceptual queryformulation, particularly the use of semantic informa-tion on data models to make query formulation intuitivefor end-users. The CQL approach allows for the con-ceptual abstraction of DB queries and exploits the richsemantics of DB schemas. The design of CQL has led tothe following contributions:

• use of relationship semantics of data models to alle-viate or free the user from dealing with syntacticcomplexity of query formulation in current query lan-guages;

• use of the roles played by entities in relationships indeveloping semantic graphs of conceptual queries;

• use of the roles played by entities in relationships indeveloping pseudo-natural language explanations ofqueries;

• use of system-constructed semantic graphs to aid theautomatic generation of SQL.

The evaluation of CQL shows that end-users are likelyto perform better with abbreviated concept-based querylanguages and to have a better perception of them thanof SQL. The result of the experiment is pertinent to thecurrent practice in database development and use. Thecurrent approach prescribes three steps: conceptualschema design, logical schema design, and logical query.The results here suggest that a transition to a two-stepapproach should be adopted: conceptual schema designand abbreviated conceptual query. This is consistentwith the recommendation in Chan et al. (1998). We,however, note that since there are no other existingempirical studies on abbreviated concept-based querylanguages, the experimental study here should be takenonly as seminal, an initial exploratory one.

In this study, we discuss the design of CQL, thestrategies for CQL query processing, and the compara-tive study between CQL and SQL.

The rest of the paper is organized as follows: InSection 2, we perform a literature review of concept-based query languages and empirical studies involvingconcept-based query languages. We discuss the necessityfor concept-based query languages in Section 3. Section4 presents CQL. The syntax of CQL and query formu-lation in CQL are examined. The strategy for dealingwith iterative concept-based queries in CQL is discussedin Section 5. The comparative study on CQL and SQL isreported in Section 6. In Section 7, the results of theexperiment are reported. We devote Section 8 to a dis-cussion on the results of the experiment. The paperconcludes in Section 9.

46 V. Owei et al. / The Journal of Systems and Software 63 (2002) 45–67

2. Related works

The main goal in developing concept-based querylanguages is to provide end-users with high-level, easy touse, and user-friendly interfaces for data manipulation.As far as we are aware, the universal relation (UR) in-terface (Maier et al., 1986; Maier and Ullman, 1983) wasone of the earliest efforts in that direction. An examin-ation of existing database query languages seems to in-dicate a continuing trend in this direction. This trendhas mainly resulted in the emergence of conceptualDBQLs that employ concepts on semantic schemas toaid query formulation. We examine only a sample set ofexisting concept-based query languages in this section.We also discuss a set of comparative experimentalstudies involving concept-based query languages.

2.1. Enabling data manipulation through semantic pathson conceptual schemas

Chang and Sciore (1992) propose the Universal Re-lation with Semantic Abstraction (URSA) model, whichis an extension of the UR interface. Instead of de-manding a universally unique role for each attribute, theURSA model requires this uniqueness of role only withina limited set, called closure, of entities. Querying inURSA is based on the UR query paradigm. Its refer-encing scheme therefore forces a QUEL-type and anSQL-type syntax. This may render it not suitable for thegenerality of end-users. Peckham et al. (1996) propose aDB design paradigm that abstracts the relationship se-mantics of application conceptual data models and usesthis as a predictor of query and update paths.

Peckham et al. show that association roles in se-mantic schemas define connection paths between ob-jects, and these connections can be used to enable datamanipulation. The URSA study shows that the seman-tics of the association among schema entities can beused to ensure the semantic correctness of queries. CQLextends these ideas by showing that the connectionpaths have meanings that are derived from the semanticmeaning of the association roles, and that the path-meanings can be used to determine and select the correctpaths of abbreviated conceptual queries.

The Intuitive System 3 defines a very intuitive archi-tecture for information retrieval. The system is aimedat end-user interaction with heterogeneous DBs, but isgeneric enough for non-heterogeneous and single DBscenarios. The point-and-click mode of request formu-lation in Intuitive presents an ER schema to the user,who can then specify a query by selecting the subschemadefining the desired query path. Interesting similarities

and differences between intuitive and CQL exist here:With Intuitive, to formulate the query on persons whoappear in an interview, the user highlights the enti-ties ‘‘Person’’ and ‘‘Interview’’ and the relationship‘‘appears_in’’ linking the two entities. In CQL this isspecified as ‘‘Person appears_in Interview’’. For morecomplex queries involving longer paths, the Intuitiveuser still highlights the entire path on the schema, theCQL user does not. Intuitive supports multimedia data,text retrieval from documents, and exploratory search ofhypertexts. Intuitive is a much more comprehensivesystem than CQL, which is currently narrowly focusedon the manipulation of data in a single DB via a se-mantic data model.

ConQuer-II (Bloesch and Halpin, 1997) is a com-mercial concept-based query language based on theObject-Role Modeling (ORM) paradigm (Halpin, 1995,1996; Halpin and Orlowska, 1992; Halpin and Proper,1995a,b). ORM models applications in terms of the se-mantic roles played by objects and entities in relation-ships. ConQuer-II allows queries to be formulated viapaths through the conceptual schema. The query pathsare constructed from the semantic roles of objects andentities. Data manipulation in the system proposed inMannino and Shapiro (1990), i.e., the Graph Model,involves finding a path from a set of starting nodesthrough possible intermediate nodes and edges to a setof terminating nodes. In Graph Model, entities are con-ceptually conceived of as graph nodes and link semantictypes as graph edges. Query formulation involves graph-ically selecting a set of source and target nodes, thendrawing a set of edges between the selected sets of nodes,and finally specifying for each node a set of data retrievalcriteria. Users select each node and edge on the graph-path between the source and the target. Graphs aremanually manipulated until the desired query is obtained.Although CQL adopts these basic ideas, it, however,extends them by requiring users to specify only the end-point, i.e., starting and terminating, entities and rela-tionship roles. The CQL system automatically deduces thecorrect intermediate nodes to use on a given query path.

Vizla (Berztiss, 1993) is a visual query languageinterface for the information control prototyping lan-guage SF (Berztiss, 1986). In Vizla, a database is ab-stracted as a collection of sets (entities) and functionsthat map from this collection of sets to auxiliary sets(attributes). Queries are formulated in Vizla by point-ing to representations of functions, their domains andcodomains, or subsets of the domains and codomains,and to various operators in a conceptual model of adatabase. The items selected in this way are displayedand assembled graphically in a workspace, or window,similar to the query formulation workspace in CQL.

The workspace concept is used in Vizla to reducethe cognitive burden query formulation imposes onend-users. It achieves this by allowing users to separate

3 P. Rosengren, P. Kool, S. Paulsson, U. Wingstedt, Available from

<http://www.sisu.se/oldprojects/intuitive/intuitive.html>.

V. Owei et al. / The Journal of Systems and Software 63 (2002) 45–67 47

http://www.sisu.se/oldprojects/intuitive/intuitive.html

querying into sequences of small steps, save intermediateresults of such sequences, and combine the intermediateresults into final results. Ad hoc queries can therefore beformulated and processed in this manner. This is anapproach that we feel can be adopted, with certainmodifications, by CQL to facilitate query formulation.The query-formulation-by-pointing approach in Vizlacould be tedious and unappealing for complex querieswith long query paths through the schema. This is be-cause users must point to the entire paths of queries andall the functions and operators needed for computationof the paths. The abbreviated querying approach in CQLcuts down on the number of operations that users mustperform and, thereby, improves on the Vizla approach.

Vizla is a full-fledged, self-standing query language.On the other hand, CQL in its current prototype stage isa front-end to an underlying full-fledged query languagelike SQL. In addition to its use as a query language,Vizla is also designed to function as a programminglanguage. For this reason, it is aimed at being at least asexpressive as a general programming language. It is,expectedly, therefore more expressively powerful thanCQL. CQL can, therefore, additionally benefit from thework on Vizla as we develop it (CQL) further in itsinterface design and expressive power.

Usually, multiple paths exist for a query specifiedagainst a database schema. An approach to dealing withmultiple paths is through abbreviated queries. In ab-breviated queries users mention only a subset of theobjects/entities of interest. The system then interpretsthe query formulation, i.e., finds the necessary connec-tions between the objects. In the path-finding approachin Wu and Ichikawa (1992), the query-induced map ofthe database schema is first pruned to make the graphless bushy. Then the shortest-path in the pruned sub-graph is selected. CQL adopts a similar approach in itspath construction and selection algorithm. In Czejdoet al. (1990), all possible paths are displayed; the userthen selects a particular path. CQL does not display allpossible paths. Instead, from the set of all candidatepaths, it selects and displays the natural language tran-scription of only the minimum cost path.

2.2. Comparative studies on database query languageinterfaces

A number of studies have been conducted either tomotivate the development of concept-based query lan-guages or to demonstrate their superiority over otherquery paradigms. A discussion on comparative studiesarguing for concept-based query languages can befound in Chan et al. (1993) and Siau et al. (1995). Bothstudies compare SQL and the concept-based DBQLKnowledge Query Language (KQL) (Chan, 1989). Theyshow that users of concept-based query languages out-perform SQL users: Irrespective of time, the KQL users

performed better than their SQL counterparts, withrespect to query accuracy, query formulation time, anduser confidence. Additional empirical studies suggest-ing the superiority of concept-based data retrieval ap-proaches over other query approaches can be found inBatra (1993), Batra et al. (1990), Batra and Sein (1994),Jarvenpaa and Machesky (1989) and Owei et al. (1997b).All of these studies point to the need for alternativequery paradigms. The abbreviated concept-based ap-proach in CQL clearly offers one such alternative.

In Chan et al. (1998) an empirical study is conductedto investigate the effect of entity-relationships versus rela-tional models, and textual versus visual query languagesfor end-user database interfaces. For the relational model,SQL and QBE are used as the textual and visual querylanguages respectively. For the E–R model, KQL (Chan,1989) and VKQL (Siau et al., 1991, 1992) are used as therespective textual and visual query languages. The studyshowed that in general users achieve better performance,in terms of accuracy, confidence and time, with a higherlevel or a visual query language. That is, in general KQLoutperformed SQL and VKQL outperformed QBE.

In Jih et al. (1989), user performances with a con-ceptual data model (the ER model) and the relationaldata model were compared. Both groups of subjectsused SQL for query formulation. The results showedthat the relational group made fewer syntactic errors,but took longer time than the ER group. The differencein semantic accuracy of the queries in the two groupswas not significant.

As the survey here indicates, although concept-basedDBQLs directly use concepts on conceptual schemas,the few existing ones still require users to know and statein one form or other the semantics of the entire path ofthe query to be specified. In this respect, we believe thatthey do not exploit the full potential of conceptualschemas and thereby unduly limit themselves. WithCQL, we remove the requirement. We believe that thisrepresents an added reduction of the cognitive loadon end-users; we expect this reduced cognitive burdento manifest in their performance with and perceptionsabout CQL. In a later section, we conduct an experi-ment to test this belief and expectation. In the followingsection, however, we examine the need for the abbrevi-ated querying approach in CQL.

3. The necessity for abbreviated concept-based query

languages

3.1. Problems with existing relational query languages

Current commercially popular DBMSs use linearkeyword languages (LKL) like the Structured QueryLanguage (SQL) or graphical (or visual) languageslike Query-By-Example (QBE). With linear keyword


languages like SQL, query formulation requires exactsyntactic specification of the SQL SELECT-FROM-WHERE clauses in such a manner that the resultingspecification is semantically correct. Existing graphicaland visual languages like QBE are based on drag-and-drop or point-and-click approaches. These languagespartly ameliorate the cognitive workload demanded byLKLs by providing user-friendly example-based visualinterfaces. These languages require users to explicitlymention all the tables needed by the system to solve theproblem. Furthermore, in LKL and QBE systems theuser must also specify query paths. This explicit navi-gation is a major source of difficulty for a typical end-user. The query formulation approach in CQL does notsuffer from these two problems. Also, the user does nothave to specify query paths. CQL is, therefore, partic-ularly suitable for business and administrative end-userswho, generally speaking, are not programmers.

Query specification in LKLs like SQL and in QBEsystems are based on joins defined either during datadefinition or during query formulation. Partly as a resultof this, query formulation with these languages becomesa cognitively demanding task for the typical DB end-user, with this burden increasing as the complexity ofthe required query increases. Recent QBE implementa-tions like ACCESS are able to perform automatic joinsonce the tables to be joined have been specified by theuser. This requires the joins to have been defined as‘‘relationships’’ during table creation. Where neededjoins are not defined, the system can ‘‘suggest’’ possiblejoins to the user. The domain types of attributes can beused for this task. The system is unable to select the joinsfor the user. The ability to select definite joins is tanta-mount to specifying a particular query path; this requiresthe use of meta-knowledge about the schema in the formof the meaning of a query path to ensure the semanticcorrectness of the selected path. Such meta-knowledge islacking in existing LKL and QBE systems. In CQL, thismeta-knowledge is provided by the semantics of theroles played by schema entities in relationships.

3.2. Problems with existing concept-based query lan-guages

Concept-based or conceptual query interfaces reducethe cognitive load in querying DBs by allowing users todirectly use constructs from conceptual schemas (Batra,1993; Batra et al., 1990; Batra and Sein, 1994; Chan, 1989;Owei, 1994; Siau et al., 1995). It is our view that the fol-lowing must hold on a true concept-based query language:

1. Query specification must be done against the DB con-ceptual schema.

2. The direct use of the semantic abstractions, such asentities, relationships or roles, on the conceptualschema must be supported.

3. Query formulation must be essentially declara-tive and at most require only a minimal amountof manual or mental schema navigation by the end-user.

As exemplified in Chan et al. (1993), instead of speci-fying the relational condition ‘‘where s.sno¼ sp.sno andsp.pno¼ p.pno’’, concept-based interfaces would allowfor a more natural specification like ‘‘where S suppliesP’’. The CQL approach provides additional enhance-ment to this in the form of built-in meta-knowledgethat is used to automatically determine the intermediateentities between S and P.

Since the mid-1980s, but more so in the 1990s, anumber of approaches using meta-knowledge about DBschemas to enhance the facility of end-users in queryformulation have been proposed (for example, Ander-son et al., 1991; Chan et al., 1993; Chang and Sciore,1992; Markowitz and Shoshani, 1989; Peckham et al.,1996; Siau et al., 1995). The recent prototypical ap-proaches in Chan et al. (1993), Chang and Sciore (1992),Peckham et al. (1996) and Siau et al. (1995) elevatequery formulation from the logical level to the concep-tual schema level by supporting the direct use of con-cepts and abstractions on conceptual schema in querystatements. Query formulation can be further facilitatedin these systems by reducing the cognitive workloadentailed by their use. One way this can be achieved isthrough minimizing what is required to be specified bythe end-user and then allowing the system to use schemameta-knowledge to determine and select a semanticallycorrect query. As mentioned above, CQL is based onthis approach. We therefore state the theme of the studyas follows: Given the rich semantics of semantic datamodels like the E–R model, we aim to exploit the semanticinformation in these models to support abbreviated con-cept-based queries, thereby reducing the cognitive loadfaced by end-users in query formulation and facilitatingtheir ability to formulate queries.

3.3. Meaning of query abbreviation in CQL

In comparison to LKL and QBE queries, conceptualqueries in CQL are highly abbreviated. The mainproblem with abbreviated queries is to derive the cor-responding semantically correct full queries (Markowitzand Shoshani, 1989). This concern naturally carries overto conceptual queries. We illustrate with Fig. 1, which isa semantically constrained entity-relationship diagram 4

(SCERD) of a university department.

4 SCERD contains other constructs that are used for updates. These

have been left out in Fig. 1, since they are not pertinent to the

discussion here.


In SCERD, entity types on the schema bear explicitlynamed relationships, or associations, among themselves.Each relationship has a semantic meaning. Double-headed arrows are used in a SCERD to indicate that theentities at both heads of the arrows have a direct semanticrelationship, and the arrow-heads are labeled with theroles played, e.g., Is-a, Has or Advises, by the entities inspecific relationships. The association semantics of therelationships involving entities are constrained by theroles the entities play in the particular relationship. InSCERD, the semantics of the links between entities,therefore, lie in the form of roles. CQL supports the di-rect use of SCERD constructs in query formulation.

For the sake of brevity (and convenience), we use thefollowing abbreviations for our schema entities wherenecessary:

STD � StudentCR � Course-RegistrationSEC � Section

C � CourseT � TeacherSEK � Secretary

Example. Suppose the following query is posed andspecified on Fig. 1.

Query 1: What course(s) is student Marshall takingsuch that some sections are taught by associate profes-sor Jones?

An abbreviated CQL formulation of this query wouldrequire the user to specify only the stated entities Stu-dent, Teacher and Course, along with a set of selectionpredicates on these entities. The system is then requiredto chart one or more paths through the conceptualschema from Student and Teacher to Course. We referto such paths as derived paths. In addition to path der-ivation, the system must also be capable of performingany needed operations, e.g., conjunction or disjunction,

Fig. 1. Semantically constrained E–R diagram of a University Department Schema.


on the derived paths. In this case, the meaning of the de-sired query demands that the sub-paths Student ! � � � !Course and Teacher ! � � � ! Course be derived andconjunctively combined, where ‘‘� � �’’ indicates segmentsof the sub-paths that must be determined by the system.Furthermore, these segments must be such that themeaning of the resulting path is the same as that of thedesired query. Clearly, the sub-path STDjenrolled-in !CRjis-enrollment-for ! C is semantically correct. In thisnotation jenrolled-in and jis-enrollment-for denote the seman-tic roles, respectively, played by the Student and theCourse-Registration entities on that path. The path de-rives its semantics from the totality of the semantics ofthe roles played by all the entities on the path.

An examination of Fig. 1 shows that multiple pathsexist between Student and Course and also betweenTeacher and Course. What complicates the problemhere is that all the paths do not have the same mean-ing. For example, the semantics of STDjadvised-by !Tjcan-teach ! C, i.e., the sub-path leading from Studentto Course via Teacher deals with advisor–advisee rela-tionship, and not with students taking classes. It wouldbe semantically incorrect for the system to include thissub-path in constructing the query path.

The task of the system, then, is twofold: (1) To de-termine PI � P and PII � P such that for each pi 2 PI

and pk 2 PII, STD ! fPIg ! C and T ! fPIIg ! C aresemantically correct. In this notation P is the query pathand fPIg and fPIIg are subquery paths. In CQL, meta-knowledge (in the form of the semantics of roles) aboutthe relationships entities participate in is used to resolvethis path ambiguity problem. (2) To select, from all thepossible, i.e., candidate, paths in (1), a pi and a pk. Wemodify and use the path selection algorithm in Puranik(1988) for this.

4. The conceptual query language (CQL)

In CQL, the cognitive burden end-users experience informulating DB queries is reduced by migrating much ofthis task to the underlying DBMS. The CQL approachrequires only the entities and conditions explicitly men-tioned in the query statement to be specified in queryformulations.

4.1. CQL syntax

CQL has a simple and straightforward query syntax.A Backus–Naur form (BNF) specification of CQL isgiven in Owei (1994) and is available upon request. Thebasic, or canonical, form of a CQL query, Q, can beexpressed as: Query :¼ QðtE; sE; fCsel;CsemgÞ.

• tE set of targets,• sE the set of sources,• Csel the selection criteria/conditions,

• Csem semantic relationships (or association roles) be-tween implicit sources and the entities semanticallyadjacent to them on the conceptual schema.

An implicit source is either a source or a target entityof the query. An implicit target may be the target of thequery or an intermediate entity that is neither the sourcenor the target of the specified query, but lies on thequery path.

4.1.1. Query formulation options in CQLCQL requires at least a semantic relationship to be

specified for each source entity. Whether the specifica-tion of semantic relationships for target entities is nec-essary depends on the query being formulated. The twopossibilities are illustrated next.

4.1.1.1. Specification of only source semantic roles; nospecification of target roles. A user may want to specifyonly the roles of sources for any of the following reasons:

1. Multiple roles for the target entity exist, but the userbelieves that all of them are semantically equivalentwith respect to the query. In this case, it is irrelevantwhich one is used.

2. Multiple roles for the target entity exist, and the userwishes to consider all of them, in order to pick thepath that is consistent with the meaning of the in-tended query.

In both of these cases, the different candidate solutionsare composed and their natural language explanationsreturned to the user to pick one.

Example. For the specified sources (Student (STD) andTeacher (T)), target (Course (C)), and conditions ofQuery 1, the CQL formulation in terms of the canonicalform is: Query ¼ QfC; ðT ; STDÞ; ðCsel;CsemÞg. When theattributes and values of interest are indicated, the ex-pressionbecomes:Query ¼ QfChC-namei; ðSTDhS-name¼‘marshall’i;TðhT-name¼‘jones’i;hT-title¼‘associate professor’iÞÞ; ð-; STD enrolled-inCR and T teaches SECÞg. In this expanded form, ‘‘STDenrolled-in CR’’ and ‘‘T teaches SEC’’ are the pertinentsemantic relationships for the correct query paths.

Note that there is no requirement that Marshall mustbe enrolled in Jones’ section. Therefore, the edgesCourse-Registration – Course and Section – Courseneed to be considered by the system in computing thesolution. For this reason, no semantic roles involvingthe target, Course, have been or need to be specified. Asdiscussed below, under the second option for queryformulation, a specification of a semantic role for atarget forces the system to converge on the target entityonly through the specified edge and to ignore any otheredge terminating in the target.


4.1.1.2. Specification of source and target semantic roles.This is the case where the user specifies the target rolethat is consistent with the intended query. Only thechosen target edge is considered in computing thesolution. This is a special case of the first option, where aqualification criterion is imposed on the multiple edges toselect only a subset of the edges. In this case, all thecomposed candidate solutions contain the specified tar-get roles. Here, too, the natural language explanations ofthe solutions are returned to the user to pick one.

Example. Suppose Query 1 is modified as in Query 2 andthen specified on Fig. 1.

Query 2: What course(s) is student Marshall takingfrom associate professor Jones?

The CQL formulation is: Query ¼ QfChC-namei;ðSTDhS-name¼‘marshall’i;TðhT-name¼‘jones’i;hT-title¼‘associate professor’iÞÞ;ð-; STD enrolled-in CR and T teaches SEC; Courseconsists-of SectionÞg.

Because Course consists-of Section is specified in thiscase, only the solutions containing this edge are con-sidered by the system. Forcing all the paths to the targetto contain this edge achieves the necessary conditionthat student Marshall is registered in the section of thecourse taught by Jones.

The CQL formulation is then passed to the underly-ing DBMS to determine the set of candidate paths andto select a path. As already mentioned, in performingthese tasks, the CQL system uses semantic informationabout the schema. This information is in the form of thesemantic roles played by schema entities in their rela-tionships with other entities. Next, we describe howusers specify queries and how they are aided in this taskby the system.

4.2. CQL user interface

The notation Query :¼ QðtE; sE; fCsel;CsemgÞ is an ab-straction of the CQL query input interface presented tothe end-user for query formulation. In formulating aquery, the user is aided by a computer-supported queryformulation system, which we illustrate next.

4.2.1. Query formulation surrogate system (QFSS)CQL uses a query formulation surrogate system

(QFSS) to help users tailor their queries to the targetDB. QFSS provides information that enables users toreduce what Batra (1993) describes as users’ anomalousstate of knowledge. This is achieved by reducing thesemantic gap between user query formulations and thelogical and conceptual states of the DB. In using CQL,

Fig. 2. CQL query screen.


users can familiarize themselves with the conceptualaspects of the target DB through an interaction withQFSS. QFSS provides users with helpful information onthe schema concepts and constructs that they must befamiliar with in order to be able to formulate queries. Touse QFSS, the user clicks on the item about which in-

formation is needed. A window is then opened showinginformation on the clicked item. Information on thefollowing CQL constructs and concepts is provided.Selection operators: These are the operations and

functions supported by CQL. They are used in the se-lection conditions of queries. Arithmetic and logical

Fig. 3. QFSS help window on schema entities.


operators (þ, ), <, ¼, etc) and aggregate functions(sum, count, average, etc) belong in this group.Entity item: Clicking on a schema entity opens up a

window showing the attributes of the entity, other en-tities semantically associated with the clicked entity, andthe semantic roles the entity plays in its relationshipswith others entities.Attribute list: This list shows the attributes of all the

schema entities. Clicking on an attribute on the listopens an attribute window showing the entities havingthe clicked attribute, the pairs of entities related via theattribute, the semantic relationships involving the entity.Semantic role list: The semantic roles of the schema

entities are contained on this list. If a role on this list isclicked a window opens to show the pairs of entitiesrelated through the clicked role, and the semantic rolesinvolving the pair of entities.

In providing the query formulation surrogate system,the intention is to provide inexperienced or novice userswith help to ease the task of formulating queries. The useof this help system is not mandatory. Expert or knowl-edgeable users can bypass it and directly formulate theirqueries. In the rest of the section, we provide a brief il-lustration on the use of the QFSS in query formulation.

4.2.2. Browsing facility of the QFSSIn specifying queries in CQL users can directly write

the query anew on a CQL input interface in a queryformulation window (QFW), or workspace. For thismode of interaction, users directly fill the form in Fig. 4.Alternatively, they can compose the query with the helpof the QFSS, by clicking on the desired items. Thechosen items are written to the input interface in theQFW. In the QFW, the query can then be used as is, ifthe user believes it is semantically equivalent to the de-sired query, or modified as necessary. To facilitate thelatter mode of query specification via the QFSS, theQFSS windows are hyperlinked to support navigationbetween windows. In the following, we illustrate the useof the QFSS with our example query.

Fig. 2 shows an example of the CQL queryscreen. 5 To use the help of the QFSS, the user clicksthe upper button in Fig. 2. This action takes the userto the top screen of Fig. 3. The user can request help

Fig. 4. CQL query formulation interface.

5 All the screen dumps illustrated in this paper are initial simulation

versions. The screens are currently undergoing re-design along with

CQL for production use.


on schema information by clicking one or more of thebuttons on this screen. Clicking on the ‘‘Entities’’button, for example, pops up the listing of entities inthe bottom screen of Fig. 3. For our example queryone of the entities to be selected is Student. On clickingthis entity, the descriptive attributes of the entity aredisplayed (see Fig. 3). The user then picks the desiredattributes of Student. All the selected items are written

to the input interface shown in Fig. 4. The rest of theform can be filled similarly. After query formulation,the query is processed as we describe next.

4.3. Query processing strategy in CQL

Fig. 5 shows the conceptual query processing strategyin CQL. Each module of this figure is an abstraction for

Fig. 5. Conceptual design of CQL.


Fig. 7. Logical combination of sub-query paths into solution schema for Query 1: (a) candidate solution Schema 1 ðS2:S4Þ; (b) candidate solution

Schema 2 ðS1:S3Þ; (c) candidate solution Schema 3 ðS2:S3Þ; (d) candidate solution Schema 4 ðS1:S4Þ.

Fig. 8. Natural language transcription of Query 1 using a template filling technique.

Fig. 6. Sub-query paths from the unpacking of Query 1: (a) sub-query path S1; (b) sub-query path S2; (c) sub-query path S3; (d) sub-query path S4.


a particular concept in CQL. The main modules arebriefly discussed in this section.Request specification: Queries are input through the

CQL input interface as already described.

4.3.1. CQL query processing architectureThe portion of Fig. 5 after request specification

constitutes the conceptual architecture for query pro-cessing in CQL. It is here that specified queries areinterpreted. These processing stages are illustrated withQuery 1.

4.3.1.1. Set Handler. CQL transcribes specified queriesinto a set-specification form. The set of sources, targetsand conditions is written to templates consisting of agroup of semantically typed slots or place-holders forsources, targets, and conditions (selection and seman-tic). The Set Handler is an abstraction for the CQLinput form; its internal representation is that of an in-termediate set form. The set representation makes itamenable to set theoretical treatment. The purpose ofthe Set Handler is to decompose specified queries intosub-queries, which are then used to derive sub-querypaths. Because the Set Handler deals more with the in-ternal aspects of CQL query processing, we provide onlya brief discussion of it here. The interested reader isreferred to Owei (1994) for details.

The Set Handler performs two functions: (1) set-packing and (2) set-unpacking.

4.3.1.2. Set-packing. Set-packing extracts the specifiedsources, targets and conditions, and writes them to thequery template. This task is performed by the set-packer.

Example.For Query 1,we already showed that the templateis instantiated as: S ¼ QfChC-namei; ðSTDhS-name¼‘marshall’i;TðhT-name¼‘jones’i;hT-title¼‘associate professor’iÞÞ; ð-; STD enrolled-inCR and T teaches SECÞg: According to this, the set-packer identifies and extracts Course as the target, Studentand Teacher as the set of sources, and Std enrolled-in CRand T teaches SEC as the semantic relationships.

Set-unpacking: Set-unpacking uses set-theoretic op-erations to fragment the query written to the set-packerinto its subqueries.

4.3.1.3. Pathfinder. This component is used in CQL toweed from the schema those paths that are not seman-tically equivalent to the specified query. Computation-ally, this effectively prunes the search space prior toquery path construction. The function of the Pathfinderis twofold: (1) to construct the graphical path corre-sponding to each subquery, and (2) to combine the sub-paths into a set of candidate solution schema.

Path construction: The sub-paths constructed forexample Query 1 are S1; S2; S3 and S4, shown in Fig.6. These sub-paths are combined conjunctively to formeach of the four candidate solution schema shown inFig. 7. (The portions of the solution schema encasedin masks in Fig. 7 correspond to those elements thatare not specified in the query formulation, but se-mantically belong to the query paths, as already dis-cussed.)Path selection: This function selects a set of candidate

solution schema as the chosen solution schema. We usea simple path selection algorithm for demonstrationpurposes. The objective is to choose, starting with theminimum-cost candidate solution schema, a maximumof N solution schema, where N is an integer whose valuecan be preset during application implementation or se-lected by the user during query formulation. For pro-totype purposes, we adopt the former option and set Nto the number 6. The cost criterion used in our proto-type is the total number of edges (or arcs) in a solutionschema. If Cj;k denotes the cost of candidate solutionschema Sj:Sk, then by our cost criterion, it is seen fromFig. 7 that C1;3 ¼ 4, C2;3 ¼ 4, C1;4 ¼ 4 and C2;4 ¼ 6.Each of the solution schema S1:S3, S2:S3 and S1:S4,therefore, qualifies as a minimum cost solution schema.For Query 1, therefore, the system chooses S1:S3, S1:S4,S2:S3 and S2:S4.

For Query 2, only S2 and S3 satisfy the restrictionimposed by the hCourse consists-of Sectioni semanticrelationship. Therefore, the only candidate solutionschema generated for this query is S2:S3.

4.3.1.4. Pseudo-natural language explainer. User valida-tion is essential in abbreviated queries, to ensure thesemantic correctness of constructed queries. In CQL, thesemantic role played by entities in relationships are notonly used for unambiguous query formulation, they arealso used by the system to construct pseudo-naturallanguage explanations of queries. The system-con-structed explanations are returned to the user for vali-dation. As suggested in Bono and Ficorilli (1992) andDalianis (1997), to facilitate legibility, the system doesnot generate lengthy sentences. 6 According to Waldand Sorenson (1990), Gulla (1996) and Dalianis (1997),this is the recommended and an effective way of ob-taining a good compromise between natural languageand readability, focus and relevance. The form of thepseudo-natural language statements constructed by theCQL system, therefore, supports ease of understanding.A BNF for CQL’s natural language system and thenatural language generation algorithm are given inOwei (1994) and are available upon request. For the

6 To facilitate understanding of long query paths, CQL can

theoretically provide the user with the option of eliding the masked

portions of the path (as in Fig. 7) from the NL explanation.


selected solution schema of Query 1 and Query 2, 7

Figs. 8 and 9 show the system-generated natural lan-guage transcriptions. This system-explanations are thenreturned to the user for validation.Query validation: Validation of CQL query formula-

tions involves checking for semantic equivalence, orconsistency, between the formulated query and the sys-tem-generated explanation. The user simply reads the

Fig. 9. Natural language transcription of Query 2.

Fig. 10. SQL statement for Query 1.

7 Since only one NL explanation is returned, the user is not asked to

pick an option.


system-generated explanation and, where it is believedthat the explanation is semantically consistent with theintended query, the user validates it. It is thereafter ex-ecuted. Otherwise, it is modified as necessary beforeexecution.

4.3.1.5. Request execution. For the validated pseudo-NLquery restatement, an SQL statement is generated by thesystem. The SQL statement is executed by an underlyingrelational DBMS (in the current implementation ofCQL) and the answer returned to the user. The algo-rithm for mapping CQL to SQL can be found in Owei(1994) and is also available upon request.

Example. For Query 1, the generated SQL statementcorresponding to the semantics of the query explanationof Fig. 8 is shown in Fig. 10.

Beyond linear path queries like the example discussedabove, we allow for the support of other types of que-ries, e.g., iterative queries. We next discuss the strategyfor handling iterative queries in CQL.

5. Iteration, image maps and the ancestor problem in

CQL

An issue in abbreviated concept-based queries dealswith the ability of the system to handle iterative queries.In an iterative query, a single entity is repeatedly ac-cessed to retrieve some data.

When a query is specified as iterative in CQL, themechanism operates as follows: Each time the entityEi is iteratively referenced, it is renamed as a new entityEi;m by a replication and renaming operation, wherem ¼ 1; 2; 3; . . . and Ei;m is the mth reference of Ei. Forexample, if n iterative references to Ei on path Pr aremade, then Pr is generated as the path ES ! � � � !Ei;1 ! Ei;2 ! � � � ! Ei;ðn�1Þ ! Ei;n ! � � � ! ET .

Iteration in CQL is achieved through the RE-PEAT(n) function. The number of iterations to be per-formed on the specified operation is given by n. When nis not specified or is set to zero (0), no iteration occurs.This is a default case. In an iterative query, the sourceentity is the same as the target entity. Given the entityPERSON(P name;P child;P parent; . . .), suppose thefollowing query is specified.

Example. Find the name of the third ancestor of PersonX.

This query can be formulated in CQL as follows:Target attribute: hP-parentiTarget: hPERSONiSource attribute: hP-nameiSource attribute value: hP-name ¼ PER XiSource: hPERSONiSelection conditions: hREPEATð3Þ PERSONiSemantic relationship conditions: nilFig. 11 shows the path for this formulation of the

query. Note that the PERSON entity is replicated fourtimes in the graph. However, the first ‘‘iteration’’ cor-responds to n ¼ 0, as indicated in the figure. But, asdiscussed above, this is not an actual iteration. There-fore, there are actually three iterations, namely, forn ¼ 1, n ¼ 2 and n ¼ 3. In Fig. 11, note that at itera-tion level n ¼ 1 Person X.P_parent¼Person X-Parent.P_name, and similarly for n ¼ 2 and n ¼ 3.

Algorithm Ancestor/within-Entity in Owei (1994)translates CQL iterative queries into SQL join queryformulations. For our ancestor example above, the re-sulting SQL statement is

SELECT T3.P_nameFROM PERSON T1 T2 T3

WHERE PERSON.P_name¼T1.P_parentAnd T1.P_name¼T2.P_parentAnd T2.P_name¼T3.P_parentAnd PERSON.P_name¼ ‘Per X’;

Suppose we wish to formulate the query ‘‘Find all theancestors (up to the fourth ancestor) of Person X. Theformulation is essentially the same as the one above,with the following difference in the selection statement:Selection conditions : hREPEATðnÞ PERSON for n ¼1; 2; 3; 4i. This is equivalent to the statement: Selectionconditions : hREPEATðnÞ PERSON forall0<n6 4i.

5.1. A Na€ııve strategy to generalized recursion

By applying other constraints on the iteration vari-able n, we can deal with a very special case of recursivequeries. For example, the COUNT aggregate functioncan be used for cases where there is need to iterate to alllevels, but the deepest level is not known. This prob-lem is handled in two steps in CQL: (1) first deter-mine the deepest level, and (2) set the iteration variable

Fig. 11. Derived query path of an iterative query in CQL (shows three levels).


to the determined number of levels and then iterate. Theselection condition for this case is: hREPEATðnÞPERSON; where n ¼ COUNTðPERSON:P nameÞi.COUNT(PERSON.P_name) returns a value equal tothe number of tuples with non-empty key values forPERSON. REPEAT(n) therefore iterates to all levels.

The mapping to the logical level produces the fol-lowing SQL statement:

SELECT Tn.P_nameFROM PERSON T1T2 � � �Tn

WHERE PERSON.P_name¼T1.P_parentAnd T1.P_name¼T2.P_parentAnd T2.P_name¼T3.P_parentAnd . . .And Tn�1.P_name¼Tn.P_nameAnd PERSON.P_name¼ ‘Per X’;

The algorithm for this mapping is not supported in thecurrent implementation of the CQL prototype.

Clearly, the REPEAT(n) function in the count ex-pression only gives the total number of distinct values,and does not measure the true deepest nesting level. Thevalue of n therefore represents an upperbound on thepossible levels of nesting, and this may be too high forthe desired depth of nesting. Additionally, since ourmapping strategy considers n copies of Table T, we re-alize that it may be too na€ııve for very large databaseswith deep nesting.

Finally, for any finite-size schema, CQL query Qderives a finite length query path. Therefore, every pathin the resulting schema of solution paths is finite. Fur-thermore, each entity is finite in terms of number ofattributes. This characteristic of CQL also guaran-tees that every iterative CQL query is well-behaved andsafe.

CQL is capable of far more complex queries than wehave illustrated, as the discussion in Owei and Navathe(2001) on the expressive power of CQL suggests. Ourconcern in this paper, however, is not to present acomplete coverage of CQL, but rather an exposition of

the CQL approach and its rationale. In the rest of thepaper we turn to an experimental evaluation of CQLby end-users.

6. Experimental comparison between CQL and SQL 8

The research model for the experiment is shown inFig. 12. According to this model, end-user performanceand perception are important factors for evaluatingquery language interfaces (the independent variables).The model also indicates that these factors are moder-ated by users’ experience with databases. For this rea-son, we designed the experiment by controlling fordatabase experience. In our model, the performancevariables are accuracy of query formulation and queryformulation time, and the perception variables are ease-of-use and user satisfaction.

6.1. Research hypotheses

To compare the performance and perception of end-users with the two different query language interfaces,this study proposes the following research hypotheses onquery correctness, query formulation time, and end-userperceptions.

Hypothesis 1 (Query correctness). The group using CQLwill formulate queries more accurately than the groupusing SQL.

As our earlier discussion suggests, CQL makes alower cognitive demand on users than does SQL. With

Fig. 12. End-user performance and perception research model.

8 This is only the first of a set of experiments we hope to conduct. In

a future study, we hope to compare CQL to other concept-based query

languages to test the claim that CQL provides further gain on ease-

of-use over these other languages. We also plan to compare CQL to

QBE, which is also commonly supported by commercial PC-based tools.


CQL, users simply state what they seek and what theyknow. With SQL, there is the additional burden on theusers to mentally or manually navigate the DB schemain order not to violate the mandatory order of se-quencing of concepts and constructs. The higher cog-nitive load imposed by SQL translates into greaterdifficulty. We, therefore, expect the CQL subjects toperform better than the SQL subjects.

Hypothesis 2 (Query formulation time). The subjects us-ing CQL will take less time to formulate queries thanthose using SQL.

The time taken to perform a given task is indicativeof any of the following: the difficulty in using the tool,the degree of friendliness of the tool used, and the abilityof the user to use the tool. All of these speak to the te-dium in using the tool, which in turn is affected by thecognitive workload in using the tool. Since we have al-ready argued that the interfaces impose different cogni-tive workloads on users, we expect the query writingtimes to be different.

Hypothesis 3 (End-user perception).

• H3a: The subjects will perceive CQL easier to usethan SQL.

• H3b: The subjects using CQL will be more satisfiedwith their query language than do the subjects usingSQL.

Our rationale for these hypotheses is as follows: Asystem that demands greater technical knowledge andhigher cognitive skills is more difficult to use than onethat is less taxing cognitively.

There are different aspects that make the question ofuser satisfaction an important one. First, we believe thata statement about their satisfaction with the task is anindirect statement about their satisfaction with the toolitself. Second, a sense of satisfaction with a task favor-ably disposes the user towards using the tool for othersimilar tasks, and vice versa. We expect that the users’higher performance and better perception of ease-of-usewith CQL will result in their also scoring CQL higherthan SQL in overall satisfaction.

6.2. Experimental design

A simple randomized design was used for the exper-iment. Subjects were randomly assigned to one of theDBQL interface types. Then, each group was exposed toall levels of the query writing tasks.

6.2.1. SubjectsA total of 33 subjects (17 for CQL and 16 for SQL)

were recruited for this study from a required under-

graduate introductory computer-based information sys-tems course. All participation was voluntary.

6.2.2. VariablesThere were three sets of variables in the experiment.

These were the independent variables, the control vari-ables and the dependent variables.

6.2.2.1. Independent variables. Since the purpose of theexperiment was to compare CQL and SQL, these twoquery languages represented the set of independentvariables. SQL was chosen since it is the underly-ing query language supported by the generality of im-plemented relational database management systems.Furthermore, it has become the generally acceptedstandard for benchmarking relational and other querylanguages in terms of functionality. The AmericanNational Standards Institute (ANSI) views it as thestandard language for relational DBMSs. CQL is theproposed query language which is tested against SQL.Indeed, we consider the CQL-SQL comparison apro-pos. Since CQL currently runs as a front-end to anunderlying relational DBM S, the comparisons makethe relative advantages of CQL vis-�aa-vis SQL readilyappreciable.

6.2.2.2. Dependent variables. Accuracy of formulation:The main performance variable was the accuracy, orcorrectness, of the query formulation. It is defined anddetermined here as the semantic gap between the answerresulting from the formulation and the true, or correct,answer, i.e., the correspondence between the two. Thefollowing three possibilities exist here: (1) The resultinganswer is completely different from the correct answer.(2) The correct answer is only a part of the resultinganswer. In this case, the resulting answer contains othervalues besides the correct one. (3) The resulting answerand the correct one are one and the same. Only possi-bility (3) was considered in this experiment as a correctformulation.Query formulation time: This is the time taken by a

subject to formulate a query. Before a subject startedworking on a given query, the beginning time was re-corded. The moment the subject stopped working on thequery, the stop time was recorded. For the query, theformulation time was the difference between the stoptime and the start time.Measures of end-user perception: A 5-point Likert

scale questionnaire was used to measure end-user per-ceptions on the ease-of-use of and user-satisfactionwith a given query language. Ease-of-use is a measureof users’ perception of the ease with which a givenquery language interface can be used to formulatequeries. User-satisfaction is a measure of the overallsatisfaction of the subjects with the given query lan-guage.


6.2.2.3. Control variables. A single control variable wasused in this experiment: experience with databases. Thesubjects were selected based on their database experience,especially with exposure to query languages. Since thisstudy focuses on ascertaining end-users relative perfor-mance with and perception of CQL and SQL, we usesubjects who had little or no prior database and querylanguage experience. With end-user computing so prev-alent in organizations today, the profile of a typical DBuser in organizations is likely to fit these characteristics.A pre-test questionnaire was used to screen the subjects.

6.2.3. TaskThe queries to be formulated by the subjects were of

three difficulty levels: low-difficulty, medium-difficultyand high-difficulty. Task difficulty was determined as perShaw (1981) and Gallupe et al. (1988) by the degree ofcognitive load, and Bell and Rowe (1992) by the com-plexity of the joins, i.e., by the number of database ta-bles to be joined, and the complexity of the selectioncriteria. The following classification scheme, which issimilar to that in Brosey and Shneiderman (1978) andBell and Rowe (1992) was used:

• Low-difficulty: One or two tables to be joined andsimple selection criteria expression.

• Medium-difficulty: One or two tables to be joined andcomplex selection criteria expression.

• High-difficulty: More than two tables to be joinedand complex selection criteria expression.

Three queries were used in the main experiment. Thequeries are:

• Query 1: Find the names of secretaries who currentlydo not work for Professor Erickson.

• Query 2: List the names and student-numbers of thestudents who are taking all the courses having thename ‘‘Database’’.

• Query 3: List the names and student-numbers of thestudents who have taken either operations researchcourses taught by teachers whose titles are higherthan assistant professor or management sciencecourses taught by a professor.

These queries cover all three levels of complexity. Typ-ically a few more queries will be used. For example, inChan et al. (1998) eight queries were used and in Belland Rowe (1992) seven were used. However, since ourstudy was meant as an initial and exploratory one, welimited the number to three. We also took into consid-eration the fact that our subjects were composed ofbusiness majors who are more likely to pose only a fewqueries on application DBs at a given time and are likelyto be easily fatigued by having to perform many queriesin a continuous, non-intermittent session.

6.2.4. ProcedureTo ensure that the subjects acquired an acceptable

working level of competence in the use of the querylanguages, training, comprising lectures and worked-outexamples, was given to the subjects in each group. Theexamples were the same for the two query languagegroups. As part of the training for each group, practiceproblems were assigned. Thus, training across the querylanguages was tightly controlled and held constant.Each subject was trained only in the query language towhich he or she was assigned. To ensure that the sub-jects practiced on the homework problems, the assignedpractice problems were collected, graded and discussed.Training was continued until all the subjects were fullytrained. This is a common approach found in otherexperimental studies in the literature (e.g., Batra et al.,1990; Chan et al., 1998; Greenblatt and Waxman, 1978;Jarvenpaa and Machesky, 1986). As part of the training,the subjects were trained in the use of the query for-mulation interfaces to which they were assigned. Theexperiment was conducted thereafter.

Although the subjects worked independently and atindividual pace, they all worked simultaneously at thesame session. Each subject worked on only one query ata time. The queries were assigned in a sequential order,i.e., the subjects started with Query 1, then Query 2, andfinally Query 3. Only when a subject indicated that he orshe had completed the formulation of a particular querywas the next query in line presented to him or her. Eachsubject was given a hard copy of the DB schema. Sinceconceptual queries are specified against conceptualschemas, the CQL subjects used the SCERD in Fig. 1.Logical queries are specified against logical schemas.Thus the SQL subjects were given the relational schemain Fig. 13, which is a logical transformation of theSCERD in Fig. 1 into relations.

The performance and perception variables weremeasured as explained in Section 6.2.2.

7. Results

A t-test was used to determine the significance of thefindings on each hypothesis. A series of post hoc ana-

Fig. 13. Relational equivalent of Fig. 1.


lyses were also performed to gain further insight intoCQL and SQL.

7.1. Hypotheses testing

Hypothesis 1 (Query correctness). The group using CQLwill formulate queries more accurately than the groupusing SQL.

Table 1 shows the average number of correct answersformulated by the subjects. The overall scores show thatthere were more correct formulations in the CQL groupthan in the SQL group, but it was found that the dif-ference was not statistically significant (t ¼ 1:114, p ¼0:137). When we considered query difficulty level, how-ever, a statistically significant difference was obtainedbetween CQL and SQL for high difficulty level queries.For these queries the groups with CQL outperformedthe groups with SQL (t ¼ 3:31, p < 0:001). The meanscore of the CQL subjects was 0.41 and that of the SQLsubjects was 0.02. The difference for low and mediumdifficulty level queries is seen to be insignificant.

These results are consistent with what we wouldlogically expect. Less difficult queries demand lowerlevels of cognitive skills to perform the task. Thereforethe performance difference between the two groups forsuch tasks is not significant. For more complex que-ries, the cognitive workload becomes more demanding.However, as we argued earlier, CQL shields the usermore from much of this complexity than does SQL. It is,therefore, consistent with our argument that the CQLgroup did significantly better than the SQL group. Thissuggests that for typical end-users, CQL is a better in-terface than SQL for complex queries. The finding hereis consistent with that in Chan et al. (1998).

Hypothesis 2 (Query formulation time). The subjectsusing CQL will take less time to formulate queries thanthose using SQL.

As shown in Table 2, irrespective of the difficulty levelof the query, groups with CQL took significantly lesstime than groups with SQL. Thus, this hypothesis wassupported. Given our earlier argument on the relation-ship between time spent on a task and fatigue, as well as

that between tedium and the willingness and ability ofusers to use a tool, these results suggest that the use ofCQL is less likely to induce fatigue and tedium in end-users. It can, therefore, be expected that end-users wouldbe more willing and able to use CQL than SQL.

Hypothesis 3 (End-user perception). Cook and Campbell(1979) state that instrument validation should precedeother core empirical validation. Nine questions on thesurvey questionnaire probed respondents’ reaction tousing CQL (SQL) as a database query language. Prin-cipal component analysis indicated that two underlyingcomponents ease-of-use and satisfaction – can be de-rived from the nine items measuring end-user percep-tion. A summary of the responses associated with thesenine items appears in Tables 3 and 4. The reliability ofthe measuring instrument was also assessed. The Cron-bach alpha values for these factors ranged from 0.80 to0.93, well above the 0.70 acceptable level that is con-sidered to be adequate for behavioral research (Nun-nally, 1978). For our study, the questions used in thequestionnaire were adapted from previous studies (Batraet al., 1990; Higa, 1988).

• H3a: The subjects will perceive CQL easier to usethan SQL.

• H3b: The subjects using CQL will be more satisfiedwith their query language than do the subjects usingSQL.

For each variable, the aggregated mean scores were usedfor the statistical testing of hypotheses H3a and H3b. Asshown in Table 5, there were statistically significantdifferences between the two database query languages.The CQL subjects perceived their query language to beeasier to use than their SQL counterparts felt aboutSQL; they also felt more satisfied with CQL than theSQL subjects were with SQL. These differences weremore pronounced when query-difficulty level was con-sidered. For queries of high difficulty, there was a widergap in the perceptions on ease-of-use between the twosubject groups.

The scores indicate that users are more likely to besatisfied with CQL than with SQL. And given thefindings in McKeen et al. (1994) that user influence is

Table 1

Mean scores of query correctness

Query difficulty

levelDatabase query

language type

CQL SQL

Low 0.45 0.46

Medium 0.32 0.38

High� 0.41 0.02

Overall mean 0.393 0.278

* p < 0:001.

Table 2

Mean scores of query writing time (minutes)

Query difficulty

levelDatabase query

language type

CQL SQL

Low� 3.51 5.84

Medium� 4.62 6.22

High� 6.21 8.56

Overall mean� 4.78 6.87

* p < 0:05.


positively related to user satisfaction and in Zmud(1979) that consistently ‘‘positive associations have beenobserved between MIS usage and MIS satisfaction’’, itcan be argued that users would be more likely to useCQL than SQL.

8. Discussion

A further analysis was conducted to reveal the errortypes the subjects committed in their query formula-

tions. This analysis provided further insight into thesources of difficulty in the use of these languages.

A large proportion of SQL users committed errorsthat were related to identifying intermediate or join ta-bles. This suggests that the users may have had difficultywith navigating the schema to identify the tables neededfor queries. Also, a large proportion of these subjectsexperienced problems with the syntax of SQL joinstatements. An overly high fraction of users were seen tohave committed one or more forms of errors related toformulating nested sub-queries. This is indicative of amajor difficulty users have in using SQL to formulatecomplex queries.

A main point of the analysis of error-types was toillumine, and thereby elicit, the types of difficulty usersmight have with abbreviated concept-based query lan-guages. The major problem the subjects had with CQLwas in identifying and specifying semantic relationships(such as Course is-taught-by Teacher) and selectionconditions. The existence of these difficulties reveals apotential source of cognitive load on the class of end-

Table 3

Items measuring perceptions on ease-of-use

Question Mean (S.D.) Factor loading

SQL CQL SQL CQL

I found CQL (SQL) cumbersome to use

(1¼ strongly agree, 5¼ strongly disagree)

1.25 (0.17) 2.18 (0.18) 0.81 0.64

Using CQL (SQL) was frustrating


1.25 (0.19) 2.12 (0.23) 0.72 0.85

Using CQL (SQL) required a lot of mental effort


1.25 (0.19) 2.00 (0.19) 0.72 0.69

CQL (SQL) was clear and understandable to me

(1¼ strongly disagree, 5¼ strongly agree)

2.50 (0.20) 2.71 (0.22) 0.84 0.59

Overall, I found CQL (SQL) easy to use


2.44 (0.22) 3.00 (0.19) 0.80 0.84

Based on your experience, CQL (SQL) was

(1¼ very difficult to learn, 5¼ very easy to learn)

2.94 (0.36) 4.29 (0.34) 0.85 0.64

% Variance explained 79.4 70.7

Eigenvalue 4.76 4.24

Cronbach alpha 0.93 0.90

Table 4

Items measuring satisfaction

Question Mean (S.D.) Factor loading

SQL CQL SQL CQL

I feel confident about using CQL (SQL) for database querying


2.37 (1.26) 3.65 (1.37) 0.88 0.76

If faced with a similar task in the future, would you use CQL (SQL)?

(1¼ absolutely no, 5¼ absolutely yes)

2.44 (1.29) 4.29 (1.40) 0.53 0.69

How satisfied were you with CQL (SQL)?

(1¼ extremely unsatisfied, 5¼ extremely satisfied)

2.16 (0.89) 3.35 (1.14) 0.82 0.82

% Variance explained 74.1 75.7

Eigenvalue 2.22 2.27

Cronbach alpha 0.80 0.84

Table 5

Scores of end-user perception

Database query language type mean (S.D.)

SQL CQL

Ease-of-use� 1.94 (0.79) 2.72 (0.77)

Satisfaction� 2.32 (1.24) 3.76 (1.46)

* p < 0:01.


users. While the cause of these difficulties was not as-certained by the experiment, it should be interesting andinsightful to ascertain whether the problems wouldpersist as users become more familiar with conceptualschema constructs and concepts, either through moreextensive training or more frequent use. This notwith-standing, the realization of this potential difficultyshould inform on and be incorporated into the devel-opment of abbreviated concept-based query languages.Future studies should therefore be conducted to morethoroughly study this problem.

An adjunct but important observation from theerror-type comparisons is the realization that a queryconsidered difficult to formulate in SQL is not neces-sarily so in CQL, and vice versa. We believe that thisdifference in difficulty may not be entirely unrelated tothe functionality supported by each query language andthe extent to which the language eases the cognitive loadon the user. This suggests that the classification ofqueries into difficulty levels may be a function of thequery language and of the query itself. This is an issuethat we also consider worthy of further studies by querylanguage developers.

9. Conclusion

This study proposed an abbreviated concept-basedquery language that allows for the conceptual ab-straction of database (DB) queries and exploits the richsemantics of semantic data models to ease and facili-tate query formulation. The proposed CQL is aimedat making minimal demands on end-users’ cognitiveknowledge of DB technology. CQL uses the roles playedby entities in semantic data models to render transpar-ent the technical complexity of existing DB query lan-guages. Semantic roles are also used to automaticallyconstruct query graphs and pseudo-natural languageexplanations of queries, and to generate SQL codes. Thewhole approach is fully implemented and statisticallyvalidated.

In this study, CQL was compared with SQL as asuitable database query language for non-expert end-users. The CQL subjects, on the average, formulated thequeries more accurately than their SQL counterparts.The superiority of CQL over SQL for end-users wasespecially more pronounced the greater the complexityof the queries. This appeared to have induced a morefavorable feeling of ease-of use and of satisfaction on thepart of the CQL subjects than on the SQL subjects. Theresults of the main experiment provided empirical sup-port to the claim that CQL is better suited than SQLas a query language interface for non-expert databaseusers. However, as we mentioned above, the results hereshould be viewed as preliminary, because of the ex-ploratory nature of the study. Further studies using a

larger number of queries are needed to confirm andconsolidate the findings here.

The main experimental results called into questionthe classification of queries into difficulty levels. If thedifficulty level of a query is determined by how well userscan formulate the query in a given query language, thena lack of uniformity in classification is to be expected. Aquery that is considered difficult in SQL may not nec-essarily be so in CQL, and vice versa. It therefore seemsthat a query’s level of difficulty is dependent not only onthe query, but also on the query language used. Wewould like to investigate this issue further in futurestudies.

Issues of concern that need to be considered in thedesign of a query language for the class of typical end-users were highlighted by the study. The results showedthat users may not perform well or view favorably lan-guages that tax their cognitive skills. Query languageinterfaces must render transparent to the end-user whatis technically arcane.

The study also provided valuable insights not solelyfor the further development of CQL, but for otherconcept-based query languages as well. Since the use ofsuch languages require the knowledge of conceptualschema constructs, the difficulty our subjects had withexpressing semantic relationships and selection condi-tions with CQL may well serve to alert developers ofsuch languages to take cognizance of this in the designof the languages.

We have implemented a CQL prototype, which cur-rently runs as a front-end to an underlying relationalDBMS. Although CQL was originally developed to beused by secretaries and administrators for DB queryingand report generation in the Administration Informa-tion Management Systems at the Georgia Institute ofTechnology, 9 it is sufficiently powerful to support ex-pert users (see Owei and Navathe, 2001 for a discussionon the expressive power of CQL).

References

Anderson, M., Shin, Doug-Guk, 1991. Integrating an intelligent

interface with a relational database for two way man–machine

communication. In: Proc. IEEE/ACM Int’l Conf. on Developing

and Managing Expert System Programs, Washington, DC, Sep-

tember 30 – October 2, 1991, pp. 4–11.

Batra, D., 1993. A framework for studying human error behavior

in conceptual database modeling. Information and Management

25, 121–131.

Batra, D., Hoffer, J.A., Bostrom, R.P., 1990. Comparing representa-

tions with relational and EER models. CACM 33 (2), 126–139.

9 We acknowledge and are grateful to Art Vandenberg and

Christopher Smith of the Office of Information Technology at the

Georgia Institute of Technology for their guidance on and contribu-

tion to the prototype implementation of CQL. Their participation was

inexpendable.


Batra, D., Sein, M.K., 1994. Improving conceptual database design

through feedback. International Journal of Human-Computer

Studies 40, 653–676.

Bell, J., Rowe, L., 1992. An exploratory study of ad hoc query

languages to databases. In: Proc. 8th Int’l Conf. on Data

Engineering, February 3–7, 1992, pp. 606–613.

Berztiss, A.T., 1986. Data abstraction in the specification of informa-

tion systems. In: Proc. IFIP World Congress 86, pp. 83–90.

Berztiss, A.T., 1993. The query language vizla. IEEE TKDE 5 (5),

813–825.

Bloesch, A.C., Halpin, T.A., 1996. ConQuer: a conceptual query

language. In: Conceptual Modeling – ER’96. Lecture Notes in

Computer Science, vol. 1157. Springer, Berlin, pp. 121–133.

Bloesch, A.C., Halpin, T.A., 1997. Conceptual queries using conquer –

II. In: Proc. ER97: 16th Int’l Conf. on Conceptual Modeling.

Springer, Los Angeles, pp. 112–126.

Bono, G., Ficorilli, P., 1992. Natural language restatement of queries

expressed in a graphical language. In: Proc. 11th Int’l Conf. on the

E–R Approach. Karlsruhe, Germany, October 1992, pp. 357–374.

Brosey, M., Shneiderman, B., 1978. Two experimental comparisons of

relational and hierarchical database models. International Journal

of Man–Machines Studies 10, 625–637.

Catarci, T., Santucci, G., 1988. Query by diagram: a graphic query

system. In: Proc. 7th Int’l Conf. on The E-R Approach, Rome,

Italy, November 16–18, 1988, pp. 157–174.

Chan, H.C., 1989. A knowledge level user interface using the ER

model. Ph.D. Dissertation, The University of British Columbia,

Vancouver, BC.

Chan, H., Siau, K., Wei, K., 1998. The effect of data model, system

and task characteristics on user query performance – an empirical

study. Data Base 29 (1), 31–49.

Chan, H.C., Wei, K.K., Siau, K.L., 1993. User-database interface: the

effect of abstraction levels on query performance. MIS Quarterly 17

(4), 441–464.

Chang, T., Sciore, E., 1992. A universal relation data model with

semantic abstractions. IEEE TKDE 4 (1), 23–33.

Cook, T.D., Campbell, D.T., 1979. Quasi-experimental design and

analysis issues for field settings, Houghton, Mifflin, Boston.

Cronan, T., Douglas, D., 1990. End user training and computing

effectiveness in public agencies: an empirical study. JMIS 6 (4), 21–

40.

Czejdo, B., Elmasri, R., Embley, D.W., Rusinkiewicz, M., 1990. A

graphical data manipulation language for an extended entity-

relationship model. Computer 23 (March), 26–36.

Dalianis, H., 1997. Explaining conceptual models – an architecture and

design principles. In: Conceptual Modeling ER-97, pp. 214–228.

Gallupe, R., DeSanctis, G., Dickson, G.W., 1988. Computer-based

support for group problem-finding: an experimental investigation.

MIS Quarterly (June), 277–296.

Greenblatt, D., Waxman, J., 1978. A study of three database query

languages. In: Shneiderman, B. (Ed.), Databases: Improving

Usability and Representativeness. Academic Press, New York.

Gulla, J.A., 1996. A general explanation component for conceptual

modelling in CASE environments. ACM TOIS 14 (2), 297–329.

Halpin, T.A., 1995. Conceptual Schema and Relational Database

Design, second ed. Prentice-Hall, Sydney, Australia.

Halpin, T.A., 1996. Business rules and object-role modeling. Database

Program and Design 9 (10), 66–72.

Halpin, T.A., Orlowska, M.E., 1992. Fact-oriented modelling for data

analysis. Journal of Information Systems 2 (2), 1–23.

Halpin, T.A., Proper, H.A., 1995a. Subtyping and polymorphism in

object-role modeling. Data and Knowledge Engineering 15, 251–

281.

Halpin, T.A., Proper, H.A., 1995b. Database schema transformation

and optimization. In: OOER’95: Object-Oriented and Entity-

Relationship Modeling. Lecture Notes in Computer Science, vol.

1021. Springer, Berlin, pp. 191–203.

Higa, K., 1988. End-user logical database design: the structured entity

model approach. Ph.D. Thesis, Submitted to the Committee on

Business Admin., Graduate College, The University of Arizona.

Jarvenpaa, S.L., Machesky, J.J., 1986. End user learning behavior in

data analysis and data modeling tools. In: Proc. 17th Int’l Conf. on

Information Systems, San Diego, CA, 1986, pp. 152–167.

Jarvenpaa, S.L., Machesky, J.J., 1989. Data analysis and learning: an

experimental study of data modeling tools. International Journal of

Man–Machine Studies 31, 367–391.

Jih, W.J., Bradbard, D.A., Snyder, C.A., Thompson, N.G.A., 1989.

The effects of relational and ER data models on query performance

of end-users. International Journal Man–Machine Studies 31, 257–

267.

Maier, D., Rozenshtein, D., Warren, D.S., 1986. Window Functions.

In: Kanellakis, P. (Ed.), Advances in Computing Research. JAI

Presspp. 213–246.

Maier, D., Ullman, J.D., 1983. Maximal object and the semantics of

universal relation databases. ACM TODS 8 (1), 1–14.

Mannino, M.V., Shapiro, L.D., 1990. Extensions to query languages

for graph traversal problems. IEEE TKDE 2 (3), 353–363.

Markowitz, Shoshani, 1989. Abbreviated query interpretation in EER

oriented databases. In: Proc. 8th Int’l. Conf. on E–R Approach,

October 18–20, 1989, pp. 325–344.

McKeen, J.D., Guimaraes, T., Wetherbe, J.C., 1994. The relationship

between user participation and user satisfaction: an investigation of

four contingency factors. MIS Quarterly 18 (4), 427–451.

Nunnally, J.C., 1978. Psychometric Theory, second ed. McGraw-Hill,

New York.

Owei, V., 1994. Framework for a conceptual query language for

capturing relationship semantics in databases. Ph.D. Dissertation,

The Georgia Institute of Technology.

Owei, V., Higa, K., 1994. A paradigm for natural language explana-

tion of database queries: a semantic data model approach. Journal

of Database Management (Winter).

Owei, V., Navathe, S.B., 2001. A formal basis for an abbreviated

concept-based query language. Data and Knowledge Engineering

36 (2), 109–151.

Owei, V., Navathe, S.B., Rhee, H., 1997a. Natural language query

filtration in the conceptual query language. In: Proc. of 30th

Hawaii International Conference on System Sciences, Maui,

Hawaii, USA, January 1997.

Owei, V., Rhee, H., Navathe, S.B., 1997b. Statistical validation of the

CQL approach. Technical Report Georgia Tech-UIC CQL-TR -1.

Peckham, J., Maryanski, F., Demurjian, S., 1996. Towards the

correctness and consistency of update semantics in semantic

databases. IEEE TKDE 8 (3), 503–507.

Pittges, J., 1995a. Maintaining instance-based constraints for seman-

tic query optimization. In: Proc. 6th IFIP TC-2 Working Conf.

on Data Semantics (DS-6), Stone Mountain, Georgia, May

1995.

Pittges, J., 1995b. Metadata view graphs: a framework for query

optimization and metadata management. Ph.D. Thesis, Georgia

Institute of Technology.

Pittges, J., Mark, L., Navathe, S., 1995. Maintaining semantic and

structural metadata in the metadata view graph. In: Proc. 7th Int’l.

Conf. on Management of Data (COMAD’95), Pune, India,

December 1995.

Puranik, S., 1988. A data definition language for the object-oriented

semantic association model and algorithms for intelligent query

processing. Master of Science Thesis, Presented to the Graduate

School of the University of Florida.

Shaw, M.E., 1981. Group Dynamics: The Psychology of Small Group

Behavior, third ed. McGraw-Hill, New York.

Siau, K.L., Chan, H.C., Tan, K.P., 1991. Visual knowledge query

language as a front-end to relational systems. In: Proc. 15th

Annual Int’l Computer Software and Applications Conf., Tokyo,

1991, pp. 373–378.


Siau, K.L., Chan, H.C., Tan, K.P., 1992. Visual knowledge query

language. IEICE Transactions on Information and Systems E75-D

(5), 697–703.

Siau, K.L., Chan, H.C., Wei, K.K., 1995. The effects of conceptual and

logical interfaces on visual query performance of end users. In:

Proc. ICIS, Amsterdam, The Netherlands, December 10–13, 1995,

pp. 225–235.

Wald, J., Sorenson, P., 1990. Explaining ambiguity in a formal query

language. ACM TODS 15 (2), 125–161.

Wu, X., Ichikawa, T., 1992. KDA: a knowledge-based database

assistant with a query guiding facility. IEEE TKDE 4 (5).

Zmud, R.W., 1979. Individual differences and mis success: a review

of the empirical literature. Management Science 25 (10), 966–

979.